dev-resources.site
for different kinds of informations.
Part 4 - Using Cursor and Claude to Create Automated Tests with Playwright
PART 4
Here is the Github source code repo for this project: https://github.com/joegiglio/cursor2-kb
Itās been a whileā¦
Last time we spoke, Cursor seemed to be breaking down and the index of the file structure had become corrupted. After waiting 23 days for a fix, I am happy to report that the problem does finally seem to be addressed! The UI is still quirky and it seems to slow down with some type of memory leak after a while which can be fixed with a restart. But at least now it is making changes to the correct files and I am not faced with a mysteriously corrupted file index.
While waiting for a fix, I noticed that many Cursor users were complaining about the instability of Cursor and/or Claude. Yes, it works sometimes and it is impressive when it does but it is a black box of moving parts, constant updates behind the scenes and inconsistent results.
I decided to put the Help Desk project to the side and wanted to see if I can build a Knowledge Base application instead. This might be a little simpler - no need for real time updates or complex integration points.
I used the same technology stack that I am most familiar with - Python, Flask and Bootstrap with a SQLite database. Everything was coming along swimmingly until I hit a snag with ādatabase migrationsā. Similar to the previous chapter, it generated constraint code that SQLite did not like and it threw an error:
Exception during DB migration:
raise ValueError("Constraint must have a name")
ValueError: Constraint must have a name
That should have been easy enough for Claude to fix but it started spiraling and I ended up rebuilding my database from scratch. I will blame this on my inexperience with debugging flask-migrate but Claude was not able to clean up the mess without just wiping the database and starting over.
After things settled down, I ended up with a mostly functional knowledge base system, along with an Admin panel, dark / light mode, search (just using the SQL queries) and test data generator. It was built with Quill which is an open source text editor with support for formatting and images.
I say āmostly functionalā because I purposely did not complicate it with authentication or mutli-tenancy. I kept the architecture simple with basic Flask routes and CRUD operations. No API yetā¦ Perhaps I will revisit these features in phase 2.
As I was adding features, I kept noticing that Claude would fix one thing but break another - just like a real developer. Also, there was CSS and Javascript scattered all over the code. Sometimes, adding new articles would work but editing would not work and vice versa. Sometimes article formatting code would appear here and there and I had to ask Claude to clean it up. It was becoming whack-a-mole.
The project was getting complex enough that it needed test automation which was briefly discussed in the last chapter. In most companies, there is very little budget for QE and most Developers hate writing tests, so maybe AI can help?
I have a respectable amount of experience with Selenium and Cucumber but it has been a while. Playwright has been gaining traction and supposedly has better handling for locators and retries so I asked Claude to create some Playwright tests. The VS-Code Playwright extensions all seem to assume you are using Javascript and they do not have direct support for Python so I will need to run the tests from the Terminal, like an animal.
[PROMPT]
I would like to add end to end browser tests to this project using Playwright and Python. I would like the tests to use fixtures to seed the necessary data and then delete the data at the end of each test run. The tests need to clean up after themselves and not leave test data behind. If you can configure it to setup a test database while running the tests that is even better. Please let me know which libraries to install, create tests for the top 10 scenarios by looking at the main features of the product and let me know how to run these tests from within Cursor.
[/PROMPT]
The first attempt was pretty good, providing instructions on how to install the necessary libraries. The file structure for the tests also seemed accurate and it even added a test for responsiveness by changing the viewport size. Nice touchā¦ But I did not like the initial set of tests it generated so I asked it for further refinement:
[PROMPT]
Authentication has not been implemented in this product yet. Please do not create tests for those features. Instead, focus on Admin functionality like creating, editing and deleting topics; adding, editing and deleting articles. Also, focus on end user functionality such as searching and viewing articles. Look at the code to figure out the top 10 features of the product and create tests for them.
[/PROMPT]
The next batch of tests looked a little better but still no Admin functionality.
[PROMPT]
Most of those tests look pretty good but I am not seeing anything for Admin functionality in the @admin.html file. We need to have tests for adding, editing, deleting topics and adding, editing and deleting articles.
[/PROMPT]
Now it looks like it got it right and created code to test the following app scenarios:
These look like great starting points. Now letās see if the Claude's execution instructions are correct. Well, unfortunately notā¦
[PROMPT]
I followed your instructions for running the tests and I am running into a couple of problems.
I opened a new terminal and ran the commands āFLASK_ENV=testing flask runā followed by āpytest tests/e2e/ā. I see the following problems:
- When I navigate to the home page, I am seeing the same data I was seeing before. It did not seem to switch to the test database.
- An exception: ImportError while loading conftest '/Users/joe/cursor2-kb/tests/e2e/conftest.py'. tests/e2e/conftest.py:12: in from models import Topic, Article ModuleNotFoundError: No module named 'models'
[/PROMPT]
Claude created a models.py file, duplicating the code in the app.py file and I questioned whyā¦
The initial code for this application was all done in the app.py file, including the models. This works fine but can leave you with an unwieldy single file application. Best practice says to separate the models and routes from the config and this is where Claude was taking me, even though it took a long winded route:
[PROMPT]
The models.py file you are proposing looks like it is duplicating code from app.py. Would it make more sense to use the models in app.py instead?
[/PROMPT]
[RESPONSE]
Actually, it's the other way around - we should move the models out of app.py into models.py to follow the separation of concerns principle. This is a common pattern in Flask applications where:
app.py handles application configuration and initialization
models.py contains all database models
routes are often in separate blueprint files
[/RESPONSE]
Fair enough. Letās go with it!
Well, not so fast. Upon closer inspectionā¦
[PROMPT]
I like where you are going with separating the models.py file but you also deleted all the routes from the app.py file. I do not see them in any other file. They are just goneā¦
[/PROMPT]
Now Claude pulled all the routes out of the app.py file and moved them into a separate file for blueprints. It is making way too many architectural changes and I want it to revertā¦ Beware the brilliant new developer that wants to rewrite everything! Although the suggested changes are considered best practices for a professional Flask codebase, I will not trust those changes until the tests are in place so we can verify they are working properly.
[PROMPT]
I understand that using blueprints is the preferred method for professional level Flask code. However, I am not comfortable making such refactors right now. I am first trying to get end to end tests in place. Then we can look at the project structure. Please do not move the routes out of the app.py file.
[/PROMPT]
Cursor attempted to revert but the merge UI was a sea of green and red and I was not comfortable with the change. New strategy: Letās get some simple Playwright tests in place and then work on incrementally improving them. Maybe this will be betterā¦
I stashed the proposed test list since I liked where that was going and then closed out of the Cursor Composer window and started a new Composer which I believe removes all content of the previous Composer.
Letās try again with a simpler starting point. Iterative development is usually best anyway.
[PROMPT]
I would like to add end to end browser tests to this project using Playwright and Python. Please let me know which libraries to install, create tests for the top 10 scenarios by looking at the main features of the product and let me know how to run these tests from within Cursor.
[/PROMPT]
We had similar problems this time with Claude wanting to test functionality that does not yet exist. Time to scold Claude!
[PROMPT]
I see you are creating tests for authentication functionality. Although we have a login page, it is only a placeholder and does not actually function. It does not even have fields to allow anyone to login. Please review the project code again and come up with the top 10 test scenarios for Playwright.
[/PROMPT]
This caused Claude to apologize and create a bunch of test scenarios all around the Search functionality on the homepage. I was hoping not to hand feed the tests I wanted, but instead wanted it to find the features by looking at the code. No luck though, so I went back with:
[PROMPT]
This looks better but is focused too much on search. We need tests for the following features: search, create topic, edit topic, delete topics, create article, edit article, delete article, view article and some responsive tests to make sure everything looks OK on different screen resolutions.
[/PROMPT]
We are starting to spiral alreadyā¦ It created the basic framework for some more tests but it is making too many assumptions about data that will exist. Make no assumptions. This is why it is best to have a test database that is pre-seeded but we will try to get there eventually. For nowā¦
[PROMPT]
Some of these tests make assumptions that certain links will be valid and certain content will exist. This may not be the case if articles or topics are deleted. Please make each test self contained by creating the data it needs and then deleting it when the test completes.
[/PROMPT]
So now Claude is right back to creating fixtures anyway! I guess there is no way around it if you want to do it rightā¦
Another problem is using test strings like āTest Subjectā and āTest Articleā which may clash if we run parallel tests which is a common approach to speed up execution. I wanted the tests to use randomized data so I prompted with this:
[PROMPT]
Please use random data for the topic name, article name and article content. It would also be a good idea to add a test for foreign language characters to make sure they appear OK during browsing and searching.
[/PROMPT]
I have a feeling we are going to need a lot more tweaks but this is looking like a very reasonable start. Claude returned the following:
These changes were made:
Use Faker to generate random data in multiple languages
Add specific tests for multilingual content
Verify proper display of various character sets
Test search functionality with non-Latin characters
Verify no character encoding issues in display
Test content creation in multiple languages
The tests now verify that your application correctly handles:
Japanese (ę„ę¬čŖ)
Chinese (äøę)
Korean (ķźµģ“)
Russian (Š ŃŃŃŠŗŠøŠ¹)
And regular English content
Each test creates its own random data and cleans up after itself, making the tests more reliable and independent.
These are the tests that were created:
Now to run them. First, I need to install the faker library which Claude instructed me to doā¦
I was given a bunch of different flags and options for running the tests. From previous experience, all of these instructions look right but I am going to use the pytest.ini file Claude created and pass in a few flags for verbose output, screenshots and a slowed down test execution so I can follow along visuallyā¦
And now to run pytest tests/e2e/ to actually run the tests!
All of the tests failed with this same error:
playwright._impl._errors.Error: Protocol error (Page.navigate): Cannot navigate to invalid URL
In pytest.ini, the base_url is set to http://localhost:5000 which looks right to me.
I couldnāt figure out why it was failing so I asked Claude. He made a few small tweaks and then suggested I move the pytest.ini file to the ROOT directory of the project and out of the tests/e2e directory where he originally placed it. I pushed back on this and he said it was fine where it is. OK thenā¦
He also suggested I use absolute URLs in the tests instead of the URL set in the INI file and text fixtures. Wrong. I also pushed back on that and he agreed and fixed it. Losing faith hereā¦
With these changes, I ran the tests again and now I see the browser open up and go to the base URL which is the homepage of the knowledge base but all tests are still failing. I went a few more rounds with Claude and it looks like the root page issues may be fixed but now we are failing with:
FAILED tests/e2e/test_admin.py::test_create_topic[chromium] - playwright._impl._errors.TimeoutError: Timeout 30000ms exceeded.
Which looks like more of a locator issue.
[PROMPT]
Now all the admin tests in test_admin.py are failing with this error: admin_page.fill("#topic-title", data["title"]). That is not the correct locator. Please fix.
[/PROMPT]
Claude fixed one test but the others are still failing with locator errors. I got more explicit:
[PROMPT]
Other than test_create_topic, the rest of the tests in @test_admin.py fail with locator errors. Please review all the locators in the tests and correct them.
[/PROMPT]
Claude made a few changes and now we have another locator issueā¦
[PROMPT]
Now all tests are failing with:
tests/e2e/test_admin.py::test_create_topic[chromium] - AssertionError: Locator expected to be visible
[/PROMPT]
It is obvious we will be here for a while, going round and round while Claude makes the necessary fixes. What is also obvious and unfortunate is that Claude does not have the proper context of the application and page elements without me having to be very explicit. Just asking Claude to ālook at the code and create testsā is not good enough. You will need to be very familiar with feature functionality and page structure so you can nurse it along.
Claude was helpful in setting up the architecture of tests with pytest.ini and conftest.py but even so, it needed several tweaks for a basic test to run without an immediate failure.
This project uses Bootstrap on the front end along with Javascript confirmation and modal dialogs. Claude seems to have a very difficult time with these if there are multiple modals in the source code, where only one is visible, which is a common design pattern. After several iterations and guiding it with pseudo-code, we eventually came to an understanding and a passing test.
Throughout these tests, Claude made many assumptions about the names of locators, most of which did not exist anywhere in the source code. Only after explicitly telling it to look in certain files was it able to clean up but why make assumptions? Perhaps I did not feed it the correct files to begin with? I am still not sure how all of that is supposed to work.
Sometimes it seems to have access to the entire project while other times I need to be much more explicit. If it needs access to a file it should ask and not just create bullshit code. Remember, each back and forth costs you API credits.
Besides lots of test failures, I am also having issues with screenshots not being saved even though I am passing the flag in the pytest.ini file. Claude offered me a few permutations but none of them fixed the problem. I looked at the Playwright documentation and still no resolution. This seems like another one of those issues that āshould be easy to fixā but you can easily burn a dozen Claude requests and get nowhere quickly. I will move on for now, using a workaround that Claude suggested.
After spending WAY TOO much time making tweaks with Claude, I now have only two tests that are passing and a third that starts out as PASS but ultimately ends up as FAIL due to some conflict between the test code and fixture code. It is becoming exhausting to debug and I feel like I am having circular conversations. I have gotten to a point where I just blindly merging suggested code changes and hoping for a miracle.
Earlier in this chapter, my intention was to build a test harness that seeded and removed test data along with reusable fixtures and functions to keep the data and test code clean. Those intentions are long out the window and now I just want the tests to pass and I will try to clean them up later. Let me know if any of that sounds familiarā¦
Even something seemingly as simple as maximizing the browser window when launched is proving to be way too difficult. Claude has made attempts with Python and Javascript and it just doesnāt work. He doesnāt know why and the geniuses on Stack Overflow have answered that question several times and their suggestions do not work either. My guess is there is some code conflict among the files but I donāt know the code well enough to figure it out right now. Where is AI when you need it!?
I found that some of the generated test code had more than one expect/assert. I believe it to be best practice to have only ONE assert per test but know that is not always a realistic case especially when the tests are new and you donāt trust them yet.
I found myself being a āprompt engineerā trying to figure out how to nudge things in the right direction. Since I did not write any of the product code myself I was not familiar with its inner workings and was hoping Claude could pick it up and run with it. After all, he wrote it but he seems to have a short memory. He gives you enough to get going but certainly not enough to ask once and collect the money.
What would make life easier is to have APIs to call for CRUD functions that create and delete data, leaving the end to end tests to only test their intended functionality. The shortcomings and missing features of this test project may sound all too familiar if you have ever written automated tests for a software application. You often end up having to work around product shortcomings before you actually get around to testing product functionality. You may recall way back in the first chapter that I intended to build the product API-First and that quickly spiraled out of control. Absent of APIs, perhaps direct SQL calls would have made more sense for setup and teardown.
If you have automated tests working in another project and are familiar with the code and page structure of a new project, copy/paste/tweak will likely get you further faster than the Claude Playwright approach outlined here. At least then you will have some familiarity with the code and can tweak if with some confidence. As of now (November, 2024), I find asking Claude to generate end to end tests with Playwright to be a painful exercise.
Claude20241022 Supplement
While writing this article, Anthropic released a new version of Claude Sonnet 3.5, dated 10/22/2024, which also became available in Cursor. They claim there are some important advancements in coding so I figured I would re-do my testing to see if there were any improvements.
After clearing up some virtual environment and dependency issues caused by changes Claude made to the requirements.txt file, I was on my way. I wiped out my database, created a new branch and went with the same prompt we started with earlier in the chapter:
[PROMPT]
I would like to add end to end browser tests to this project using Playwright and Python. I would like the tests to use fixtures to seed the necessary data and then delete the data at the end of each test run. The tests need to clean up after themselves and not leave test data behind. If you can configure it to setup a test database while running the tests that is even better. Please let me know which libraries to install, create tests for the top 10 scenarios by looking at the main features of the product and let me know how to run these tests from within Cursor.
[/PROMPT]
This resulted in Claude generating a tests/e2e directory and all the expected files but when I attempted to run the tests, they all failed. If I try to go to the website, all pages returned a 500. I noticed conftest.py has code that is supposed to switch the application to use a test database. This is great but it looks like it never initialized. The 'sqlite:///test.db' does not exist even though I see it referenced in the code.
If I restart the server, this also throws exceptions. I noticed all the tables and all my data has been removed from the original topics.db file. All my data and schema has been wiped without warning! Thankfully, this was just test data.
After complaining to Claude, he made some changes and created a Python file to initialize the database which I ran.
Now when I run the tests, they all throw an exception about missing constraints.
When I hit site manually, I get a database exception:
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such table: topic
[SQL: SELECT topic.id AS topic_id, topic.name AS topic_name, topic.sort_order AS topic_sort_order, topic.description AS topic_description, article_1.id AS article_1_id, article_1.title AS article_1_title, article_1.content AS article_1_content, article_1.keywords AS article_1_keywords, article_1.topic_id AS article_1_topic_id, article_1.sort_order AS article_1_sort_order
FROM topic LEFT OUTER JOIN article AS article_1 ON topic.id = article_1.topic_id ORDER BY topic.sort_order]
The act of running the tests has wiped my database again without warning! Something aināt right around hereā¦.
[PROMPT]
I am still having some serious database issues: If I use the @init_db.py script, it restores my database tables and the app functions fine. However if I run the tests with pytest tests/e2e/ each test fails with: ERROR tests/e2e/test_app.py::test_reorder_topics[chromium] - sqlalchemy.exc.IntegrityError: (sqlite3.IntegrityError) NOT NULL constraint failed: article.topic_id. Then when I try to use my main DB again, all the tables are removed again! It does not seem to properly setup the test database which also breaks my other database somehow!
[/PROMPT]
Claude made some changes. Next prompt:
[PROMPT]
We are still having issues. Now all tests fail with ERROR tests/e2e/test_app.py::test_create_article[chromium] - sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such table: topic andit is still siping out my other database!
[/PROMPT]
Claude made a bunch of changes and when I try to run the database initialization script again, it fails with:
(.venv) joe@Joes-MacBook-Pro cursor2-kb % python init_db.py
Traceback (most recent call last):
File "/Users/joe/cursor2-kb/init_db.py", line 1, in <module>
from app import app, db
File "/Users/joe/cursor2-kb/app.py", line 40, in <module>
app = create_app()
^^^^^^^^^^^^
File "/Users/joe/cursor2-kb/app.py", line 34, in create_app
db.init_app(app)
^^
NameError: name 'db' is not defined
I also can not start the server since it generates the same error.
TIME FOR A RESET!
Letās simplify our initial prompt, again re-using one from earlier in this chapter. I restored the database manually, created a new branch, fixed the virtual environment and went with:
[PROMPT]
I would like to add end to end browser tests to this project using Playwright and Python. Please let me know which libraries to install, create tests for the top 10 scenarios by looking at the main features of the product and let me know how to run these tests from within Cursor.
[/PROMPT]
The initial list of tests look pretty good:
Two passed without any intervention, another positive sign:
There is an odd quirk though where passing the ā-headed flag does not allow me to view the tests as they run. Claude was able to find a conflict between the pytest.ini and one of the conftest.py fixtures:
Claude instructed me to change the code here if I wanted to not run headless. I pushed back:
[PROMPT]
This change is working but I do not want to pass a headless directive here. I want it to be handled in the @pytest.ini file in the addopts section.
[/PROMPT]
Claude then recommended I make changes in two files every time I want to run headless. Why?
[PROMPT]
This is not correct. I want to control the head/headless flag in the @pytest.ini file. Your solution will force me to change this setting in two places. Please fix.
[/PROMPT]
We then ended up with this snippet of code which seemed odd to me. Why do we care about the āhelp flag?
This resulted in a new error which Claude was able to fix when prompted:
TypeError: argument of type 'bool' is not iterable
tests/conftest.py:45: TypeError
[PROMPT]
Now the tests are running but I am still not seeing a visible browser when they run.
[/PROMPT]
The new Sonnet version seems to be stumbling like the previous version. I am sure I can keep plugging away at it and eventually it will tighten up but I am not seeing any dramatic differences for our purposes here.
I am frustrated with this experiment and will end it here. Next chapter, letās see if Claude does any better with Selenium which has been around for much longer than Playwright. Perhaps there are more examples out there that Claude was able to consume. Playwright may be preferred these days but I find the Playwright / Claude pairing to be exhausting.
ABOUT THE AUTHOR
Joe Giglio is a 25+ year veteran of the software industry. In his career, he has worked for a local dial-up ISP, startups of all sizes and a well established, slow moving enterprise. There were important lessons learned along every stop.
His recent projects include:
Remote work training and consulting: chiefremoteofficer.com
Kindle book "Making Remote Work, Work For You"
Remote company rankings. Think "Glassdoor for remote companies": remotescorecard.com
Joeās guiding principles are: Remote first, quality driven, customer champion and lifelong learning. He was an introvert and practiced social distancing long before it was cool!
He currently resides in North Carolina and is always interested in new opportunities, meeting quality people and learning about exciting projects.
Contact Joe Giglio: joe -AT- joegiglio.org | Twitter | LinkedIn
Featured ones: