Web scraping is the concept of programmatically collecting data from a website. This article will discuss using Playwright for python web scraping. The most popular web-scraping packages for python are requests and Beautiful Soup used together. This combination is potent and straightforward to use for most web pages. However, the use case has limitations because the combination relies on making server requests and reading the static HTML returned. It can be challenging to scrape single-page applications (SPAs) or websites where the objects to scrape are only available after some javascript interactions. Playwright circumvents these limitations by interacting with web pages like humans to find the data that needs scraping.
The problem
Fun fact about me. I swing dance competitively (although Iām still a novice). Competitive swing dancing events award points to competitors who place well in competitions. As a dancer gains points, they can move up through divisions (ānewcomerā, ānoviceā, āintermediateā, āadvancedā, or āallstarā). Those points are stored with the Wold Swing Dance Council (WSDC). The WSDC has a website page where you can look up any dancer by their name or dancer ID to see what division the dancer is in and how many points they have in that division.
I wanted to create a web page allowing users to search for dancers and return the same data as the WSDC dancer lookup page. The problem is that the lookup on that webpage uses javascript to make AJAX requests to update the page with dancer point data asynchronously. This type of request is difficult to replicate and scrape with the python ārequestsā and āBeautiful Soupā libraries. So Iām going to scrape the page with Playwright.
What is Playwright?
Playwright bills itself as a framework for ā⦠end-to-end testing modern web appsā. It is a tool like Selenium that allows the user to write python code (or Node.js, Java, or .NET code) to open a web browser and interact with a web application as a human would. Playwright can programmatically perform any action a human user can perform, such as typing into an input box and clicking a submit button. I recommend checking out my other blog post on Playwright, āEnd-to-end website testing with Playwrightā, which goes into depth on how Playwright works and how you can use it for testing your website.
Setup
You can find all the code for this web scraping example on GitHub here. I wrote the code for python 3.10, but it should work in python versions 3.9+ (or lower versions of python if type hints are removed). You can also see a finished code version toward the end of the article, here. To get started with the examples, first, create a virtual environment. Then pip install the packages weāll be using into that virtual environment:
pip install flask playwright pytest
Playwright requires an additional installation step to install the browsers it uses for interacting with websites:
playwright install
And weāre all set.
Generating the initial script with ācodegenā
Playwright has a nifty tool called ācodegenā to help you start writing a playwright script. Hereās the command line argument I typed to get an outline of my script:
playwright codegen https://www.worldsdc.com/registry-points/
This command opened two windows: a web browser to the page https://www.worldsdc.com/registry-points/ that I want to web-scrape and a second window with some buttons and python code. I then interacted with the website exactly as I would to get a swing dancerās data. I clicked the search box, typed in my name, and clicked the dropdown link for myself, filling the page with my competitive swing dance data. Hereās a GIF of how those steps looked:
The Playwright codegen tool wrote an initial script.
As I performed these actions, the second window filled out a Playwright script to replicate my steps. Hereās the script it generated:
from Playwright.sync_api import Playwright, sync_playwright, expect
def run(playwright: Playwright) -> None:
browser = playwright.chromium.launch(headless=False)
context = browser.new_context()
page = context.new_page()
page.goto("https://www.worldsdc.com/registry-points/")
page.frame_locator("iframe[name=\"myiFrame\"]").get_by_placeholder("Search by Name or WSDC #").click()
page.frame_locator("iframe[name=\"myiFrame\"]").get_by_placeholder("Search by Name or WSDC #").fill("theodore williams")
page.frame_locator("iframe[name=\"myiFrame\"]").get_by_text("Theodore Williams (11612)").click()
# ---------------------
context.close()
browser.close()
with sync_playwright() as playwright:
run(playwright)
This auto-generated script was a great first step. I saved the script to a file named check_dancer_points.py
. When I ran the script, it did what I wanted it to do, although quickly, so it was hard to see what was happening. So, I adjusted the ābrowserā line to see whatās happening more easily: browser = playwright.chromium.launch(headless=False, slow_mo=1000)
. Now when I run the script, there is a 1-second (1000 millisecond) pause between each action, and I can see that the script is performing the expected actions. Hereās what it looks like when I run the script:
Running the script generated by the Playwright codegen tool.
Updating the script for my needs
The Playwright ācodegenā generated script was a great starting point for web-scraping swing dancer data. However, it has two problems. First, the input is specific to my name. I want the script to accept any name (or dancer ID) as input. Second, the script doesnāt retrieve the dancerās data from the page after performing the actions. It just exits immediately. Letās address these two problems one at a time.
Accepting dynamic user input
I want the script to accept a name or dancer ID as input, so Iāll adjust the script like so:
from Playwright.sync_api import Playwright, Page, Browser, BrowserContext, sync_playwright
def check_points_inner(
page: Page, name_or_id: str
) -> None:
"""Check dancer points with the World Swing Dance Council site."""
i_frame = page.frame_locator('iframe[name="myiFrame"]')
search_bar = i_frame.get_by_placeholder("Search by Name or WSDC #")
search_bar.click()
search_bar.fill(name_or_id)
search_results = i_frame.locator(".tt-selectable")
try:
search_results.first.click(timeout=2000)
except TimeoutError:
return
def setup_playwright(playwright: Playwright) -> tuple[Browser, BrowserContext, Page]:
"""Set up playwright."""
browser = playwright.chromium.launch(headless=False, slow_mo=1000)
context = browser.new_context()
page = context.new_page()
page.goto("https://www.worldsdc.com/registry-points/")
return browser, context, page
def teardown_playwright(browser: Browser, context: BrowserContext) -> None:
"""Tear down playwright."""
context.close()
browser.close()
def check_points(name_or_id: str) -> None:
"""Check dancer points with the World Swing Dance Council site."""
with sync_playwright() as playwright:
browser, context, page = setup_playwright(playwright)
try:
check_points_inner(page, name_or_id)
finally:
teardown_playwright(browser, context)
if __name__ == "__main__":
name_or_id = input("Name or ID: ")
check_points(name_or_id)
This script runs and works as intended. It performs identically to the auto-generated script when my name, āTheodore Williamsā, is used for the input. So what changes did I make? Iāll explain the changes from the bottom to the top since the changes make more sense in this direction.
if __name__ == "__main__":
name_or_id = input("Name or ID: ")
check_points(name_or_id)
if __name__ == "__main__"
is a typical if
block for python scripts. It says, āOnly run this bit of code if the python file is called as a scriptā (perhaps with python FILE_NAME
on the command line). name_or_id = input("Name or ID: ")
will prompt the user to input a āName or IDā. We set the value of that input to the variable name_or_id
. Then we pass the name_or_id
variable to a new function, check_points
. Letās look at the function check_points
.
def check_points(name_or_id: str) -> None:
"""Check dancer points with the World Swing Dance Council site."""
with sync_playwright() as playwright:
browser, context, page = setup_playwright(playwright)
try:
check_points_inner(page, name_or_id)
finally:
teardown_playwright(browser, context)
with sync_playwright() as playwright:
starts a synchronous playwright
context block, where we will perform all our playwright actions, just like in the auto-generated script. Inside the context block, we call three functions. setup_playwright
, which establishes our playwright environment, check_points_inner
, which performs the actions on the page; and teardown_playwright
, which closes resources generated by setup_playwright
. To ensure we always close resources, even if an error occurs, we call check_points_inner
in a try
block and teardown_playwright
in a finally
block. Therefore, no matter what happens in the try
block, the finally
block will be called. Letās look at each of those functions individually.
def setup_playwright(playwright: Playwright) -> tuple[Browser, BrowserContext, Page]:
"""Set up playwright."""
browser = playwright.chromium.launch(headless=False, slow_mo=1000)
context = browser.new_context()
page = context.new_page()
page.goto("https://www.worldsdc.com/registry-points/")
return browser, context, page
The setup_playwright
function copies the start of the run
function we automatically generated earlier. First, it launches a new browser
, browser_context
, and page
, and then it goes to the page we are interested in for web scraping. Finally, it returns all the resources we generated so we can use them in our other functions.
def teardown_playwright(browser: Browser, context: BrowserContext) -> None:
"""Tear down playwright."""
context.close()
browser.close()
The teardown_playwright
function copies the end of the run
function we automatically generated earlier, closing the browser
and browser context
we opened in setup_playwright
.
def check_points_inner(
page: Page, name_or_id: str
) -> None:
"""Check dancer points with the World Swing Dance Council site."""
i_frame = page.frame_locator('iframe[name="myiFrame"]')
search_bar = i_frame.get_by_placeholder("Search by Name or WSDC #")
search_bar.click()
search_bar.fill(name_or_id)
search_results = i_frame.locator(".tt-selectable")
try:
search_results.first.click(timeout=2000)
except TimeoutError:
return
The check_points_inner
function is the meat of our web-scraping work. All the work is performed on items located on the page
object we generated in the setup_playwright
function. This function is an update of our auto-generated run
function from earlier that looked like this:
page.frame_locator("iframe[name=\"myiFrame\"]").get_by_placeholder("Search by Name or WSDC #").click()
page.frame_locator("iframe[name=\"myiFrame\"]").get_by_placeholder("Search by Name or WSDC #").fill("theodore williams")
page.frame_locator("iframe[name=\"myiFrame\"]").get_by_text("Theodore Williams (11612)").click()
First, I noticed that the code page.frame_locator("iframe[name=\"myiFrame\"]")
is used three times in the original script. So I pulled it into its own variable for reusability and readability: i_frame = page.frame_locator('iframe[name="myiFrame"]')
. Next, the code .get_by_placeholder("Search by Name or WSDC #")
is reused twice, so I pulled this into its own variable as well: search_bar = i_frame.get_by_placeholder("Search by Name or WSDC #")
. Next, we mouse-click the search bar from the previous line and fill in the search bar with the name_or_id
we received as user input.
Filling in the search bar causes a dropdown menu to appear on the website. Previously, we clicked on the item from the dropdown with .get_by_text("Theodore Williams (11612)").click()
. However, since the user input wonāt always be āTheodore Williamsā, this selector wonāt work. So how can we find that first search result without knowing its text contents? With CSS selectors.
If I go to the page I want to scrape in my browser, manually perform the search bar fill-in and hover the dropdown result; I can use the browser developer tools to find the CSS selector for the element I want. I do this in Chrome by right-mouse-clicking the dropdown result and clicking āinspectā. Chrome pulls up its developer tools and highlights the inspected element in the pageās HTML. The HTML element highlighted by the inspect tool looks like this:
<div class="tt-suggestion tt-selectable">Theodore Williams (11612)</div>
Therefore, to get the search result element with Playwright, we can chain a locator from the outer iframe using the CSS class tt-selectable
. Remember, CSS selectors for element classes have a leading .
so the result is search_results = i_frame.locator(".tt-selectable")
. Finally, we click the search result with .click()
. But what about the other changes?
try:
search_results.first.click(timeout=2000)
except TimeoutError:
return
The above code handles the problem we get if we donāt find precisely one dancer. If the search finds more than one dancer, the locator i_frame.locator(".tt-selectable")
would correspond to a list of search results. .first
selects the first result in that list. But what if the search turns up zero dancers? In this case, the dropdown wonāt appear, meaning there will be zero elements with the tt-selectable
class.
Playwright is clever and understands that JavaScript is not instant, so the element we want to click on might only be available after some time. In this case, Playwright will try clicking the element repeatedly until the click succeeds. If the click never succeeds, the click
method will eventually throw a TimeOutError
. By default, the click
method will throw a TimeOutError
after 30 seconds (or 30,000 milliseconds). However, since we are confident the dropdown menu will find a dancer in the first two seconds after typing a name (probably in less than one second), we can update this timeout to 2 seconds (or 2,000 milliseconds). Then, we can catch the error with a try/except
block if a timeout occurs. For now, weāll return nothing, erroring silently. In the next step, weāll update the return value, so the error isnāt silent.
Web-scraping dancer data
So far, weāve had Playwright perform actions that bring a dancerās data into view. Next, weāll scrape that data from the page and return the data as JSON. Hereās how I adjusted the script to scrape dancer data:
from typing import Union
import json
from Playwright.sync_api import Playwright, Page, Browser, BrowserContext, sync_playwright
def check_points_inner(
page: Page, name_or_id: str
) -> dict[str, Union[int, str, None]]:
"""Check dancer points with the World Swing Dance Council site."""
# Search for dancer
i_frame = page.frame_locator('iframe[name="myiFrame"]')
search_bar = i_frame.get_by_placeholder("Search by Name or WSDC #")
search_bar.click()
search_bar.fill(name_or_id)
search_results = i_frame.locator(".tt-selectable")
try:
search_results.first.click(timeout=2000)
except TimeoutError:
return {"error": "No results found."}
# Scrape results
results = i_frame.locator("#lookup_results")
name_and_id = results.locator("h1").inner_text()
name, dancer_id = name_and_id.split(" (")
dancer_id = dancer_id.strip(")")
lower_level = results.locator(".lead").first.locator(".label-success").inner_text()
upper_level_loc = results.locator(".lead").first.locator(".label-warning")
upper_level = upper_level_loc.inner_text() if upper_level_loc.is_visible() else None
div_and_points = results.locator("h3").first.inner_text()
highest_pointed_division, points_in_division, _ = div_and_points.split(" ")
return {
"name": name,
"id": int(dancer_id),
"lower_level": lower_level,
"upper_level": upper_level,
"highest_pointed_division": highest_pointed_division,
"points_in_division": int(points_in_division),
}
def setup_playwright(playwright: Playwright) -> tuple[Browser, BrowserContext, Page]:
"""Set up playwright."""
browser = playwright.chromium.launch(headless=False, slow_mo=1000)
context = browser.new_context()
page = context.new_page()
page.goto("https://www.worldsdc.com/registry-points/")
return browser, context, page
def teardown_playwright(browser: Browser, context: BrowserContext) -> None:
"""Tear down playwright."""
context.close()
browser.close()
def check_points(name_or_id: str) -> dict[str, Union[int, str, None]]:
"""Check dancer points with the World Swing Dance Council site."""
with sync_playwright() as playwright:
browser, context, page = setup_playwright(playwright)
try:
return check_points_inner(page, name_or_id)
finally:
teardown_playwright(browser, context)
if __name__ == "__main__":
name_or_id = input("Name or ID: ")
results = check_points(name_or_id)
print(json.dumps(results, indent=4))
What changed? First, check_points
now returns the result of check_points_inner
, and check_points_inner
returns a dictionary. Weāll talk about the dictionaryās contents momentarily. Then, in our if __name__ == "__main__":
block for running the script, we put the dictionary results of check_points
into a variable named results
, and we print out that dictionary as a JSON string.
Now letās look at how we generate the dictionary in check_points_inner
:
def check_points_inner(
page: Page, name_or_id: str
) -> dict[str, Union[int, str, None]]:
"""Check dancer points with the World Swing Dance Council site."""
# Search for dancer
i_frame = page.frame_locator('iframe[name="myiFrame"]')
search_bar = i_frame.get_by_placeholder("Search by Name or WSDC #")
search_bar.click()
search_bar.fill(name_or_id)
search_results = i_frame.locator(".tt-selectable")
try:
search_results.first.click(timeout=2000)
except TimeoutError:
return {"error": "No results found."}
# Scrape results
results = i_frame.locator("#lookup_results")
name_and_id = results.locator("h1").inner_text()
name, dancer_id = name_and_id.split(" (")
dancer_id = dancer_id.strip(")")
lower_level = results.locator(".lead").first.locator(".label-success").inner_text()
upper_level_loc = results.locator(".lead").first.locator(".label-warning")
upper_level = upper_level_loc.inner_text() if upper_level_loc.is_visible() else None
div_and_points = results.locator("h3").first.inner_text()
highest_pointed_division, points_in_division, _ = div_and_points.split(" ")
return {
"name": name,
"id": int(dancer_id),
"lower_level": lower_level,
"upper_level": upper_level,
"highest_pointed_division": highest_pointed_division,
"points_in_division": int(points_in_division),
}
This function is identical to the function from the last section until our except
clause. Now, since we always want to return a dictionary, we return a dictionary explaining "No results found"
if the dancer search returns no dancers to click before the two-second timeout.
Then, we scrape the results from the dancer search using chained Playwright locators
to select the data and return those values in a dictionary. The locators find the data with CSS selectors that I looked up using the browserās āinspectā developer tool, just like before. Hereās a GIF showing how I found the CSS selectors for the locators with Chromeās inspect tool.
Scrape page data using Playwright locators and CSS selectors.
Side note: notice the āupper_levelā and ālower_levelā dictionary keys. Swing dancers from one division (for instance, Novice) gain points until they achieve a threshold that allows them to compete in a higher division (for example, Intermediate). However, until they earn points in the higher-level division, they can compete in both higher- and lower-level divisions. The website will list both divisions (or levels) when this happens. When the website lists both divisions, we scrape and return both values. We put the lower division in the ālower_levelā field and the higher division in the āupper_levelā field.
The final product
So far, I have introduced a script for scraping dancer data from the World Swing Dance Council website. Next, we will create a small Flask application with an endpoint that uses our web-scraping functions to return dancer data over the web. Hereās the final code, including the Flask application endpoint.
"""check_dancer_points: Check dancer points with the World Swing Dance Council site."""
from typing import Union
from Playwright.sync_api import sync_playwright, TimeoutError, Page, Browser, BrowserContext, Playwright
import json
from flask import Flask, request
TIMEOUT = 2000
app = Flask(__name__)
@app.route("/")
def points_route() -> dict:
"""Return the points for a dancer as JSON (flask converts dictionary to JSON)."""
name_or_id = request.args.get("name_or_id")
return (
check_points(name_or_id) if name_or_id else {"error": "No name or ID provided."}
)
def check_points_inner(
page: Page, name_or_id: str
) -> dict[str, Union[int, str, None]]:
"""Check dancer points with the World Swing Dance Council site."""
# Search for dancer
i_frame = page.frame_locator('iframe[name="myiFrame"]')
search_bar = i_frame.get_by_placeholder("Search by Name or WSDC #")
search_bar.click()
search_bar.fill(name_or_id)
search_results = i_frame.locator(".tt-selectable")
try:
search_results.first.click(timeout=TIMEOUT)
except TimeoutError:
return {"error": "No results found."}
# Scrape results
results = i_frame.locator("#lookup_results")
name_and_id = results.locator("h1").inner_text()
name, dancer_id = name_and_id.split(" (")
dancer_id = dancer_id.strip(")")
lower_level = results.locator(".lead").first.locator(".label-success").inner_text()
upper_level_loc = results.locator(".lead").first.locator(".label-warning")
upper_level = upper_level_loc.inner_text() if upper_level_loc.is_visible() else None
div_and_points = results.locator("h3").first.inner_text()
highest_pointed_division, points_in_division, _ = div_and_points.split(" ")
return {
"name": name,
"id": int(dancer_id),
"lower_level": lower_level,
"upper_level": upper_level,
"highest_pointed_division": highest_pointed_division,
"points_in_division": int(points_in_division),
}
def setup_playwright(playwright: Playwright) -> tuple[Browser, BrowserContext, Page]:
"""Set up playwright."""
browser = playwright.chromium.launch(headless=True)
context = browser.new_context()
page = context.new_page()
page.goto("https://www.worldsdc.com/registry-points/")
return browser, context, page
def teardown_playwright(browser: Browser, context: BrowserContext) -> None:
"""Tear down playwright."""
context.close()
browser.close()
def check_points(name_or_id: str) -> dict[str, Union[int, str, None]]:
"""Check dancer points with the World Swing Dance Council site."""
with sync_playwright() as playwright:
browser, context, page = setup_playwright(playwright)
try:
return check_points_inner(page, name_or_id)
finally:
teardown_playwright(browser, context)
if __name__ == "__main__": # pragma: no cover
name_or_id = input("Name or ID: ")
results = check_points(name_or_id)
print(json.dumps(results, indent=4))
What changed? Hereās the most noticeable change.
app = Flask(__name__)
@app.route("/")
def points_route() -> dict:
"""Return the points for a dancer as JSON (flask converts dictionary to JSON)."""
name_or_id = request.args.get("name_or_id")
return (
check_points(name_or_id) if name_or_id else {"error": "No name or ID provided."}
)
This code snippet creates a simple Flask application and a single endpoint at /
. The endpoint expects a query string with the parameter āname_or_idā. The function passes the provided āname_or_idā query string parameter value into our check_points
function. That function returns a dictionary with the dancerās data, returning an error instead if no āname_or_idā query string parameter is provided. Finally, flask converts the dictionary to JSON and returns that JSON in the response body.
The other most notable change is that we replaced the setup_playwright
function line browser = playwright.chromium.launch(headless=False, slow_mo=1000)
with browser = playwright.chromium.launch(headless=True)
. We launch the browser in headless mode without āslow_moā because we donāt need to see a chrome window open and perform the web scraping. This change makes the scraping perform much faster.
To launch the Flask application, run the following:
flask --app check_dancer_points run --reload
Then open a web browser to http://localhost:5000. When we open a browser to http://localhost:5000, weāll see the error {"error": "No name or ID provided."}
. We must provide the required āname_or_idā query string parameter. Letās try opening the page with this query string: http://localhost:5000/?name_or_id=theodore%20williams (note: ā%20ā is the URL encoding for a space " "
). With the query string provided, the resulting output body looks like this (after indenting):
{
"highest_pointed_division": "Novice",
"id": 11612,
"lower_level": "NOV",
"name": "Theodore Williams",
"points_in_division": 8,
"upper_level": null
}
Success! We have a working Flask application capable of scraping dancer data from the World Swing Dance Council website using Playwright.
Testing the application
No code is complete without automated tests to ensure everything is working as expected. Hereās a test file to test our Flask endpoint under a few conditions.
"""Tests for check_dancer_points.py."""
import pytest
from pytest import MonkeyPatch
from flask.testing import FlaskClient
from check_dancer_points import app
import check_dancer_points
@pytest.fixture()
def client() -> FlaskClient:
"""Create a fresh flask client for a function."""
return app.test_client()
def test_no_name_or_id(client: FlaskClient) -> None:
"""Test that no name or ID provided returns an error."""
response = client.get("/")
assert response.json == {"error": "No name or ID provided."}
def test_no_results(client: FlaskClient, monkeypatch: MonkeyPatch) -> None:
"""Test that no results found returns an error."""
monkeypatch.setattr(check_dancer_points, "TIMEOUT", 1)
response = client.get("/?name_or_id=not%20a%20dancer")
assert response.json == {"error": "No results found."}
def test_name_and_id(client: FlaskClient) -> None:
"""Test that name and ID are returned."""
response = client.get("/?name_or_id=Theodore%20Williams") # That's me!
assert response.json and "error" not in response.json
assert response.json["name"] == "Theodore Williams"
assert response.json["id"] == 11612
If youāre not familiar with testing python code with pytest, check out my blog post ā9 pytest tips and tricks to take your tests to the next levelā. The above tests make use of a single fixture called client
. The fixture uses an import of our Flask app to create and return a Flask test client. The three test functions use the Flask test client by providing client
as a parameter to the test.
All three tests call the /
endpoint. The first test calls the endpoint without the required query string and asserts that the appropriate error is returned. The second test provides a query string for a non-existent dancer and asserts the relevant error is returned. It also monkey patches the TIMEOUT
from 2000 milliseconds down to 1 millisecond to speed up the test. You can read about monkey patching global variables in my pytest blog post. The final test checks the endpoint with a working dancer name (my name) and asserts that the returned name and dancer ID are as expected.
Conclusions
In this blog post, we learned how to scrape a websiteās data with Playwright and how to test a Flask application that makes use of Playwright web scraping. Playwright is a valuable tool for scraping websites that rely on javascript, requiring human-like page interactions to expose the desired data. I hope you found this tutorial useful! š
Comments (0)
You can style your comment using markdown. See this markdown cheat sheet for ideas. Comment as a guest with the below form, or to comment as yourself. Optionally, provide your email to be notified when others comment on this blog post.