dev-resources.site
for different kinds of informations.
How to Scrape With Headless Firefox
In this guide, we'll explain how to install and use headless Firefox with Selenium, Playwright, and Puppeteer. Additionally, we'll go over a practical example of automating each of these libraries for common tasks when scraping web pages.
Headless Firefox With Selenium
Let's start our guide by exploring Selenium headless Firefox. First, we'll have to install Selenium using the following pip
command:
pip install selenium
The above command will install Selenium4. It allows us to download the WebDriver binaries automatically, either for Chrome or Firefox:
from selenium import webdriver
from selenium.webdriver import FirefoxOptions
# selenium firefox browser options
options = FirefoxOptions()
options.add_argument("-headless")
# initiating the browser and download the webdriver
with webdriver.Firefox(options=options) as driver:
# go to the target web page
driver.get("https://httpbin.dev/user-agent")
print(driver.page_source)
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0"
In the above code, we start by defining basic browser configuration using the FirefoxOptions
class. Then, we use the webdriver.Firefox
constructor to create a Selenium Firefox instance, which also downloads the Firefox WebDriver binaries automatically. Finally, we request the target web page and return the HTML content.
The above code uses the -headless
argument to run Selenium Firefox headless (without a graphical user interface). To run it in the headful mode , we can simply remove the argument and add an optional browser viewport size :
from selenium import webdriver
from selenium.webdriver import FirefoxOptions
# selenium firefox browser options
options = FirefoxOptions()
# browser viewport size
options.add_argument("--width=1920")
options.add_argument("--height=1080")
# initiating the browser and download the webdriver
with webdriver.Firefox(options=options) as driver:
# ...
Now that we can spin a Firefox headless browser. We can automate it with the regular Selenium API.
Basic Selenium Firefox Navigation
In this guide, we'll create a headless Firefox scraping script to automate the login process on web-scraping.dev/login. We'll request the target page URL, accept the cookie policy, fill in the login credentials, and click the login button:
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
with webdriver.Firefox() as driver:
# go to the target web page
driver.get("https://web-scraping.dev/login?cookies=")
# define a timeout
wait = WebDriverWait(driver, timeout=5)
# accept the cookie policy
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#cookie-ok")))
driver.find_element(By.CSS_SELECTOR, "button#cookie-ok").click()
# wait for the login form
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[type='submit']")))
# fill in the login credentails
username_button = driver.find_element(By.CSS_SELECTOR, "input[name='username']")
username_button.clear()
username_button.send_keys("user123")
password_button = driver.find_element(By.CSS_SELECTOR, "input[name='password']")
password_button.clear()
password_button.send_keys("password")
# click the login submit button
driver.find_element(By.CSS_SELECTOR, "button[type='submit']").click()
# wait for an element on the login redirected page
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "div#secret-message")))
secret_message = driver.find_element(By.CSS_SELECTOR, "div#secret-message").text
print(f"The secret message is: {secret_message}")
"The secret message is: 🤫"
Here, we define timeouts to wait for specific elements to appear using Selenium's expected conditions. Then, we use find_element method to find the elements and click them.
For further details on using Selenium for web scraping, refer to our dedicated guide.
Headless Firefox With Playwright
Let's explore headless Firefox scraping with Playwright, a popular web browser automation tool with straightforward APIs.
We'll cover using Playwright headless Firefox in both Python and Node.js APIs. First, install Playwright using the following command:
Python:
pip install playwright
Node.js
npm install playwright
Next, install the Firefox WebDriver binaries using the following command:
Python
playwright install firefox
Node.js
npx playwright install firefox
To start headless Firefox with Playwright, we have to explicitly select the browser type:
Python
from playwright.sync_api import sync_playwright
with sync_playwright() as playwright:
# launch playwright firefox browser
browser = playwright.firefox.launch(headless=True)
# new browser session with the default settings
context = browser.new_context()
# new browser tab
page = context.new_page()
# request the target page url
page.goto("https://httpbin.dev/user-agent")
# get the page HTML
print(page.content())
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:123.0) Gecko/20100101 Firefox/123.0"
Node.js
const { firefox } = require('playwright');
(async () => {
// launch playwright firefox browser
const browser = await firefox.launch({ headless: true });
// new browser session with the default settings
const context = await browser.newContext();
// new browser tab
const page = await context.newPage();
// request the target page url
await page.goto('https://httpbin.dev/user-agent');
// get the page HTML
const content = await page.content();
console.log(content);
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0"
// close the browser
await browser.close();
})();
Here, we start a Playwright headless Firefox and create a new browser context with the default settings, including headers and localization. Then, we open a Playwright page and request the target page URL.
The above code runs the browser instance in the headless mode. To use the headful mode, we can disable the headless
option and define the browser viewport:
Python
from playwright.sync_api import sync_playwright
with sync_playwright() as playwright:
# disable the headless mode
browser = playwright.firefox.launch(headless=False)
# define the browser viewport
context = browser.new_context(
viewport = { "width": 1280, "height": 1024 }
)
Node.js
const { firefox } = require('playwright');
(async () => {
// disable the headless mode
const browser = await firefox.launch({ headless: false });
// define the browser viewport
const context = await browser.newContext({
viewport: { width: 1920, height: 1080 }
});
Next, let's explore automating the Playwright Firefox browser for scraping.
Basic Playwright Firefox Navigation
Let's automate the previous web-scraping.dev/login
example using Playwright:
Python
from playwright.sync_api import sync_playwright
with sync_playwright() as playwright:
browser = playwright.firefox.launch(headless=True)
context = browser.new_context()
page = context.new_page()
# request the target web page
page.goto("https://web-scraping.dev/login?cookies=")
# accept the cookie policy
page.click("button#cookie-ok")
# wait for the login form
page.wait_for_selector("button[type='submit']")
# wait for the page to fully load
page.wait_for_load_state("networkidle")
# fill in the login credentials
page.fill("input[name='username']", "user123")
page.fill("input[name='password']", "password")
# click the login submit button
page.click("button[type='submit']")
# wait for an element on the login redirected page
page.wait_for_selector("div#secret-message")
secret_message = page.inner_text("div#secret-message")
print(f"The secret message is {secret_message}")
"The secret message is 🤫"
Node.js
const { firefox } = require('playwright');
(async () => {
const browser = await firefox.launch({ headless: true });
const context = await browser.newContext();
const page = await context.newPage();
// request the target web page
await page.goto('https://web-scraping.dev/login?cookies=');
// wait for the page to fully load
await page.waitForLoadState('networkidle');
// accept the cookie policy
await page.click("button#cookie-ok")
// wait for the login form
page.waitForSelector("button[type='submit']")
// fill in the login credentials
await page.fill("input[name='username']", "user123");
await page.fill("input[name='password']", "password");
// click the login submit button
await page.click("button[type='submit']");
// wait for an element on the login redirected page
await page.waitForSelector("div#secret-message");
const secretMessage = await page.innerText("div#secret-message");
console.log(`The secret message is ${secretMessage}`);
"The secret message is 🤫"
// close the browser
await browser.close();
})();
Let's break down the above Playwright Firefox scraping code. We start by initiating a Firefox browser and navigating to the target page URL. Next, we use a combination of Playwright page methods to:
- Wait for specific selectors, as well as the load state.
- Select, fill, and click elements.
Check our dedicated guide for more details on web scraping with Playwright.
Headless Firefox With Puppeteer
Finally, let's explore using Puppeteer for headless Firefox. First, install the puppeteer library package using npm
:
npm install puppeteer
Next, install the Firefox browser binaries:
npx puppeteer browsers install firefox
To use headless Firefox with Puppeteer, we can specify firefox
as the product:
const puppeteer = require('puppeteer');
(async () => {
// launch the puppeteer browser
const browser = await puppeteer.launch({
// use firefox as the browser name
product: 'firefox',
// run in the headless mode
headless: true
})
// start a browser page
const page = await browser.newPage();
// goto the target web page
await page.goto('https://httpbin.dev/user-agent');
// get the page HTML
console.log(await page.content());
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:126.0) Gecko/20100101 Firefox/126.0"
// close the browser
await browser.close();
})();
The above code will run the headless browser in the headless mode. To run Firefox in the headful mode, we can disable the headless
parameter and define the browser viewport:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({
product: 'firefox',
headless: false
})
const page = await browser.newPage();
await page.setViewport({width: 1920, height: 1080});
})();
Next, let's explore headless Firefox scraping with Puppeteer through our previous example.
Basic Puppeteer Firefox Navigation
Here's how we can wait, click, and fill elements with Puppeteer:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({
product: 'firefox',
headless: true
})
// create a browser page
const page = await browser.newPage();
// go to the target web page
await page.goto(
'https://web-scraping.dev/login?cookies=',
{ waitUntil: 'domcontentloaded' }
);
// wait for 500 milliseconds
await new Promise(resolve => setTimeout(resolve, 500));
// accept the cookie policy
await page.click('button#cookie-ok')
// wait for the login form
await page.waitForSelector('button[type="submit"]')
// fill in the login credentials
await page.$eval('input[name="username"]', (el, value) => el.value = value, 'user123');
await page.$eval('input[name="password"]', (el, value) => el.value = value, 'password');
await new Promise(resolve => setTimeout(resolve, 500));
// click the login button and wait for navigation
await page.click('button[type="submit"]');
await page.waitForSelector('div#secret-message');
secretMessage = await page.$eval('div#secret-message', node => node.innerHTML)
console.log(`The secret message is ${secretMessage}`);
// close the browser
await browser.close();
})();
Let's break down the above Puppeteer scraping execution flow. We start by launching a Puppeteer headless browser and then request the target web page. Then, we click and fill in the required elements while utilizing timeout to wait for them or the page to load. Finally, we use a CSS selector to parse the secret message element from the HTML.
For more details on web scraping with Puppeteer, refer to our dedicated guide, as well as our Puppeteer-stealth guide, which prevents Puppeteer scraper blocking.
FAQ
In this quick guide, we went through a step-by-step guide on scraping with headless Firefox in Selenium, Playwright, and Puppeteer.
How to block resources with Firefox headless browsers?
Blocking headless browser resources can significantly increase web scraping speed. For full details, refer to our dedicated article pages on blocking resources for each browser automation library: Selenium, Playwright, and Puppeteer.
How to scrape background requests with Firefox headless browser?
Inspecting background requests with Firefox is natively supported in Playwright and Puppeteer. As for Selenium, it's available through selenium-wire.
Summary
In this quick guide, we went through a step-by-step guide on scraping with headless Firefox in Selenium, Playwright, and Puppeteer.
Furthermore, we have explored common browser navigation mechanisms to perform web scraping with Firefox:
- Waiting for load states, page navigation, and selectors.
- Selecting elements, clicking buttons, and filling out forms.
Featured ones: