Logo

dev-resources.site

for different kinds of informations.

How to Scrape With Headless Firefox

Published at
4/18/2024
Categories
headlessbrowsers
puppeteer
selenium
node
Author
scrapfly_dev
Author
12 person written this
scrapfly_dev
open
How to Scrape With Headless Firefox

How to Scrape With Headless Firefox

In this guide, we'll explain how to install and use headless Firefox with Selenium, Playwright, and Puppeteer. Additionally, we'll go over a practical example of automating each of these libraries for common tasks when scraping web pages.

Headless Firefox With Selenium

Let's start our guide by exploring Selenium headless Firefox. First, we'll have to install Selenium using the following pip command:

pip install selenium
Enter fullscreen mode Exit fullscreen mode

The above command will install Selenium4. It allows us to download the WebDriver binaries automatically, either for Chrome or Firefox:

from selenium import webdriver 
from selenium.webdriver import FirefoxOptions

# selenium firefox browser options
options = FirefoxOptions()
options.add_argument("-headless")

# initiating the browser and download the webdriver 
with webdriver.Firefox(options=options) as driver: 
    # go to the target web page
    driver.get("https://httpbin.dev/user-agent")

    print(driver.page_source)
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0"
Enter fullscreen mode Exit fullscreen mode

In the above code, we start by defining basic browser configuration using the FirefoxOptions class. Then, we use the webdriver.Firefox constructor to create a Selenium Firefox instance, which also downloads the Firefox WebDriver binaries automatically. Finally, we request the target web page and return the HTML content.

The above code uses the -headless argument to run Selenium Firefox headless (without a graphical user interface). To run it in the headful mode , we can simply remove the argument and add an optional browser viewport size :

from selenium import webdriver 
from selenium.webdriver import FirefoxOptions

# selenium firefox browser options
options = FirefoxOptions()
# browser viewport size
options.add_argument("--width=1920")
options.add_argument("--height=1080")

# initiating the browser and download the webdriver 
with webdriver.Firefox(options=options) as driver: 
    # ...
Enter fullscreen mode Exit fullscreen mode

Now that we can spin a Firefox headless browser. We can automate it with the regular Selenium API.

Basic Selenium Firefox Navigation

In this guide, we'll create a headless Firefox scraping script to automate the login process on web-scraping.dev/login. We'll request the target page URL, accept the cookie policy, fill in the login credentials, and click the login button:

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

with webdriver.Firefox() as driver:
    # go to the target web page
    driver.get("https://web-scraping.dev/login?cookies=")

    # define a timeout
    wait = WebDriverWait(driver, timeout=5)

    # accept the cookie policy
    wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#cookie-ok")))
    driver.find_element(By.CSS_SELECTOR, "button#cookie-ok").click()

    # wait for the login form
    wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[type='submit']")))

    # fill in the login credentails
    username_button = driver.find_element(By.CSS_SELECTOR, "input[name='username']")
    username_button.clear()
    username_button.send_keys("user123")

    password_button = driver.find_element(By.CSS_SELECTOR, "input[name='password']")
    password_button.clear()
    password_button.send_keys("password")

    # click the login submit button
    driver.find_element(By.CSS_SELECTOR, "button[type='submit']").click()

    # wait for an element on the login redirected page
    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "div#secret-message")))

    secret_message = driver.find_element(By.CSS_SELECTOR, "div#secret-message").text
    print(f"The secret message is: {secret_message}")
    "The secret message is: 🤫"
Enter fullscreen mode Exit fullscreen mode

Here, we define timeouts to wait for specific elements to appear using Selenium's expected conditions. Then, we use find_element method to find the elements and click them.

For further details on using Selenium for web scraping, refer to our dedicated guide.

Headless Firefox With Playwright

Let's explore headless Firefox scraping with Playwright, a popular web browser automation tool with straightforward APIs.

We'll cover using Playwright headless Firefox in both Python and Node.js APIs. First, install Playwright using the following command:

Python:

pip install playwright
Enter fullscreen mode Exit fullscreen mode

Node.js

npm install playwright
Enter fullscreen mode Exit fullscreen mode

Next, install the Firefox WebDriver binaries using the following command:

Python

playwright install firefox
Enter fullscreen mode Exit fullscreen mode

Node.js

npx playwright install firefox
Enter fullscreen mode Exit fullscreen mode

To start headless Firefox with Playwright, we have to explicitly select the browser type:

Python

from playwright.sync_api import sync_playwright

with sync_playwright() as playwright:
    # launch playwright firefox browser
    browser = playwright.firefox.launch(headless=True)

    # new browser session with the default settings
    context = browser.new_context()

    # new browser tab
    page = context.new_page()

    # request the target page url
    page.goto("https://httpbin.dev/user-agent")

    # get the page HTML
    print(page.content())
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:123.0) Gecko/20100101 Firefox/123.0"
Enter fullscreen mode Exit fullscreen mode

Node.js

const { firefox } = require('playwright');

(async () => {
    // launch playwright firefox browser
    const browser = await firefox.launch({ headless: true });

    // new browser session with the default settings
    const context = await browser.newContext();

    // new browser tab
    const page = await context.newPage();

    // request the target page url
    await page.goto('https://httpbin.dev/user-agent');

    // get the page HTML
    const content = await page.content();
    console.log(content);
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0"

    // close the browser
    await browser.close();
})();
Enter fullscreen mode Exit fullscreen mode

Here, we start a Playwright headless Firefox and create a new browser context with the default settings, including headers and localization. Then, we open a Playwright page and request the target page URL.

The above code runs the browser instance in the headless mode. To use the headful mode, we can disable the headless option and define the browser viewport:

Python

from playwright.sync_api import sync_playwright

with sync_playwright() as playwright:
    # disable the headless mode
    browser = playwright.firefox.launch(headless=False)

    # define the browser viewport
    context = browser.new_context(
        viewport = { "width": 1280, "height": 1024 }
    )
Enter fullscreen mode Exit fullscreen mode

Node.js

const { firefox } = require('playwright');

(async () => {
    // disable the headless mode
    const browser = await firefox.launch({ headless: false });

    // define the browser viewport
    const context = await browser.newContext({
        viewport: { width: 1920, height: 1080 }
    });
Enter fullscreen mode Exit fullscreen mode

Next, let's explore automating the Playwright Firefox browser for scraping.

Basic Playwright Firefox Navigation

Let's automate the previous web-scraping.dev/login example using Playwright:

Python

from playwright.sync_api import sync_playwright

with sync_playwright() as playwright:
    browser = playwright.firefox.launch(headless=True)
    context = browser.new_context()
    page = context.new_page()

    # request the target web page
    page.goto("https://web-scraping.dev/login?cookies=")

    # accept the cookie policy
    page.click("button#cookie-ok")

    # wait for the login form
    page.wait_for_selector("button[type='submit']")

    # wait for the page to fully load
    page.wait_for_load_state("networkidle")

    # fill in the login credentials
    page.fill("input[name='username']", "user123")
    page.fill("input[name='password']", "password")

    # click the login submit button    
    page.click("button[type='submit']")

    # wait for an element on the login redirected page
    page.wait_for_selector("div#secret-message")

    secret_message = page.inner_text("div#secret-message")
    print(f"The secret message is {secret_message}")
    "The secret message is 🤫"
Enter fullscreen mode Exit fullscreen mode

Node.js

const { firefox } = require('playwright');

(async () => {
  const browser = await firefox.launch({ headless: true });
  const context = await browser.newContext();
  const page = await context.newPage();

  // request the target web page
  await page.goto('https://web-scraping.dev/login?cookies=');

  // wait for the page to fully load
  await page.waitForLoadState('networkidle');

  // accept the cookie policy
  await page.click("button#cookie-ok")

  // wait for the login form
  page.waitForSelector("button[type='submit']")  

  // fill in the login credentials
  await page.fill("input[name='username']", "user123");
  await page.fill("input[name='password']", "password");

  // click the login submit button    
  await page.click("button[type='submit']");

  // wait for an element on the login redirected page
  await page.waitForSelector("div#secret-message");

  const secretMessage = await page.innerText("div#secret-message");
  console.log(`The secret message is ${secretMessage}`);
  "The secret message is 🤫"

  // close the browser
  await browser.close();
})();
Enter fullscreen mode Exit fullscreen mode

Let's break down the above Playwright Firefox scraping code. We start by initiating a Firefox browser and navigating to the target page URL. Next, we use a combination of Playwright page methods to:

  • Wait for specific selectors, as well as the load state.
  • Select, fill, and click elements.

Check our dedicated guide for more details on web scraping with Playwright.

Headless Firefox With Puppeteer

Finally, let's explore using Puppeteer for headless Firefox. First, install the puppeteer library package using npm:

npm install puppeteer
Enter fullscreen mode Exit fullscreen mode

Next, install the Firefox browser binaries:

npx puppeteer browsers install firefox
Enter fullscreen mode Exit fullscreen mode

To use headless Firefox with Puppeteer, we can specify firefox as the product:

const puppeteer = require('puppeteer');

(async () => {
    // launch the puppeteer browser 
    const browser = await puppeteer.launch({
        // use firefox as the browser name
        product: 'firefox',
        // run in the headless mode
        headless: true
    })

    // start a browser page
    const page = await browser.newPage();

    // goto the target web page
    await page.goto('https://httpbin.dev/user-agent');

    // get the page HTML
    console.log(await page.content());
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:126.0) Gecko/20100101 Firefox/126.0"

    // close the browser
    await browser.close();
})();
Enter fullscreen mode Exit fullscreen mode

The above code will run the headless browser in the headless mode. To run Firefox in the headful mode, we can disable the headless parameter and define the browser viewport:

const puppeteer = require('puppeteer');

(async () => {
    const browser = await puppeteer.launch({
        product: 'firefox',
        headless: false
    })
    const page = await browser.newPage();
    await page.setViewport({width: 1920, height: 1080});
})();
Enter fullscreen mode Exit fullscreen mode

Next, let's explore headless Firefox scraping with Puppeteer through our previous example.

Basic Puppeteer Firefox Navigation

Here's how we can wait, click, and fill elements with Puppeteer:

const puppeteer = require('puppeteer');

(async () => {
    const browser = await puppeteer.launch({
        product: 'firefox',
        headless: true
    })

    // create a browser page
    const page = await browser.newPage();

    // go to the target web page
    await page.goto(
        'https://web-scraping.dev/login?cookies=',
        { waitUntil: 'domcontentloaded' }
    );

    // wait for 500 milliseconds        
    await new Promise(resolve => setTimeout(resolve, 500));

    // accept the cookie policy
    await page.click('button#cookie-ok')    

    // wait for the login form
    await page.waitForSelector('button[type="submit"]')

    // fill in the login credentials
    await page.$eval('input[name="username"]', (el, value) => el.value = value, 'user123');
    await page.$eval('input[name="password"]', (el, value) => el.value = value, 'password');    
    await new Promise(resolve => setTimeout(resolve, 500));

    // click the login button and wait for navigation
    await page.click('button[type="submit"]');
    await page.waitForSelector('div#secret-message');    

    secretMessage = await page.$eval('div#secret-message', node => node.innerHTML)
    console.log(`The secret message is ${secretMessage}`);

    // close the browser
    await browser.close();    
})();
Enter fullscreen mode Exit fullscreen mode

Let's break down the above Puppeteer scraping execution flow. We start by launching a Puppeteer headless browser and then request the target web page. Then, we click and fill in the required elements while utilizing timeout to wait for them or the page to load. Finally, we use a CSS selector to parse the secret message element from the HTML.

For more details on web scraping with Puppeteer, refer to our dedicated guide, as well as our Puppeteer-stealth guide, which prevents Puppeteer scraper blocking.

FAQ

In this quick guide, we went through a step-by-step guide on scraping with headless Firefox in Selenium, Playwright, and Puppeteer.

How to block resources with Firefox headless browsers?

Blocking headless browser resources can significantly increase web scraping speed. For full details, refer to our dedicated article pages on blocking resources for each browser automation library: Selenium, Playwright, and Puppeteer.

How to scrape background requests with Firefox headless browser?

Inspecting background requests with Firefox is natively supported in Playwright and Puppeteer. As for Selenium, it's available through selenium-wire.

Summary

In this quick guide, we went through a step-by-step guide on scraping with headless Firefox in Selenium, Playwright, and Puppeteer.

Furthermore, we have explored common browser navigation mechanisms to perform web scraping with Firefox:

  • Waiting for load states, page navigation, and selectors.
  • Selecting elements, clicking buttons, and filling out forms.
puppeteer Article's
30 articles in total
Favicon
How to Web Scrape with Puppeteer: A Beginner-Friendly Guide
Favicon
Running Puppeteer on a Server: A Complete Tutorial
Favicon
Collect All Requested Images on a Website Using Puppeteer
Favicon
Automate Web Testing in C#: A Guide with PuppeteerSharp and SpecFlow
Favicon
A step-by-step guide to setting up a Puppeteer screenshot API on Ubuntu
Favicon
Elevate Your Web Scraping with These Puppeteer Alternatives
Favicon
Creating a Next.js API to Convert HTML to PDF with Puppeteer (Vercel-Compatible)
Favicon
How to configure Swiftproxy proxy server in Puppeteer?
Favicon
installing google chrome in docker
Favicon
Code Against the Clock: Creating the class hunter
Favicon
Writing integration tests with jest and puppeteer
Favicon
Puppeteer Vs Playwright: Scrape a Strapi-Powered Website
Favicon
Mengirim Pesan WhatsApp dengan JavaScript
Favicon
Headless Browser – A Stepping Stone Towards Developing Smarter Web Applications
Favicon
Puppeteer junior
Favicon
Converting HTML web pages into PDF
Favicon
How to Scrape With Headless Firefox
Favicon
Top 5 Puppeteer Alternatives for Node.js
Favicon
How to generate PDF's with Puppeteer on Vercel in 2024
Favicon
Simplify PDF Generation in Node.js with html-to-pdf-pup
Favicon
How to do Web Scraping with Puppeteer and NodeJS in 2024 | Puppeteer tutorial
Favicon
Sometimes things simply don't work
Favicon
User browser vs. Puppeteer
Favicon
WebAuthn E2E Testing: Playwright, Selenium, Puppeteer
Favicon
Mastering Request Interceptions in Puppeteer
Favicon
How to grab all titles of products from an Amazon page
Favicon
Testing web components
Favicon
Rendering PDF from URLs and HTML input using express js
Favicon
Login with Puppeteer and re-use cookies for another window
Favicon
How to download and upload files in Puppeteer

Featured ones: