Logo

dev-resources.site

for different kinds of informations.

User browser vs. Puppeteer

Published at
4/21/2024
Categories
crawling
automation
puppeteer
Author
adaschevici
Categories
3 categories in total
crawling
open
automation
open
puppeteer
open
Author
11 person written this
adaschevici
open
User browser vs. Puppeteer

Intro

When crawling the web nowadays most web pages will be SPAs and use various JS frameworks and libraries that render dynamically. This means that the easy way to crawl is by using some type of headless browser.

There are several options that I know for doing this:

For the sake of simplicity I have chosen not to look at things like cypress since the focus is not the testing but more the automation.

I will focus mostly on puppeteer.

How it works

Puppeteer communicates with chromium using the CDP via a websocket. Theoretically this is possible not only in nodejs but any programming language, but in practice the most comprehensive implementation is what the fine people working on puppeteer have done. What that means is that you have access to many of the features that are accessible from chrome (cookies, storage, dom, screenshots etc...).

The controversial case for web scraping

Web scraping is a bit of a controversial topic and many website tend to clamp down on automatic browsing.

There are a wide range of methods to figure out if a visitor is real or one of our machine overlords. It varies from checking browser capabilities, cookies, captchas and even more advanced behavioral analysis.

Warning: past this point proceed at your own risk

A way to get an overview of what your current browser capabilities are...that some websites might look at, and block you if you don't play nice can be found here.

The following snippet shows how to check puppets' default profile.

import puppeteer from 'puppeteer'

// puppeteer usage as normal
puppeteer.launch({ headless: true }).then(async browser => {
  console.log('Running tests..')
  const page = await browser.newPage()
  await page.goto('https://bot.sannysoft.com')
  await page.screenshot({ path: './screenshots/testresult.png', fullPage: true })
  await browser.close()
  console.log(`All done, check the screenshot. ✨`)
})
Enter fullscreen mode Exit fullscreen mode

Now if you want to get the site believe you you are playing nice, you need to find a way to get this check passing you need a few more modules.

# with pnpm you can install the required as follows
pnpm i puppeteer-extra puppeteer-extra-plugin-stealth
Enter fullscreen mode Exit fullscreen mode

And then do the same but using the stealth plugin.

import puppeteer from 'puppeteer-extra'

// add stealth plugin and use defaults (all evasion techniques)
import StealthPlugin from 'puppeteer-extra-plugin-stealth'
puppeteer.use(StealthPlugin())

// puppeteer usage as normal
puppeteer.launch({ headless: true }).then(async browser => {
  console.log('Running tests..')
  const page = await browser.newPage()
  await page.goto('https://bot.sannysoft.com')
  await page.screenshot({ path: './screenshots/testresult.png', fullPage: true })
  await browser.close()
  console.log(`All done, check the screenshot. ✨`)
})
Enter fullscreen mode Exit fullscreen mode

Conclusions

  • When crawling you should behave as a human would
  • There is no way to fully pretend...but it is fun to try.
  • Be polite and don't do this a mega scale so that you don't crash servers

Featured ones: