Logo

dev-resources.site

for different kinds of informations.

Web Crawler in Action: How to use Webspot to implement automatic recognition and data extraction of list web pages

Published at
4/9/2023
Categories
ai
crawler
programming
Author
tikazyq
Categories
3 categories in total
ai
open
crawler
open
programming
open
Author
7 person written this
tikazyq
open
Web Crawler in Action: How to use Webspot to implement automatic recognition and data extraction of list web pages

Introduction

Using web crawling programs to extract list web pages is a one of those common web data extraction tasks. For engineers to write web crawlers, how to efficiently code and generate extraction rules is quite necessary, otherwise most of the time can be wasted on writing CSS selectors and XPath data extraction rules of web crawling programs. In light of this issue, this article will introduce an example of using open source tool Webspot to automatically recognize and extract data of list web pages.

Webspot

Webspot is an open source project aimed at automating web page data extraction. Currently, it supports recognition and crawling rules extraction of list pages and pagination. In addition, it provides a web UI interface for users to visually view the identified results, and allows developers to use APIs to obtain recognition results.

Installation of Webspot is quite easy, you can refer to the official documentation for the installation tutorial with Docker and Docker Compose. Execute the commands below to install and start Webspot.

# clone git repo
git clone https://github.com/crawlab-team/webspot

# start docker containers
docker-compose up -d
Enter fullscreen mode Exit fullscreen mode

Wait for it to start up, which might take half a minute to initialize the application.

After the initialization, we can visit the web page http://localhost:9999, and should be able to see the user interface below, which means it has started successfully.

Webspot 初始化界面

Now, we can create a new request for page recognition. Click "New Request" and enter https://quotes.toscrape.com. Then click "Submit". Wait for a while and we should be able to see the page below.

Webspot 列表页识别

Use API to auto extract data

We are now using Python to call the API of Webspot to auto extract data.

The whole process is as below.

  1. Call the Webspot API to obtain the extraction rules for list pages and pagination. The extraction rules are CSS Selectors.

  2. Define the retrieval target based on the extraction rules of the list page, that is, each item and its corresponding field on the list page.

  3. Determine the target for crawling the next page based on the extraction rules of pagination, and let the crawler automatically crawl the data of the next page.

Call API

Calling the API is very simple, just pass the URL to be recognized into the body. The code is as follows.

import requests
from bs4 import BeautifulSoup
from pprint import pprint

# API endpoint
api_endpoint = 'http://localhost:9999/api'

# url to extract
url = 'https://quotes.toscrape.com'

# call API to recognize list page and pagination elements
res = requests.post(f'{api_endpoint}/requests', json={
    'url': 'https://quotes.toscrape.com'
})
results = res.json()
pprint(results)
Enter fullscreen mode Exit fullscreen mode

Running the code above in Python Console can obtain the recognition result data similar to the following.

{...
 'method': 'request',
 'no_async': True,
 'results': {'pagination': [{'detector': 'pagination',
                             'name': 'Next',
                             'score': 1.0,
                             'scores': {'score': 1.0},
                             'selectors': {'next': {'attribute': None,
                                                    'name': 'pagination',
                                                    'node_id': 120,
                                                    'selector': 'li.next > a',
                                                    'type': 'css'}}}],
...
             'plain_list': [{...
                                                'fields': [{'attribute': '',
                                         'name': 'Field_text_1',
                                         'node_id': None,
                                         'selector': 'div.quote > span.text',
                                         'type': 'text'},
                                         ...],
                                        ...}],
                        },
...}
Enter fullscreen mode Exit fullscreen mode

The recognition results include CSS Selectors for list pages and pagination, as well as the corresponding fields for each item on the list page.

List page and fields extraction logics

Next, we will write the logic for extracting list pages and fields.

First, we can obtain list page selectors and fields through results.

# list result
list_result = results['results']['plain_list'][0]

# list items selector
list_items_selector = list_result['selectors']['full_items']['selector']
print(list_items_selector)

# fields
fields = list_result['fields']
print(fields)
Enter fullscreen mode Exit fullscreen mode

Then, we can write the logic for parsing list page items.

def get_data(soup: BeautifulSoup) -> list:
    # data
    data = []

    # items
    items_elements = soup.select(list_items_selector)
    for el in items_elements:
        # row data
        row = {}

        # iterate fields
        for f in fields:
            # field name
            field_name = f['name']

            # field element
            field_element = el.select_one(f['selector'])

            # skip if field element not found
            if not field_element:
                continue

            # add field value to row
            if f['type'] == 'text':
                row[field_name] = field_element.text
            else:
                row[field_name] = field_element.attrs.get(f['attribute'])

        # add row to data
        data.append(row)

    return data
Enter fullscreen mode Exit fullscreen mode

In the function get_data in the above code, we pass a BeautifulSoup instance as parameter, and use list_items_selectors and fields to parse and obtain the list data which is then returned the function caller.

Request list page and pagination logics

Then, we need to write the logics of requesting list pages and pagination, that is, to request a given URL and parse its pagination, then call the function above get_data.

We need to first obtain the pagination's CSS Selector.

# pagination next selector
next_selector = results['results']['pagination'][0]['selectors']['next']['selector']
print(next_selector)
Enter fullscreen mode Exit fullscreen mode

Then, we write crawler logic, which continuously crawls data from website list pages.

def crawl(url: str) -> list:
    # all data to crawl
    all_data = []

    while True:
        print(f'requesting {url}')

        # request url
        res = requests.get(url)

        # beautiful soup of html
        soup = BeautifulSoup(res.content)

        # add parsed data
        data = get_data(soup)
        all_data += data

        # pagination next element
        next_el = soup.select_one(next_selector)

        # end if pagination next element not found
        if not next_el:
            break

                # url of next page
        url = urljoin(url, next_el.attrs.get('href'))

    return all_data
Enter fullscreen mode Exit fullscreen mode

So we have completed all the coding parts.

Putting them all together

The following is the complete code for the entire crawl logic.

from urllib.parse import urljoin

import requests
from bs4 import BeautifulSoup
from pprint import pprint


def get_data(soup: BeautifulSoup) -> list:
    # data
    data = []

    # items
    items_elements = soup.select(list_items_selector)
    for el in items_elements:
        # row data
        row = {}

        # iterate fields
        for f in fields:
            # field name
            field_name = f['name']

            # field element
            field_element = el.select_one(f['selector'])

            # skip if field element not found
            if not field_element:
                continue

            # add field value to row
            if f['type'] == 'text':
                row[field_name] = field_element.text
            else:
                row[field_name] = field_element.attrs.get(f['attribute'])

        # add row to data
        data.append(row)

    return data


def crawl(url: str) -> list:
    # all data to crawl
    all_data = []

    while True:
        print(f'requesting {url}')

        # request url
        res = requests.get(url)

        # beautiful soup of html
        soup = BeautifulSoup(res.content)

        # add parsed data
        data = get_data(soup)
        all_data += data

        # pagination next element
        next_el = soup.select_one(next_selector)

        # end if pagination next element not found
        if not next_el:
            break

        # url of next page
        url = urljoin(url, next_el.attrs.get('href'))

    return all_data


if __name__ == '__main__':
    # API endpoint
    api_endpoint = 'http://localhost:9999/api'

    # url to extract
    url = 'https://quotes.toscrape.com'

    # call API to recognize list page and pagination elements
    res = requests.post(f'{api_endpoint}/requests', json={
        'url': 'https://quotes.toscrape.com'
    })
    results = res.json()
    pprint(results)

    # list result
    list_result = results['results']['plain_list'][0]

    # list items selector
    list_items_selector = list_result['selectors']['full_items']['selector']
    print(list_items_selector)

    # fields
    fields = list_result['fields']
    print(fields)

    # pagination next selector
    next_selector = results['results']['pagination'][0]['selectors']['next']['selector']
    print(next_selector)

    # start crawling
    all_data = crawl(url)

    # print crawled results
    pprint(all_data[:50])
Enter fullscreen mode Exit fullscreen mode

Run the code and we can obtain the following result data.

[{'Field_link_url_6': '/author/Albert-Einstein',
  'Field_link_url_8': '/tag/change/page/1/',
  'Field_text_1': '“The world as we have created it is a process of our '
                  'thinking. It cannot be changed without changing our '
                  'thinking.”',
  'Field_text_2': '“The world as we have created it is a process of our '
                  'thinking. It cannot be changed without changing our '
                  'thinking.”',
  'Field_text_3': '\n'
                  '            Tags:\n'
                  '            \n'
                  'change\n'
                  'deep-thoughts\n'
                  'thinking\n'
                  'world\n',
  'Field_text_4': 'Albert Einstein',
  'Field_text_5': '(about)',
  'Field_text_7': 'change'},
  ...
Enter fullscreen mode Exit fullscreen mode

Now, we have achieved the crawling task of automatically extracting lists using Webspot. In this way, there is no need to explicitly define CSS Selectors or XPaths. Just call the Webspot API and we can obtain the list page data.

crawler Article's
30 articles in total
Favicon
The best web crawler tools in 2025
Favicon
Proxy IP and crawler anomaly detection make data collection more stable and efficient
Favicon
Session management of proxy IP in crawlers
Favicon
How Crawler IP Proxies Enhance Competitor Analysis and Market Research
Favicon
How to configure Swiftproxy proxy server in Puppeteer?
Favicon
Common web scraping roadblocks and how to avoid them
Favicon
什么是网络爬虫及其工作原理?
Favicon
网络爬虫架构设计
Favicon
Traditional crawler or AI-assisted crawler? How to choose?
Favicon
AI+Node.js x-crawl crawler: Why are traditional crawlers no longer the first choice for data crawling?
Favicon
Building a README Crawler With Node.js
Favicon
The Ultimate Instagram Scraping API Guide for 2024
Favicon
How to efficiently scrape millions of Google Businesses on a large scale using a distributed crawler
Favicon
A Step-by-Step Guide to Building a Scalable Distributed Crawler for Scraping Millions of Top TikTok Profiles
Favicon
Python爬虫如何爬wss数据
Favicon
Web Crawler in Action: How to use Webspot to implement automatic recognition and data extraction of list web pages
Favicon
Web Scraping vs. Crawling: What’s the Difference?
Favicon
Crawler Web dev.to using Colly when learning Golang
Favicon
Glue Crawlers: No GetObject, No Problem
Favicon
Simple tool crawl urls form domain
Favicon
用 2Captcha 通過 CAPTCHA 人機驗證
Favicon
The Difference Between Web Scraping vs Web Crawling
Favicon
Design a Web Crawler
Favicon
Build A Web Crawler To Find Any Broken Links on Your Site with Python & BeautifulSoup
Favicon
DRUM
Favicon
15 Best Website Downloaders & Website Copier – Save website locally to read offline
Favicon
Google News | Crawler
Favicon
[Beginner] How to build Youtube video crawler web application with Rails 6, Devise, Nokogiri and Bootstrap 4?
Favicon
TYPO3 Crawler with TYPO3 9 & 10 Support
Favicon
How to generate a Laravel sitemaps on the fly?

Featured ones: