dev-resources.site
for different kinds of informations.
Web Crawler in Action: How to use Webspot to implement automatic recognition and data extraction of list web pages
Introduction
Using web crawling programs to extract list web pages is a one of those common web data extraction tasks. For engineers to write web crawlers, how to efficiently code and generate extraction rules is quite necessary, otherwise most of the time can be wasted on writing CSS selectors and XPath data extraction rules of web crawling programs. In light of this issue, this article will introduce an example of using open source tool Webspot to automatically recognize and extract data of list web pages.
Webspot
Webspot is an open source project aimed at automating web page data extraction. Currently, it supports recognition and crawling rules extraction of list pages and pagination. In addition, it provides a web UI interface for users to visually view the identified results, and allows developers to use APIs to obtain recognition results.
Installation of Webspot is quite easy, you can refer to the official documentation for the installation tutorial with Docker and Docker Compose. Execute the commands below to install and start Webspot.
# clone git repo
git clone https://github.com/crawlab-team/webspot
# start docker containers
docker-compose up -d
Wait for it to start up, which might take half a minute to initialize the application.
After the initialization, we can visit the web page http://localhost:9999, and should be able to see the user interface below, which means it has started successfully.
Now, we can create a new request for page recognition. Click "New Request" and enter https://quotes.toscrape.com. Then click "Submit". Wait for a while and we should be able to see the page below.
Use API to auto extract data
We are now using Python to call the API of Webspot to auto extract data.
The whole process is as below.
Call the Webspot API to obtain the extraction rules for list pages and pagination. The extraction rules are CSS Selectors.
Define the retrieval target based on the extraction rules of the list page, that is, each item and its corresponding field on the list page.
Determine the target for crawling the next page based on the extraction rules of pagination, and let the crawler automatically crawl the data of the next page.
Call API
Calling the API is very simple, just pass the URL to be recognized into the body. The code is as follows.
import requests
from bs4 import BeautifulSoup
from pprint import pprint
# API endpoint
api_endpoint = 'http://localhost:9999/api'
# url to extract
url = 'https://quotes.toscrape.com'
# call API to recognize list page and pagination elements
res = requests.post(f'{api_endpoint}/requests', json={
'url': 'https://quotes.toscrape.com'
})
results = res.json()
pprint(results)
Running the code above in Python Console can obtain the recognition result data similar to the following.
{...
'method': 'request',
'no_async': True,
'results': {'pagination': [{'detector': 'pagination',
'name': 'Next',
'score': 1.0,
'scores': {'score': 1.0},
'selectors': {'next': {'attribute': None,
'name': 'pagination',
'node_id': 120,
'selector': 'li.next > a',
'type': 'css'}}}],
...
'plain_list': [{...
'fields': [{'attribute': '',
'name': 'Field_text_1',
'node_id': None,
'selector': 'div.quote > span.text',
'type': 'text'},
...],
...}],
},
...}
The recognition results include CSS Selectors for list pages and pagination, as well as the corresponding fields for each item on the list page.
List page and fields extraction logics
Next, we will write the logic for extracting list pages and fields.
First, we can obtain list page selectors and fields through results
.
# list result
list_result = results['results']['plain_list'][0]
# list items selector
list_items_selector = list_result['selectors']['full_items']['selector']
print(list_items_selector)
# fields
fields = list_result['fields']
print(fields)
Then, we can write the logic for parsing list page items.
def get_data(soup: BeautifulSoup) -> list:
# data
data = []
# items
items_elements = soup.select(list_items_selector)
for el in items_elements:
# row data
row = {}
# iterate fields
for f in fields:
# field name
field_name = f['name']
# field element
field_element = el.select_one(f['selector'])
# skip if field element not found
if not field_element:
continue
# add field value to row
if f['type'] == 'text':
row[field_name] = field_element.text
else:
row[field_name] = field_element.attrs.get(f['attribute'])
# add row to data
data.append(row)
return data
In the function get_data
in the above code, we pass a BeautifulSoup
instance as parameter, and use list_items_selectors
and fields
to parse and obtain the list data which is then returned the function caller.
Request list page and pagination logics
Then, we need to write the logics of requesting list pages and pagination, that is, to request a given URL and parse its pagination, then call the function above get_data
.
We need to first obtain the pagination's CSS Selector.
# pagination next selector
next_selector = results['results']['pagination'][0]['selectors']['next']['selector']
print(next_selector)
Then, we write crawler logic, which continuously crawls data from website list pages.
def crawl(url: str) -> list:
# all data to crawl
all_data = []
while True:
print(f'requesting {url}')
# request url
res = requests.get(url)
# beautiful soup of html
soup = BeautifulSoup(res.content)
# add parsed data
data = get_data(soup)
all_data += data
# pagination next element
next_el = soup.select_one(next_selector)
# end if pagination next element not found
if not next_el:
break
# url of next page
url = urljoin(url, next_el.attrs.get('href'))
return all_data
So we have completed all the coding parts.
Putting them all together
The following is the complete code for the entire crawl logic.
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
from pprint import pprint
def get_data(soup: BeautifulSoup) -> list:
# data
data = []
# items
items_elements = soup.select(list_items_selector)
for el in items_elements:
# row data
row = {}
# iterate fields
for f in fields:
# field name
field_name = f['name']
# field element
field_element = el.select_one(f['selector'])
# skip if field element not found
if not field_element:
continue
# add field value to row
if f['type'] == 'text':
row[field_name] = field_element.text
else:
row[field_name] = field_element.attrs.get(f['attribute'])
# add row to data
data.append(row)
return data
def crawl(url: str) -> list:
# all data to crawl
all_data = []
while True:
print(f'requesting {url}')
# request url
res = requests.get(url)
# beautiful soup of html
soup = BeautifulSoup(res.content)
# add parsed data
data = get_data(soup)
all_data += data
# pagination next element
next_el = soup.select_one(next_selector)
# end if pagination next element not found
if not next_el:
break
# url of next page
url = urljoin(url, next_el.attrs.get('href'))
return all_data
if __name__ == '__main__':
# API endpoint
api_endpoint = 'http://localhost:9999/api'
# url to extract
url = 'https://quotes.toscrape.com'
# call API to recognize list page and pagination elements
res = requests.post(f'{api_endpoint}/requests', json={
'url': 'https://quotes.toscrape.com'
})
results = res.json()
pprint(results)
# list result
list_result = results['results']['plain_list'][0]
# list items selector
list_items_selector = list_result['selectors']['full_items']['selector']
print(list_items_selector)
# fields
fields = list_result['fields']
print(fields)
# pagination next selector
next_selector = results['results']['pagination'][0]['selectors']['next']['selector']
print(next_selector)
# start crawling
all_data = crawl(url)
# print crawled results
pprint(all_data[:50])
Run the code and we can obtain the following result data.
[{'Field_link_url_6': '/author/Albert-Einstein',
'Field_link_url_8': '/tag/change/page/1/',
'Field_text_1': '“The world as we have created it is a process of our '
'thinking. It cannot be changed without changing our '
'thinking.”',
'Field_text_2': '“The world as we have created it is a process of our '
'thinking. It cannot be changed without changing our '
'thinking.”',
'Field_text_3': '\n'
' Tags:\n'
' \n'
'change\n'
'deep-thoughts\n'
'thinking\n'
'world\n',
'Field_text_4': 'Albert Einstein',
'Field_text_5': '(about)',
'Field_text_7': 'change'},
...
Now, we have achieved the crawling task of automatically extracting lists using Webspot. In this way, there is no need to explicitly define CSS Selectors or XPaths. Just call the Webspot API and we can obtain the list page data.
Featured ones: