dev-resources.site

for different kinds of informations.

Building an Async E-Commerce Web Scraper with Pydantic, Crawl4ai & Gemini

Published at

1/11/2025

TLDR:

Learn how to build an E-commerce scraper using crawl4ai's LLM-based extraction and Pydantic models. The scraper fetches both listing data (names, prices) and detailed product information (specs, reviews) asynchronously.

Try the full code in Google Colab

Ever wanted to analyze E-commerce product data but found traditional web scraping too complex? In this guide, I'll show you how to build a reliable scraper using modern Python tools. We'll use crawl4ai for intelligent extraction and Pydantic for clean data modeling.

Why Crawl4AI and Pydantic?

Crawl4AI: A robust library that simplifies web crawling and scraping by leveraging AI-based extraction strategies.
Pydantic: A Python library for data validation and settings management, ensuring the scraped data adheres to predefined schemas.

Why Scrape Tokopedia?

Tokopedia is one of Indonesia’s largest e-commerce platforms - I am native here and I use this platform a lot, but I am not their employee or affiliated :). You can use any e-commerce as you wish. If you’re a developer intrigued by e-commerce analytics, market research, or automated data gathering, scraping these listings can be quite useful.

What Makes This Approach Different?

Instead of wrestling with complex CSS selectors or XPath queries, we're using crawl4ai's LLM-based extraction. This means:

More resilient to website changes
Cleaner, structured data output
Less maintenance headache

Setting Up Your Environment

First, let's install our required packages:

%pip install -U crawl4ai
%pip install nest_asyncio
%pip install pydantic

We'll also need nest_asyncio for running async code in notebooks:

import crawl4ai
import asyncio
import nest_asyncio
nest_asyncio.apply()

Defining Our Data Models

We'll use Pydantic to define exactly what data we want to extract. Here are our two main models:

from pydantic import BaseModel, Field
from typing import List, Optional

class TokopediaListingItem(BaseModel):
    product_name: str = Field(..., description="Name of the product in listing.")
    product_url: str = Field(..., description="URL link to product detail.")
    price: str = Field(None, description="Price displayed in listing.")
    store_name: str = Field(None, description="Store name from listing.")
    rating: str = Field(None, description="Rating displayed in listing.")
    image_url: str = Field(None, description="Primary image from listing.")

class TokopediaProductDetail(BaseModel):
    product_name: str = Field(..., description="Name of product on detail page.")
    all_images: List[str] = Field(default_factory=list, description="List of all product image URLs.")
    specs: str = Field(None, description="Technical specifications or short info.")
    description: "str = Field(None, description=\"Long product description.\")"
    variants: List[str] = Field(default_factory=list, description="List of variants or color options.")
    satisfaction_percentage: Optional[str] = Field(None, description="Percentage of satisfied customers.")
    total_ratings: Optional[str] = Field(None, description="Number of ratings.")
    total_reviews: Optional[str] = Field(None, description="Number of reviews.")
    stock: Optional[str] = Field(None, description="Stock availability.")

These models act as a contract for what data we expect to extract. They also provide automatic validation and clear documentation.

The Scraping Process

Our scraper works in two stages:

1. Crawling Product Listings

First, we fetch search results pages:

async def crawl_tokopedia_listings(query: str = "mouse-wireless", max_pages: int = 1):
    listing_strategy = LLMExtractionStrategy(
        provider="gemini/gemini-1.5-pro",
        api_token=os.getenv("GEMINI_API_KEY"),
        schema=TokopediaListingItem.model_json_schema(),
        instruction=(
            "Extract structured data for each product in the listing. "
            "Each product should have: product_name, product_url, price,"
            "store_name, rating (scale 1-5), image_url."
        ),
        verbose=True,
    )

    all_results = []

    async with AsyncWebCrawler(verbose=True) as crawler:
        for page in range(1, max_pages + 1):
            url = f"https://www.tokopedia.com/find/{query}?page={page}"
            result = await crawler.arun(
                url=url,
                extraction_strategy=listing_strategy,
                word_count_threshold=1,
                cache_mode=CacheMode.DISABLED,
            )
            data = json.loads(result.extracted_content)
            all_results.extend(data)

    return all_results

2. Fetching Product Details

Then, for each product URL we found, we fetch its detailed information:

async def crawl_tokopedia_detail(product_url: str):
    detail_strategy = LLMExtractionStrategy(
        provider="gemini/gemini-1.5-pro",
        api_token=os.getenv("GEMINI_API_KEY"),
        schema=TokopediaProductDetail.model_json_schema(),
        instruction=(
            "Extract fields like product_name, all_images (list), specs,"
            "description, variants (list), satisfaction_percentage,"
            "total_ratings, total_reviews, stock availability."
        ),
        verbose=False,
    )

    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url=product_url,
            extraction_strategy=detail_strategy,
            word_count_threshold=1,
            cache_mode=CacheMode.DISABLED,
        )

        parsed_data = json.loads(result.extracted_content)
        return TokopediaProductDetail(**parsed_data)

Putting It All Together

Finally, we combine both stages into a single function:

async def run_full_scrape(query="mouse-wireless", max_pages=2, limit=15):
    listings = await crawl_tokopedia_listings(query=query, max_pages=max_pages)
    listings_subset = listings[:limit]

    all_data = []
    for i, item in enumerate(listings_subset, start=1):
        detail_data = await crawl_tokopedia_detail(item["product_url"])
        combined_data = {
            "listing_data": item,
            "detail_data": detail_data.dict(),
        }
        all_data.append(combined_data)
        print(f"[Detail] Scraped {i}/{len(listings_subset)}")

    return all_data

Running the Scraper

Here's how to use it:

# Scrape first 5 products from page 1
results = await run_full_scrape("mouse-wireless", max_pages=1, limit=5)

# Print results nicely formatted
for result in results:
    print(json.dumps(result, indent=4))

Pro Tips

Rate Limiting: Be respectful of Tokopedia's servers. Add delays between requests if scraping many pages.
Caching: Enable crawl4ai's cache during development:

cache_mode=CacheMode.ENABLED

Error Handling: The code includes basic error handling, but you might want to add retries for production use.
API Keys: Store your Gemini API key in environment variables, not in the code.

What's Next?

You could extend this scraper to:

Save data to a database
Track price changes over time
Analyze product trends
Compare prices across stores

Wrapping up

Using crawl4ai with LLM-based extraction makes web scraping much more maintainable than traditional methods. The combination with Pydantic ensures your data is well-structured and validated.

Remember to always check a website's robots.txt and terms of service before scraping. Happy coding!

Important links:

Crawl4AI

Official Website: https://crawl4ai.com
GitHub Repository: https://github.com/unclecode/crawl4ai
Documentation: https://crawl4ai.com/mkdocs/core/installation/

Pydantic

Official Documentation: https://docs.pydantic.dev/latest/
PyPI Page: https://pypi.org/project/pydantic/
GitHub Repository: https://github.com/pydantic/pydantic

Note: The complete code is available in the Colab notebook. Feel free to try it out and adapt it for your needs.

webscraping Article's

30 articles in total

🗾 How to Use the iTown Japan Directory Scraper to Build Comprehensive Business Lists