Logo

dev-resources.site

for different kinds of informations.

Building an Async E-Commerce Web Scraper with Pydantic, Crawl4ai & Gemini

Published at
1/11/2025
Categories
python
beginners
webscraping
tutorial
Author
async_dime
Author
10 person written this
async_dime
open
Building an Async E-Commerce Web Scraper with Pydantic, Crawl4ai & Gemini

TLDR:

Learn how to build an E-commerce scraper using crawl4ai's LLM-based extraction and Pydantic models. The scraper fetches both listing data (names, prices) and detailed product information (specs, reviews) asynchronously.

Try the full code in Google Colab


Ever wanted to analyze E-commerce product data but found traditional web scraping too complex? In this guide, I'll show you how to build a reliable scraper using modern Python tools. We'll use crawl4ai for intelligent extraction and Pydantic for clean data modeling.

Why Crawl4AI and Pydantic?

  • Crawl4AI: A robust library that simplifies web crawling and scraping by leveraging AI-based extraction strategies.
  • Pydantic: A Python library for data validation and settings management, ensuring the scraped data adheres to predefined schemas.

Why Scrape Tokopedia?

Tokopedia is one of Indonesia’s largest e-commerce platforms - I am native here and I use this platform a lot, but I am not their employee or affiliated :). You can use any e-commerce as you wish. If you’re a developer intrigued by e-commerce analytics, market research, or automated data gathering, scraping these listings can be quite useful.

What Makes This Approach Different?

Instead of wrestling with complex CSS selectors or XPath queries, we're using crawl4ai's LLM-based extraction. This means:

  • More resilient to website changes
  • Cleaner, structured data output
  • Less maintenance headache

Setting Up Your Environment

First, let's install our required packages:

%pip install -U crawl4ai
%pip install nest_asyncio
%pip install pydantic
Enter fullscreen mode Exit fullscreen mode

We'll also need nest_asyncio for running async code in notebooks:

import crawl4ai
import asyncio
import nest_asyncio
nest_asyncio.apply()
Enter fullscreen mode Exit fullscreen mode

Defining Our Data Models

We'll use Pydantic to define exactly what data we want to extract. Here are our two main models:

from pydantic import BaseModel, Field
from typing import List, Optional

class TokopediaListingItem(BaseModel):
    product_name: str = Field(..., description="Name of the product in listing.")
    product_url: str = Field(..., description="URL link to product detail.")
    price: str = Field(None, description="Price displayed in listing.")
    store_name: str = Field(None, description="Store name from listing.")
    rating: str = Field(None, description="Rating displayed in listing.")
    image_url: str = Field(None, description="Primary image from listing.")

class TokopediaProductDetail(BaseModel):
    product_name: str = Field(..., description="Name of product on detail page.")
    all_images: List[str] = Field(default_factory=list, description="List of all product image URLs.")
    specs: str = Field(None, description="Technical specifications or short info.")
    description: "str = Field(None, description=\"Long product description.\")"
    variants: List[str] = Field(default_factory=list, description="List of variants or color options.")
    satisfaction_percentage: Optional[str] = Field(None, description="Percentage of satisfied customers.")
    total_ratings: Optional[str] = Field(None, description="Number of ratings.")
    total_reviews: Optional[str] = Field(None, description="Number of reviews.")
    stock: Optional[str] = Field(None, description="Stock availability.")
Enter fullscreen mode Exit fullscreen mode

These models act as a contract for what data we expect to extract. They also provide automatic validation and clear documentation.

The Scraping Process

Our scraper works in two stages:

1. Crawling Product Listings

First, we fetch search results pages:

async def crawl_tokopedia_listings(query: str = "mouse-wireless", max_pages: int = 1):
    listing_strategy = LLMExtractionStrategy(
        provider="gemini/gemini-1.5-pro",
        api_token=os.getenv("GEMINI_API_KEY"),
        schema=TokopediaListingItem.model_json_schema(),
        instruction=(
            "Extract structured data for each product in the listing. "
            "Each product should have: product_name, product_url, price,"
            "store_name, rating (scale 1-5), image_url."
        ),
        verbose=True,
    )

    all_results = []

    async with AsyncWebCrawler(verbose=True) as crawler:
        for page in range(1, max_pages + 1):
            url = f"https://www.tokopedia.com/find/{query}?page={page}"
            result = await crawler.arun(
                url=url,
                extraction_strategy=listing_strategy,
                word_count_threshold=1,
                cache_mode=CacheMode.DISABLED,
            )
            data = json.loads(result.extracted_content)
            all_results.extend(data)

    return all_results
Enter fullscreen mode Exit fullscreen mode

2. Fetching Product Details

Then, for each product URL we found, we fetch its detailed information:

async def crawl_tokopedia_detail(product_url: str):
    detail_strategy = LLMExtractionStrategy(
        provider="gemini/gemini-1.5-pro",
        api_token=os.getenv("GEMINI_API_KEY"),
        schema=TokopediaProductDetail.model_json_schema(),
        instruction=(
            "Extract fields like product_name, all_images (list), specs,"
            "description, variants (list), satisfaction_percentage,"
            "total_ratings, total_reviews, stock availability."
        ),
        verbose=False,
    )

    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url=product_url,
            extraction_strategy=detail_strategy,
            word_count_threshold=1,
            cache_mode=CacheMode.DISABLED,
        )

        parsed_data = json.loads(result.extracted_content)
        return TokopediaProductDetail(**parsed_data)
Enter fullscreen mode Exit fullscreen mode

Putting It All Together

Finally, we combine both stages into a single function:

async def run_full_scrape(query="mouse-wireless", max_pages=2, limit=15):
    listings = await crawl_tokopedia_listings(query=query, max_pages=max_pages)
    listings_subset = listings[:limit]

    all_data = []
    for i, item in enumerate(listings_subset, start=1):
        detail_data = await crawl_tokopedia_detail(item["product_url"])
        combined_data = {
            "listing_data": item,
            "detail_data": detail_data.dict(),
        }
        all_data.append(combined_data)
        print(f"[Detail] Scraped {i}/{len(listings_subset)}")

    return all_data
Enter fullscreen mode Exit fullscreen mode

Running the Scraper

Here's how to use it:

# Scrape first 5 products from page 1
results = await run_full_scrape("mouse-wireless", max_pages=1, limit=5)

# Print results nicely formatted
for result in results:
    print(json.dumps(result, indent=4))
Enter fullscreen mode Exit fullscreen mode

Pro Tips

  1. Rate Limiting: Be respectful of Tokopedia's servers. Add delays between requests if scraping many pages.

  2. Caching: Enable crawl4ai's cache during development:

cache_mode=CacheMode.ENABLED
Enter fullscreen mode Exit fullscreen mode
  1. Error Handling: The code includes basic error handling, but you might want to add retries for production use.

  2. API Keys: Store your Gemini API key in environment variables, not in the code.

What's Next?

You could extend this scraper to:

  • Save data to a database
  • Track price changes over time
  • Analyze product trends
  • Compare prices across stores

Wrapping up

Using crawl4ai with LLM-based extraction makes web scraping much more maintainable than traditional methods. The combination with Pydantic ensures your data is well-structured and validated.

Remember to always check a website's robots.txt and terms of service before scraping. Happy coding!


Important links:

Crawl4AI

Pydantic


Note: The complete code is available in the Colab notebook. Feel free to try it out and adapt it for your needs.

webscraping Article's
30 articles in total
Favicon
🗾 How to Use the iTown Japan Directory Scraper to Build Comprehensive Business Lists
Favicon
What is Browser Fingerprinting And How does it Identity Fingerprint?
Favicon
Building an Async E-Commerce Web Scraper with Pydantic, Crawl4ai & Gemini
Favicon
Dados da Web
Favicon
How to Use Web Scraping for Job Postings in Your Search
Favicon
Google and Anthropic are working on AI agents - so I made an open source alternative
Favicon
Pandas + NBB data 🐼🏀
Favicon
Scrape YouTube Video Details Efficiently with Python
Favicon
Unlock the Power of Google News Scraping with Python
Favicon
No-code Solutions for Turning Search Results Into Markdown for LLMs
Favicon
No-code Solutions for Turning Search Results Into Markdown for LLMs
Favicon
How to Scrape Hotel Listings and Unlock the Power of Data
Favicon
KaibanJS v0.14.0: A New Era for Web Scraping and AI Workflows
Favicon
How to Web Scrape with Puppeteer: A Beginner-Friendly Guide
Favicon
The best web crawler tools in 2025
Favicon
created a site where you can scrape data from any url you want
Favicon
The Power of Scraping Google Maps
Favicon
AI Web Agents: The Future of Intelligent Automation
Favicon
Building “Product Fetcher”: A Solopreneur’s Dive into Scraping, AI, and Automation
Favicon
The beginning of my journey
Favicon
Fascinating and brilliantly done!
Favicon
Empower Your Go Web Crawler Project with Proxy IPs
Favicon
How to Scrape Google Trends Data With Python?
Favicon
OpenBullet 2: The Web Scraping Tool You Need for Success
Favicon
How to Scrape Amazon Product Data, Seller info and Search Data With Python
Favicon
Powerful Tools to Crawl Websites for Developers and Businesses
Favicon
Best AI Scraping Browser: Scrape and Monitor Data from Any Website
Favicon
ScrapeStorm: The Ultimate Tool for SSENSE Data Extraction
Favicon
The Complete Guide to Web Scraping: What It Is and How It Can Help Businesses
Favicon
How to Scrape Google Trends Data With Python?

Featured ones: