Logo

dev-resources.site

for different kinds of informations.

Unlock the Power of Google News Scraping with Python

Published at
1/9/2025
Categories
webscraping
Author
swiftproxy_residential
Categories
1 categories in total
webscraping
open
Author
22 person written this
swiftproxy_residential
open
Unlock the Power of Google News Scraping with Python

You want the latest news, fast and structured. No problem. Scraping Google News is one of the quickest ways to gather up-to-the-minute headlines, monitor emerging trends, and dive deep into sentiment analysis. In this post, I’ll walk you through how to scrape Google News using Python—no fluff, just actionable insights.
By the end of this tutorial, you'll know how to efficiently pull headlines and links from Google News, cleanly store them in JSON format, and even avoid blocks using proxies and headers.

Step 1: Python Environment Setup

First, make sure you have Python installed on your system. Then, we’ll install the key libraries: requests and lxml.
Run this in your terminal:

pip install requests
pip install lxml
Enter fullscreen mode Exit fullscreen mode

These tools will handle HTTP requests and parse the HTML content of Google News, giving you the power to extract exactly what you need.

Step 2: Get to Know Your Target URL and XPath

Now, we need to understand the structure of Google News’ webpage. Here's the URL of the page we’ll scrape:

url = "https://news.google.com/topics/CAAqKggKIiRDQkFTRlFvSUwyMHZNRGRqTVhZU0JXVnVMVWRDR2dKSlRpZ0FQAQ?hl=en-US&gl=US&ceid=US%3Aen"
Enter fullscreen mode Exit fullscreen mode

This page displays multiple news articles with titles and links to related stories. To grab this information, we need to understand how the HTML is organized. Here's a simplified breakdown of the XPath structure:
Main News Container: //c-wiz[@jsrenderer="ARwRbe"]
Main News Title: //c-wiz[@jsrenderer="ARwRbe"]/c-wiz/div/article/a/text()
Main News Link: //c-wiz[@jsrenderer="ARwRbe"]/c-wiz/div/article/a/@href
Related News Container: //c-wiz[@jsrenderer="ARwRbe"]/c-wiz/div/div/article/
Related News Title: //c-wiz[@jsrenderer="ARwRbe"]/c-wiz/div/div/article/a/text()
Related News Link: //c-wiz[@jsrenderer="ARwRbe"]/c-wiz/div/div/article/a/@href
Now we know where to look for the data we need.

Step 3: Fetch Google News Content

We’ll fetch the page content using requests. Here’s the code to do that:

import requests

url = "https://news.google.com/topics/CAAqKggKIiRDQkFTRlFvSUwyMHZNRGRqTVhZU0JXVnVMVWRDR2dKSlRpZ0FQAQ?hl=en-US&gl=US&ceid=US%3Aen"
response = requests.get(url)

if response.status_code == 200:
    page_content = response.content
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")
Enter fullscreen mode Exit fullscreen mode

This sends a GET request to the Google News URL and stores the HTML content. If something goes wrong (like a 404 or 500 error), we’ll know.

Step 4: Analyze the HTML with lxml

Once we have the raw HTML, we need to parse it to make sense of the structure. That’s where lxml comes in. Here’s how to parse the page:

from lxml import html

# Parse the HTML content
parser = html.fromstring(page_content)
Enter fullscreen mode Exit fullscreen mode

This command turns the raw HTML into an object we can query using XPath.

Step 5: Extract News Data

Now, we get to the fun part: extracting headlines and links. We’ll first extract the main news headlines, then dive into the related articles. Here’s how:

# Extract main news articles
main_news_elements = parser.xpath('//c-wiz[@jsrenderer="ARwRbe"]')

news_data = []

for element in main_news_elements[:10]:  # Grab the first 10 main headlines
    title = element.xpath('.//c-wiz/div/article/a/text()')[0]
    link = "https://news.google.com" + element.xpath('.//c-wiz/div/article/a/@href')[0][1:]

    # Ensure data exists before appending to the list
    if title and link:
        news_data.append({
            "main_title": title,
            "main_link": link,
        })
Enter fullscreen mode Exit fullscreen mode

With this, we’ve pulled the main headlines and links from Google News. But we’re not done yet—let's go deeper.

Step 6: Extract Related Articles

For each main headline, there are often related articles. Let’s pull those too.

# Extract related articles within each main article
for element in main_news_elements[:10]:
    related_articles = []
    related_news_elements = element.xpath('.//c-wiz/div/div/article')

    for related_element in related_news_elements:
        related_title = related_element.xpath('.//a/text()')[0]
        related_link = "https://news.google.com" + related_element.xpath('.//a/@href')[0][1:]
        related_articles.append({"title": related_title, "link": related_link})

    # Add related articles to the main news data
    news_data[-1]["related_articles"] = related_articles
Enter fullscreen mode Exit fullscreen mode

Now, each main news article in our news_data list includes related articles—giving us a more comprehensive set of data.

Step 7: Export Your Data as JSON

All this data needs to be stored somewhere. We’ll save it to a JSON file so you can use it later for analysis or sentiment analysis.

import json

# Save the extracted data to a JSON file
with open('google_news_data.json', 'w') as f:
    json.dump(news_data, f, indent=4)
Enter fullscreen mode Exit fullscreen mode

Now you have a file named google_news_data.json filled with news headlines and links, ready for further analysis.

Additional Tips: Working with Proxies and Custom Headers

Working with Proxies
If you're scraping a lot of data, sites like Google News might block you. To avoid that, use proxies. Here’s how:

proxies = {
    "http": "http://your_proxy_ip:port",
    "https": "https://your_proxy_ip:port",
}

response = requests.get(url, proxies=proxies)
Enter fullscreen mode Exit fullscreen mode

By routing your requests through different IPs, you can scrape more efficiently without being blocked.
Customizing Headers
Websites often block requests that look like they're from bots. To avoid detection, you can add custom headers:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8',
}

response = requests.get(url, headers=headers)
Enter fullscreen mode Exit fullscreen mode

These headers make your requests look like they’re coming from an actual browser.

Complete Code Sample

Here’s everything wrapped up in one script:

import requests
from lxml import html
import json

url = "https://news.google.com/topics/CAAqKggKIiRDQkFTRlFvSUwyMHZNRGRqTVhZU0JXVnVMVWRDR2dKSlRpZ0FQAQ?hl=en-US&gl=US&ceid=US%3Aen"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'}
proxies = {"http": "http://your_proxy_ip:port", "https": "https://your_proxy_ip:port"}

response = requests.get(url, headers=headers, proxies=proxies)

if response.status_code == 200:
    page_content = response.content
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")
    exit()

# Parse HTML
parser = html.fromstring(page_content)

# Extract news
main_news_elements = parser.xpath('//c-wiz[@jsrenderer="ARwRbe"]')
news_data = []

for element in main_news_elements[:10]:
    title = element.xpath('.//c-wiz/div/article/a/text()')[0]
    link = "https://news.google.com" + element.xpath('.//c-wiz/div/article/a/@href')[0][1:]

    # Extract related articles
    related_articles = []
    related_news_elements = element.xpath('.//c-wiz/div/div/article')
    for related_element in related_news_elements:
        related_title = related_element.xpath('.//a/text()')[0]
        related_link = "https://news.google.com" + related_element.xpath('.//a/@href')[0][1:]
        related_articles.append({"title": related_title, "link": related_link})

    news_data.append({
        "main_title": title,
        "main_link": link,
        "related_articles": related_articles
    })

# Save to JSON
with open("google_news

_data.json", "w") as json_file:
    json.dump(news_data, json_file, indent=4)

print("Data extraction complete. Saved to google_news_data.json")
Enter fullscreen mode Exit fullscreen mode

Conclusion

Scraping Google News with Python is an efficient way to gather real-time news data. Whether you’re tracking trends, analyzing sentiment, or just curious about the latest headlines, this method provides a solid foundation. Use proxies and custom headers to avoid being blocked, and save your data in JSON format for easy access.

webscraping Article's
30 articles in total
Favicon
🗾 How to Use the iTown Japan Directory Scraper to Build Comprehensive Business Lists
Favicon
What is Browser Fingerprinting And How does it Identity Fingerprint?
Favicon
Building an Async E-Commerce Web Scraper with Pydantic, Crawl4ai & Gemini
Favicon
Dados da Web
Favicon
How to Use Web Scraping for Job Postings in Your Search
Favicon
Google and Anthropic are working on AI agents - so I made an open source alternative
Favicon
Pandas + NBB data 🐼🏀
Favicon
Scrape YouTube Video Details Efficiently with Python
Favicon
Unlock the Power of Google News Scraping with Python
Favicon
No-code Solutions for Turning Search Results Into Markdown for LLMs
Favicon
No-code Solutions for Turning Search Results Into Markdown for LLMs
Favicon
How to Scrape Hotel Listings and Unlock the Power of Data
Favicon
KaibanJS v0.14.0: A New Era for Web Scraping and AI Workflows
Favicon
How to Web Scrape with Puppeteer: A Beginner-Friendly Guide
Favicon
The best web crawler tools in 2025
Favicon
created a site where you can scrape data from any url you want
Favicon
The Power of Scraping Google Maps
Favicon
AI Web Agents: The Future of Intelligent Automation
Favicon
Building “Product Fetcher”: A Solopreneur’s Dive into Scraping, AI, and Automation
Favicon
The beginning of my journey
Favicon
Fascinating and brilliantly done!
Favicon
Empower Your Go Web Crawler Project with Proxy IPs
Favicon
How to Scrape Google Trends Data With Python?
Favicon
OpenBullet 2: The Web Scraping Tool You Need for Success
Favicon
How to Scrape Amazon Product Data, Seller info and Search Data With Python
Favicon
Powerful Tools to Crawl Websites for Developers and Businesses
Favicon
Best AI Scraping Browser: Scrape and Monitor Data from Any Website
Favicon
ScrapeStorm: The Ultimate Tool for SSENSE Data Extraction
Favicon
The Complete Guide to Web Scraping: What It Is and How It Can Help Businesses
Favicon
How to Scrape Google Trends Data With Python?

Featured ones: