Logo

dev-resources.site

for different kinds of informations.

Turning search results into Markdown for LLMs

Published at
12/6/2024
Categories
webscraping
markdown
serpapi
llm
Author
nate_serpapi
Categories
4 categories in total
webscraping
open
markdown
open
serpapi
open
llm
open
Author
12 person written this
nate_serpapi
open
Turning search results into Markdown for LLMs

Intro

This article will walk through converting search results into Markdown format, suitable for use in large language models (LLMs) and other applications.

Markdown is a lightweight markup language that provides a simple, readable way to format text with plain-text syntax. Checkout Markdown Guide for more information:

Markdown Guide

Use Case

Markdown's simple, readable format allows for the transformation of raw webpage data into clean, actionable information across different use cases:

  1. LLM Training: Generate Q&A datasets or custom knowledge bases.
  2. Content Aggregation: Create training datasets or compile research.
  3. Market Research: Monitor competitors or gather product information.

SerpApi

SerpApi is a web scraping company that allows developers to extract search engine results and data from various search engines, including Google, Bing, Yahoo, Baidu, Yandex, and others. It provides a simple way to access search engine data programmatically without dealing directly with the complexities of web scraping.

This guide focuses on the Google Search API, but the concepts and techniques discussed can be adapted for use with SerpApi’s other APIs.

Google Search API

The Google Search API lets developers programmatically retrieve structured JSON data from live Google searches. Key benefits include:

  • CAPTCHA and browser automation: Avoid manual intervention or IP blocks.
  • Structured data: Output is clean and easy to parse.
  • Global and multilingual support: Search in specific languages or regions.
  • Scalability: Perform high-volume searches without disruptions.

Google Search Engine Results API

Gettings Started

This section provides a complete code example for fetching Google search results using SerpApi, parsing the webpage content, and converting it to Markdown. While this example uses Node.js (JavaScript), the same principles apply in other languages.

Required Packages

Make sure to install the following pages in your Node.js project.

SerpApi JavaScript: Scrape and parse search engine results using SerpApi. Get search results from Google, Bing, Baidu, Yandex, Yahoo, Home Depot, eBay and more.

SerpApi JavaScript

Cheerio: A fast, flexible, and elegant library for parsing and manipulating HTML and XML.

Cheerio

Turndown: Convert HTML into Markdown with JavaScript.

Turndown

Importing Packages

First, we must import all of our required packages:

import dotenv from "dotenv";
import fetch from "node-fetch";
import fs from "fs/promises";
import path from "path";
import { getJson } from "serpapi";
import * as cheerio from "cheerio";
import TurndownService from "turndown";
Enter fullscreen mode Exit fullscreen mode

Fetching Search Results

The fetchSearchResults function retrieves search results using SerpApi’s Google Search API:

const fetchSearchResults = async (query) => {
  return await getJson("google", {
    api_key: process.env.SERPAPI_KEY,
    q: query,
    num: 5,
  });
};

Enter fullscreen mode Exit fullscreen mode

Create a .env file, include your SerpApi key, and install the dotenv package. Or, replace the process.env.SERPAPI_KEY process with your API key if you are simply running the script locally.

Parsing Webpage Content

The parseUrl function fetches the HTML of a given URL, cleans it, and converts it to Markdown:

const parseUrl = async (url) => {
  try {
    // Configure fetch request with browser-like headers
    const response = await fetch(url, {
      headers: {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)",
        Accept:
          "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
      },
    });

    if (!response.ok) {
      throw new Error(`HTTP error! status: ${response.status}`);
    }

    const html = await response.text();

    // Initialize HTML parser and markdown converter
    const $ = cheerio.load(html);
    const turndown = new TurndownService({
      headingStyle: "atx",
      codeBlockStyle: "fenced",
    });

    // Clean up HTML by removing unnecessary elements
    $("script, style, nav, footer, iframe, .ads").remove();

    // Extract title and main content
    const title = $("title").text().trim() || $("h1").first().text().trim();
    const mainContent =
      $("article, main, .content, #content, .post").first().html() ||
      $("body").html();
    const content = turndown.turndown(mainContent || "");

    return { title, content };
  } catch (error) {
    console.error(`Failed to parse ${url}:`, error.message);
    return null;
  }
};

Enter fullscreen mode Exit fullscreen mode

This function ensures a clean, readable Markdown by removing non-essential elements like scripts and ads.

Sanitizing Keywords

To prevent filename issues, we can sanitize keywords before using them in filenames:

const sanitizeKeyword = (keyword) => {
  return keyword
    .replace(/\\s+/g, "_") // Replace spaces with underscores
    .substring(0, 15) // Truncate to 15 characters
    .toLowerCase(); // Convert to lowercase
};

Enter fullscreen mode Exit fullscreen mode

Writing to Markdown

This function writes the parsed content to a Markdown file, using the sanitize function to set the file's name:

const writeToMarkdown = async (data, keyword, index, url) => {
  const sanitizedKeyword = sanitizeKeyword(keyword);
  const filename = path.join(
    "output",
    `${new Date().toISOString()}_${sanitizedKeyword}_${index + 1}.md`
  );
  const content = `[//]: # (Source: ${url})\\n\\n# ${data.title}\\n\\n${data.content}`;
  await fs.writeFile(filename, content, "utf-8");
  return filename;
};

Enter fullscreen mode Exit fullscreen mode

Main Execution

The main script invokes the process. Update the keywords array to keywords relevant to your use case:

// Example Keyword array
const keywords = ["coffee", "playstation 5", "web scraping"];

// Main execution block
(async () => {
  try {
    // Create output directory if it doesn't exist
    await fs.mkdir("output", { recursive: true });

    // Process each keyword
    for (const keyword of keywords) {
      const results = await fetchSearchResults(keyword);

      // Process search results if available
      if (results.organic_results && results.organic_results.length > 0) {
        for (const [index, result] of results.organic_results.entries()) {
          try {
            const data = await parseUrl(result.link);
            const filename = await writeToMarkdown(
              data,
              keyword,
              index,
              result.link
            );
            console.log(`Written to: ${filename}`);
          } catch (err) {
            console.error(`Failed to process ${result.link}:`, err.message);
            continue;
          }
        }
      } else {
        console.log(`No organic results found for keyword: ${keyword}`);
      }
    }
  } catch (error) {
    console.error(error);
  }
})();


Enter fullscreen mode Exit fullscreen mode

To summarize the above, we:

  • Setup output directory: Ensures files are saved to an appropriate location.
  • Fetch and parse results: Process each search result URL for relevant content.
  • Error handling: Prevents the entire process from failing due to individual errors.

Next Steps

While the above should get you started, you may need to configure Cheerio or Turndown further to dial in the sections you're scraping.

You can find a repository for the above code here:

NateSkiles/search-results-to-markdown

Conclusion

SerpApi simplifies accessing structured search engine data through programmatic methods. By leveraging code-based solutions, developers can efficiently extract and transform web pages from search results into usable formats, enabling data collection and analysis.

Related Blogs

markdown Article's
30 articles in total
Favicon
Converting documents for LLM processing — A modern approach
Favicon
Use LateX in Astro.js for Markdown Rendering
Favicon
Markdown Syntax & Features: A Comprehensive 2025 Guide
Favicon
Converting documents for LLM processing — A modern approach
Favicon
🎄 A Christmas Gift for Developers: FileToMarkdown!
Favicon
Callout Blocks in a New Way
Favicon
David Blue's Handy Test Document
Favicon
NanoMD: Lightweight Markdown Editor
Favicon
colorize chatgpt with markdown
Favicon
Turning search results into Markdown for LLMs
Favicon
The Final Stretch of My Open Source Journey: Part 2
Favicon
Asking for feedback on open source CLI tool that exports Markdown to PDF using html and css templates(MDExport)
Favicon
Deep Dive into Microsoft MarkItDown
Favicon
NanoMD: 輕量化 Markdown 編輯器
Favicon
obsidian neovim markdown
Favicon
6 free Markdown (.md) WYSIWYG desktop Editors – Part3
Favicon
Cross Platform Blog Publishing Automation: Write Once, Publish Everywhere
Favicon
Getting Started with Blog Automation: A Test Post
Favicon
Transform Your Codebase into Comprehensive Documentation with Markdown
Favicon
Django Day DK 2024: I was there
Favicon
TypeScript and ReactMarkdown: A Tale of Types, Tears, and Triumph
Favicon
Level Up Your GitHub Profile: A Complete Guide to Stand Out and Shine
Favicon
Logseq, un éditeur puissant pour optimiser vos prises de notes
Favicon
Introduction to Markup Languages
Favicon
Boost Your Productivity with VS Code and .vscode for Dev.to Markdown
Favicon
🛠️ How to Create an Awesome GitHub Profile Using Markdown
Favicon
🛠️ How to Create an Awesome GitHub Profile Using Markdown
Favicon
Build a static website with Markdown content, using Nuxt and Fusionable (server API approach)
Favicon
Boost Your Productivity with VS Code and .vscode for Dev.to Markdown
Favicon
Today’s new knowledge #8(Markdown)

Featured ones: