Logo

dev-resources.site

for different kinds of informations.

AI-pipe: Pipeline for generating/storing embeddings from AI models to DB with data scraped from sites using custom scripts

Published at
12/30/2024
Categories
devchallenge
brightdatachallenge
ai
webdev
Author
ogbotemi_ogungbamila_3ad3
Author
25 person written this
ogbotemi_ogungbamila_3ad3
open
AI-pipe: Pipeline for generating/storing embeddings from AI models to DB with data scraped from sites using custom scripts

This is a submission for the Bright Data Web Scraping Challenge: Most Creative Use of Web Data for AI Models

What I Built

A web page to quickly create a pipeline to feed AI models data scraped from a provided webpage.

Features

Custom scriptinig

Total control over the kind, type and form of data scraped from webpages is given in the form of custom scripts with templates provided.

Embeddings generation

The web service supports generating embeddings from OpenAI and Ollama AI models. It also provides a fallback for users without access to AI models running on a remote server through PostgresML

Demo

Coming in rather late but here is a link to a deployed demo of the webapp below

https://ai-pipe.vercel.app/

GitHub logo ogbotemi-2000 / ai-pipe

A webapp that scrapes data you tell it to from the internet and lets you cleanse and format it which is then fed to an AI model to generate embeddings

ai-pipe

A webapp that scrapes data you tell it to from the internet and lets you cleanse and format it which is then fed to an AI model to generate embeddings

AI model providers

Ollama

Via a remote deployment like Koyeb

Open AI

Support for adding the API key along with the request body

PostgresML

Used as a fallback

Scraping data

You provide a URL and specify what nodes to target as well as what kind of data to extract from them all of which gets sent to the backend. The response is sent and you can work on each response for the nodes targeted by writings scripts to format, cleanse the data and preview the result before generating embeddings for it via an AI model of your choosing.

Embedding

The generated embedding is provided to be copied

In Addition

The webpage features links to useful resources on AI and…

How I Used Bright Data

Scraping browser

I used Puppeteer along with a web socket URL that points to a browser provided by BrightData to access websites, mutate the DOM and traverse the DOM while applying custom scripts to scrape data from it.

Here is the code that handles the above

const puppeteer = require('puppeteer-core'),
      path   = require('path'),
      fs     = require('fs'),
      both = require('../js/both'),
      file   = path.join(require('../utils/rootDir')(__dirname), './config.json'),
      config = fs.existsSync(file)&&require(file)||{...process.env};

module.exports = function(request, response) {
  let { data } = request.body, result;
  let { nodes, url } = data = JSON.parse(data),
  /**serialize the needed function in the imported object for usage in puppeteer */
      _both = { asText: both.asText.toString() };

  new Promise(async (res, rej) => {  
    puppeteer.connect({
      headless: false,
      browserWSEndpoint: config.BROWSER_WS,
    }).then(browser=>browser.newPage().then(async page=>{

      await page.setUserAgent('5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36');
      await page.goto(url, { waitUntil:'load', timeout:1e3*60 });
      // await page.waitForFunction(() => document.readyState === 'complete');
      // await page.screenshot({path: 'example.png'});


      const result = await page.evaluate((nodes, both) => {
        /** convert serialized function string back into a function to execute it */
        both.asText = new Function(`return ${both.asText}`)()
        /**remove needless nodes from the DOM */
        document.head.remove(), ['link', 'script', 'style', 'svg'].forEach(tag=>document.body.querySelectorAll(tag).forEach(el=>el.remove()))
        /**defined "node" - the variable present in the dynamic scripts locally to make it available in the 
          custom function context when created with new Function */
        let page = {}, node, fxns = Object.keys(nodes).map(key=>
          /**slip in the local variable - page and prepend a return keyword to make the function string work 
           * as expected when made into a function
          */
          nodes[key] = new Function(`return ${nodes[key].replace(',', ', page, ')}`)()
        );
        /** apply the functions for the nodes to retrieve data as the DOM is being traversed */
        both.asText(document.body, (_node, res)=>fxns.find(fxn=>res=fxn(node=_node, page)) && /*handle fetching media assets later here*/res || '');
        return page
      }, nodes, _both);
      res(result), await browser.close();
    }).catch(rej))
    .catch(rej)
  }).then(page=>result = page)
  .catch((err, str, name)=>{
    str = err.toString(), name = err.constructor.name, result = {
      error: /^\[/.test(str) ? `${name}: A sudden network disconnect occured` : str
    }
  }).finally( ()=> {
    response.json(result)
  })
}
Enter fullscreen mode Exit fullscreen mode

Web Unlocker

For stubborn sites that used Cloudflare Trunstile to prevent scraping, I tested some code using BrightData's proxy API and it worked!
In the future, I will implement a workaround whereby the downloaded HTML of the stubborn sites gets sent to the client-side to be scraped via scripts based on how useful people find this service.

Qualified Prompts

AI pipeline

My submission is primarily focused on this prompt however it happens to offer solutions to businesses that have always wanted to control and format the data they scrape from sites.

Thanks for reading

I built this for the BrightData challenge but I will improve on it if it turns out to be something useful

brightdatachallenge Article's
30 articles in total
Favicon
Scrape Data from Shopee
Favicon
Estile: AI-Driven Clothing Recommendations Enhanced by Bright Data Scraping
Favicon
Detoxify: Make your YouTube Feed 100x better
Favicon
Scrape Phone Plans
Favicon
Fascinating and brilliantly done!
Favicon
The Tech News Scraper
Favicon
Scrape Data from Complex, Interactive Websites
Favicon
MyGithub scrap datas from Your Github account with a new format
Favicon
Web Scraping Tutorial: Extract Data from Websites Using Python
Favicon
[Boost]
Favicon
Trading Signal from Sentiment Analysis using Bright Data API
Favicon
Congrats to the Bright Data Web Scraping Challenge Winners!
Favicon
WebCrawlAI: An AI-Powered Web Scraper Built Using Bright Data
Favicon
Compare prices across AliExpress, eBay, & Amazon.
Favicon
Track Amazon Prices in Real-Time and Solve CAPTCHAs Seamlessly with Bright Data
Favicon
Gigs AI: A Conversational Chatbot Powered by Aggregated Data from Freelancer and Upwork
Favicon
SEO Performance Analysis Tool: AI-Powered SEO Insights with Complex Web Scraping
Favicon
Reddit Recap: Audio summaries of subreddits powered by BrightData
Favicon
AI-pipe: Pipeline for generating/storing embeddings from AI models to DB with data scraped from sites using custom scripts
Favicon
JobScout.ai: Smarter Job Search with AI and Bright Data
Favicon
State of the Art Automated Web Scraper using Bright Data
Favicon
Trend Chat
Favicon
PriceTracker Pro: Multi-E-Commerce Price Tracking with Bright Data's Web Scrapers API πŸš€
Favicon
Web Scraper API to Solve Business Problems
Favicon
Tech Trend Tracker: AI-Powered News Analysis for Technology Insights
Favicon
Yoda’s EU Grant Finder for Solopreneurs: Powered by Bright Data
Favicon
Scrape Unscrapeable Amazon Dataset with BrightData, React.js and Node.js
Favicon
Scrapping Yahoo Finance with AI Analysis
Favicon
Make Cursor Composer Smarter with Bright Web Scraping Capabilities
Favicon
Bright data Challenge - Industry AI Watchdog

Featured ones: