dev-resources.site
for different kinds of informations.
Easy site Crawling in Elixir with ex_crawlzy
History
The last year (2022) i make the new year purpose (the tradition it's do nothing) to create something that create incomes, sounds like the default "i want to create my startup" history that never success, but must exist a starting poit isn't so i decide to create my brand new first service that will start as free and progresively will add this pricing service
So this super effective brand new product was never released, because i had some personal issues with my free time so it's on-hold or i want to belive this xD, btw, i coded a lot of stuff, created and learned a lot because this service was focused on crawling, yes, crawling
My favorite language is Elixir so all my personal projects are on Elixir and also im a beliver of OpenSource, so i decided that my code wouldn't be just isolated on that private repo, i created something good and thing that maybe in the world, some dude alone without hope of humanity must need help crawling a site using Elixir because i think that everyone in their career, in some moment, needs defeat the crawling monster, so my library must make this task easy
The solution
Stop of history, and let's move to the solution
Crawling exist in Elixir, the classic http request and the libraries that make this more 'easy"
The main point it's that all the libraries that i found has the same point, just the call and returns the html as text, so the parsing and all the hard stuff basically it's the same
What is the difference on make my own call using some http request librar and use your library? NOTHING!!!
So i create my library that basically calls to the endpoint site and returns the content parsed as a Map, sounds great, and it's great
The library it's ex_crawlzy and gives a direct solution to this crawl based on css selectors, that it's the stuff that almost all world does when the crawl it's needed, so you just need your selectors and this library makes the hard work
The code
The same of all libraries, add the dependencies
def deps do
[
{:ex_crawlzy, "~> 0.1.1"}
]
end
And includes the classic way that gives all the crawling libraries, just giving the html as text
site = "https://example.site"
{:ok, html_content} = ExCrawlzy.crawl(site)
But here come's the interesting stuff, the library includes a function to parse the html_content
if you give the css selectors in a map using a key: selector
format
fields = %{
# shortcut for use a function from ExCrawlzy.Utils
body: {"div#the_body", :text}
# module/function way
# body: {"div#the_body", {ExCrawlzy.Utils, :text}}
# body: {"div#the_body", fn content ->
# ExCrawlzy.Utils.text(content)
# end}
}
{:ok, %{body: body}} = ExCrawlzy.parse(fields, html_content)
You can parse using a direct shortcut of one parser from ExCrawlzy.Utils
module, a tuple {Module, :function}
or directly a function that must be called when the field it's parsed
Want more
The solution it's there, but i know that you will need more, maybe more organization in the case that you have a big map to fill of data, so for have more organization i added the module ExCrawlzy.Client.Json
that helps to define directly your crawler in a module and just call your YourModule.crawl/1
and will call, parse and return your data, the implementation it's easy
First let's define the module
defmodule ExampleCrawler do
use ExCrawlzy.Client.Json
add_field(:title, "head title", :text)
add_field(:body, "div#the_body", :text)
add_field(:inner_field, "div#the_body div#inner_field", :text)
add_field(:inner_second_field, "div#inner_second_field", :text_alt)
add_field(:number, "div#the_number", :text)
add_field(:exist, "div#the_body div#exist", :exist)
add_field(:not_exist, "div#the_body div#not_exist", :exist)
add_field(:link, "a.link_class", :link)
add_field(:img, "img.img_class", :img)
def text_alt(sub_doc) do
ExCrawlzy.Utils.text(sub_doc)
end
end
If you check the code, in this cases you can define as callback a function from ExCrawlzy.Utils
or a directly defined function in your module, the function text_alt/1
it's defined in the module, the crawler will check automatically if it's a defined function in module or a parser from utils module
And then just use it
site = "https://example.site"
{:ok, data} = ExampleCrawler.crawl(site)
List of elements
Let's supose that you're crawling a store, i want's to list the products in your crawl, you can define this crawler using the module ExCrawlzy.Client.JsonList
defmodule ExampleCrawlerList do
use ExCrawlzy.Client.JsonList
list_size(2)
list_selector("div.possible_value")
add_field(:field_1, "div.field_1", :text)
add_field(:field_2, "div.field_2", :text)
end
This module defines 2 new definition macros list_selector/1
and you need define parent selector, the one that has the list of elements and list_size/1
will define how many elements of the list will take when it's parsing
Example of the html pattern of the list
<div class="parent_class"> <div class="child_class"> ...content </div> <div class="child_class"> ...content </div> <div class="child_class"> ...content </div> </div>
And follows the same rules as the first crawler module
site = "https://example_list.site"
{:ok, data} = ExampleCrawlerList.crawl(site)
Adding http clients
Talking about security, a lot of sites detect robots (crawlers) on the calls and retrieves a forbidden response or other kind fo responses to ensure just real traffic incomes, to this, the library has pre-coded some simulated clients that helps to avoid some robot-detectors, but in case that your site needs other specific browser request headers, you can define it
site = "https://example.site"
clients = [
[
{"referer", "https://your_site.com"},
{"user-agent", "Custom User Agent"}
]
]
{:ok, content} = ExCrawlzy.crawl(site, clients)
Or use the macro add_browser_client/1
in your crawler module sharing a list of tuples
defmodule ExampleCrawler do
use ExCrawlzy.Client.Json
add_browser_client([
{"referer", "https://your_site.com"},
{"user-agent", "Custom User Agent"}
])
add_field(:field_1, "div.field_1", :text)
end
Testing
Test it's really easy, it's a http client so the testing it's a http client, you can use tesla
for the testing part
First add this line to your test.exs
file, must add specifically your module
config :tesla, ExampleCrawler, adapter: Tesla.Mock
This it's a test example, the html it's saved in the priv
folder to have more organization, strongly recommended this step
defmodule ExampleCrawlerTest do
use ExUnit.Case
import Tesla.Mock
setup do
{:ok, content} =
:your_app
|> :code.priv_dir()
|> then(&"#{&1}/test.html")
|> File.read()
mock(fn
%{method: :get, url: "https://example_list.site"} ->
%Tesla.Env{status: 200, body: content}
end)
:ok
end
test "list things" do
site = "https://example_list.site"
assert {:ok, data} = ExampleCrawlerList.crawl(site)
end
end
Conclusions
The library it's usefull, easy to implement, a lot of stuff that solves, in a traditional way you need add the library floki
that it's a css selector parser, so you can skip that
There it's stuff to get done, this library it's based on Declarative Development so there is a lot of stuff to develop to make more flexible and complete the crawling, like nested crawlers, so all its just a concept right now and im trying to find the time to add more and more stuff to this library and the other stuff that i want to work
Featured ones: