dev-resources.site
for different kinds of informations.
Exploring OSM changesets via DuckDB
Is the statement, "DuckDB - A game changer for analyzing OpenStreetMap" true?
Let's find out together!
There were so many positive posts about this analytical database system for some time, especially for geospatial data, so I decided: it's time for doubts and it's better to test it by myself. Together with out-of-curiosity testing for dbt outside prod environment, it ended up as a small ETL pet project.
What's the idea?
Letβs take data from OpenStreetMap, load it into DuckDB, run some analytical queries, and show the result. Quite simple, right?
But first of all, people do that all the time, so itβs already boring.
What about not only geodata, but βmetadataβ? Every time someone draws a house on a map, a changeset is created with additional information like user_id, timestamp, software used, tags, and some others.
Looks interesting! If I can combine that with the data itself, it can literally βleverage cross-functional expertise to catalyze impactful decisions for key imaginary stakeholders!β
Where should we start?
I started sitting in my IKEA chair behind an old 5-year-old laptop with 8 cores and 7.24G RAM in WSL, downloading 6.8 GB of the latest .osm.bz2 changeset file with XML data inside from planet.openstreetmap.org.
Let the project begin!
Changesets
What do we have?
We have 7 gigs of archived XML with data starting from 2005-04-09
osmaboardbt$ bzcat data/in/osm_data/changesets-latest.osm.bz2 | head -n 5
<?xml version="1.0" encoding="UTF-8"?>
<osm license="http://opendatacommons.org/licenses/odbl/1-0/" copyright="OpenStreetMap and contributors" version="0.6" generator="planet-dump-ng 1.2.4" attribution="http://www.openstreetmap.org/copyright" timestamp="2024-11-25T00:59:59Z">
<bound box="-90,-180,90,180" origin="http://www.openstreetmap.org/api/0.6"/>
<changeset id="1" created_at="2005-04-09T19:54:13Z" closed_at="2005-04-09T20:54:39Z" open="false" user="Steve" uid="1" min_lat="51.5288506" min_lon="-0.1465242" max_lat="51.5288620" max_lon="-0.1464925" num_changes="2" comments_count="48"/>
<changeset id="2" created_at="2005-04-17T14:45:48Z" closed_at="2005-04-17T15:51:14Z" open="false" user="nickw" uid="94" min_lat="51.0025063" min_lon="-1.0052705" max_lat="51.0047760" max_lon="-0.9943439" num_changes="11" comments_count="5"/>
What do I want?
I want a folder with Parquet files, each one close to 1GB (following some best practices Iβve heard somewhere I donβt remember), with structured data keeping all columns from the source: id, created_at, closed_at, uid, user, num_changes, bbox, tags.
I couldnβt find a tool that would do this for me, so it seems like this isnβt a common thing people do. Maybe my Python script might help someone in the future.
It was also a good idea to prepare test data first. What Iβve learned in the past is that I might have dozens of test runs, and saving minutes by running them on a slice of data will end up saving hours in the end. So, I also made a bash script to take just one month from every year and process it.
Together with GNU Parallel utilizing almost all my hardware, I was able to split the source file into 12 .osm.bz2 ones and later into 12 Parquet files in reasonable time, where just the latest file should be updated/processed next week.
Cool! Well, not 100% cool. There are duplicates across files due to the logic of the --after and --before filters of the osmium-changeset-filter tool. But I decided to follow Unix philosophy here and handle deduplication at the dbt models level.
D select filename, count(*) cnt from read_parquet('data/out/parquet/changesets_*.parquet', filename = true) group by 1 order by 1;
βββββββββββββββββββββββββββββββββββββββββββββββββ¬βββββββββββ
β filename β cnt β
β varchar β int64 β
βββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββ€
β data/out/parquet/changesets_2005_2012.parquet β 13777528 β
β data/out/parquet/changesets_2013_2015.parquet β 20731429 β
β data/out/parquet/changesets_2016.parquet β 8420095 β
β data/out/parquet/changesets_2017.parquet β 10091179 β
β data/out/parquet/changesets_2018.parquet β 10756788 β
β data/out/parquet/changesets_2019.parquet β 13068229 β
β data/out/parquet/changesets_2020.parquet β 17588235 β
β data/out/parquet/changesets_2021.parquet β 18808719 β
β data/out/parquet/changesets_2022.parquet β 15036354 β
β data/out/parquet/changesets_2023.parquet β 14889321 β
β data/out/parquet/changesets_2024.parquet β 12775670 β
β data/out/parquet/changesets_latest.parquet β 659191 β
βββββββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββββ€
β 12 rows 2 columns β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
OSM PBF
To be honest, I thought that processing changeset data would be the hardest part here. But it turned out that loading PBF files into DuckDB might have even more issues.
Issue 1: Data Sources
There are several sources for these files. The most popular one, Geofabrik, provides two "types" of dumps: the regular one from download.geofabrik.de and a second one from osm-internal.download.geofabrik.de, which includes full personal metadata. I needed this metadata for joining with the changeset data. Honestly, I donβt understand why changeset_id is considered personal information, but it is what it is.
Issue 2: Missing Metadata in DuckDB.
Even having the *internal.osm.pbf file doesnβt make things easier becauseβ¦ neither QuackOSM nor loading the file with the Spatial extension shows you that information! A little disappointing for a "game-changer," as far as Iβm concerned.
I also encountered some issues documented on a GitHub page:
WSL usage: sometimes code can break since DuckDB is trying to use all available memory, that can be occupied by Windows.
Temporary files: Around 14G(!) of temporary files were created during the process.
Issue 3: Osmium Export and DuckDB
Next, I turned to the good old team player Osmium, where you can update the config file to get everything you need. But yet againβ¦
At first, I thought I could use a simple CSV-like PostgreSQL output and load it into the table like Iβve done dozens of times. Unfortunately, DuckDB couldnβt handle large lines (the max_line_size=3600000 parameter helped) or complex escape symbols with things like four slashes and multiple quotes. Sure, I could use sed or chain replace() functions over and over, but I donβt want workarounds. A game-changer was promised to me!
Then, I tried GeoJSON (RFC7946) option: a 15GB file containing one FeatureCollection with complex tags and so on. Nope. Errors like "maximum_object_size of bla-bla bytes exceeded while reading file," plus some others I canβt even remember.
Okay, maybe GeoJSON with it's maximum_object_size limits isnβt ideal. Enter GeoJSON Text Sequence (RFC8142), also known as geojsonseq or jsonseq. Each line (beginning with an RS (0x1E, record separator) and ending with a linefeed character) contains one GeoJSON object, making it suitable for streaming. Sounds perfect, right? Nope. DuckDB couldnβt read it.
Believe me, I tried dozens of options and approaches. I donβt know how someone new to geospatial data or even data engineering could find this situation acceptable.
The Solution: shellfs extension
Finally, after hours of googling, I found a gem: the Shellfs extension for DuckDB. This made DuckDB finally useful for my case. Seriously, give it a star on GitHub!
With this, I could load my PBF file, complete with all the data I needed, without creating large temporary files. Just a short one-liner:
./duckdb poland_internal.duckdb -c "load shellfs; set memory_limit = '$(duckdb_max_memory)'; create table poland_internal_raw as select json_extract(json, '$.properties') as properties, json_extract(json, '$.geometry') as geom_geojson from read_json_objects('OSMIUM_POOL_THREADS=$(osmium_pool_threads) osmium export -i sparse_file_array -c osmium.config.json -f geojsonseq data/in/osm_data/poland-latest-internal.osm.pbf | jq -c |', format='auto');"
Make sure both duckdb and osmium have enough RAM to run at the same time.
dbt and export to S3
Moving furter. Now we have 2 .duckdb files with raw osm data. It's super easy to add both to profiles.yml file.
osmaboar_dbt:
materialized: table
outputs:
dev:
type: duckdb
path: ../osmaboardbt.duckdb
attach:
- path: ../poland_internal.duckdb
extensions:
- spatial
- json
- shellfs
threads: 6
memory_limit: '4G'
I donβt want to spend much time here; the reader is welcome to check GitHub for details. I created two models to load raw OSM data into a more useful form, with some casts and adjustments. One model defines the userβs first changeset, which was impossible to get from the source data: changesets_count, the number of changesets the user has made before the current one, only works for iD and Rapid OSM editors.
Another model removes duplicates (mentioned above) from Parquet files using a standart row_number() approach.
Three other models are for aggregations and analytics, where DuckDB really shines:
- Number of changesets per month and editor
- Number of changesets by hosts
- Filtered data for amenities with focused on editor, shops and opening hours.
What I want to highlight here is that dbt macros are a great feature! Iβve used them before, and theyβre a fantastic tool for reusing code and keeping business logic centralized, which is essential for larger teams. For example, I used macros to clean up software versions and retain only editor names.
Once the data is ready, itβs pushed to S3 and is accessible for you!
All links, along with the dbt project, are available on GitHub and in the notebook in the next chapter.
Showing results
Data processed, CSV is ready, what can we do now? Grapics!
I've created a notebook in Databricks Community Edition so you can check it out. Nothing special about the code β loading the CSV, casting types, and showing results.
And the results are quite interesting!
For example, I didnβt think StreetComplete was so popular for improving maps. Even the stats for people making their first edits look great. It seems like gamification works!
I also noticed that Maps.me is still a tool for OSM contributors, even though Organic Maps has been on the market for quite some time. Really, why are people still using Maps.me? Itβs okay for overall stats, but for first edits? I donβt get it.
Another interesting one is OsmBiz. It looks like businesses are aware of OSM. Iβd love to learn more about them.
When it comes to everyday usage, I care about the working hours of places. Letβs take shops as an example. If we separate them into two groups β those with working hours and those without β itβs quite disappointing. 94% of places donβt have working hours. But when shops do have them, StreetComplete, Every Door, and B-Jazz show up again.
You can explore the data yourself. Just clone the notebook! Or leave me a message here or on GitHub issues β it would be great to know how people might use it!
Summary and some thoughts.
Is the statement "DuckDB - A game changer for analyzing OpenStreetMap" true? I don't think so, at least not for now.
On one hand, I came to the same conclusion as the author of Should You Ditch Spark for DuckDb or Polars? (very good post, take a look!).
The issues with OOM errors, the hype surrounding the tool, and its primary use case as an interactive, ad-hoc query tool for data exploration rather than for ETL workflowsβall of these seem true to me. But even for OSM data exploration, DuckDB lacks flexibility when it comes to handling metadata, and there are some minor issues with loading geodata into the database.
It seems like PostGIS is still the better, more mature solution here. Even if I can run some queries in three seconds instead of a couple of minutes, PostGIS still offers more functions, better logging, improved observability, and a robust ecosystem and community. Thereβs also the option to explore pg_parquet. I'm also looking forward to seeing more progress with Iceberg. :)
It's no surprise that no technology is a silver bullet. As always, consider your specific workload and pay close attention to DBMS types. The distinction between multi-user and single-user systems isn't just about the number of connected agents; it's also about accommodating different human users and their needs.
DuckDB + Spatial is promising, but it's still a bit too young to replace PostgreSQL + PostGIS + Osmium at this stage. Perhaps in a couple of yearsβ¦
Baits and hot-takes
One more thing I wanted to add to make the post a bit more fun :)
After seeing all the errors and illogical decisions, here are my shout-outs:
Every tool and script based on DuckDB should have a max_memory option! Seriously, I want to limit memory usage in my WSL/Docker environment, and I want control over it.
As one of the main rules in business goes, "Get money from clients in any way they use," every new database should be able to ingest my data in any form I provide. Long strings? Handle them! WKT/WKB for geography? No problem, ingest them! Some other kind of JSON? Eat it! I want my data inside the system without any hiccups at that stage.
- It's almost 2025, and OpenStreetMap should start sharing data in Parquet file format. Archived XML is so "nineties".
What do you think?
Featured ones: