Logo

dev-resources.site

for different kinds of informations.

A Polars exploration into Kedro

Published at
5/17/2023
Categories
kedro
python
polars
datascience
Author
astrojuanlu
Categories
4 categories in total
kedro
open
python
open
polars
open
datascience
open
Author
11 person written this
astrojuanlu
open
A Polars exploration into Kedro

One year ago I travelled to Lithuania for the first time to present at PyCon/PyData Lithuania, and I had a great time there. The topic of my talk was an evaluation of some alternative dataframe libraries, including Polars, the one that I ended up enjoying the most.

I enjoyed it so much that this week I’m in Vilnius again, and I’ll be delivering a workshop at PyCon Lithuania 2023 called “Analyze your data at the speed of light with Polars and Kedro”.

In this blog post you will learn how using Polars in Kedro can make your data pipelines much faster, what’s the current status of Polars in Kedro, and what can be expected in the near future. In case it’s the first time you’ve heard about Polars, I have included a short introduction at the beginning.

Let’s dive in!

What is the Polars library?

Polars is an open-source library for Python, Rust, and NodeJS that provides in-memory dataframes, out-of-core processing capabilities, and more. It is based on the Rust implementation of the Apache Arrow columnar data format (you can read more about Arrow on my earlier blog post “Demystifying Apache Arrow”), and it is optimised to be blazing fast.

Snippet of Polars code

The interesting thing about Polars is that it does not try to be a drop-in replacement to pandas, like Dask, cuDF, or Modin, and instead has its own expressive API. Despite being a young project, it quickly got popular thanks to its easy installation process and its “lightning fast” performance.

I started experimenting with Polars one year ago, and it has now become my go-to data manipulation library. I gave several talks about it, for example at PyData NYC, and the room was full.

How do Polars and Kedro get used together?

If you want to learn more about Kedro, you can watch a video introduction on our YouTube channel:

Traditionally Kedro has favoured pandas as a dataframe library because of its ubiquity and popularity. This means that, for example, to read a CSV file, you would add a corresponding entry to the catalog:

openrepair-0_3-categories:
  type: pandas.CSVDataSet
  filepath: data/01_raw/OpenRepairData_v0.3_Product_Categories.csv
Enter fullscreen mode Exit fullscreen mode

And then, you would use that dataset as input for your node functions, which would, in turn, receive pandas DataFrame objects:

def join_events_categories(
    events: pd.DataFrame,
    categories: pd.DataFrame,
) -> pd.DataFrame:
        ...
Enter fullscreen mode Exit fullscreen mode

(This is just one of the formats supported by Kedro datasets of course! You can also load Parquet, GeoJSON, images… have a look at the kedro-datasets reference for a list of datasets maintained by the core team, or the #kedro-plugin topic on GitHub for some contributed by the community!)

The idea of this blog post is to teach you how can you use Polars instead of pandas for your catalog entries, which in turn allow you to write all your data transformation pipelines using Polars dataframes. For that, I crafted some examples that use the Open Repair Alliance dataset, containing more than 80 000 records of repair events across Europe.

And if you’re ready to start, let’s go!

Get started with Polars for Kedro

First of all, you will need to add kedro-datasets[polars.CSVDataSet] to your requirements. At the time of writing (May 2023), the code below requires development versions of both kedro and kedro-datasets, which you can declare on your requirements.txt or pyproject.toml as follows:

# requirements.txt

kedro @ git+https://github.com/kedro-org/kedro@3ea7231
kedro-datasets[pandas.CSVDataSet,polars.CSVDataSet] @ git+https://github.com/kedro-org/kedro-plugins@3b42fae#subdirectory=kedro-datasets
Enter fullscreen mode Exit fullscreen mode
# pyproject.toml

[project]
dependencies = [
    "kedro @ git+https://github.com/kedro-org/kedro@3ea7231",
    "kedro-datasets[pandas.CSVDataSet,polars.CSVDataSet] @ git+https://github.com/kedro-org/kedro-plugins@3b42fae#subdirectory=kedro-datasets",
]
Enter fullscreen mode Exit fullscreen mode

If you are using the legacy setup.py files, the syntax is very similar:

setup(
    requires=[
        "kedro @ git+https://github.com/kedro-org/kedro@3ea7231",
        "kedro-datasets[pandas.CSVDataSet,polars.CSVDataSet] @ git+https://github.com/kedro-org/kedro-plugins@3b42fae#subdirectory=kedro-datasets",
    ]
)
Enter fullscreen mode Exit fullscreen mode

After you install these dependencies, you can start using the polars.CSVDataSet by using the appropriate type in your catalog entries:

openrepair-0_3-categories:
  type: polars.CSVDataSet
  filepath: data/01_raw/OpenRepairData_v0.3_Product_Categories.csv
Enter fullscreen mode Exit fullscreen mode

and that’s it!

Reading real world CSV files with polars.CSVDataSet

It turns out that reading CSV files is not always that easy. The good news is that you can use the load_args parameter of the catalog entry to pass extra options to the polars.CSVDataSet, which mirror the function arguments of polars.read_csv. For example, if you want to attempt parsing the date columns in the CSV, you can set the try_parse_dates option to true:

openrepair-0_3-categories:
  type: polars.CSVDataSet
  filepath: data/01_raw/OpenRepairData_v0.3_Product_Categories.csv
  load_args:
    # Doesn't make much sense in this case,
    # but serves for demonstration purposes
    try_parse_dates: true
Enter fullscreen mode Exit fullscreen mode

Some of these parameters are required to be Python objects: for example, polars.read_csv takes an optional dtypes parameter that can be used to specify the dtypes of the columns, as follows:

pl.read_csv(
    "data/01_raw/OpenRepairData_v0.3_aggregate_202210.csv",
    dtypes={
        "product_age": pl.Float64,
        "group_identifier": pl.Utf8,
    }
)
Enter fullscreen mode Exit fullscreen mode

Kedro catalog files only support primitive types. But fear not! You can use more sophisticated configuration loaders in Kedro that allow you to tweak how such files are parsed and loaded.

To pass the appropriate dtypes to read this CSV file, you can use the TemplatedConfigLoader, or alternatively the shiny new OmegaConfigLoader with a custom omegaconf resolver. Such resolver will take care of parsing the strings in the YAML catalog and transforming them into the objects Polars needs. Place this code in your settings.py:

# settings.py

import polars as pl
from omegaconf import OmegaConf
from kedro.config import OmegaConfigLoader

if not OmegaConf.has_resolver("polars"):
    OmegaConf.register_new_resolver("polars", lambda attr: getattr(pl, attr))

CONFIG_LOADER_CLASS = OmegaConfigLoader
Enter fullscreen mode Exit fullscreen mode

And now you can use the special OmegaConf syntax in the catalog:

openrepair-0_3-events-raw:
  type: polars.CSVDataSet
  filepath: data/01_raw/OpenRepairData_v0.3_aggregate_202210.csv
  load_args:
    dtypes:
      # Notice the OmegaConf resolver syntax!
      product_age: ${polars:Float64}
      group_identifier: ${polars:Utf8}
    try_parse_dates: true
Enter fullscreen mode Exit fullscreen mode

Now you can access Polars data types with ease from the catalog!

Future plans for Polars integration in Kedro

This all looks very promising, but it’s only the tip of the iceberg. First of all, these changes need to land in stable versions of kedro and kedro-datasets. More importantly, we are working on a generic Polars dataset that will be able to read other file formats, for example Parquet, which is faster, more compact, and easier to use.

Polars makes me so excited about the future of data manipulation in Python, and I hope that all Kedro users are able to leverage this amazing project on their data pipelines very soon!

Featured ones: