Logo

dev-resources.site

for different kinds of informations.

Quick tip: Using SingleStore for Iceberg Catalog Storage

Published at
7/4/2024
Categories
singlestoredb
apacheiceberg
catalog
jdbc
Author
veryfatboy
Author
10 person written this
veryfatboy
open
Quick tip: Using SingleStore for Iceberg Catalog Storage

Abstract

SingleStore recently announced bi-directional support for Apache Iceberg. Iceberg uses catalogs that are an integral part of the Iceberg table format, designed to manage large-scale tabular data in a more efficient and reliable way. Catalogs store metadata and track the location of tables, enabling data discovery, access, and management. Iceberg supports multiple catalog backends, including Hive Metastore, AWS Glue, Hadoop, and through a database system using JDBC. This allows users to choose the most suitable backend for their specific data infrastructure. In this short article, we'll implement an Iceberg catalog using SingleStore and JDBC.

The notebook file used in this article is available on GitHub.

Introduction

The JDBC catalog in Apache Iceberg is a specialised catalog implementation that uses a relational database system to store metadata about Iceberg tables. This option uses the transactions and scalability of relational database systems to manage and query metadata efficiently. The JDBC catalog provides a good choice for environments where relational database systems are already in use or preferred. The JDBC connection needs to support atomic transactions.

Create a SingleStoreDB Cloud account

A previous article showed the steps to create a free SingleStoreDB Cloud account. We'll use the following settings:

  • Workspace Group Name: Iceberg Demo Group
  • Cloud Provider: AWS
  • Region: US East 1 (N. Virginia)
  • Workspace Name: iceberg-demo
  • Size: S-00

We'll make a note of the password and store it in the secrets vault using the name password.

Import the notebook

We'll download the notebook from GitHub.

From the left navigation pane in the SingleStore cloud portal, we'll select DEVELOP > Data Studio.

In the top right of the web page, we'll select New Notebook > Import From File. We'll use the wizard to locate and import the notebook we downloaded from GitHub.

Run the notebook

After checking that we are connected to our SingleStore workspace, we'll run the cells one by one.

We'll use Apache Spark to create a tiny Iceberg Lakehouse in the SingleStore portal for testing purposes.

For production environments, please use a robust file system for your Lakehouse.

For the SparkSession, we'll need two packages (SingleStore JDBC Client and Iceberg Spark Runtime), as follows:

# List of Maven coordinates for all required packages
maven_packages = [
    "com.singlestore:singlestore-jdbc-client:1.2.3",
    "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2"
]

# Create Spark session with all required packages
spark = (SparkSession
             .builder
             .config("spark.jars.packages", ",".join(maven_packages))
             .appName("Spark Iceberg Catalog Test")
             .getOrCreate()
        )

spark.sparkContext.setLogLevel("ERROR")
Enter fullscreen mode Exit fullscreen mode

In the Iceberg Lakehouse, we'll store the Iris flower data set. We'll first download the Iris CSV file into a Pandas Dataframe and then convert this to a Spark Dataframe.

We'll need to create a SingleStore database to use with Iceberg:

DROP DATABASE IF EXISTS iceberg;
CREATE DATABASE IF NOT EXISTS iceberg;
Enter fullscreen mode Exit fullscreen mode

A quick and easy way to find the connection details for the database is to use the following:

from sqlalchemy import *

db_connection = create_engine(connection_url)
url = db_connection.url
Enter fullscreen mode Exit fullscreen mode

The url will contain the host, the port, and the database name. We can use all these details to configure Spark:

spark.conf.set("spark.sql.catalog.s2_catalog", "org.apache.iceberg.spark.SparkCatalog")
spark.conf.set("spark.sql.catalog.s2_catalog.type", "jdbc")
spark.conf.set("spark.sql.catalog.s2_catalog.warehouse", "warehouse")

# SSL/TLS configuration
spark.conf.set("spark.sql.catalog.s2_catalog.jdbc.useSSL", "true")
spark.conf.set("spark.sql.catalog.s2_catalog.jdbc.trustServerCertificate", "true")

# JDBC connection URL
spark.conf.set("spark.sql.catalog.s2_catalog.uri", f"jdbc:singlestore://{url.host}:{url.port}/{url.database}")

# JDBC credentials
spark.conf.set("spark.sql.catalog.s2_catalog.jdbc.user", "admin")
spark.conf.set("spark.sql.catalog.s2_catalog.jdbc.password", password)
Enter fullscreen mode Exit fullscreen mode

Finally, we can test our setup.

First, we'll store the data from the Spark Dataframe in the Lakehouse, partitioned by Species:

(iris_df.write
    .format("iceberg")
    .partitionBy("species")
    .save("s2_catalog.db.iris")
)
Enter fullscreen mode Exit fullscreen mode

Next, we'll check what's stored, as follows:

spark.sql("""
    SELECT file_path, file_format, partition, record_count
    FROM s2_catalog.db.iris.files
""").show()
Enter fullscreen mode Exit fullscreen mode

Example output:

+--------------------+-----------+-----------------+------------+
|           file_path|file_format|        partition|record_count|
+--------------------+-----------+-----------------+------------+
|warehouse/db/iris...|    PARQUET| {Iris-virginica}|          50|
|warehouse/db/iris...|    PARQUET|    {Iris-setosa}|          50|
|warehouse/db/iris...|    PARQUET|{Iris-versicolor}|          50|
+--------------------+-----------+-----------------+------------+
Enter fullscreen mode Exit fullscreen mode

We can run queries on our tiny Lakehouse:

spark.sql("""
    SELECT * FROM s2_catalog.db.iris LIMIT 5
""").show()
Enter fullscreen mode Exit fullscreen mode

Example output:

+------------+-----------+------------+-----------+--------------+
|sepal_length|sepal_width|petal_length|petal_width|       species|
+------------+-----------+------------+-----------+--------------+
|         6.3|        3.3|         6.0|        2.5|Iris-virginica|
|         5.8|        2.7|         5.1|        1.9|Iris-virginica|
|         7.1|        3.0|         5.9|        2.1|Iris-virginica|
|         6.3|        2.9|         5.6|        1.8|Iris-virginica|
|         6.5|        3.0|         5.8|        2.2|Iris-virginica|
+------------+-----------+------------+-----------+--------------+
Enter fullscreen mode Exit fullscreen mode

We'll now delete all Iris-virginica records:

spark.sql("""
    DELETE FROM s2_catalog.db.iris
    WHERE species = 'Iris-virginica'
""")
Enter fullscreen mode Exit fullscreen mode

and check the Lakehouse:

spark.sql("""
    SELECT file_path, file_format, partition, record_count
    FROM s2_catalog.db.iris.files
""").show()
Enter fullscreen mode Exit fullscreen mode

Example output:

+--------------------+-----------+-----------------+------------+
|           file_path|file_format|        partition|record_count|
+--------------------+-----------+-----------------+------------+
|warehouse/db/iris...|    PARQUET|    {Iris-setosa}|          50|
|warehouse/db/iris...|    PARQUET|{Iris-versicolor}|          50|
+--------------------+-----------+-----------------+------------+
Enter fullscreen mode Exit fullscreen mode

We can also check the metadata stored in SingleStore:

SELECT * FROM iceberg_tables;
Enter fullscreen mode Exit fullscreen mode

Example output:

+--------------+-----------------+------------+-------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+
| catalog_name | table_namespace | table_name | metadata_location                                                                   | previous_metadata_location                                                          |
+--------------+-----------------+------------+-------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+
| s2_catalog   | db              | iris       | warehouse/db/iris/metadata/00001-6ea55045-6162-4462-9f8c-597ddbc5b846.metadata.json | warehouse/db/iris/metadata/00000-39743969-9e4b-4875-81ad-d8310656d28f.metadata.json |
+--------------+-----------------+------------+-------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+
Enter fullscreen mode Exit fullscreen mode

Summary

In this short article, we've seen how to configure SingleStore to manage an Iceberg Lakehouse catalog. Using a simple example, we've run some queries on our Lakehouse and SingleStore has managed the metadata for us using JDBC.

singlestoredb Article's
30 articles in total
Favicon
Quick tip: Visualising the Air Quality Index (AQI) across Punjab, Pakistan and India
Favicon
Quick tip: Using SingleStore with OpenAI's Swarm
Favicon
Quick tip: Using SingleStore and WebAssembly for Sentiment Analysis of Stack Overflow Comments
Favicon
Quick tip: Building Predictive Analytics for Loan Approvals
Favicon
Quick tip: Build Vector Embeddings for Video via Python Notebook & OpenAI CLIP
Favicon
Quick tip: SingleStore Kai support for MongoDB $vectorSearch
Favicon
Quick tip: Using SingleStore with PyIceberg
Favicon
Quick tip: Using SingleStore for Iceberg Catalog Storage
Favicon
Quick tip: Using picoGPT in the SingleStore portal
Favicon
Quick tip: Ollama + SingleStore - LangChain = :-(
Favicon
Quick tip: How to Build Local LLM Apps with Ollama and SingleStore
Favicon
Quick tip: Using R, OpenAI and SingleStore Notebooks
Favicon
Quick tip: Write numpy arrays directly to the SingleStore VECTOR data type
Favicon
Quick tip: Using R, Rayshader and SingleStore Notebooks
Favicon
Quick tip: Using R with SingleStore Notebooks
Favicon
Quick tip: Using Apache Spark and GraphFrames with SingleStore Notebooks
Favicon
Quick tip: Using Apache Spark Structured Streaming with SingleStore Notebooks
Favicon
Quick tip: Using SingleStore Spark Connector's Query Pushdown with SingleStore Notebooks
Favicon
Quick tip: Using the SingleStore Spark Connector with SingleStore Notebooks
Favicon
Quick tip: Using Apache Spark with SingleStore Notebooks for Fraud Detection
Favicon
Quick tip: Cosine Similarity revisited in SingleStore
Favicon
Quick tip: Using Apache Spark with SingleStore Notebooks
Favicon
Quick tip: Using Approximate Nearest Neighbor (ANN) Search with SingleStoreDB
Favicon
Quick tip: Using the new VECTOR data type and Infix Operators in SingleStoreDB
Favicon
Quick tip: Dot Product, Euclidean Distance and Cosine Similarity in SingleStoreDB
Favicon
Vector Databases & AI Applications for Dummies
Favicon
Quick tip: Analysing Stock Tick Data in SingleStoreDB using LangChain and OpenAI's Whisper
Favicon
Quick tip: Replicating JSON data from MongoDB to SingleStore Kai and creating OpenAI embeddings
Favicon
Quick tip: Streaming data from MongoDB Atlas to SingleStore Kai using Kafka and CDC
Favicon
Quick tip: Using LangChain's SQLDatabaseToolkit with SingleStoreDB

Featured ones: