Logo

dev-resources.site

for different kinds of informations.

Unlearning what you know about relational databases to unlock the power of Redshift

Published at
9/2/2023
Categories
redshift
aws
dataengineering
data
Author
nelyanne
Categories
4 categories in total
redshift
open
aws
open
dataengineering
open
data
open
Author
8 person written this
nelyanne
open
Unlearning what you know about relational databases to unlock the power of Redshift

My first encounter with the concept of columnar databases was when I learned about BigQuery offered by GCP. Initially, I wasn't sure what made them so different, as they didn't appear to be very different from standard relational databases. You are still creating tables where the row structure is strictly defined, and you can query them using familiar SQL syntax. However, those databases are architectured in a totally different way, and it affects how you should be working with them. After I gained practical experience with Redshift I understood that I needed to turn some mental switches in my brain in order to use it effectively. Here's a short list of important differences between Redshift and the traditional databases, and how they impact the optimal schema design.

No primary keys

In relational databases, defining a primary key is often one of the first things to define when designing a new table. Redshift doesn't have a concept of primary keys, or constraints in general. What does it mean in practice? There is no straightforward way to enforce the uniqueness of values in your columns. You need to be very careful in your ingestion process to ensure that you are not inserting duplicates. Additionally, it's good to have validation rules in place that would routinely check for the presence of duplicates, and have a remediation plan if they are detected.

No foreign keys

In relational databases, joining even large tables can be quick and easy - thanks to the existence of foreign keys. You cannot define foreign keys in Redshift. As a consequence, joins can become painfully slow, especially if tables on two sides of the join are distributed using a different column. Consider a situation where you are running a Redshift on a 6-node cluster and you are trying to join the table ORDERS (distributed by customer ID) with the table DELIVERIES (distributed by location). It will mean that each of the 6 chunks of the ORDERS table will have to be matched against 6 pieces of the DELIVERIES table, which will require a lot of inter-node communication.

How to join the tables effectively if you cannot define foreign keys?

  • Distribute both tables used in the join by the same column (the one used in the join).
  • For small tables (like static configuration, dictionary tables), use distribution style ALL, so each Redshift node will hold a copy of the entire table.
  • If the options above are not possible:
    • Do the JOIN before the ingestion step (for example, during extraction from the source database) and store the rows in a denormalized format.
    • Define the join logic in a materialized view.

Only one SORTKEY and DISTKEY per table

In a relational database, adding another index to a table is often the first option considered when experiencing performance issues (which doesn't mean it's always the best solution). As a result, you may end up with a single table where most of columns are indexed, to support queries with conditions applying to different columns.

In Redshift, you don't have this luxury - each table can have only one key used for sorting and one key used to distribute the table between nodes. You have to be mindful of your choice! Understand your query patterns and design the sort key and distribution key accordingly. If you have use cases where the same table needs to be queried using different filtering criteria, you can create a materialized view that will use a different distribution key than the source table.

Impact of SORT KEY choice

Let's say that you decided that the records in your table will be sorted by the update_date. As a consequence:

  • Redshift will work great if you want to query all rows last updated within a certain time frame and calculate aggregated statistics for them.
  • Redshift will not work so great when you want to find a row with a specific ID, and you don't know when was the last time it was updated. (This task would be trivial on a relational database with an index on ID.)

Impact of DIST KEY choice

Let's say that you want to distribute data in your ORDERS table based on customer_id. It would result in all orders done by a single customer to be stored on the same Redshift node.

  • If you want to calculate the number of orders per customer, each part of the aggregation will take place locally on a single node. Calculations will be spread between the nodes, fully utilizing your cluster's resources.
  • If you want to calculate the statistics for a single customer, Redshift will be able to use the power of only one node. It can result in sub-optimal performance. You can solve it by extending distribution key definition to include another column (for example, based on the date of the order).
  • If some customers have significantly more orders than the orders, your data can become skewed and the load on different nodes will be uneven (some nodes will store more orders than the others). If this happens, Redshift should be able to solve it by redistributing the keys between the nodes.

Conclusion

This list is by no means exhaustive, especially as Redshift is still a developing technology, with new features released every quarter. Before you can fully enjoy the benefits it offers (scalability, cost-effectiveness, high-performance), you need to forget some of your old routines and embrace its uniqueness.

redshift Article's
30 articles in total
Favicon
Securing Amazon Redshift - Best Practices for Access Control
Favicon
Migrate from Native Google to AWS Redshift: Benefits and Best Practices
Favicon
A Comprehensive Guide to Establishing a Successful Connection to Amazon Redshift Using the ODBC Driver
Favicon
Cloud Data Warehouse Comparison: Who’s the Real MVP?
Favicon
Move Data from DynamoDB to Redshift Using Estuary
Favicon
Amazon Redshift Workload Management (WLM): A Step-by-Step Guide
Favicon
Building a Scalable Data Platform: Addressing Uncertainty in Data Requirements with AWS
Favicon
Step by Step process to setup Redshift datashare across Redshift clusters
Favicon
Amazon Redshift guia de estudio (Comunidad AWS ML Latam)
Favicon
Data Governance on AWS using DataZone
Favicon
How to Migrate Amazon Redshift to a Different Account and Region: Step-by-Step Guide
Favicon
Channel Your Inner Scrooge with Redshift Reserved Instances: Slash Your Cloud Bill Like a Boss
Favicon
Streamline SSO Access to AWS Redshift Query Editor with Okta and Terraform
Favicon
Two ways to manage secrets for AWS Redshift Serverless with AWS Secrets Manager !!
Favicon
Three Ways to Retrieve Row Counts in Redshift Tables and Views
Favicon
Optimising Sentiment Analysis Workflows: AWS Zero-ETL and Amazon Redshift Synergy-Part 1
Favicon
Leveraging AWS Redshift for Your Organization's Needs
Favicon
From Relational to Analytical: The Power of Redshift Data Warehousing and Analytics
Favicon
Interactions Tracker, Part 3: Why I stopped and Lessons Learned
Favicon
A Comprehensive Guide to AWS DynamoDB vs. Redshift for Databases and Data Warehouses
Favicon
Understanding Redshift
Favicon
Unlearning what you know about relational databases to unlock the power of Redshift
Favicon
4 reasons why your lambda function cannot communicate with RedShift
Favicon
Query Amazon Redshift from YugabyteDB though PostgreSQL Foreign Data Wrapper and VPC peering
Favicon
Run analytical queries on SAP data with Redshift Serverless powered by Amazon AppFlow
Favicon
AWS Redshift: Robust and Scalable Data Warehousing
Favicon
How to build your own data platform. Episode 2: authorization layer. Data Warehouse implementation.
Favicon
Data Analysis with Redshift Serverless and Quicksight - Part 1
Favicon
DBT + REDSHIFT = ❤
Favicon
Analytics on AWS — Amazon Redshift

Featured ones: