dev-resources.site
for different kinds of informations.
Data Analysis with Redshift Serverless and Quicksight - Part 1
In the first part of this blog, we will focus mainly on setting up Redshift serverless cluster and configuring access to external WorldWide Event Attendance Data Exchange via Redshift DataSharing Feature so it can be accessed from the database in the cluster provisioned. We will then run some queries and unload data to S3. The other features provided by Redshift such as cluster performance monitoring, data recovery and guarding against surprise bills will also be touched upon. In the second part, we will setup Quicksight and connect to our Redshift cluster, to access the data and build dashboards to generate some interesting insights.
Note that the queries would cost between $20-$30 as we are using the entire dataset with minimal filtering. Serverless will try and optimise the computation by scaling to more RPUs which will increase cost. In addition, you will also be charged for data storage. This is still well within the free trial. You can adjust the RPU base capacity or set usage limits which is explained more in the Data Recovery, Monitoring and Cost Management
section.
Worldwide Event Attendance is a free product available in AWS Marketplace, which allows subscribers to query, analyse and build applications quickly. Instructions on how to subscribe to this product can be found here
Once subscribed to, we need to create a datashare in the Redshift cluster to access the data immediately.
We will use Redshift Serverless, the serverless offering of Redshift which removes the need for setting up and managing underlying cluster specs and scaling. All new users get free $300 credits for a trial period of 3 months. Sign in to the Redshift console and select Serverless free trial. You are only billed according to capacity used in a given duration (RPU hours), which scales automatically to optimise running the query. There is also a charge for Redshift Managed Storage (RMS).
Redshift Serverless Setup
If this is the first time using Redshift Serverless, then need to create a default workgroup. A workgroup is a collection of compute resources (VPC subnet groups, security groups, RPUs) and can be associated with one namespace - collection of database objects and users comprising tables, users, schemas, KMS keys etc.
- Sign in to the Redshift Serverless console and choose Create workgroup.
- Specify a value for Workgroup name: e.g. default
- In Network and Security option, choose VPC and security groups. I have chosen the default VPC and associated security group. In addition, I also created another security group to allow access to Quicksight, which we will need to later connect to our database to access the data and create dashboards. Setting up the security group for this can be found here
The security group you choose should have an inbound rule to allow traffic to port 5439 (default Redshift port) from CIDR address range where Quicksight was created. This can be looked up here. For example, if Quicksight is configured in us-east-1
, the associated IP address range for data source connectivity is 52.23.63.224/27
- Select Create a New namespace and specify the name
- Under admin user credentials, setup a username and password to connect to the database in the cluster. We will use this later when manually connecting to the redshift endpoint from Quicksight.
We will now need to create a role to associate with redshift serverless endpoint so it can unload data into S3.
- In the Permissions section, Click the Create IAM role option in the Manage IAM roles dropdown in the Associate Roles subsection. Select the option Specific S3 buckets and select the S3 bucket created for unloading the nodes and relationship data to. Then click Create IAM role as default. This will create an IAM role as the default with AWS managed policy AmazonRedshiftAllCommandsFullAccessattached. This includes permissions to run SQL commands to COPY, UNLOAD, and query data with Amazon Redshift Serverless.
- Now back in the Associate IAM roles section, select Associate IAM role and tick the role just created.
- By default, KMS encryption is provided with AWS owned key. The Encryption and security section can be skipped unless you want to provide your own KMS key for encryption and enable database logging.
- Click Next
- In the Review Step, check that all the options and configuration are set correctly and select Create.
Now go back to the Serverless Dashboard and check the workgroup list to see the workgroup and namespace created, with status showing as 'Available'.
We will also need to connect Quicksight to our Redshift endpoint. Hence we will need to make the database publicly accessible to allow access from outside the VPC. The VPC security group we configured earlier should have an inbound rule to only allow access from the Quicksight.
- In the navigation panel on the left, click on Workgroup configuration and select the Workgroup created
- In the Network and security panel under the Data access tab, the Publicly accessible option by default is turned off. Click the edit button (highlighted in yellow in the screenshot) and Turn on Public Accessible option.
Note if our Redshift cluster was in a private subnet, we would need to create a private connection from Quicksight to the VPC in which the cluster is located, as described here.
Accessing the DataShare
Redshift data sharing allows you to share live data across clusters with different accounts and regions with relative ease. This also decouples storage and compute and ensures access to live data and consistency, without the need to copy or move data.
We now want to create a datashare to access the Worldwide Event Attendance data exchange from our cluster. Carry out the following steps:
- Navigate to the Redshift Serverless dashboard in the Amazon Redshift console and select default namespace.
- Navigate to the Datashares tab and scroll down to subscriptions to AWS Data Exchange datashares.
- Click on the datashare
worldwide_event_test_data
. - Choose Create database from datashare.
- In the Create database from datashare pop-up, specify
worldwide_event_test_data
as the Database name. - Choose Create. You will see a message confirming successful database creation. You are now ready to run read-only queries on this database.
Running Example Queries in the Editor
In order to successfully query the datashare database, you are required to connect to redshift cluster using the cluster native database first and then use cross-database query notation <shareddatabase>.<schema>.<object>
to query the data in shared database.
- Navigate to the Redshift query editor v2 page. Select your Serverless workgroup (default), and you will be presented a window to select authentication as a Federated User (which will generate temporary credentials) or provide Database Username and Password that was setup during the cluster setup.
- If selecting Federated User, set database name to the database in your cluster 'dev'. Then click save
- For Database Username and Password authentication, set the database name 'dev' and input your usernmame and password. Then click save.
This will also save the credentials in AWS Secrets In the Secrets Manager console, choose Secrets, and then choose the secret. Scroll down to the Secret Value section in the Secrets Detail page and click Retrieve Secret Value
on the right hand side. You will see the username and password saved as key value pairs for the associated engine
, dbname
, port
and dbClusterIdentifer
.
You will wonder why we just do not connect directly to the worldwide_event_data_exchange
datashare. This is because Amazon Redshift data sharing has the following considerations, as detailed in the docs :
- Connecting directly to datashare database is not possible.
- As a datashare user, you can still only connect to your local cluster database.
- Creating databases from a datashare does not allow you to connect to them, but you can read from them.
If you try and connect directly as a federated user for example, then you will get the following error.
We are now ready to run queries in the editor. Before moving on, as mentioned previously running all the queries once to join all the tables with most of the data will incur a cost (probably less than $20) and still within the free credits. Should you wish to control this you can reduce the RPU base capacity which defaults to 128 RPUs when creating the cluster (where 1 RPU provides 16 GB memory). This is discussed in the Data Recovery, Monitoring and Cost Management
section.
In the query editor, run the sql block below. This will join the event, sales, venue, category and date tables and create a view of the results for the aggregated ticket price, commissions and total number of ticket sold for a given event in venue on a calendar date.
CREATE VIEW worldwide_events_vw AS
SELECT caldate, day, week, month, qtr, year, holiday, eventname, catname, venuename, venuecity, venueseats, SUM(pricepaid) AS total_price, SUM(commission) AS total_commission, SUM(qtysold) AS total_tickets
FROM "worldwide_event_data_exchange"."public"."event" as event
JOIN "worldwide_event_data_exchange"."public"."sales" as sales
on event.eventid=sales.eventid
JOIN (SELECT * FROM "worldwide_event_data_exchange"."public"."venue" WHERE venueseats > 0) as venue
on event.venueid=venue.venueid
JOIN "worldwide_event_data_exchange"."public"."category" as cat
on event.catid=cat.catid
JOIN "worldwide_event_data_exchange"."public"."date" AS datetable
ON datetable.dateid=event.dateid
GROUP BY caldate, day, week, month, qtr, year, holiday, eventname, catname, venuename, venuecity, venueseats
with no schema binding;
By default, creating views from external tables is not supported and will throw an error as below. The with no schema binding
statement at the end of the query is to allow us to create the view successfully.
If the query ran successfully, the view should be visible from the check that the view is visible under the view dropdown after refresh (highlighted in blue in the screenshot below). Now lets check the data in the view, by running a SELECT * FROM "public"."worldwide_events_vw"
. The results should be similar to the screenshot below.
Unloading data to S3
Create a S3 bucket or use an existing one if you wish. I have created a new S3 bucket redshift-worldwide-events with an events folder. The UNLOAD SQL command allows users to unload the results of the query to a S3 bucket. This requires the query, S3 path and IAM role arn to allow permissions to write to the S3 bucket. In addition we will be adding the following extra options below with the command, to create a single csv file to include the header.
CSV DELIMITER AS: to use csv format with delimiter as ','
HEADER: specify first row as header row
CLEANPATH: to remove any existing S3 file before unloading new query
PARALLEL OFF: turn off parallel writes as we want a single CSV files rather than multiple partitions.
So the query will look like below.
unload ('SELECT * FROM worldwide_events_vw')
to 's3://redshift-worldwide-events/events/worldwide_events'
iam_role '<your-iam-role-arn>'
CSV DELIMITER AS ','
HEADER
cleanpath
parallel off
The s3 bucket path needs to be in the format <s3://object-path/name-prefix>
. So if the bucket created is redshift-worldwide-events
with an events folder, then the object path becomes redshift-worldwide-events/events
. The 'name-prefix' is the object name prefix to set which gets concatenated with a slice number (if it is a single file then 000). So if the 'name-prefix' is set as worldwide_events
then the object stored will be names worldwide-events000
. The iam role arn will be the arn of the role we associated with the redshift cluster when setting up. You can find it in the dashboard by navigating to namespace configuration (highlighted in yellow in screenshot below).
If the query is successful, you should see a success message in the editor as below.
Navigate to the s3 bucket and check that the file is visible. This can now be downloaded or accessed via other AWS services for further analysis as required.
Data Recovery, Monitoring and Cost Management
Recovery points in Amazon Redshift Serverless are created approximately every 30 minutes and saved for 24 hours.
In the Redshift Serverless console, the data backup tab shows the list of database backups which can be restored if required in case there is a failure. Given that backups are only available for 24 hours, we can also create a snapshot from a backup to be used at a later time if required.
We can monitor cluster performance via CloudWatch metrics in the Redshift console and CloudWatch, such as CPU utilization, latency etc for monitoring cluster performance. In addition, we can also monitor database query and load events directly in the console at a 1 minute resolution. The screenshot below shows the RPU capacity used for some of the queries executed. As the number of queries increases, Redshift Serverless automatically scales to optimise performance.
Based on usage patterns, we can monitor usage and control billing by updating the Base Capacity and Max RPU hours per day, week or month in the respective sections of the workgroup configuration dashboard as shown in the screenshot below. The default RPU is 128 which can be reduced to a minimum of 8 for simpler queries on smaller data. Under the Manage Usage limits
we can set a maximum RPU-hours limit by frequency and decide what action needs to be taken if breached (e..g turn off user queries, send an alert). Details on how to configure this is described in more detail in the docs
Cloudwatch alarms can also be set up from the Redshift serverless dashboard by choosing alarms from the navigation menu and selecting create alarm. We can then set a alarm for a metric to track in either the namespace or workgroup. In the screenshot below, I have set an alarm to monitor compute capacity and trigger when it exceeds a threshold of 80 RPU for 10 periods of 1 minute each. However, you can lower the threshold and/or number of consecutive periods if required. The minimum duration of a period is 1 minute and can be increased in denominations from the dropdown as shown below.
In the serverless dashboard, you can now see the status of the alarm. It will first start off in INSUFFICIENT_DATA
state until it gathers more data to evaluate and change state to either OK
or ALARM
.
Clicking on the View in CloudWatch
widget allows users to view more details in the CloudWatch dashboard like the alarm status over the most recent time period and history of the alarm state changes as shown in the screenshot below. For a more detailed explanation on Alarm Periods and Evaluation Periods, please refer to the AWS documentation
Clicking on the date for the state change in the history tab brings up a summary of the state change in json format. Here we can diagnose further why an alarm changed state from OK to ALARM status for example in the screenshot below. The threshold exceeded 80 RPUs for 10 consecutive datapoints (each evaluated after 1 min period), which was the configuration we set. This was probably a longer running query which required a higher compute capacity (128 RPUs) in this case.
That concludes the first part of this tutorial. Continue to part 2 for creating visualisations in Quicksight.
Featured ones: