dev-resources.site
for different kinds of informations.
How to build your own data platform. Episode 2: authorization layer. Data Warehouse implementation.
Introduction.
This article is the second part of the episode about building an authorization layer for your data platform. You can find the whole list of articles following this link: https://medium.com/@gu.martinm/list/how-to-build-your-own-data-platform-9e6f85e4ce39
In the previous article we talked about how to implement the authorization layer in the Data Lake, in this second part we will be talking about the same but in the Data Warehouse.
Authorization layer.
You can see in this diagram the Lakehouse with its metastore and the Data Warehouse. We already talked about the authorization layer for the Lakehouse in the previous article. Now it is the turn for the Data Warehouse.
Because we will be using Amazon Web Services with AWS Redshift, we will be implementing this layer using Lake Formation.
Processing layer.
Human users and processes will be the ones accessing the stored data through the authorization layer. Machines and processes like Zeppelin notebooks, AWS Athena for SQL, clusters of AWS EMR, Databricks, etc, etc.
The problem with the authorization.
Data engineers, data analysts and data scientists work in different and sometimes isolated teams. They do not want their data to be deleted or changed by tools or people outside their teams.
Data owners are typically in charge of granting access to their data.
Owner — consumer, relationship.
A data consumer requests access to some data owned by a different team in a different domain. For example, a table in a database.
The data owner grants access by approving the access request.
Upon the approval of an access request, a new permission is added to the specific table.
Our authorization layer must be able to provide the above capability if we want to implement a data mesh with success.
Data Warehouse, AWS Redshift.
The Data Warehouse is implemented on the top of AWS Redshift. Not many years ago a new service was released by Amazon called AWS Redshift RA3. What makes RA3 different from the old Redshift is that, in the new implementation, computation and storage are separated. Before having RA3, if users needed more storage capabilities, more computation had also to be paid even if computation was not a problem. And in the opposite way, when users needed more computation capabilities, more storage had to be paid. So, Redshift costs were typically high.
We will be using AWS Redshift RA3. Here you can find some useful links that explain further what are AWS Redshift and AWS Redshift RA3:
- https://docs.aws.amazon.com/redshift/latest/mgmt/welcome.html
- https://aws.amazon.com/blogs/big-data/use-amazon-redshift-ra3-with-managed-storage-in-your-modern-data-architecture/
Data Warehouse, AWS Redshift RA3.
Amazon Redshift data sharing allows you to securely and easily share data for read purposes across different Amazon Redshift clusters without the complexity and delays associated with data copies and data movement. Data can be shared at many levels, including schemas, tables, views, and user-defined functions, providing fine-grained access controls that can be tailored for different users and businesses that all need access to the data.
Lake Formation can be integrated with data sharing.
For further information visit the following links:
- https://aws.amazon.com/blogs/big-data/announcing-amazon-redshift-data-sharing-preview/
- https://aws.amazon.com/blogs/big-data/centrally-manage-access-and-permissions-for-amazon-redshift-data-sharing-with-aws-lake-formation/
Authorization, Federated Lake Formation.
Using Lake Formation with AWS Redshift RA3 we can manage the permissions across different accounts from only one central account in a federated way. We are delegating permissions to other accounts but we keep the control of them.
Authorization, implementation.
In order to implement federated authorization with AWS Redshift RA3 you can follow the next steps:
AWS Redshift RA3, producer account:
- CREATE DATASHARE producer_sharing
- GRANT USAGE ON DATASHARE producer_sharing TO ACCOUNT ‘FEDERATED_GOVERNANCE’
- ALTER DATASHARE producer_sharing ADD SCHEMA producer_schema
AWS Redshift RA3, consumer account:
- CREATE DATASHARE consumer_sharing
- GRANT USAGE ON DATASHARE consumer_sharing TO ACCOUNT ‘FEDERATED_GOVERNANCE’
- ALTER DATASHARE consumer_sharing ADD SCHEMA consumer_schema
AWS Redshift RA3, main federated account:
- Through Lake formation console, allow access from consumer account to producer_sharing. You can see a screenshot about this configuration down below.
With the above configuration, the query from the consumer account will only see the column brand_id
.
Conclusion.
In this article we have explained how you can implement an authorization layer using AWS AWS Redshift RA3 and AWS Lake Formation.
With this authorization layer we will be able to resolve the following problems:
Producers and consumers from different domains must have the capability of working in an isolated way (if they wish so) if we want to implement a data mesh with success.
Producers must be able to decide how consumers can access their data. They are the data owners, and they decide how others use their data.
Fine grained permissions can be established. At column and even if we want, at row level. This will be of great interest if we want to be GDPR compliant. More information about how to implement the GDPR in your own data platform will be explained in future articles.
Stay tuned for the next article about how to implement your own Data Platform with success.
I hope this article was useful. If you enjoy messing around with Big Data, Microservices, reverse engineering or any other computer stuff and want to share your experiences with me, just follow me.
Featured ones: