dev-resources.site
for different kinds of informations.
Data Transfer from S3 to Cloud Storage using GCP Storage Transfer Service
Storage Transfer Service automates the transfer of data to, from, and between object and file storage systems, including Google Cloud Storage, Amazon S3, Azure Storage, on-premises data, and more. It can be used to transfer large amounts of data quickly and reliably, without the need to write any code. Depending on your source type, you can easily create and run Google-managed transfers, or configure self-hosted transfers that give you full control over network routing and bandwidth usage. Storage transfer service only allows transfer into GCP and does not support bi-directional transfer e.g. from GCP to AWS.
In this blog, we will demonstrate how to create a on-off storage transfer job to transfer data from S3 bucket to GCP Cloud Storage. In addition, we will also demonstrate how to setup an event transfer job to transfer objects by continuously listen to event notifications associated with objects being added or modified in source S3 bucket
Prerequisites
Before you begin, make sure you have the following prerequisites:
- A GCP account with the necessary permissions to create and manage storage buckets and transfer jobs.
- An AWS account with the necessary permissions to create and manage S3 buckets.
- The AWS CLI installed and configured on your local machine.
- The gcloud CLI installed and configured on your local machine.
- The necessary IAM roles and permissions set up in both AWS and GCP.
Create a source S3 bucket demo-s3-transfer
and destination cloud storage bucket demo-storage-transfer
. In the source S3 bucket, we will upload some parquet files in a prefix 2024/12
. We will be transferring the parquet files in this prefix into the demo-storage-transfer
bucket.
Storage Transfer REST API
Storage Transfer Service uses a Google-managed service account to move your data. This service account is automatically created the first time you create a transfer job or call googleServiceAccounts.get
, or visit the job creation page in the Google Cloud console. The service account's format is typically project-PROJECT_NUMBER@storage-transfer-service.iam.gserviceaccount.com
. googleServiceAccounts.get
We can use the
googleServiceAccounts.get
method to retrieve the managed Google service account that is used by Storage Transfer Service to access buckets in the project where transfers run or in other projects. Each Google service account is associated with one Google Cloud project.Navigate to the googleServiceAccounts.get reference page here.
On the right, you will see an window open, where you can enter the project ID under the request parameters. Executing this will return the subjectId in the response, along with the storage transfer account email. Keep a note of the subject ID and storage service managed account as we will require it in the latter sections.
Alternatively, we can do the same via cli, using curl command and passing the bearer token in the header.
curl -X GET -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "x-goog-user-project: <project-id>" https://storagetransfer.googleapis.com/v1/googleServiceAccounts/<project-id>
The x-goog-user-project header key is required to set the default project quota for the request see the troubleshooting guide. If excluded, you may get the following error:The storagetransfer.googleapis.com API requires a quota project, which is not set by default
AWS IAM role permissions
In the AWS console, navigate to IAM and create a new role.
Select Custom trust policy and paste the following trust policy.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "accounts.google.com"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"accounts.google.com:sub": <subject-id>
}
}
}
]
}
- Replace the value with the subjectID of the Google-managed service account that you retrieved from the previous section using the
googleServiceAccounts.get
reference page. It should look like the screenshot below.
- Paste the following json policy to grant permissions to the role to list bucket and get objects from the S3 bucket.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [ "*"]
}
]
}
- Once the role is created, note down the ARN value, which will be passed to Storage Transfer Service when initiating the transfer programatically in python.
Transfer permissions in GCP
The GCP service account used to create the transfer job will need to granted the Storage Transfer User role (roles/storagetransfer.user) and roles/iam.roleViewer
. In addition, we need to give the Google-managed service account retrieved in the previous section, access to resources needed to complete transfers.
- Navigate to the Cloud Storage Bucket
demo-storage-transfer
. In the permissions tab, click grant access.
- In the new window, enter the principal as the managed gcp transfer service email. Assign the Storage Admin Role.
Create one-off batch Storage Transfer Job
We can interact with Storage Transfer Service programmatically with Python.
- copy this folder which contains the requirements.txt and script for initiating the storage transfer job, checking status and verifying completion.
- in command line terminal window, run
pip install - requirements.txt
, to install the google-cloud-storage-transfer and cloud-storage libraries. - If you use a service account json, then set the environment variable GOOGLE_APPLICATION_CREDENTIALS to the path to this service account. Otherwise, use one of the other GCP authentication options
- Now, run the following command to execute the storage_transfer_batch.py job script in a terminal of your choosing. This will transfer the data from the 2024/12 prefix in the S3 bucket to the GCP bucket with a Data prefix. We pass in the arn of the role we created earlier, which will be assumed during the transfer to generate temp credentials with the required permissions.
python python/storage_transfer.py --gcp_project_id <your-gcp-project-id> --gcp_bucket <your-gcp-bucket> --s3_bucket <your-s3-bucket> --s3_prefix <s3-prefix> --gcp_prefix <gcp-prefix> --role_arn <aws-role-arn>
- You should see the logs as in the screenshot below. Wait for the job to show as completed.
Navigate to the cloud storage bucket and you should see the data in the bucket in the Data prefix
You can monitor and check your transfer jobs from the Google Cloud Console UI. Open the Google Cloud Console and navigate to "Transfer Service". The jobs executed will be listed.
In the monitoring tab, we can see plots for performance metrics (bytes transferred, objects processed, transfer rate etc).
In the operations and configuration tabs, we can get more details regarding Transfer specifications e.g. Run history, data transferred and other configuration details we set for the transfer job.
Create event driven transfer job
Event-driven transfers listen to Amazon S3 Event Notifications sent to Amazon SQS to know when objects in the source bucket have been modified or added.
Create an SQS queue in AWS
- In AWS management console, go to the SQS service, click on "Create queue" and provide a name for the queue.
- In the Access policy section, select Advanced. A JSON object is displayed. Paste the policy below, replacing the values for , and . This will only permit SQS:SendMessage action on the SQS queue from the S3 bucket in the AWS account.
{
"Version": "2012-10-17",
"Id": "example-ID",
"Statement": [
{
"Sid": "example-statement-ID",
"Effect": "Allow",
"Principal": {
"Service": "s3.amazonaws.com"
},
"Action": "SQS:SendMessage",
"Resource": <SQS-RESOURCE-ARN>,
"Condition": {
"StringEquals": {
"aws:SourceAccount": <AWS-ACCOUNT-ID>
},
"ArnLike": {
"aws:SourceArn": <S3_BUCKET_ARN>
}
}
}
]
}
Now we need to enable notifications in the S3 bucket, setting the SQS queue as destination.
- Navigate go the S3 bucket and select the Properties tab. In the Event notifications section, click Create event notification.
- Specify a name for this event.In the Event types section, select "All object create events", as in the screenshot below.
- As the Destination select SQS queue and select the queue you created previously.
Create an event driven Storage transfer job
We will now use the GCP cloud console to create an event driven transfer job. Navigate to the GCP Transfer Service page and click Create transfer job
- Select Amazon S3 as the source type, and Cloud Storage as the destination.
- For the Scheduling mode select Event-driven and click Next.
- Enter the S3 bucket name. We will use the same bucket we used previously for the one-off transfer but you can use a different one if you wish.
- Enter the Amazon SQS queue ARN that you created earlier, as in the screenshot below
- Select the destination Cloud Storage bucket path (which can optionally include a prefix) as in the screenshot below.
- Leave the rest of the options as defaults and click create.
- The transfer job starts running and an event listener waits for notifications on the SQS queue.
We can test this by putting some data into S3 bucket source location. Observe your objects being replicated from AWS S3 to GCS bucket. You can also view monitoring details in the SQS queue.
Conclusion
GCP's Storage Transfer Service is a powerful tool for transferring data from S3 to GCS. It offers a cost-effective, scalable, and secure solution for data migration, with flexible scheduling and data filtering options. In this practical blog, we walked you through the steps required to set up GCP's Storage Transfer Service for transferring data from S3 to GCS. By following these steps, you can easily migrate your data from S3 to GCS with minimal effort and maximum efficiency.
References
Featured ones: