dev-resources.site
for different kinds of informations.
Building a "Real-Time" Data Integration Platform on AWS
Introduction
During my work on several projects as a Cloud Architect, I noticed a recurring challenge:
Integrating diverse data sources and delivering real-time updates to multiple clients
Whether it's monitoring environmental conditions in smart agriculture, tracking assets in logistics, or managing connected devices in smart cities, the ability to process and act on data instantly can make a significant difference.
In this post, I'll share my journey of building a serverless Near Real-Time Data Integration Platform using the AWS Serverless Application Model (SAM).
All the infrastructure code is available in a GitHub repository:
https://github.com/acriado-dev/sam-real-time-websockets
Overview
Understanding Real-Time vs Near Real Time
Before diving into the details and architectural components, it's important to clarify a core concept for the platform that sometimes causes confusion. In the context of data processing, what is the difference between "real-time" and "near real-time"?
Real-Time: Data is processed and delivered as soon as it is generated, with virtually no delay. This is critical in scenarios where immediate action is required. Examples include industrial automation, connected vehicles, and healthcare monitoring.
Near Real-Time: This refers to a slight delay in data processing, typically ranging from milliseconds to seconds. Examples include data analytics, CDNs, and social media monitoring.
About this Project
To answer the question,
is this platform Real-time or Near Real-Time?
We have to take into account that TRUE Real-Time implies that data is processed and delivered with minimal latency, often microsecons or milliseconds, with strict timing constraints. This level of immediacy is not accomplished with the Cloud resources used for thi project implementation, let's analyze the most significant ones:
AWS Lambda: Introduces some latency due to cold starts and the execution time of the function. While this latency is generally small (typically within milliseconds to a couple of seconds), it does mean that it cannot be considered "true" Real-Time but rather operates in near Real-Time.
WebSocket API: This component allow low-latency communication, which is closer to Real-Time, but the overall latency is influenced by the procesisng times of the connected Services.
DynamoDB and Streams: Provides very fast data retrieval and storage capabilities, but Lambda triggered by streams still introduces additional processing time.
Given these factors, the most accurate description for the platform should be 'near Real-Time'. However, there are some benefits of this approach, from cost efficiency to scalability and simplicity, but remember that depending on the use case you're facing of for your application, Near Real-time could be the best decision.
Architecture
I will explore the project's architecture, detailing various components and explaining how the serverless approach offered an appropiate and efficient solution. In addition, a workflow example will be described in order to understand better How the project works:
Components:
Client Integration: Frontend application (Mobile & WebApp) that is able to connect to the WebSocket API and receive the real time data updates.
WebSocket API Gateway: Entry point of the application. It's responsible for handling the WebSocket connections and the messages that are sent to the clients.
OnConnect Lambda Function: Is triggered when a client connects to the WebSocket API Gateway. Stores the connection information in the DynamoDB table.
OnDisconnect Lambda Function: Is triggered when a client disconnects from the WebSocket API Gateway. Removes the connection information from the DynamoDB table.
OnReceiveRealTimeItem Lambda Function: Is triggered when a new item is added to the DynamoDB table. Sends the real time data updates to the clients that are interested in the item.
RealTimeData DynamoDB Table: Used to store the real time data items. Each item has a unique key and a value that represents the data that is sent to the clients.
WebSocketConnectionManager DynamoDB Table: Stores the connection information of the clients that are connected to the WebSocket API Gateway. Each item has a unique connectionId and a realTimeItemKey that represents the item that the client is interested in.
-
Integration sources: Data sources responsible for sending the real time data updates to the platform. For this project, the following sources has been considered:
- SDK: AWS SDK and CLI for any language compatible.
- Data transfer: Any data transfer mechanism that is able to send the data to the platform. For example, data replication from other database.
- Device Location & Sensors: IoT devices mainly sending sensor data and telemetry. For example, GPS location of a vehicle.
- REST API: For example, a weather API that sends the weather data to the platform.
- Async (SNS, SQS): Asynchronous mechanism that is able to send the data to the platform. For example, an SNS topic that sends the data to the platform.
- S3 Events: Event triggered by an S3 bucket. For example, when a new file is uploaded to the bucket.
Workflow:
Next, we are going to describe the different numbered steps marked in the diagram:
The client integration connects to the WebSocket API Gateway.
The WebSocket API Gateway triggers the OnConnect Lambda function.
The OnConnect and OnDisconnect Lambda functions store and remove the connection information from the WebSocketConnectionManager DynamoDB table.
RealTimeData items are added to the RealTimeData DynamoDB table from the integration sources.
The OnReceiveRealTimeItem lambda function is triggered through DynamoDB Streams.
The OnReceiveRealTimeItem lambda function sends the real time data updates to the clients that are interested in the item.
The client integration receives the real time data updates from the WebSocket API Gateway.
Setup and Test the Platform
In order to test the Platform accordingly we are going to use wscat utility to simulate an API client and monitor the websocket API. This way we can allow each tes client to receive the data instantly.
npm install wscat
We can retrieve the WebSocket API endpoint in AWS Console although is useful to have it as an output for the SAM project:
The Endpoint should have the following composition:
wscat -c wss://[api-id].execute-api.[aws-region-id].amazonaws.com/[environment]?realTimeItemKey=[item-key]
Notice that the realTimeItemKey
query parameter is required to connect to the WebSocket API. This parameter is used to filter the messages that are sent to the client. The value of this parameter must be the same as the realTimeItemKey
attribute of the item that you want to receive updates for.
According to the project specification, the realTimeItemKey
is a dynamic value and can be adapted depending on the client/integration needs.
In our example, we are going to use vehicleId as real time item to receive updates from multiple clients, then, the realTimeItemKey
should be vehicleId
, and the command to connect should be:
wscat -c wss://svcz00plil.execute-api.eu-central-1.amazonaws.com/develop?vehicleId=3
Let's see some Examples of the three main events of the Platform:
- OnConnect:
%[https://youtu.be/lUFf14zZsxg]
- OnDisconnect
%[https://youtu.be/x8ZYvpD_Q1k]
- OnReceiveRealTimeItem
%[https://youtu.be/8YyZ7hJ0Ft4]
Conclusion
This project is a fantastic example of a simple near real-time integration in a Serverless architecture. It can be used as a starting point for implementing a more complex integration or added to another project to handle real-time processing.
In my opinion, it's important to optimize these kinds of data integrations and make them as efficient as possible. Many projects overuse polling, which can be considered an anti-pattern for proper real-time implementation.
Additionally, note that a special part of the project focuses on IaC, as it's crucial to implement workloads that can be reused and reproduced in any Cloud environment (in this case, AWS).
Featured ones: