Logo

dev-resources.site

for different kinds of informations.

Steps of Big Data Pipeline

Published at
12/28/2023
Categories
bigdata
aws
datalake
Author
andreyai
Categories
3 categories in total
bigdata
open
aws
open
datalake
open
Author
8 person written this
andreyai
open
Steps of Big Data Pipeline

Image description

With the increase in computational and storage power, companies have been collecting more data than ever. This leading the need for new tasks and job opportunities. In order to extract value from data companies should rely on data pipelines. These pipelines consist of stages like collection, storage, process, and analyzing data.

Collection

This step is responsible for ingesting data from different sources to use them for later analysis. This data comes mainly from real-time and batch sources. 

In real-time platforms, we have those who produce data (Producers) and those who consume data (Consumers). Usually, an example of it would be what Netflix and Spotify use to send their data to millions of users. Examples of streaming include services like Kafka, AWS Kinesis, AWS SQS.

Batch collection step may involves migrating data from an existing database. For example ingest data from a transactional database like RDS, PostgreSQL, MySQL, Oracle, Aurora to a data lakes or data warehouses like AWS Redshift. For that in AWS, you can use the AWS Data Migration Service.

Storage

Once we collect our data it will need a place to be stored. In this service, by knowing their frequency and need we can control data lifecycle. This goes from getting more frequent data to archiving or deleting them. 
Some services that help with that would be AWS S3.

Process

This step deals with ETL which involves the process of cleaning, enriching, and transforming raw data into a more sophisticated layer. 
Some services that help with that would be AWS Glue, AWS EMR, AWS Lambda.

Governance

Data governance consists of data management, data quality, and data stewardship. This helps to manage policies to access data, data discovery, data accuracy, validation, and completeness. 
Some services that help with them are AWS Glue Catalog, AWS LakeFormation.

Analyze

This part is responsible for extracting value from data by performing data analysis, machine learning, and data visualization. This consists in extracting meaning from data by showing how it is organized, grouping, and predicting it.
Some services that help with that would be AWS Sagemaker, AWS QuickSight.

datalake Article's
30 articles in total
Favicon
Databricks - Variant Type Analysis
Favicon
Data warehouse vs data lake
Favicon
In-place Serverless Querying AWS S3 Data
Favicon
Best Practices for Implement Data Lake in Data Management
Favicon
Servicios usados para un Data Lake en AWS
Favicon
10 Reasons to Make Apache Iceberg and Dremio Part of your Data Lakehouse Strategy
Favicon
Microsoft Fabric - Revolutionizing Data Analytics in the AI Era
Favicon
Microsoft Fabric course in Hyderabad | Microsoft Fabric Online Training
Favicon
Steps of Big Data Pipeline
Favicon
Introducción a los Data Lakes
Favicon
Data Evolution - Databases to Data Lakehouse
Favicon
10 use cases of a data lakehouse for modern businesses
Favicon
Why Kafka Is the New Data Lake?
Favicon
Iceberg Time Travel & Rollbacks in Trino
Favicon
Data Lakehouse
Favicon
What Is A Data Lake? Architecture And Tools
Favicon
7 reasons why a user would need to query Amazon S3 directly
Favicon
Iceberg Schema Evolution in Trino
Favicon
Retrieving Azure Data Lake Gen2 Folder Sizes and Sending Email Reports
Favicon
Iceberg DML & Maintenance in Trino
Favicon
How to build your own data platform. Episode 2: authorization layer. Data Lake implementation.
Favicon
Is data lake house the right choice for you?
Favicon
Data engineering for beginners (Part 01)
Favicon
How to collect IOT data, do magic and publish and sell enriched insights!!
Favicon
Securing Data Lake in AWS
Favicon
"Features of Data Lake Federated Analysis"_Apache Doris Summit 2022
Favicon
Demystifying Metadata Management — Part 1
Favicon
"Design of Multi-Table Materialized View"_Apache Doris Summit 2022
Favicon
Data Lake and Data Management
Favicon
How to build your own data platform. Episode 1: sharing data between environments. Data Warehouse implementation.

Featured ones: