Logo

dev-resources.site

for different kinds of informations.

Template for design document of Apache Spark project

Published at
4/2/2024
Categories
spark
pyspark
Author
pankaj_chikhalwale
Categories
2 categories in total
spark
open
pyspark
open
Author
18 person written this
pankaj_chikhalwale
open
Template for design document of Apache Spark project

In an Apache Spark based data engineering/analytics project - what would a design document template look like ?

Of course answer depends on the business/project requirements.

But will the design document "template" contain the following aspects ?

Am I missing something ?

My current list of aspects in the design template -

(1) Tables/Views to be created (if any) in the source system in order to facilitate my project's pipeline(s). For some of the pipelines Kafka topic is the source. (2) Pipeline - schema of the data, estimated data volume per call, format (csv etc.), Kafka topic, frequency of pulling data (daily, weekly etc) from source system, is the data pulled as needed or per a schedule or based on an event, connectivity etc. (3) What kind of data objects will be created to persist the data in the data-lake ? (4) High level statement of all code changes, config changes, and data changes (including movement). (5) Which design standards/best practices are being followed ? Critical design decisions to optimize/improve pipeline performance. (6) Which regulatory compliance standards are being applied and how ? (7) Which aggregation objects/views are to be created so that data and analytics reports can be served.

pyspark Article's
30 articles in total
Favicon
Infraestrutura para análise de dados com Jupyter, Cassandra, Pyspark e Docker
Favicon
Intro to Data Analysis using PySpark
Favicon
Azure Synapse PySpark Toolbox Contents
Favicon
Azure Synapse PySpark Toolbox 001: Input/Output
Favicon
Mastering Dynamic Allocation in Apache Spark: A Practical Guide with Real-World Insights
Favicon
Auditoria massiva com Lineage Tables do UC no Databricks
Favicon
Platform to practice PySpark Questions
Favicon
Entendendo e aplicando estratégias de tunning Apache Spark
Favicon
[API Databricks como serviço interno] dbutils — notebook.run, widgets.getArgument, widgets.text e notebook_params
Favicon
Pytest Mocks, o que são?
Favicon
Achieving Clean and Scalable PySpark Code: A Guide to Avoiding Redundancy
Favicon
Real-Time Streaming Analytics with PySpark on AWS using Kinesis and Redshift.
Favicon
Hiring Alert!
Favicon
PySpark optimization techniques
Favicon
Creating a data pipeline using Dataproc workflow templates and cloud Schedule
Favicon
Running pyspark jobs on Google Cloud Dataproc
Favicon
Calling All Senior Data Engineering Innovators!
Favicon
Comprehensive Guide to Schema Inference with MongoDB Spark Connector in PySpark
Favicon
Checking object existence in large AWS S3 buckets using Python and PySpark (plus some grep comparison)
Favicon
Troubleshooting Kafka Connectivity with spark streaming
Favicon
PySpark: missing value
Favicon
Spark: Introduction
Favicon
Template for design document of Apache Spark project
Favicon
Building an Anime Recommendation System with PySpark in SageMaker
Favicon
PySpark & Apache Spark - Overview
Favicon
Batch Processing using PySpark on AWS EMR
Favicon
Running PySpark in JupyterLab on a Raspberry Pi
Favicon
Python Interpreter in Docker and Pyspark Tests in Docker
Favicon
Apply Function Only Works on the First 1000 Rows of PySpark.Pandas DF
Favicon
create UDF in pyspark to join 2 tables

Featured ones: