dev-resources.site

for different kinds of informations.

Template for design document of Apache Spark project

Published at

4/2/2024

Categories

spark

pyspark

Author

pankaj_chikhalwale

Main Article

https://dev.to/pankaj_chikhalwale/template-for-design-document-of-apache-spark-project-4c6e

Categories

2 categories in total

Author

18 person written this

pankaj_chikhalwale

Template for design document of Apache Spark project

In an Apache Spark based data engineering/analytics project - what would a design document template look like ?

Of course answer depends on the business/project requirements.

But will the design document "template" contain the following aspects ?

Am I missing something ?

My current list of aspects in the design template -

(1) Tables/Views to be created (if any) in the source system in order to facilitate my project's pipeline(s). For some of the pipelines Kafka topic is the source. (2) Pipeline - schema of the data, estimated data volume per call, format (csv etc.), Kafka topic, frequency of pulling data (daily, weekly etc) from source system, is the data pulled as needed or per a schedule or based on an event, connectivity etc. (3) What kind of data objects will be created to persist the data in the data-lake ? (4) High level statement of all code changes, config changes, and data changes (including movement). (5) Which design standards/best practices are being followed ? Critical design decisions to optimize/improve pipeline performance. (6) Which regulatory compliance standards are being applied and how ? (7) Which aggregation objects/views are to be created so that data and analytics reports can be served.

pyspark Article's

30 articles in total

Infraestrutura para análise de dados com Jupyter, Cassandra, Pyspark e Docker

Intro to Data Analysis using PySpark

Azure Synapse PySpark Toolbox Contents

Azure Synapse PySpark Toolbox 001: Input/Output

Mastering Dynamic Allocation in Apache Spark: A Practical Guide with Real-World Insights

Auditoria massiva com Lineage Tables do UC no Databricks

Platform to practice PySpark Questions

Entendendo e aplicando estratégias de tunning Apache Spark

[API Databricks como serviço interno] dbutils — notebook.run, widgets.getArgument, widgets.text e notebook_params

Pytest Mocks, o que são?

Achieving Clean and Scalable PySpark Code: A Guide to Avoiding Redundancy

Real-Time Streaming Analytics with PySpark on AWS using Kinesis and Redshift.

PySpark optimization techniques

Creating a data pipeline using Dataproc workflow templates and cloud Schedule

Running pyspark jobs on Google Cloud Dataproc

Calling All Senior Data Engineering Innovators!

Comprehensive Guide to Schema Inference with MongoDB Spark Connector in PySpark

Checking object existence in large AWS S3 buckets using Python and PySpark (plus some grep comparison)

Troubleshooting Kafka Connectivity with spark streaming

PySpark： missing value

Spark: Introduction

Template for design document of Apache Spark project

currently reading

Building an Anime Recommendation System with PySpark in SageMaker

PySpark & Apache Spark - Overview

Batch Processing using PySpark on AWS EMR

Running PySpark in JupyterLab on a Raspberry Pi

Python Interpreter in Docker and Pyspark Tests in Docker

Apply Function Only Works on the First 1000 Rows of PySpark.Pandas DF

create UDF in pyspark to join 2 tables

Featured ones:

abubakersiddique761