dev-resources.site
for different kinds of informations.
PySpark optimization techniques
Published at
8/28/2024
Categories
pyspark
dataengineering
spark
optimization
Author
rado_mayank
Author
11 person written this
rado_mayank
open
There are a variety of different parts of Spark jobs that you might want to optimize, and it’s valuable to be specific. Following are some of the areas:
- Code-level design choices (e.g., RDDs versus DataFrames)
- Joins (e.g., use Broadcast joins and avoid Cartesian joins or even full outer joins
- Aggregations (e.g., using reduceByKey when possible over groupByKey)
- Individual application properties
- Inside of the Java Virtual Machine (JVM) of an executor
- Worker nodes
- Cluster and deployment properties
- Using efficient data storage formats like Parquet or ORC can significantly reduce storage size and improve read/write performance.
- Efficient Storage: Using formats like Parquet or ORC compresses the data, reducing storage costs and improving disk I/O performance.
- Faster Query Performance: These formats are optimized for large-scale processing, leading to faster query execution times due to their columnar storage structure.
- Row-based file formats (e.g., CSV, JSON) store data by rows. Each row contains all the fields for a particular record, making it efficient for writing and retrieving whole records.
- Columnar-based file formats (e.g., Parquet, ORC) store data by columns. Each column contains all the values for a particular field, making it more efficient for analytical queries that involve aggregation and filtering.
ORC (Optimized Row Columnar) and Parquet are popular columnar storage file formats used in big data processing frameworks like Apache Spark and Hadoop. They are optimised for storage and query performance in distributed data environments. Both ORC and Parquet files are binary formats, which means you cannot read them directly like CSV files.
SELECT AVG(salary) FROM employees WHERE age > 30;
- Row-Based (CSV): Reads all rows, including unnecessary data, resulting in higher I/O.
- Columnar-Based (Parquet): Reads only the age and salary columns, reducing I/O.
- Columnar-Based (ORC): Reads only the age and salary columns, but with additional optimization due to lightweight indexing, it skips irrelevant rows faster, resulting in even better query performance.
- Broadcast joins improve join performance when one of the tables is small enough to fit into the memory of each worker node.
- Improved Join Performance: Broadcasting a small table to all nodes minimizes the need for shuffling large datasets, significantly speeding up the join operation.
- Memory Efficiency: This method works best when the small table fits in memory, avoiding expensive disk I/O operations.
- Caching is useful when a DataFrame is reused multiple times. It avoids recomputation and speeds up the workflow.
- Avoids Recomputations: Caching prevents the need to recompute DataFrames multiple times during a workflow, saving time.
- Increases Performance: By storing DataFrames in memory, subsequent actions on the DataFrame are executed much faster.
- Proper partitioning of DataFrames can improve parallelism and reduce shuffling, enhancing performance.
- Enhanced Parallelism: Proper repartitioning ensures that the workload is evenly distributed across nodes, improving parallel processing.
- Reduced Shuffling: By partitioning data based on key columns, you minimize costly shuffle operations during joins or aggregations.
- DataFrames are optimized for performance and provide a higher level of abstraction compared to RDDs.
- Higher Abstraction: DataFrames provide a more user-friendly API compared to RDDs, with automatic optimization under the hood.
- Performance Optimization: The Catalyst optimizer in Spark SQL optimizes DataFrame operations, making them faster than equivalent RDD operations.
- User-defined functions (UDFs) are often slower as they operate row-wise. Use built-in functions whenever possible.
- Performance Overhead: UDFs can slow down processing since they operate on each row individually and bypass many of Spark's internal optimizations.
- Leverage Built-in Functions: Built-in functions are optimized for distributed processing and often execute much faster than UDFs.
pyspark Article's
30 articles in total
Infraestrutura para análise de dados com Jupyter, Cassandra, Pyspark e Docker
read article
Intro to Data Analysis using PySpark
read article
Azure Synapse PySpark Toolbox Contents
read article
Azure Synapse PySpark Toolbox 001: Input/Output
read article
Mastering Dynamic Allocation in Apache Spark: A Practical Guide with Real-World Insights
read article
Auditoria massiva com Lineage Tables do UC no Databricks
read article
Platform to practice PySpark Questions
read article
Entendendo e aplicando estratégias de tunning Apache Spark
read article
[API Databricks como serviço interno] dbutils — notebook.run, widgets.getArgument, widgets.text e notebook_params
read article
Pytest Mocks, o que são?
read article
Achieving Clean and Scalable PySpark Code: A Guide to Avoiding Redundancy
read article
Real-Time Streaming Analytics with PySpark on AWS using Kinesis and Redshift.
read article
Hiring Alert!
read article
PySpark optimization techniques
currently reading
Creating a data pipeline using Dataproc workflow templates and cloud Schedule
read article
Running pyspark jobs on Google Cloud Dataproc
read article
Calling All Senior Data Engineering Innovators!
read article
Comprehensive Guide to Schema Inference with MongoDB Spark Connector in PySpark
read article
Checking object existence in large AWS S3 buckets using Python and PySpark (plus some grep comparison)
read article
Troubleshooting Kafka Connectivity with spark streaming
read article
PySpark: missing value
read article
Spark: Introduction
read article
Template for design document of Apache Spark project
read article
Building an Anime Recommendation System with PySpark in SageMaker
read article
PySpark & Apache Spark - Overview
read article
Batch Processing using PySpark on AWS EMR
read article
Running PySpark in JupyterLab on a Raspberry Pi
read article
Python Interpreter in Docker and Pyspark Tests in Docker
read article
Apply Function Only Works on the First 1000 Rows of PySpark.Pandas DF
read article
create UDF in pyspark to join 2 tables
read article
Featured ones: