Logo

dev-resources.site

for different kinds of informations.

PySpark optimization techniques

Published at
8/28/2024
Categories
pyspark
dataengineering
spark
optimization
Author
rado_mayank
Author
11 person written this
rado_mayank
open
PySpark optimization techniques
There are a variety of different parts of Spark jobs that you might want to optimize, and it’s valuable to be specific. Following are some of the areas:
  • Code-level design choices (e.g., RDDs versus DataFrames)
  • Joins (e.g., use Broadcast joins and avoid Cartesian joins or even full outer joins
  • Aggregations (e.g., using reduceByKey when possible over groupByKey)
  • Individual application properties
  • Inside of the Java Virtual Machine (JVM) of an executor
  • Worker nodes
  • Cluster and deployment properties

Image description

  • Using efficient data storage formats like Parquet or ORC can significantly reduce storage size and improve read/write performance.
  • Efficient Storage: Using formats like Parquet or ORC compresses the data, reducing storage costs and improving disk I/O performance.
  • Faster Query Performance: These formats are optimized for large-scale processing, leading to faster query execution times due to their columnar storage structure.

Image description

Image description

  • Row-based file formats (e.g., CSV, JSON) store data by rows. Each row contains all the fields for a particular record, making it efficient for writing and retrieving whole records.
  • Columnar-based file formats (e.g., Parquet, ORC) store data by columns. Each column contains all the values for a particular field, making it more efficient for analytical queries that involve aggregation and filtering.

Image description

Image description
ORC (Optimized Row Columnar) and Parquet are popular columnar storage file formats used in big data processing frameworks like Apache Spark and Hadoop. They are optimised for storage and query performance in distributed data environments. Both ORC and Parquet files are binary formats, which means you cannot read them directly like CSV files.

Image description

Image description

SELECT AVG(salary) FROM employees WHERE age > 30;
  • Row-Based (CSV): Reads all rows, including unnecessary data, resulting in higher I/O.
  • Columnar-Based (Parquet): Reads only the age and salary columns, reducing I/O.
  • Columnar-Based (ORC): Reads only the age and salary columns, but with additional optimization due to lightweight indexing, it skips irrelevant rows faster, resulting in even better query performance.

Image description

  • Broadcast joins improve join performance when one of the tables is small enough to fit into the memory of each worker node.
  • Improved Join Performance: Broadcasting a small table to all nodes minimizes the need for shuffling large datasets, significantly speeding up the join operation.
  • Memory Efficiency: This method works best when the small table fits in memory, avoiding expensive disk I/O operations.

Image description

Image description

Image description

  • Caching is useful when a DataFrame is reused multiple times. It avoids recomputation and speeds up the workflow.
  • Avoids Recomputations: Caching prevents the need to recompute DataFrames multiple times during a workflow, saving time.
  • Increases Performance: By storing DataFrames in memory, subsequent actions on the DataFrame are executed much faster.

Image description

Image description

Image description

  • Proper partitioning of DataFrames can improve parallelism and reduce shuffling, enhancing performance.
  • Enhanced Parallelism: Proper repartitioning ensures that the workload is evenly distributed across nodes, improving parallel processing.
  • Reduced Shuffling: By partitioning data based on key columns, you minimize costly shuffle operations during joins or aggregations.

Image description

Image description

Image description

  • DataFrames are optimized for performance and provide a higher level of abstraction compared to RDDs.
  • Higher Abstraction: DataFrames provide a more user-friendly API compared to RDDs, with automatic optimization under the hood.
  • Performance Optimization: The Catalyst optimizer in Spark SQL optimizes DataFrame operations, making them faster than equivalent RDD operations.

Image description

Image description

Image description

  • User-defined functions (UDFs) are often slower as they operate row-wise. Use built-in functions whenever possible.
  • Performance Overhead: UDFs can slow down processing since they operate on each row individually and bypass many of Spark's internal optimizations.
  • Leverage Built-in Functions: Built-in functions are optimized for distributed processing and often execute much faster than UDFs.

Image description

Image description

pyspark Article's
30 articles in total
Favicon
Infraestrutura para análise de dados com Jupyter, Cassandra, Pyspark e Docker
Favicon
Intro to Data Analysis using PySpark
Favicon
Azure Synapse PySpark Toolbox Contents
Favicon
Azure Synapse PySpark Toolbox 001: Input/Output
Favicon
Mastering Dynamic Allocation in Apache Spark: A Practical Guide with Real-World Insights
Favicon
Auditoria massiva com Lineage Tables do UC no Databricks
Favicon
Platform to practice PySpark Questions
Favicon
Entendendo e aplicando estratégias de tunning Apache Spark
Favicon
[API Databricks como serviço interno] dbutils — notebook.run, widgets.getArgument, widgets.text e notebook_params
Favicon
Pytest Mocks, o que são?
Favicon
Achieving Clean and Scalable PySpark Code: A Guide to Avoiding Redundancy
Favicon
Real-Time Streaming Analytics with PySpark on AWS using Kinesis and Redshift.
Favicon
Hiring Alert!
Favicon
PySpark optimization techniques
Favicon
Creating a data pipeline using Dataproc workflow templates and cloud Schedule
Favicon
Running pyspark jobs on Google Cloud Dataproc
Favicon
Calling All Senior Data Engineering Innovators!
Favicon
Comprehensive Guide to Schema Inference with MongoDB Spark Connector in PySpark
Favicon
Checking object existence in large AWS S3 buckets using Python and PySpark (plus some grep comparison)
Favicon
Troubleshooting Kafka Connectivity with spark streaming
Favicon
PySpark: missing value
Favicon
Spark: Introduction
Favicon
Template for design document of Apache Spark project
Favicon
Building an Anime Recommendation System with PySpark in SageMaker
Favicon
PySpark & Apache Spark - Overview
Favicon
Batch Processing using PySpark on AWS EMR
Favicon
Running PySpark in JupyterLab on a Raspberry Pi
Favicon
Python Interpreter in Docker and Pyspark Tests in Docker
Favicon
Apply Function Only Works on the First 1000 Rows of PySpark.Pandas DF
Favicon
create UDF in pyspark to join 2 tables

Featured ones: