Logo

dev-resources.site

for different kinds of informations.

Leveraging PySpark.Pandas for Efficient Data Pipelines

Published at
7/4/2024
Categories
dataengineering
spark
pandas
python
Author
felipe_de_godoy
Author
15 person written this
felipe_de_godoy
open
Leveraging PySpark.Pandas for Efficient Data Pipelines

In the world of big data, Spark has become a pivotal tool for handling and processing large datasets efficiently. However, if you're a data scientist or a data analyst accustomed to the simplicity and power of Pandas, you might find transitioning to Spark a bit daunting. That's where the Pandas API on Spark comes in! It brings the familiar Pandas syntax to the Spark ecosystem, allowing you to leverage the distributed computing power of Spark while working with a Pandas-like interface.

Why Use Pandas API on Spark?

The Pandas API on Spark allows you to:

  1. Handle Larger-Than-Memory Data: Work with datasets that exceed the memory capacity of a single machine.
  2. Leverage Distributed Computing: Benefit from the parallel processing power of a Spark cluster.
  3. Use Familiar Syntax: Transition smoothly from Pandas to Spark without having to learn a completely new API.

Setting Up Your Environment

To get started, we'll use Docker to set up a local PySpark environment. Open your terminal and run the following command:


docker run -it -p 8888:8888 jupyter/pyspark-notebook

Enter fullscreen mode Exit fullscreen mode

Once the container is running, open your browser and navigate to the second link to access your PySpark environment.

Image description

Getting the Data

We'll use a dataset from Kaggle for this example. You can find the dataset here: Students Performance Dataset. Download the CSV file and place it in the appropriate location within your Docker container (you can drag it to jupyter tab in your browser).

Processing Data with Pandas API on Spark

With the environment set up and the file in the correct place, you can run the following code to read, treat, visualize, and save the data to S3.

Step 1: Import Libraries and Initialize Spark Session

!pip install boto3 plotly
import pandas as pd
import numpy as np
import pyspark.pandas as ps
from pyspark.sql import SparkSession
import boto3
spark = SparkSession.builder.appName("PandasOnSparkExample").getOrCreate()
Enter fullscreen mode Exit fullscreen mode

Step 2: Read Data from CSV

columns = ['StudentID', 'Age', 'Gender', 'Ethnicity', 'ParentalEducation',
'StudyTimeWeekly', 'Absences', 'Tutoring', 'ParentalSupport',
'Extracurricular', 'Sports', 'Music', 'Volunteering', 'GPA', 'GradeClass']
psdf = ps.read_csv('Student_performance_data _.csv', names=columns, header=0)
Enter fullscreen mode Exit fullscreen mode

Step 3: Exploring the Data

Check the first few rows of the dataset to ensure it's loaded correctly:

print(psdf.head())
Enter fullscreen mode Exit fullscreen mode

Image description

Print column names and data types:

print(psdf.columns)
print(psdf.dtypes)
Enter fullscreen mode Exit fullscreen mode

Step 4: Handling Missing Data

Handle missing data by either dropping rows with missing values:

psdf_cleaned = psdf.dropna()
print(psdf_cleaned.head())
Enter fullscreen mode Exit fullscreen mode

Or filling them with a specific value:

psdf_filled = psdf.fillna(value=0)
print(psdf_filled.head())
Enter fullscreen mode Exit fullscreen mode

Step 5: Data Manipulations and Insights

Group your data and apply aggregate functions:

grouped_psdf = psdf.groupby('Gender').mean()
print(grouped_psdf)
Enter fullscreen mode Exit fullscreen mode

Image description

Sort your DataFrame by values:

sorted_psdf = psdf.sort_values(by='GPA', ascending=False)
print(sorted_psdf.head())
Enter fullscreen mode Exit fullscreen mode

Step 6: Visualization

Plot the GPA distribution using plotly (it must be installed):

psdf['StudyTimeWeekly'].to_pandas().plot(kind='hist')
Enter fullscreen mode Exit fullscreen mode

Image description

Step 7: Save as Compressed Parquet and Upload to S3

Save the DataFrame as a compressed Parquet file:

parquet_file = 'student_data.parquet.gzip'
psdf.to_parquet(parquet_file, compression='gzip')
Enter fullscreen mode Exit fullscreen mode

Upload the Parquet file to S3 using boto3:

s3_bucket = 'your-s3-bucket-name'
s3_key = 'path/to/save/student_data.parquet.gzip'

# Initialize a session using Amazon S3
s3 = boto3.client('s3')

# Upload the file to S3
s3.upload_file(parquet_file, s3_bucket, s3_key)
print(f"File uploaded to s3://{s3_bucket}/{s3_key}")
Enter fullscreen mode Exit fullscreen mode

Conclusion

The Pandas API on Spark bridges the gap between Pandas and Spark, offering you the best of both worlds. Whether you're handling massive datasets or looking to scale your data processing pipelines effortlessly, this API empowers you to harness the full power of Spark with the simplicity of Pandas.

Try it out and supercharge your data analytics workflow today!

For more details, you can refer to Spark's official documentation.

Happy data wrangling!

Repo: https://github.com/felipe-de-godoy/spark_with_pandas

spark Article's
30 articles in total
Favicon
Like IDE for SparkSQL: Support Pycharm! SparkSQLHelper v2025.1.1 released
Favicon
Enhancing Data Security with Spark: A Guide to Column-Level Encryption - Part 2
Favicon
Time-saver: This IDEA plugin can help you write SparkSQL faster
Favicon
How to Migrate Massive Data in Record Time—Without a Single Minute of Downtime 🕑
Favicon
Why Is Spark Slow??
Favicon
Like IDE for SparkSQL: SparkSQLHelper v2024.1.4 released
Favicon
Mastering Dynamic Allocation in Apache Spark: A Practical Guide with Real-World Insights
Favicon
Auditoria massiva com Lineage Tables do UC no Databricks
Favicon
Platform to practice PySpark Questions
Favicon
Exploring Apache Spark:
Favicon
Big Data
Favicon
Dynamic Allocation Issues On Spark 2.4.8 (Possible Issue with External Shuffle Service?)
Favicon
Entendendo e aplicando estratégias de tunning Apache Spark
Favicon
[API Databricks como serviço interno] dbutils — notebook.run, widgets.getArgument, widgets.text e notebook_params
Favicon
Análise de dados de tráfego aéreo em tempo real com Spark Structured Streaming e Apache Kafka
Favicon
My journey learning Apache Spark
Favicon
Integrating Elasticsearch with Spark
Favicon
Advanced Deduplication Using Apache Spark: A Guide for Machine Learning Pipelines
Favicon
Journey Through Spark SQL
Favicon
Choosing the Right Real-Time Stream Processing Framework
Favicon
Top 5 Things You Should Know About Spark
Favicon
PySpark optimization techniques
Favicon
End-to-End Realtime Streaming Data Engineering Project
Favicon
Machine Learning with Spark and Groovy
Favicon
Hadoop/Spark is too heavy, esProc SPL is light
Favicon
Leveraging PySpark.Pandas for Efficient Data Pipelines
Favicon
Databricks - Variant Type Analysis
Favicon
Comprehensive Guide to Schema Inference with MongoDB Spark Connector in PySpark
Favicon
Troubleshooting Kafka Connectivity with spark streaming
Favicon
Apache Spark 101

Featured ones: