Logo

dev-resources.site

for different kinds of informations.

Machine Learning with Spark and Groovy

Published at
7/21/2024
Categories
machinelearning
groovy
spark
Author
jagedn
Categories
3 categories in total
machinelearning
open
groovy
open
spark
open
Author
6 person written this
jagedn
open
Machine Learning with Spark and Groovy

In a previous post, machine-learning-groovy.html (spanish), I was playing to group similar customers using Groovy + Ignite.

To be honest, although it works, I think the script is very obscure, so I was looking for another approach, and I’ve reached Spark. In this post, I’ll cover the same issue but using it but using local mode I mean, I’ll not use many nodes as my data is a small files with 1000 records +- and can suite on my computer

To recap:

We have collected different features of our customer as number of users, number of documents finished, time to complete, web or api, and so.

As our main data comes from a big MySQL database, we have created a "feature" table fed with several "groups by", so we have something similar to

| CustomerId | Users | Finished | Days | API | Web | …​ |
| 1 | 2 | 21221 | 22 | 2212 | 18000 | …​ |
| 2 | 1 | 221 | 2 | 21 | 200 | …​ |

Project

This time I’ll use a gradle project (you can use Maven also) instead a GroovyScript. I founded some problems with Ivy downloading dependencies, so I decided to create a project with only 1 class (yes, I know, I know …​)

build.gradle

dependencies {
    // Use the latest Groovy version for building this library
    implementation 'org.apache.groovy:groovy-all:4.0.11'

    implementation 'mysql:mysql-connector-java:5.0.5'
    implementation 'org.apache.spark:spark-core_2.13:3.5.1'
    implementation 'org.apache.spark:spark-mllib_2.13:3.5.1'
    implementation 'org.apache.spark:spark-sql_2.13:3.5.1'

}
java {
    toolchain {
        languageVersion = JavaLanguageVersion.of(11)
    }
}
Enter fullscreen mode Exit fullscreen mode

(Java version 17 doesn’t work with Spark)

Features

We’ll define a correlation map between "visual labels" and mysql column:

def labels = [
    'Users' : 'nusers',
    'Documents': 'ndocuments',
    'Finished' : 'docusfinished',
    'Days' : 'daystofinish',
    'API' : 'template',
    'Web' : 'web',
    'Workflow' : 'workflow'
]
Enter fullscreen mode Exit fullscreen mode

And we’ll generate a CSV file:

def rows = mlDB.rows('select * from ml.clientes order by nombre')

def file = new File("out.csv")
file.text = (["Company"]+labels.keySet()).join(";") + "\n"
rows.eachWithIndex { row , idx->
    List<String> details = []
    labels.entrySet().eachWithIndex { entry, i ->
        details << (row[entry.value] ?: 0.0).toString()
    }
    file << "${idx+1};"+details.join(';')+"\n"
}
Enter fullscreen mode Exit fullscreen mode

By the moment nothing special, only a CSV file with a header using ";" as separator

Spark

Next we’ll create a "local" spark session and read the csv

def spark = SparkSession
        .builder()
        .appName("CustomersKMeans")
        .config(new SparkConf().setMaster("local"))
        .getOrCreate()

def dataset = spark.read()
        .option("delimiter", ";")
        .option("header", "true")
        .option("inferSchema", "true")
        .csv("out.csv")
Enter fullscreen mode Exit fullscreen mode

Transforming origin

We need to transform our dataset, adding some new columns and normalizing others:

def assembler = new VectorAssembler(inputCols: labels.keySet(), outputCol: "features")

dataset = assembler.transform(dataset)

def scaler = new StandardScaler(inputCol: "features", outputCol: "scaledFeatures", withStd: true, withMean: true)

def scalerModel = scaler.fit(dataset)

dataset = scalerModel.transform(dataset)
Enter fullscreen mode Exit fullscreen mode

We add a new "features" column and write on it all the labels (defined at the beginning), so features will contain "Users, Documents, API,…​ "

Also, we’ll transform the data using a StandardScalar so all data will be standarized using their media

Running a kMean

def kmeans = new KMeans(k:5 ,seed:1, predictionCol: "Cluster", featuresCol: "scaledFeatures" )

def kmeansModel = kmeans.fit(dataset)

// Make predictions
def predictions = kmeansModel.transform(dataset)
Enter fullscreen mode Exit fullscreen mode

We want to have 5 groups (this is a "business" requirements. They are ways to find the optimal number). We want to "create" a new column called "Cluster" where indicate in which group the data is (by default the column is called "prediction,") and also we indicate which columns used, "scaledFeatures" in this case, created previously

Showing the cluster

We’ll create a copy of the original dataset, and we’ll join to it a new columnCluster from predictions

def copy = dataset.alias("copy")
copy = copy.join(predictions.select("Company", "Cluster"), "Company", "inner")
copy.show(3)

+-------+-----+---------+--------+----+---+---+--------+--------------------+--------------------+-------+
|Company|Users|Documents|Finished|Days|API|Web|Workflow| features| scaledFeatures|Cluster|
+-------+-----+---------+--------+----+---+---+--------+--------------------+--------------------+-------+
| 1|238.0| 26906.0| 20987.0| 4.0|0.0|1.0| 0.0|[238.0,26906.0,20...|[5.09496005230008...| 0|
| 2| 1.0| 16.0| 9.0| 0.0|0.0|0.0| 0.0|(7,[0,1,2],[1.0,1...|[-0.3286794172192...| 0|
| 3| 80.0| 0.0| 0.0| 0.0|0.0|0.0| 0.0| (7,[0],[80.0])|[1.47920040595383...| 0|
+-------+-----+---------+--------+----+---+---+--------+--------------------+--------------------+-------+
only showing top 3 rows
Enter fullscreen mode Exit fullscreen mode

as you can see right now we have a copy dataset with original data plus a Cluster column indicating in which cluster this customer is

From here you can use this information to iterate over the dataset and generate some diagrama, update a database, …​ or create some HTML charts

Chart.js

In this post we want to create an HTML visualization using chart.js so we’ll create a data.js containing a Javascript object to embed in an index.html

def json = [
    labels:labels.keySet(),
    datasets:[]
]
kmeansModel.clusterCenters().eachWithIndex{v,i->
    json.datasets << [
            label:"Cluster ${i+1}",
            data: v.toArray(),
            fill: true
    ]
}
new File("data.js").text = "const dataArr = "+JsonOutput.prettyPrint(JsonOutput.toJson(json))
Enter fullscreen mode Exit fullscreen mode

Most important part here is kmeansModel.clusterCenters()

We create a JSON where datasets is an array of objects (required by chart.js) containing an array of doubles with their centers

Lastly we have an index.html with chart.js and data.js included

<head>
    <meta charset="UTF-8">
    <title>Clustering Customers</title>
    <script src="https://momentjs.com/downloads/moment.js"></script>
    <script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
    <script src="./data.js"></script>
</head>
...
    <div style="width:75%; height: 40rem">
        <canvas id="canvas"></canvas>
    </div>
...
<script>
    const config = {
        type: 'radar',
        data: dataArr,
        options: {
            elements: {
                line: {
                    borderWidth: 3
                }
            }
        },
    };

    window.onload = function() {
        var ctx = document.getElementById('canvas').getContext('2d');
        window.myLine = new Chart(ctx, config);
    };
</script>
....
Enter fullscreen mode Exit fullscreen mode

Nothing special but visually very attractive:

kmeans

Source

Here you can find the source code

https://gist.github.com/jagedn/184302ac4f89def14410f8a6f54a93ea

spark Article's
30 articles in total
Favicon
Like IDE for SparkSQL: Support Pycharm! SparkSQLHelper v2025.1.1 released
Favicon
Enhancing Data Security with Spark: A Guide to Column-Level Encryption - Part 2
Favicon
Time-saver: This IDEA plugin can help you write SparkSQL faster
Favicon
How to Migrate Massive Data in Record Time—Without a Single Minute of Downtime 🕑
Favicon
Why Is Spark Slow??
Favicon
Like IDE for SparkSQL: SparkSQLHelper v2024.1.4 released
Favicon
Mastering Dynamic Allocation in Apache Spark: A Practical Guide with Real-World Insights
Favicon
Auditoria massiva com Lineage Tables do UC no Databricks
Favicon
Platform to practice PySpark Questions
Favicon
Exploring Apache Spark:
Favicon
Big Data
Favicon
Dynamic Allocation Issues On Spark 2.4.8 (Possible Issue with External Shuffle Service?)
Favicon
Entendendo e aplicando estratégias de tunning Apache Spark
Favicon
[API Databricks como serviço interno] dbutils — notebook.run, widgets.getArgument, widgets.text e notebook_params
Favicon
Análise de dados de tráfego aéreo em tempo real com Spark Structured Streaming e Apache Kafka
Favicon
My journey learning Apache Spark
Favicon
Integrating Elasticsearch with Spark
Favicon
Advanced Deduplication Using Apache Spark: A Guide for Machine Learning Pipelines
Favicon
Journey Through Spark SQL
Favicon
Choosing the Right Real-Time Stream Processing Framework
Favicon
Top 5 Things You Should Know About Spark
Favicon
PySpark optimization techniques
Favicon
End-to-End Realtime Streaming Data Engineering Project
Favicon
Machine Learning with Spark and Groovy
Favicon
Hadoop/Spark is too heavy, esProc SPL is light
Favicon
Leveraging PySpark.Pandas for Efficient Data Pipelines
Favicon
Databricks - Variant Type Analysis
Favicon
Comprehensive Guide to Schema Inference with MongoDB Spark Connector in PySpark
Favicon
Troubleshooting Kafka Connectivity with spark streaming
Favicon
Apache Spark 101

Featured ones: