dev-resources.site
for different kinds of informations.
Mastering MongoDB: Top 5 Essential Aggregation Pipelines
Introduction
MongoDB is a leading NoSQL database with a schema-less nature, and the ability to handle large volumes of data for modern web applications.
Today MongoDB is been used in:
Streamline building AI-enriched applications.
Video Games that are global, reliable, and scalable.
Real-time analytical views of all the most important data.
Analyze and act on data from the physical world i.e. Internet of Things (IoT).
Store, edit, and present all types of content management.
Prerequisites
Basic knowledge of MongoDB. You can install MongoDB version 6.0 from here and can also use the MongoDB Atlas to try out these examples.
You can run the examples by using the MongoDB Playground.
What is Aggregation and its pipelines?
Aggregation:
MongoDB inherits capabilities to perform complex transformations and calculations, extracting meaningful information we need to aggregate data.
Aggregation simply refers to the process of collecting and processing data to produce a single consolidated output or result. E.g. calculating the sum, average, count, or other statistical measures from the dataset.
Aggregation Pipelines:
In MongoDB, the aggregation pipeline processes documents i.e. the data records, and returns computed results. A pipeline is made up of one or more stages, and each stage transforms the data i.e. documents as it moves through the pipeline.
The basic flow for any aggregation looks something like this as shown in the image below.
This framework allows for flexible, complex, and powerful operations on data, enabling complex queries and data transformations to be performed directly in databases.
Top 5 Essential Aggregation Pipelines
The top 5 essential aggregation pipelines which every MongoDB developer should know are
$match
$group
$project
$lookup
$unwind
We will discuss each aggregation in detail - from prototype form i.e. syntax to examples in the following context. We will be using some sample datasets, so let's deep dive into this.
Sample Dataset
Let's imagine we are handling data for an e-commerce platform. Below is the sample dataset for the products
and orders
collection to understand the top 5 essential aggregation pipelines.
Sample Dataset for products
collection:
// Products Collection
[
{
"_id": 101,
"productName": "Laptop",
"brand": "TechCorp",
"specs": ["16GB RAM", "512GB SSD", "Intel i7"]
},
{
"_id": 102,
"productName": "Headphones",
"brand": "AudioTech",
"specs": ["Noise Cancelling", "Bluetooth", "8hr Battery"]
},
{
"_id": 103,
"productName": "Mouse",
"brand": "TechCorp",
"specs": ["Wireless", "Optical"]
},
{
"_id": 104,
"productName": "Keyboard",
"brand": "TechWorks",
"specs": ["Mechanical", "RGB Backlit"]
}
]
Sample dataset for orders
collection:
// Orders Collection
[
{
"_id": 1,
"customerName": "Alice",
"productId": 101,
"quantity": 2,
"price": 1000,
"orderDate": "2023-08-01",
"shippingCity": "New York"
},
{
"_id": 2,
"customerName": "Bob",
"productId": 102,
"quantity": 5,
"price": 50,
"orderDate": "2023-08-05",
"shippingCity": "Los Angeles"
},
{
"_id": 3,
"customerName": "Charlie",
"productId": 101,
"quantity": 1,
"price": 1000,
"orderDate": "2023-08-05",
"shippingCity": "New York"
},
{
"_id": 4,
"customerName": "Alice",
"productId": 103,
"quantity": 3,
"price": 20,
"orderDate": "2023-08-08",
"shippingCity": "Boston"
},
{
"_id": 5,
"customerName": "Eve",
"productId": 104,
"quantity": 1,
"price": 70,
"orderDate": "2023-08-09",
"shippingCity": "San Francisco"
}
]
Please note: productId
in orders
collection refers to the product's _id
$match
Filters the documents to pass only the document that matches the specified condition(s) to the next pipeline stage.
The $match
has the following prototype form:
{$match: { <query> } }
Now based on our sample data from orders
collection, let's write a $match
query to filter orders that are shipped from New York City.
Check the query below:
db.orders.aggregate([
{$match: { shippingCity: "New York" } }
)]
Based on the dataset of orders
collection there will be two documents as a result. The result for the above aggregation pipeline will be as follows:
[
{
"_id": 1,
"customerName": "Alice",
"orderDate": "2023-08-01",
"price": 1000,
"productId": 101,
"quantity": 2,
"shippingCity": "New York"
},
{
"_id": 3,
"customerName": "Charlie",
"orderDate": "2023-08-05",
"price": 1000,
"productId": 101,
"quantity": 1,
"shippingCity": "New York"
}
]
$group
This stage separates documents into groups according to the "group key". The output is one document for each unique group key. A group key is often a field or group of fields e.g _id
, productId
💡 $group does not order its output documents.
The $group
stage has the following prototype form:
{
$group: {
_id: <expression>, // Group Key
<field1> : { <accumulator1> : <expression1> }, // Optional
...
}
}
Let's determine the total orders for each customer. Based on the orders
collection this can be determined by the following query:
db.orders.aggregate([
{
$group: {
_id: "customerName",
totalOrders: { $sum : 1 }
}
}
])
The above query will give us the following result:
[
{
"_id": "Bob",
"totalOrders": 1
},
{
"_id": "Eve",
"totalOrders": 1
},
{
"_id": "Alice",
"totalOrders": 2
},
{
"_id": "Charlie",
"totalOrders": 1
}
]
You can see from the above results that the results are not in order and also only the _id
i.e. customerName
and the accumulators i.e. totalOrders
are displayed as output.
$project
This aggregation is used to pass along the documents with the requested fields to the next stage in the pipeline. The specified fields can be existing fields from the input documents or newly computed fields.
The $project
stage has the following prototype form:
{ $project: { <specification(s)> } }
Based on the orders
collection, let's process the documents to get the total order value of each order.
In the orders
collection, we have two fields price
and quantity
. Let's use the $multiply
operator to get the total order value.
db.orders.aggregate([
{
$project : {
orderValue: { $multiply: ['$quantity', 'price'] },
customerName: 1
}
}
])
The output of the above query will be as follows: We can see from the results that only customerName
, orderValue
, together with _id
fields are projected in the next pipeline stage.
[
{
"_id": 1,
"customerName": "Alice",
"orderValue": 2000
},
{
"_id": 2,
"customerName": "Bob",
"orderValue": 250
},
{
"_id": 3,
"customerName": "Charlie",
"orderValue": 1000
},
{
"_id": 4,
"customerName": "Alice",
"orderValue": 60
},
{
"_id": 5,
"customerName": "Eve",
"orderValue": 70
}
]
💡 Suppression of the _id
field can be done by using _id: 0
in the $project stage.
$lookup
One of the most important aggregations. If you have basic SQL knowledge, it performs a left outer join to a collection in the same database to filter in documents from the "joined" collection for processing.
The $lookup
stage adds a new array field to each input document. The new array field contains the matching documents from the "joined" collection.
The $lookup
has three types of prototype forms:
Equality Match with a Single Join collection : To perform an equality match between a field from the input documents with a field from the documents of the "joined" collection.
Join Conditions and Subqueries on a Joined Collection : To perform correlated and uncorrelated subqueries with two collections, and perform other join conditions besides a single equality match, use the below syntax.
Correlated Subqueries Using Concise Syntax : The below syntax is used which removes the requirement for an equality match on the foreign local fields inside of an
$expr
operator:
Let's write a query to join products and orders using $lookup
. We are using the reference id productId
from the orders
collection
db.orders.aggregate([
{
$lookup: {
from: "products",
localField: "productId",
foreignField: "_id",
as: "productDetails"
}
}
]);
The above query will display the result as follows:
[
{
"_id": 1,
"customerName": "Alice",
"orderDate": "2023-08-01",
"price": 1000,
"productDetails": [
{
"_id": 101,
"brand": "TechCorp",
"productName": "Laptop",
"specs": [
"16GB RAM",
"512GB SSD",
"Intel i7"
]
}
],
"productId": 101,
"quantity": 2,
"shippingCity": "New York"
},
{
"_id": 2,
"customerName": "Bob",
"orderDate": "2023-08-05",
"price": 50,
"productDetails": [
{
"_id": 102,
"brand": "AudioTech",
"productName": "Headphones",
"specs": [
"Noise Cancelling",
"Bluetooth",
"8hr Battery"
]
}
],
"productId": 102,
"quantity": 5,
"shippingCity": "Los Angeles"
},
{
"_id": 3,
"customerName": "Charlie",
"orderDate": "2023-08-05",
"price": 1000,
"productDetails": [
{
"_id": 101,
"brand": "TechCorp",
"productName": "Laptop",
"specs": [
"16GB RAM",
"512GB SSD",
"Intel i7"
]
}
],
"productId": 101,
"quantity": 1,
"shippingCity": "New York"
},
{
"_id": 4,
"customerName": "Alice",
"orderDate": "2023-08-08",
"price": 20,
"productDetails": [
{
"_id": 103,
"brand": "TechCorp",
"productName": "Mouse",
"specs": [
"Wireless",
"Optical"
]
}
],
"productId": 103,
"quantity": 3,
"shippingCity": "Boston"
},
{
"_id": 5,
"customerName": "Eve",
"orderDate": "2023-08-09",
"price": 70,
"productDetails": [
{
"_id": 104,
"brand": "TechWorks",
"productName": "Keyboard",
"specs": [
"Mechanical",
"RGB Backlit"
]
}
],
"productId": 104,
"quantity": 1,
"shippingCity": "San Francisco"
}
]
You can see from the above response that an additional field productDetails
is added which itself is an array and includes the matching documents based on the "joined" collection.
$unwind
This aggregation deconstructs an array field from the input documents to output a document for each element.
Each output document is the input document with the value of the array field replaced by the element.
The prototype form used for $unwind
aggregation is as follows:
{
$unwind:
{
path: <field path>,
includeArrayIndex: <string>, // Optional
preserveNullAndEmptyArrays: <boolean> // Optional, default value: false
}
}
Let's use $unwind
aggregation on the Product specs field of products
collection: The syntax for the query will be as follows:
db.products.aggregate([
{
$unwind: {path: "$specs" }
}
]);
The response to using $unwind
will be deconstructed form of specs
array field:
[
{
"_id": 101,
"brand": "TechCorp",
"productName": "Laptop",
"specs": "16GB RAM"
},
{
"_id": 101,
"brand": "TechCorp",
"productName": "Laptop",
"specs": "512GB SSD"
},
{
"_id": 101,
"brand": "TechCorp",
"productName": "Laptop",
"specs": "Intel i7"
},
{
"_id": 102,
"brand": "AudioTech",
"productName": "Headphones",
"specs": "Noise Cancelling"
},
{
"_id": 102,
"brand": "AudioTech",
"productName": "Headphones",
"specs": "Bluetooth"
},
{
"_id": 102,
"brand": "AudioTech",
"productName": "Headphones",
"specs": "8hr Battery"
},
{
"_id": 103,
"brand": "TechCorp",
"productName": "Mouse",
"specs": "Wireless"
},
{
"_id": 103,
"brand": "TechCorp",
"productName": "Mouse",
"specs": "Optical"
},
{
"_id": 104,
"brand": "TechWorks",
"productName": "Keyboard",
"specs": "Mechanical"
},
{
"_id": 104,
"brand": "TechWorks",
"productName": "Keyboard",
"specs": "RGB Backlit"
}
]
Best Practices
Like any other high-performance tool, MongoDB also performs best when it is used properly. Here are some of the best practices for making your MongoDB perform best.
Schema Design : Use embedded documents for data that are used more often for read operations. Use of references for data that won't be accessed every time.
Indexes : Always index fields that are often required, for e.g
productId
in theorders
collection.Aggregation : Use
$match
early to reduce the amount of data that subsequent stages need to process.Data Modelling : Avoid the usage of large arrays that will grow over time. For e.g
Product Specs
Limit the Result : Use
$limit
the operator to limit the result from large datasets, wherever applicable.
More Aggregation Queries to Try Out
Do you want to try out more on aggregation, below are some of the queries for you to try out at your end.
Calculate Total Sales for Each Product Spec.
Group by date to find the total sales for each day.
Filter orders that exceed a certain price (let's assume $500).
Find out the total revenue from each city.
Sort customers based on the number of items they purchased.
Let me know the results in the comments section, you can also look into the solutions provided by me, check the References section, or click here.
How to run the examples
You can run the examples by using the MongoDB Playground. You can follow the instructions here to get started with MongoDB Playground.
Conclusion
MongoDB is a NoSQL document database for modern developers working on high-performance applications. With its JSON-like documents, MongoDB allows developers to model data however the application requires. With nesting arrays and embedded sub-documents, it allows the modeling of complex relationships between data.
I hope you have learned something new as I did. If so, kindly like and share the article and also follow me to read more exciting articles. You can check my social links here.
References
Use MongoDB Playground to run the examples in your browser without installing anything.
You can find the sample datasets and also the solutions to examples in my repo https://github.com/amanpreet-dev/essential-aggregation-pipelines
Do not forget to give a star for the repo and also follow me on GitHub
Featured ones: