dev-resources.site
for different kinds of informations.
The Adventures of Blink S2e9: Gathering Metrics with Prometheus and Grafana
Hey friends, and welcome to the next Adventure of Blink! If you've been following along, we've done a ton of cool stuff this season:
- We learned about Docker
- We configured a MongoDB (in a Docker container with persistent storage)
- We made a Flask API for the database (also in a Docker container)
- We've explored Test-Driven Development practices with PyTest
- We made our tests run every time we commit to our repository using GitHub Actions
- We created a graphical interface for our program using Tkinter
- We scanned our project for security vulnerabilities using Snyk
...suffice to say, we've been really busy. But we're not through yet! Today we're covering another oft-overlooked topic:
Observability!
Observability is a core component of the DevOps mindset... because it's a place where Dev and Ops can easily interact. Ops is usually on the receiving end of support tickets and user complaints... but it's hard to diagnose something like "the application is slow!" without firm evidence of what happened. But if your developers aren't considering metrics and observability behavior when they're coding, you're not going to have those metrics for Ops to confirm a user's complaint.
TL/DR: Youtube
When to apply metrics
The answer to this is... as early as possible! Build metrics into your code while you're writing it, and become accustomed to using them throughout the development process.
Why are they important?
Using metrics in the development process ensures that you understand from the beginning how the application behaves. You'll want to consider things like load testing as you complete your work, ensuring that you see how your code behaves when there are lots of users running it at once.
How metrics are created
We're going to introduce two products to our application environment: Prometheus and Grafana.
Prometheus is a collection mechanism for metrics that we establish in our code. It runs in a Docker container as part of our environment and listens for metrics to be sent by our code... yes, that means we have some code changes to make, but it should be pretty easy work.
What metrics are important to us?
In our Hangman game, there's not a lot of processing going on. As a result, adding metrics for the performance of the application itself? Probably not all that useful.
A place where metrics would be useful would be around the API calls. If anything's going to malfunction, it's going to be the data extraction code... after all, that's the place where multiple containers get involved and where data has to flow from one system to another seamlessly. So we'll add our metrics instrumentation to the API code.
Setting up the tools
First, let's add the Prometheus and Grafana containers to our docker-compose.yml
file:
prometheus:
image: prom/prometheus:latest
volumes:
# This prometheus.yml file we will create shortly 😉
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
mongo-exporter:
image: bitnami/mongodb-exporter:latest
environment:
# Note the name of our mongo container here
MONGODB_URI: "mongodb://mongo:27017"
depends_on:
- mongo
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
depends_on:
- prometheus
Next, let's build the prometheus.yml
file that establishes the configuration for Prometheus:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'flask-api'
# We'll have to create a /metrics endpoint in the API...
metrics_path: '/metrics' # Endpoint from Flask app
static_configs:
# This target is our api container and port
- targets: ['hangman-api:5001']
- job_name: 'mongo'
metrics_path: '/metrics' # Endpoint for the MongoDB Exporter
static_configs:
- targets: ['mongo-exporter:9216'] # Default port for MongoDB Exporter
That leads us to make our code changes in the API:
from flask import Flask, jsonify, request
# Adding in prometheus_client to help us build the metrics additions
from prometheus_client import generate_latest, Counter
from prometheus_client import multiprocess, CollectorRegistry, Gauge, Histogram
from prometheus_client import multiprocess
from pymongo import MongoClient
from pymongo.errors import PyMongoError
from bson.objectid import ObjectId
from datetime import datetime
import os
app = Flask(__name__)
# MongoDB connection
mongo_uri = os.getenv("MONGO_URI_API")
db_name = os.getenv("DB_NAME")
collection_name = os.getenv("COLLECTION_NAME")
# When testing locally, we bypass the .env and load the variables manually
# mongo_uri = "mongodb://blink:theadventuresofblink@localhost:27017/hangman?authSource=admin"
# db_name = "hangman"
# collection_name = "phrases"
client = MongoClient(mongo_uri)
db = client[db_name]
collection = db[collection_name]
# Here's where we set up the metrics objects we're going to need:
REQUEST_COUNT = Counter('flask_app_requests_total', 'Total number of requests to the app')
REQUEST_LATENCY = Histogram('flask_app_request_latency_seconds', 'Latency of requests to the app')
# This route is used by prometheus to extract the metrics.
# generate_latest() is a library method that knows how to get
# all prometheus_client objects and send them for the application
# to pick up.
@app.route('/metrics')
def metrics():
return generate_latest()
@app.route('/getall', methods=['GET'])
def get_all_items():
# This is an example of how to instrument a method.
# Notice we increment the request count, and then
# the request latency is measured by putting the entire
# method's code inside a With statement that captures
# its timing
REQUEST_COUNT.inc()
with REQUEST_LATENCY.time():
try:
# Find all records in the collection
words = list(collection.find({}, {"_id": 0})) # Exclude _id field from the response
return jsonify(words), 200
except Exception as e:
return jsonify({"error": str(e)}), 500
For brevity's sake I didn't include the rest of the API code... but each route needs to be instrumented individually. You can add more metrics if you'd like to observe different behaviors separately.
Another note: make sure you add prometheus into the API's requirements.txt!
blinker==1.8.2
click==8.1.7
dnspython==2.7.0
Flask==3.0.3
itsdangerous==2.2.0
Jinja2==3.1.4
MarkupSafe==3.0.2
prometheus_client==0.21.0
pymongo==4.10.1
Werkzeug==3.1.1
Validating that it all works
Now that we've finished setup, we can start up our application:
# Windows/Unix
docker-compose up --build
# Mac
docker compose up --build
We can see our new containers on ports 9090 (Prometheus) and 3000 (Grafana). Let's start in Prometheus, and set up some metrics queries:
...
Then once we've got them created, we can head over to Grafana to visualize them:
...
Wrapping up
These examples are small and somewhat contrived, in keeping with our theme of exploring these concepts in a small app so we can see how they work without the distraction of scale and complexity. But hopefully you can see from even these examples how much power you have as a developer to see what's happening within your code! This may seem like a lot of work for something that doesn't actually make our game any better or more interesting to play, but the value of metrics is in being able to diagnose more easily when something isn't right.
I hope you've learned a lot this week! We are nearly to the end of Season 2, and I'll tell ya what... it has been such a ride. Our season finale is going to be the long-awaited AI integration - we're going to let a Large Language Model build hangman games for us to play! So tune in next week for another Adventure of Blink!
Featured ones: