Logo

dev-resources.site

for different kinds of informations.

AIOps Powered by AWS: Developing Intelligent Alerting with CloudWatch & Built-In Capabilities

Published at
1/5/2025
Categories
aiops
aws
awscloudoperations
sre
Author
indika_wimalasuriya
Categories
4 categories in total
aiops
open
aws
open
awscloudoperations
open
sre
open
Author
19 person written this
indika_wimalasuriya
open
AIOps Powered by AWS: Developing Intelligent Alerting with CloudWatch & Built-In Capabilities

AIOps is no longer the next big thing — the journey has already started, and you need to get on board as quickly as possible. I'm going to write a four-part series covering how you can implement a comprehensive AIOps solution or framework in AWS. The series will consist of:

Series 1: AIOps Powered by AWS: Developing Intelligent Alerting with CloudWatch & Built-In Capabilities
This is the post I’m going to walk you through today.

Series 2: AIOps with AWS: Building Custom Machine Learning Models for Enhanced Alerting and Insights
This post explores how to implement AIOps using your own data and custom ML models for tailored intelligence.

Series 3: AIOps via AWS: Enabling Intelligent Resolution with Self-Healing Bots and Automation
This post covers the use of self-healing bots and other automated solutions to resolve issues without human intervention.

Series 4: AIOps with AWS: Leveraging GenAI for Smarter, AI-Powered Solutions in IT Operations
This post focuses on how Generative AI can offer advanced solutions for operational intelligence and automation.

AIOps Powered by AWS: Developing Intelligent Alerting with CloudWatch & Built-In Capabilities

AIOps stands for Artificial Intelligence for IT Operations. Today, distributed systems are increasingly complex due to monolithic applications being migrated to microservices, which are then hosted in the cloud. This has created a large volume of data sources, leading to a surge in data volume and an exponential rise in failure scenarios. As a result, humans are no longer able to manage these systems alone, and support from AI is needed.

AIOps in nutshell

AWS provides great offerings to enable or implement AIOps. Before we go further, let's identify some of the most leveraged AIOps use cases.

Anomaly Detection:

  • Metric Anomaly Detection – This involves identifying anomalies in metrics (e.g., anomalies in order failure rate or API response time).

AWS Metric Anomaly Detection
[Source - AWS]

  • Log Anomaly Detection – This involves identifying new error messages appearing in logs or a rise in the occurrence of errors.

AWS Log Anomaly Detection
[Source - AWS]

Forecasting:

  • Metric Forecasting – This involves forecasting a metric value, such as predicting when you will run out of open sessions for a particular service or when you will run out of system resources.

AWS Forecasting
[Source - AWS]

Correlation:

There are a lot of metrics, logs, tracers, or telemetry data and alerts in your system. Sometimes, finding the root cause is like finding a needle in a haystack. We need a way to reduce noise and pinpoint the actual problem. AI is able to correlate alerts, reduce noise, and guide us.

These are some of the most widely used AIOps use cases.

If you notice, a key part of AIOps is intelligent alerting. Standard threshold-based alerts no longer serve our purpose. Therefore, for traffic, errors, latency, or resources, we need to establish a baseline and receive alerts in case of baseline breaches. In these instances, AI acts as intelligent alerting.

Once the alert is triggered:

Self-Healing / Remediation Bots: We can develop self-healing or remediation bots that provide solutions. These bots can be rule-based, or we can leverage GenAI to provide smarter solutions as well.

Now that you have a good understanding of what AIOps is and its use cases, let’s look at how we can leverage AWS to implement a comprehensive AIOps framework.

CloudWatch Offering

Instrumentation and Collection

Regardless of whether you choose workload-based solutions (EC2, ECS, or EKS) or serverless solutions (AWS Lambda), you can use CloudWatch Agent, AWS Distro for OpenTelemetry, or X-Ray to get your application to emit telemetry data such as metrics, logs, traces, and events.

Visualizations

On top of the foundational telemetry data, you can build your observability dashboards.

Insights and Analytics

AWS provides various out-of-the-box insights, such as:

  • Container Insights
  • Lambda Insights
  • Log Insights
  • Application Insights
  • EC2 Health
  • AWS CloudTrail

Digital Experience Monitoring

To enable digital experience monitoring, you can leverage tools such as RUM (Real User Monitoring) and Synthetics.

All of the above are hooked into AWS CloudWatch for centralized observability.

AWS Built-in AIOps Capabilities Integrated with CloudWatch

The beauty of AWS is that it provides key AIOps use cases out of the box, such as:

  • Metric anomaly detection
  • Log anomaly detection
  • AI-driven natural language query generation
  • Intelligent insights

That’s it. You can leverage these capabilities to develop the intelligent alerting we discussed earlier.

Metric Forecasting

For Metric Forecasting, you can leverage the Forecasting service provided by AWS. We can easily integrate CloudWatch metrics with AWS Forecasting to meet our forecasting needs.

AWS DevOps Guru

Now that you’ve built some of the most common AIOps use cases, wouldn’t it be cool if AWS could monitor your entire AWS account and provide insights? Well, AWS provides AWS DevOps Guru, which can do just that. It’s based on machine learning, and some of the key use cases DevOps Guru brings to the table are:

  • Anomaly Detection: Automatically detects unusual patterns in metrics, logs, and events using machine learning.
  • Root Cause Analysis: Identifies the root cause of operational issues by correlating data from multiple sources, reducing resolution time.
  • Proactive Insights: Offers recommendations to prevent potential issues based on best practices and historical data.
  • Resource Optimization: Suggests ways to optimize resource utilization to lower costs and improve performance.
  • Database Monitoring: Provides performance insights for both relational (e.g., RDS, Redshift) and non-relational databases (e.g., DynamoDB, ElastiCache).
  • Capacity Planning: Forecasts future resource needs based on traffic patterns and usage trends.

Yes, DevOps Guru is your one-stop shop to get most of your AIOps requirements done.

What’s New in AWS Releases at re:Invent 2024?

Yes, these are exciting times! I’ve been tracking the following awesome capabilities released by AWS, which will greatly enhance your AIOps implementation journey.

Amazon CloudWatch Enhancements:

  • Contextual Observability Data – Automatically visualizes relationships between metrics, logs, and AWS resources, improving troubleshooting and root cause analysis.
  • Network Performance Monitoring – Provides near real-time monitoring of network performance across workloads using flow monitors.
  • Database Insights for Amazon Aurora – Offers deeper insights for Amazon Aurora PostgreSQL and MySQL, designed for DevOps engineers and DBAs.
  • Enhanced Observability for ECS – Adds detailed metrics from cluster to container level to improve troubleshooting for ECS workloads.
  • CloudWatch Observability Solutions for AWS Services – Pre-configured solutions for common AWS services like JVM, Apache Kafka, and NGINX.
  • Centralized Telemetry Configuration Visibility – Provides centralized auditing and visibility for AWS telemetry configurations (e.g., VPC Flow Logs, EC2 metrics) to ensure complete monitoring coverage.
  • Amazon CloudWatch Application Signals – Provides complete visibility into application transaction spans, enhancing performance analysis and root cause identification.

That's a wrap for the series opener. With these capabilities, you can build a comprehensive AIOps framework to elevate your application reliability to the next level.

sre Article's
30 articles in total
Favicon
In 2025, I resolve to spend less time troubleshooting
Favicon
Observability Unveiled: Key Insights from IBM’s SRE Expert
Favicon
SSH Keys | Change the label of the public key
Favicon
Rely.io Update Roundup - December 2024
Favicon
From Ancient Firefighters to Modern SREs: Balancing Proactive and Reactive Work with Callgoose SQIBS Automation
Favicon
AIOps Powered by AWS: Developing Intelligent Alerting with CloudWatch & Built-In Capabilities
Favicon
Automation for the People
Favicon
we are doing DevOps job market Q&A with folks from Google, AWS, Microsoft etc.
Favicon
SRE for the SaaS
Favicon
Rely.io October 2024 Product Update Roundup
Favicon
The Pocket Guide to Internal Developer Platform
Favicon
How to Configure a Remote Data Store for Prometheus
Favicon
Day 10: ls -l *
Favicon
Why does improving Engineering Performance feel broken?
Favicon
Incident Management vs Incident Response: What You Must Know
Favicon
Retry Pattern: Manejando Fallos Transitorios en Sistemas Distribuidos
Favicon
Top Backstage alternatives
Favicon
The Vital Role of Human Oversight in AI-Driven Incident Management and SRE
Favicon
The Role of External Service Monitoring in SRE Practices
Favicon
Looking for an incident management tool?
Favicon
Rely.io October 2024 Product Update Roundup
Favicon
A Very Deep Dive Into Docker Builds
Favicon
SRE Culture Embedding Reliability into Engineering Teams
Favicon
Check out our new whitepaper: "Internal Developer Platforms and Portals, a complete overview"
Favicon
Control In the Face of Chaos
Favicon
2x Faster, 40% less RAM: The Cloud Run stdout logging hack
Favicon
Understanding and Minimizing Downtime Costs: Strategies for SREs and IT Professionals
Favicon
SRE vs DevOps: What’s the Difference and Why Does It Matter? 🤓
Favicon
Rely.io September 2024 Product Update Roundup
Favicon
Best Practices for Choosing a Status Page Provider

Featured ones: