⬡ Hub
Skip to content

AWS CloudWatch

CloudWatch is a monitoring and observability service built for DevOps engineers, developers, site reliability engineers (SREs), and IT managers. CloudWatch provides you with data and actionable insights to monitor your applications, respond to system-wide performance changes, optimize resource utilization, and get a unified view of operational health.

Key Concepts

  • Metrics: CloudWatch collects and tracks metrics, which are time-ordered sets of data points (variables) about your resources and applications. Metrics are stored for 15 months, allowing you to view historical data and gain a better perspective of how your application or service is performing. Metrics are identified by a name, a namespace, and zero or more dimensions.
    • Dimensions: A name/value pair that uniquely identifies a metric. They help you filter and segment metric data (e.g., InstanceId, FunctionName).
    • Custom Metrics: You can publish your own custom metrics to CloudWatch from your applications or services, allowing you to monitor any data point relevant to your business.
  • Alarms: You can create alarms that watch a single metric or an expression based on multiple metrics. The alarm performs one or more actions based on the value of the metric or expression relative to a threshold over a number of time periods. Actions can include:
    • EC2 Actions: Stop, terminate, or recover an EC2 instance.
    • Auto Scaling Actions: Add or remove EC2 instances from an Auto Scaling group.
    • SNS Notifications: Send notifications to an SNS topic.
    • Lambda Invocations: Invoke a Lambda function.
    • Composite Alarms: Combine multiple alarms into a single alarm, reducing alarm noise.
  • Logs: CloudWatch Logs enables you to centralize the logs from all of your systems, applications, and AWS services that you use, in a single, highly scalable service. You can monitor, store, and access your log files.
    • Log Groups: Logical groupings of log streams that share the same retention, monitoring, and access control settings.
    • Log Streams: Sequences of log events from a single source (e.g., an EC2 instance, a Lambda function invocation).
    • Log Insights: A powerful, interactive query language to search and analyze your log data in CloudWatch Logs.
    • Subscription Filters: Allow you to set up real-time feeds of logs to other AWS services (e.g., Lambda, Kinesis Data Firehose) for custom processing, analytics, or archiving.
  • Events (now Amazon EventBridge): CloudWatch Events (now largely superseded by Amazon EventBridge) delivers a near real-time stream of system events that describe changes in AWS resources. You can set up rules to match events and route them to one or more target functions or streams. EventBridge extends CloudWatch Events by adding event buses, SaaS integrations, and schema registry.
    • Event Patterns: Define the criteria for matching incoming events.
    • Targets: The AWS service or resource that EventBridge invokes when an event matches a rule (e.g., Lambda functions, SQS queues, SNS topics, EC2 instances).
  • Dashboards: CloudWatch dashboards are customizable home pages in the CloudWatch console that you can use to monitor your resources in a single view, even across different regions. You can create widgets to display metrics, logs, and alarms, providing a consolidated operational view.
  • CloudWatch Agent: A unified agent that can collect metrics and logs from EC2 instances, on-premises servers, and virtual machines. It can collect system-level metrics (CPU, memory, disk, network), custom metrics, and logs from various sources.
  • Anomaly Detection: CloudWatch uses machine learning algorithms to analyze past metric data and create a model of expected behavior. It then identifies anomalies (deviations from the expected pattern) in real-time, helping you detect unusual activity without setting static thresholds.
  • Contributor Insights: Analyzes log data to identify top contributors to system behavior (e.g., top N customers, most active IP addresses, highest error-producing URLs). This helps in quickly isolating and troubleshooting operational issues.
  • Synthetic Monitoring (Canaries): CloudWatch Synthetics allows you to create configurable scripts (canaries) to monitor your endpoints and APIs from outside your application. Canaries simulate user actions and check for availability, latency, and correctness, alerting you to issues before your customers are affected.

Use Cases

  • Application Monitoring: Collect and track metrics from your applications (e.g., CPU utilization, network I/O, request counts, error rates, latency). Create alarms to notify you of performance issues, resource bottlenecks, or operational failures. Use custom metrics to monitor application-specific KPIs.
  • Log Analysis and Troubleshooting: Centralize and analyze logs from various sources (EC2, Lambda, containers, custom applications) using CloudWatch Logs. Use Log Insights to quickly query and filter log data to troubleshoot issues, identify security threats, and gain operational insights.
  • Event-Driven Architectures and Automation: Respond to changes in your AWS environment or custom application events using EventBridge. Set up rules to trigger actions like invoking Lambda functions, sending messages to SQS/SNS, or starting/stopping EC2 instances, enabling automated responses and workflows.
  • Resource Optimization and Cost Management: Monitor resource utilization metrics (CPU, memory, network) to identify underutilized or overutilized resources. Use this data to optimize resource allocation, right-size instances, and manage costs effectively.
  • Operational Health and Dashboards: Create comprehensive CloudWatch Dashboards to get a unified view of the operational health of your applications and infrastructure. Combine metrics, logs, and alarms from various services into a single pane of glass for quick assessment.
  • Proactive Issue Detection: Implement Anomaly Detection on key metrics to automatically identify unusual behavior without manual threshold configuration. Use Synthetic Monitoring (Canaries) to proactively check application availability and performance from an end-user perspective.
  • Security and Compliance Monitoring: Monitor CloudTrail logs for API activity, VPC Flow Logs for network traffic, and other security-related logs in CloudWatch Logs. Create alarms for suspicious activities or compliance violations.

Interview Questions

Conceptual Questions

  1. What is AWS CloudWatch and what are its core components?
    • AWS CloudWatch is a monitoring and observability service that provides data and actionable insights for monitoring applications, responding to system-wide performance changes, optimizing resource utilization, and getting a unified view of operational health. Its core components are:
      • Metrics: Time-ordered data points about resources and applications.
      • Logs: Centralized service for collecting, monitoring, and storing log files.
      • Alarms: Watches metrics and initiates actions based on thresholds.
      • Events (EventBridge): Delivers a near real-time stream of system events to trigger actions.
      • Dashboards: Customizable home pages for monitoring resources.
  2. Explain the concepts of metrics, dimensions, and custom metrics in CloudWatch.
    • Metrics: Fundamental data points that represent a time-ordered set of observations. They describe how a system is performing.
    • Dimensions: Name/value pairs that uniquely identify a metric. They allow you to filter and segment metrics (e.g., InstanceId, FunctionName). A metric can have up to 10 dimensions.
    • Custom Metrics: Metrics that you define and publish to CloudWatch. This allows you to monitor any data point relevant to your application or business, beyond the default metrics provided by AWS services.
  3. What are CloudWatch Alarms and what types of actions can they perform?
    • CloudWatch Alarms watch a single metric (or an expression of multiple metrics) and perform one or more actions based on the value relative to a threshold over a specified period. Actions can include:
      • Sending notifications via SNS.
      • Stopping, terminating, or recovering EC2 instances.
      • Triggering Auto Scaling actions (scale up/down).
      • Invoking Lambda functions.
      • Creating CloudWatch Events (EventBridge).
  4. Describe CloudWatch Logs and its key features like Log Groups, Log Streams, and Log Insights.
    • CloudWatch Logs: A service for centralizing, monitoring, and storing log files from various sources (EC2, Lambda, containers, applications).
    • Log Groups: Logical groupings of log streams that share common settings such as retention policies, monitoring configurations, and access controls.
    • Log Streams: Sequences of log events from a common source within a Log Group (e.g., logs from a single EC2 instance, a Lambda function invocation).
    • Log Insights: A powerful, interactive query language that allows you to search, analyze, and visualize your log data in CloudWatch Logs, helping with troubleshooting and operational intelligence.
  5. What is Amazon EventBridge and how does it relate to CloudWatch Events? Provide examples of its use.
    • Amazon EventBridge is a serverless event bus service that makes it easy to connect applications together using data from your own applications, integrated SaaS applications, and AWS services. It is an extension of CloudWatch Events, offering more features like event buses, SaaS integrations, and a schema registry.
    • Use Cases: Triggers Lambda functions based on S3 object uploads, sending SNS notifications for EC2 state changes, orchestrating workflows with Step Functions in response to events, integrating with third-party SaaS applications.
  6. What is the CloudWatch Agent and why would you use it?
    • The CloudWatch Agent is a unified agent that can collect system-level metrics (e.g., CPU, memory, disk, network) and logs from EC2 instances, on-premises servers, and virtual machines. You would use it to get deeper visibility into the performance and health of your operating systems and applications, beyond the default metrics provided by EC2, and to centralize logs from custom applications.
  7. Explain CloudWatch Anomaly Detection and Contributor Insights.
    • Anomaly Detection: CloudWatch uses machine learning algorithms to analyze past metric data, create a model of expected behavior, and then identifies real-time deviations (anomalies) from that pattern. This helps proactively detect unusual activity without needing to set static thresholds.
    • Contributor Insights: Analyzes log data to identify the top N contributors to system activity or problems (e.g., top N users consuming bandwidth, top N error-producing APIs, most active IP addresses). It helps to quickly find the root cause of operational issues.

Scenario-Based Questions

  1. You have a web application running on EC2 instances behind an ALB. You need to monitor CPU utilization and send an alert if it consistently exceeds 70% for 5 minutes. Additionally, you want to automatically scale up your Auto Scaling Group if this happens. How would you configure this in CloudWatch?
    • I would create a CloudWatch Alarm that watches the CPUUtilization metric for my Auto Scaling Group (or individual EC2 instances). The alarm would be configured to trigger when the average CPU utilization is >= 70% for 5 consecutive minutes (1-minute period). The action for this alarm would be to send a notification to an SNS topic and, critically, to trigger an Auto Scaling policy to increase the desired capacity of the Auto Scaling Group.
  2. Your serverless application built with AWS Lambda and API Gateway is experiencing intermittent errors. You need a way to quickly find and troubleshoot the errors. How would you use CloudWatch Logs for this?
    • All Lambda function invocations and API Gateway requests generate logs in CloudWatch Logs. I would use CloudWatch Log Insights to query these logs. I could filter by logGroup (e.g., /aws/lambda/my-function_errors), search for specific error messages ("ERROR"), filter by time range, and visualize error trends. I could also use Log Insights commands like fields @timestamp, @message | filter @message like /Error/ | sort @timestamp desc to quickly pinpoint and analyze error logs.
  3. You want to perform proactive monitoring of your API endpoint's availability and latency from an external perspective, rather than relying solely on server-side metrics. How would you achieve this using CloudWatch?
    • I would use CloudWatch Synthetics to create a canary. The canary would be a script (e.g., Node.js or Python) that simulates a user's interaction with my API endpoint, checking for availability, response time, and correct content. The canary would run periodically from different geographic locations. If the canary detects any issues (e.g., API is down, response time is too high), it would trigger a CloudWatch Alarm to notify me.
  4. You have multiple AWS accounts within your organization, and you want a centralized dashboard to view the operational health of critical applications across these accounts. How can CloudWatch help?
    • I would configure cross-account observability in CloudWatch. This involves designating a central monitoring account and configuring source accounts to share their metrics, logs, and traces with the monitoring account. Once shared, I can create a single CloudWatch Dashboard in the monitoring account that displays metrics and visualizes logs from all linked accounts, providing a unified operational view.
  5. Your application is generating custom application logs on EC2 instances, and you also want to collect system-level metrics (e.g., memory utilization) that are not natively emitted by EC2 into CloudWatch. How would you set this up?
    • I would install and configure the CloudWatch Agent on each EC2 instance. The agent's configuration file would specify which log files to collect (e.g., /var/log/my-app.log) and send to CloudWatch Logs. It would also be configured to collect specific system-level metrics (e.g., mem_used_percent) and publish them as custom metrics to CloudWatch. I would then create CloudWatch Alarms and Dashboards for these new logs and metrics.

Coding/CLI Examples

Here are some common CloudWatch operations using the AWS CLI and Python (Boto3).

AWS CLI Examples

  1. Publish a custom metric to CloudWatch: bash aws cloudwatch put-metric-data \ --metric-name "CustomTransactionCount" \ --namespace "MyApp/Backend" \ --value 1 \ --dimensions Service=Auth,Region=us-east-1

  2. Create a CloudWatch Alarm that triggers on a custom metric: bash aws cloudwatch put-metric-alarm \ --alarm-name "HighCustomTransactionCount" \ --comparison-operator GreaterThanThreshold \ --evaluation-periods 1 \ --metric-name "CustomTransactionCount" \ --namespace "MyApp/Backend" \ --period 60 \ --statistic Sum \ --threshold 5 \ --alarm-actions arn:aws:sns:us-east-1:123456789012:MyAlarmTopic \ --dimensions Name=Service,Value=Auth Name=Region,Value=us-east-1

  3. Put log events to a CloudWatch Log Stream: ```bash LOG_GROUP_NAME="/aws/lambda/MyLambdaFunction" LOG_STREAM_NAME="2023/10/26/[$LATEST]abcdef12345"

    First, create Log Group and Log Stream if they don't exist

    aws logs create-log-group --log-group-name "$LOG_GROUP_NAME"

    aws logs create-log-stream --log-group-name "$LOG_GROUP_NAME" --log-stream-name "$LOG_STREAM_NAME"

    TIMESTAMP=$(date +%s%3N)

    aws logs put-log-events \ --log-group-name "$LOG_GROUP_NAME" \ --log-stream-name "$LOG_STREAM_NAME" \ --log-events "timestamp=$TIMESTAMP,message='This is a log message from CLI.'" \ # Use --sequence-token for subsequent calls to ensure correct order ```

  4. Create an EventBridge rule to trigger a Lambda function on an EC2 state change: ```bash LAMBDA_FUNCTION_ARN="arn:aws:lambda:us-east-1:123456789012:function:MyEventLambda"

    1. Create EventBridge Rule

    aws events put-rule \ --name "Ec2InstanceStateChangeRule" \ --event-pattern '{"source":["aws.ec2"],"detail-type":["EC2 Instance State-change Notification"],"detail":{"state":["stopped"]}}'

    2. Add Lambda as a target

    aws events put-targets \ --rule "Ec2InstanceStateChangeRule" \ --targets "Id=1,Arn=$LAMBDA_FUNCTION_ARN"

    3. Grant permissions to EventBridge to invoke Lambda (if not already done)

    aws lambda add-permission \ --function-name "MyEventLambda" \ --statement-id "EventBridgeInvokePermission" \ --action "lambda:InvokeFunction" \ --principal events.amazonaws.com \ --source-arn "arn:aws:events:us-east-1:123456789012:rule/Ec2InstanceStateChangeRule" ```

Python (Boto3) Examples

First, ensure you have Boto3 installed (pip install boto3) and your AWS credentials configured.

  1. Publish a custom metric to CloudWatch: ```python import boto3 from datetime import datetime

    cloudwatch_client = boto3.client('cloudwatch')

    try: response = cloudwatch_client.put_metric_data( Namespace='MyApp/Backend', MetricData=[ { 'MetricName': 'ProcessedOrders', 'Dimensions': [ {'Name': 'Region', 'Value': 'us-east-1'}, {'Name': 'Service', 'Value': 'OrderProcessing'} ], 'Timestamp': datetime.utcnow(), 'Value': 15, 'Unit': 'Count' }, ] ) print(f"Metric data published: {response}") except Exception as e: print(f"Error publishing metric data: {e}") ```

  2. Create a CloudWatch Alarm that monitors CPU Utilization: ```python import boto3

    cloudwatch_client = boto3.client('cloudwatch') sns_topic_arn = "arn:aws:sns:us-east-1:123456789012:MyAlarmTopic" # REPLACE with your SNS Topic ARN

    try: response = cloudwatch_client.put_metric_alarm( AlarmName='EC2-High-CPU-Alarm', ComparisonOperator='GreaterThanOrEqualToThreshold', EvaluationPeriods=2, MetricName='CPUUtilization', Namespace='AWS/EC2', Period=300, # 5 minutes Statistic='Average', Threshold=80.0, ActionsEnabled=True, AlarmActions=[sns_topic_arn], AlarmDescription='Alarm when EC2 CPU exceeds 80%', Dimensions=[ {'Name': 'InstanceId', 'Value': 'i-0abcdef1234567890'} # REPLACE with your Instance ID ] ) print("Alarm 'EC2-High-CPU-Alarm' created.") except Exception as e: print(f"Error creating alarm: {e}") ```

  3. Query CloudWatch Log Insights: ```python import boto3 import time

    logs_client = boto3.client('logs')

    log_group_name = '/aws/lambda/MyLambdaFunction' # REPLACE with your Log Group Name

    Define the query

    query_string = "fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 20"

    try: start_query_response = logs_client.start_query( logGroupNames=[log_group_name], startTime=int((time.time() - 3600) * 1000), # Last 1 hour endTime=int(time.time() * 1000), queryString=query_string ) query_id = start_query_response['queryId'] print(f"Started Log Insights query with ID: {query_id}")

    # Wait for query completion and get results
    response = None
    while response is None or response['status'] == 'Running' or response['status'] == 'Scheduled':
        time.sleep(1)
        response = logs_client.get_query_results(queryId=query_id)
    
    if response['status'] == 'Complete':
        print("Query Results:")
        for row in response['results']:
            log_entry = {}
            for field in row:
                log_entry[field['field']] = field['value']
            print(log_entry)
    else:
        print(f"Query failed with status: {response['status']}")
    

    except Exception as e: print(f"Error querying logs: {e}") ```