Serverless Slic Watch Plugin

Automatically create CloudWatch alarms and dashboards for Lambda, SQS, Kinesis, DynamoDB, API Gateway and Step Functions

View on Github

slic-watch

serverless npm version Build Coverage Status JavaScript Style Guide

SLIC Watch provides a CloudWatch Dashboard and Alarms for:

  1. AWS Lambda
  2. API Gateway
  3. DynamoDB
  4. Kinesis Data Streams
  5. SQS Queues
  6. Step Functions

Currently, SLIC Watch is available as a Serverless Framework plugin.

Getting Started

  1. 📦 Install the plugin:
npm install serverless-slic-watch-plugin --save-dev
  1. 🖋️ Add the plugin to the plugins section of serverless.yml:
plugins:
  - serverless-slic-watch-plugin
  1. 🪛 Optionally, add some configuration for the plugin to the custom -> slicWatch section of serverless.yml. Here, you can specify a reference to the SNS topic for alarms. This is optional, but it's usually something you want so you can receive alarm notifications via email, Slack, etc.
custom:
  slicWatch:
    topicArn: {'Fn::Ref': myTopic}

See the Configuration section below for more detailed instructions on fine tuning SLIC Watch to your needs.

  1. 🚢 Deploy your application in the usual way, for example:
sls deploy
  1. 👀 Head to the CloudWatch section of the AWS Console to check out your new dashboards 📊 and alarms ⏰ !

Features

CloudWatch Alarms and Dashboard widgets are created for all supported resources in the CloudFormation stack generated by The Serverless Framework. This includes generated resources as well as resources specifed explicitly in the resources section. Any feature can be configured or disabled completely - see the section on configuration to see how.

Lambda Functions

Lambda Function alarms are created for:

  1. Errors
  2. Throttles, as a percentage of the number of invocations
  3. Duration, as a percentage of the function's configured timeout
  4. Invocations, disabled by default
  5. IteratorAge, for function's triggered by an Event Source Mapping

Lambda dashboard widgets show:

Errors Throttles Duration Average, P95 and Maximum
Errors Throttles Throttles
Invocations Concurrent Executions Iterator Age
Invocations concurrent executions Iterator Age

API Gateway

API Gateway alarms are created for:

  1. 5XX Errors
  2. 4XX Errors
  3. Latency

API Gateway dashboard widgets show:

5XX Errors 4XX Errors Latency Count
5XX Errors 4XX Errors Latency Count

DynamoDB

DynamoDB alarms are created for:

  1. Read Throttle Events (Table and GSI)
  2. Write Throttle Events (Table and GSI)
  3. UserErrors
  4. SystemErrors

Dashboard widgets are created for tables and GSIs: dynamodbGSIReadThrottle.png dynamodbGSIWriteThrottle.png dynamodbTableWriteThrottle.png

ReadThrottleEvents (Table) WriteThrottleEvent (Table)
WriteThrottleEvents Table WriteThrottleEvents Table
ReadThrottleEvents (GSI) WriteThrottleEvent (GSI)
WriteThrottleEvents GSI WriteThrottleEvents GSI

Kinesis Data Streams

Kinesis data stream alarms are created for:

  1. Iterator Age
  2. Read Provisioned Throughput Exceeded
  3. Write Provisioned Throughput Exceeded
  4. PutRecord.Success
  5. PutRecords.Success
  6. GetRecords.Success

Kinesis data stream dashboard widgets show:

Iterator Age Read Provisioned Throughput Exceeded Write Provisioned Throughput Exceeded
Iterator Age Provisioned Throughput Exceeded Put/Get Success

SQS Queues

SQS Queue alarms are create for:

  1. Age Of Oldest Message (disabled by default). If enabled, a threshold in seconds should be specified.
  2. In Flight Messages Percentage. This is a percentage of the AWS hard limits (20,000 messages for FIFO queues and 120,000 for standard queues).

SQS queue dashboard widgets show:

Messages Sent, Received and Deleted Messages Visible Age of Oldest Message
Messages Messages Visible Oldest Message

Step Functions

Step Function alarms are created for:

  1. Execution Throttled
  2. Executions Failed
  3. Executions Timed Out

The dashboard contains one widget per Step Function:

ExecutionsFailed ExecutionThrottled, ExecutionsTimedOut
Step Function widget

Configuration

Configuration is entirely optional - SLIC Watch provides defaults that work out of the box.

Note: Alarm configuration is cascading. This means that configuration properties are automatically propagated from parent to children nodes (unless an override is present at the given node).

You can customize the configuration:

  • at the top level, for all resources in each service, and/or
  • at the level of individual functions.

Plugin configuration

Top-level plugin configuration can be specified in the customslicWatch section of serverless.yml

  • The topicArn may be optionally provided as an SNS Topic destination for all alarms. If you omit the topic, alarms are still created but are not sent to any destination.
  • Alarms or dashboards can be disabled at any level in the configuration by adding enabled: false. You can even disable all plugin functionality by specifying enabled: false at the top-level plugin configuration.

Supported options along with their defaults are shown below.

# ...

custom:
  slicWatch:
    topicArn: SNS_TOPIC_ARN  # This is optional but recommended so you can receive alarms via email, Slack, etc.
    enabled: true

    alarms:
      enabled: true
      Period: 60
      EvaluationPeriods: 1
      TreatMissingData: notBreaching
      ComparisonOperator: GreaterThanThreshold
      Lambda: # Lambda Functions
        Errors:
          Threshold: 0
          Statistic: Sum
        ThrottlesPc: # Throttles are evaluated as a percentage of invocations
          Threshold: 0
        DurationPc: # Duration is evaluated as a percentage of the function timeout
          Threshold: 95
          Statistic: Maximum
        Invocations: # No invocation alarms are created by default. Override threshold to create alarms
          enabled: false # Note: this one requires both `enabled: true` and `Threshold: someValue` to be effectively enabled
          Threshold: null
          Statistic: Sum
        IteratorAge:
          Threshold: 10000
          Statistic: Maximum
      ApiGateway: # API Gateway REST APIs
        5XXError:
          Statistic: Average
          Threshold: 0
        4XXError:
          Statistic: Average
          Threshold: 0.05
        Latency:
          ExtendedStatistic: p99
          Threshold: 5000
      States: # Step Functions
        Statistic: Sum
        ExecutionsThrottled:
          Threshold: 0
        ExecutionsFailed:
          Threshold: 0
        ExecutionsTimedOut:
          Threshold: 0
      DynamoDB:
        # Consumed read/write capacity units are not alarmed. These should either
        # be part of an auto-scaling configuration for provisioned mode or should be automatically
        # avoided for on-demand mode. Instead, we rely on persistent throttling
        # to alert failures in these scenarios.
        # Throttles can occur in normal operation and are handled with retries. Threshold should
        # therefore be configured to provide meaningful alarms based on higher than average throttling.
        Statistic: Sum
        ReadThrottleEvents:
          Threshold: 10
        WriteThrottleEvents:
          Threshold: 10
        UserErrors:
          Threshold: 0
        SystemErrors:
          Threshold: 0
      Kinesis:
        GetRecords.IteratorAgeMilliseconds:
          Statistic: Maximum
          Threshold: 10000
        ReadProvisionedThroughputExceeded:
          Statistic: Maximum
          Threshold: 0
        WriteProvisionedThroughputExceeded:
          Statistic: Maximum
          Threshold: 0
        PutRecord.Success:
          ComparisonOperator: LessThanThreshold
          Statistic: Average
          Threshold: 1
        PutRecords.Success:
          ComparisonOperator: LessThanThreshold
          Statistic: Average
          Threshold: 1
        GetRecords.Success:
          ComparisonOperator: LessThanThreshold
          Statistic: Average
          Threshold: 1
        SQS:
          # approximate age of the oldest message in the queue above threshold: messages aren't processed fast enough
          AgeOfOldestMessage:
            Statistic: Maximum
            enabled: false # Note: this one requires both `enabled: true` and `Threshold: someValue` to be effectively enabled
            Threshold: null
          # approximate number of messages in flight above threshold (in percentage of hard limit: 120000 for regular queues and 20000 for FIFO queues)
          InFlightMessagesPc:
            Statistic: Maximum
            Threshold: 80 # 80% of 120.000 for regular queues or 80% of 20000 for FIFO queues

    dashboard:
      enabled: true
      timeRange:
        # For possible 'start' and 'end' values, see
        # https:# docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/CloudWatch-Dashboard-Body-Structure.html
        start: -PT3H
      metricPeriod: 300
      widgets:
        metricPeriod: 300
        width: 8
        height: 6
        Lambda:
          # Metrics per Lambda Function
          Errors:
            Statistic: ['Sum']
          Throttles:
            Statistic: ['Sum']
          Duration:
            Statistic: ['Average', 'p95', 'Maximum']
          Invocations:
            Statistic: ['Sum']
          ConcurrentExecutions:
            Statistic: ['Maximum']
          IteratorAge:
            Statistic: ['Maximum']
        ApiGateway:
          5XXError:
            Statistic: ['Sum']
          4XXError:
            Statistic: ['Sum']
          Latency:
            Statistic: ['Average', 'p95']
          Count:
            Statistic: ['Sum']
        States:
          # Step Functions
          ExecutionsFailed:
            Statistic: ['Sum']
          ExecutionsThrottled:
            Statistic: ['Sum']
          ExecutionsTimedOut:
            Statistic: ['Sum']
        DynamoDB:
          # Tables and GSIs
          ReadThrottleEvents:
            Statistic: ['Sum']
          WriteThrottleEvents:
            Statistic: ['Sum']
        Kinesis:
          # Kinesis Data Streams
          GetRecords.IteratorAgeMilliseconds:
            Statistic: ['Maximum']
          ReadProvisionedThroughputExceeded:
            Statistic: ['Sum']
          WriteProvisionedThroughputExceeded:
            Statistic: ['Sum']
          PutRecord.Success:
            Statistic: ['Average']
          PutRecords.Success:
            Statistic: ['Average']
          GetRecords.Success:
            Statistic: ['Average']
        SQS:
          # SQS Queues
          NumberOfMessagesSent:
            Statistic: ["Sum"]
          NumberOfMessagesReceived:
            Statistic: ["Sum"]
          NumberOfMessagesDeleted:
            Statistic: ["Sum"]
          ApproximateAgeOfOldestMessage:
            Statistic: ["Maximum"]
          ApproximateNumberOfMessagesVisible:
            Statistic: ["Maximum"]

An example project is provided for reference: serverless-test-project

Function-level configuration

For each function, add the slicWatch property to configure specific overrides for alarms and dashboards relating to the AWS Lambda Function resource.

functions:
  hello:
    handler: basic-handler.hello
    slicWatch:
      dashboard:
        enabled: false    # No Lambda widgets will be created for this function
      alarms:
        Lambda:
          Invocations:
            Threshold: 2  # The invocation threshold is specific to
                          # this function's expected invocation count

To disable all alarms for any given function, use:

functions:
  hello:
    handler: basic-handler.hello
    slicWatch:
      alarms:
        Lambda:
          enabled: false

A note on CloudWatch cost

This plugin creates additional CloudWatch resources that, apart from a limited free tier, have an associated cost. Depending on what you enable, SLIC Watch creates one dashboard and multiple alarms. The number of each depend on the number of resources in your stack and the number of stacks you have.

Check out the AWS CloudWatch Pricing page to understand the cost impact of creating CloudWatch resources.

References

Other Projects

  1. serverless-plugin-aws-alerts
  2. Real World Serverless Application - Serverless Operations
  3. CDK Watchful
  4. CDK Patterns - The CloudWatch Dashboard

Reading

  1. AWS Well Architected Serverless Applications Lens
  2. How to Monitor Lambda with CloudWatch Metrics - Yan Cui

LICENSE

Apache - LICENSE