Is your data secure? Find out with our free IBM security assessment! Learn More →

Services
Focus Areas

Areas of Expertise
Engagements

Discover

Build

Support
Areas of Expertise

App Modernization

Public Sector

Serverless

IoT

DevOps

Migration

Data and Machine Learning (ML)

Enterprise Architecture

24/7 Monitoring

Team Support

Datadog

Overview

Are you taking advantage of modernizing your AWS apps to protect your cloud investments?

Overview

Our mission is to accelerate high-quality cloud adoption across the Public Sector.

Overview

Whether you are new to serverless or looking to scale, Trek10 allows you to focus on building applications, not managing servers.

Related Content

AWS Lambda

With AWS Lambda, you can run code without the need for managing servers in a cost-effective manner.

Blog

What is Serverless and Why Does it Matter?

Overview

Whether you’re looking to gain visibility into plant floor machinery or seeking to enhance process efficiency, Trek10 can help.

Related Content

Blog

Serverless Architectures: IoT

Blog

Is IoT Device Shadow Right for You?

or should you build-your-own with DynamoDB?

Overview

Shorten the development lifecycle, increase reliability, and release software faster.

Related Content

AWS CloudFormation

AWS CloudFormation helps you save time and money by configuring and managing resources for you.

Containers on AWS

Containers on AWS makes managing container registries easy, autonomous, reliable, and safe from anywhere.

Overview

At Trek10, we rapidly migrate your applications with a focus on cost-effectiveness

Related Content

Amazon WorkSpaces

Amazon WorkSpaces allows you to quickly scale according to your virtual desktop needs.

Containers on AWS

Containers on AWS makes managing container registries easy, autonomous, reliable, and safe from anywhere.

Overview

Uncover insights from your data no matter where you are in your analytics journey.

Related Content

Machine Learning Ops

MLOps constitute best practices for developing, deploying, and monitoring high precision Machine Learning models.

Amazon SageMaker

Amazon SageMaker enables developers and data scientists to easily build ML models.

Overview

Enterprise Architecture (EA) combines business and technology in a proven industry recognized framework to deliver business focused results based on your industry, environment, competition and the ever increasing capabilities of cloud technologies.

Related Content

Developer Acceleration

A series of in-person architect-led training modules designed to help your team develop the necessary skills and best practices to modernize your applications.

Overview

Maximize the uptime and security of your most critical applications.

Related Content

Amazon CloudWatch

Amazon CloudWatch makes performance monitoring simple for you and your business.

Disaster Recovery

Prevent downtime, strengthen resilience, and avoid unanticipated costs with a comprehensive Disaster Recovery Plan.

Overview

Experienced solutions architects and developers at your service, on-demand.

Related Content

Amazon CloudWatch

Amazon CloudWatch makes performance monitoring simple for you and your business.

Disaster Recovery

Prevent downtime, strengthen resilience, and avoid unanticipated costs with a comprehensive Disaster Recovery Plan.

Overview

Let Trek10 help you hit the ground running with Datadog.

Related Content

AWS Premier Partner

Discover

Cloud-Native Immersion Day

Developer Acceleration

Retail | Industry Overview

SaaS on AWS

Serverless Workshop

Overview

Trek10's Cloud-Native Immersion Days are focused, high impact training sessions that will drench your teams in knowledge of the latest tech and best-practices.

Overview

Trek10’s expert-led Developer Acceleration workshops help enterprise teams quickly and safely jump-start their serverless journey.

Overview

Leveraging the vast capabilities of the AWS ecosystem, Trek10 provides retail businesses with solutions tailored to their unique needs, enabling them to innovate at speed and scale.

Overview

Trek10 helps companies migrate and build their SaaS offering on AWS with a cloud-native approach.

Overview

Whether it’s a greenfield project or re-architecting legacy, Trek10 is your guide to adopting cloud native architectures.

Build

DevOps Transformation

Internet of Things (IoT) Applications

Security

Overview

At Trek10, we leverage the best AWS native and third party tools for code-defined infrastructure, continuous integration, and automated deployment pipelines.

Overview

Trek10 helps you deliver on the promise of IoT by guiding you through the process of connecting your devices to AWS and by designing, implementing, and fully supporting your AWS cloud infrastructure.

Overview

Trek10’s security solutions and services will secure your AWS APIs and infrastructure. Schedule a meeting today to see if you qualify for a free security scan and report.

Support

CloudOps 24/7 Monitoring & Support

CloudOps Team Support

Overview

Trek10 brings managed services to the cloud. Our team works hard to reduce noise and maximize uptime in every AWS environment we manage.

Overview

Trek10 Team Support augments your team’s skills with access to a team of experienced and focused AWS solutions architects and cloud developers that specialize in leveraging AWS to the fullest.

Overview

Everyone who moves to AWS wants to secure their environment, but knowing where to start is hard. That is where Trek10 can help.
Case Studies
About
Careers
AWS Premier Partner
Community
CloudProse Blog

Spotlight

Serverless

Cost and Pricing Analysis

Cloud Native

Developer Experience

Databases

News

IoT

Monitoring, Ops & DevOps

Containers

Security and IAM

Generative AI and Machine Learning (ML)

Search Trek10

Monitoring, Ops & DevOps

All The Metrics - A Cloud Monitoring Blueprint

Andy Warzon | Jun 22 2018

Fri, 22 Jun 2018

For companies coming from an on-premise or traditional architecture, the landscape of monitoring for cloud-based distributed architectures can be pretty bewildering. What metrics should you be monitoring? What tools and providers should you use? How can you centralize the data and correlate it to gain actionable insights about your production systems?

If you’re asking these questions, this post is for you and hopefully it can be a useful primer. We’ll also share a few opinions from Trek10’s experience running our CloudOps managed service.

One caveat: This advice is focused on all but the largest scale systems. If you have a particularly large and complex distributed system, you may need to think differently about monitoring & observability. That said, our experience has been that the majority of even the largest enterprises are focused on apps that don’t meet this threshold and a focus on traditional aggregate metrics is appropriate.

For starters, we find that there are six primary categories of metrics that you need to be thinking about. In future posts we’ll dive deeper into each of these, but let’s start with a high level overview.

Six Pieces of the Cloud Monitoring Pie

Mmm, pie.

VM Metrics

Just like an on-premise VM environment, if you’re running VMs (EC2 in AWS) you need an agent running on your virtual machine to collect traditional system metrics: CPU, RAM, disk, and network interface metrics. There are some differences from the on-premise world:

Focus more on aggregate metrics for autoscaling groups / clusters. When you start treating your pets like cattle, you really only need to focus on individual VMs when they are outliers.
Make sure your system alerts can gracefully handle virtual machines that are expected to be emphemeral (in other words, don’t alert me that the virtual machine is offline when autoscaling was supposed to shut it down!).

Cloud Provider Metrics

In the AWS world this is CloudWatch. These are critical for EC2 instances (they give you the hypervisor view of your systems) but especially so for all of the other AWS managed services where they are really your only way to get any deeper insight into service health and your application’s operating profile. The AWS tools for exploring and dashboarding these metrics are getting better and better, but you can also export CloudWatch metrics to other monitoring tools.

External Uptime/Ping Metrics

It is always critical to have a “last line of defense” in terms of monitoring… if all other metrics fail to notify you of an impending problem, or if the metrics themselves are having problems, an external tool that is pinging your public endpoints can independently notify you of an outage as fast as possible.

Application Performance Monitoring (APM)

Whether you are in cloud or on premise, the APM view is critical: from your customer’s perspective, how is your application performing? And if there are problems, where in the flow of a customer transaction are the bottlenecks?

Log Aggregation

First and foremost, when you move to the cloud you have to ship off your logs. No useful logs should ever be stuck on an instance. But taking it a step further, once all of your logs are aggregated into a single tool you can start charting and dashboarding trends from these logs and correlating to your other metrics.

Custom Metrics

This final category is often overlooked but critical. Instrumenting your app with a few custom metrics will often really get to the heart of what matters: are the right business events happening and how long are they taking? Custom metrics are also a great way to track your background jobs and other back-end activities.

Trek10’s Monitoring Toolbelt

We’ll cut to the chase here and explain some of our choices of tools & providers.

Datadog: We’re huge fans of Datadog for its ability to handle several of the above categories in a single tool: An agent for VM metrics, CloudWatch integration, and (newer) APM and logging offerings. Also critical is Datadog’s large library of pre-built integrations, so when you need to select another tool for some of your metrics like New Relic, Pingdom, or SumoLogic, that elusive “single pane of glass” is just a few clicks away. Datadog also has one of the most simple and powerful custom metric feature sets around with multiple ways to push metrics with just a few lines of code. The icing on the cake is a beautiful UI and deep set of features for power users. To give you an idea…

Trek10 Pinger: There’s no lack of options in this space: Pingdom is probably the most well known external uptime monitoring tool and an interesting newer option is Apex Ping. However, at Trek10 we actually decided to build our own to handle our unique requirements as an MSP as well as to tightly integrate it with Datadog. We’ve also added custom features for monitoring SSL and domain expiration which has saved more than one client from the dreaded SSL expiration outage! As heavy Serverless users, we of course built it with Lambda. It runs in five AWS regions globally so relies on no shared dependencies with the workloads it is monitoring, or even shared continents!

Some of our Pinger metrics in a Datadog dashboard:

APM: New Relic and AppDynamics are the market leaders for traditional architectures, and we find those tools to still be the simplest and most robust for EC2-based applications. However at Trek10 we spend a lot of our time with Serverless / platform-service-based applications, and this is a much more wide open space. We’re watching trends closely including both interesting startups like IOPipe and Thundra as well as AWS’s X-Ray service.

There’s a lot more to say about all of these categories, so look for more posts to come. In the meantime, check out the rest of our blog, follow us @Trek10Inc, and let us know if we can help you with your cloud monitoring.

This is the first in a series of posts about monitoring production workloads in AWS. Related posts include.

Author

Andy Warzon

Go to Stories by Andy

Founder & CTO, Andy has been building on AWS for over a decade and is an AWS Certified Solutions Architect - Professional.

Similar Blog

Serverless

Replacing Amazon S3 Events with Amazon S3 Data Events

How to synthesize an (almost) identical payload using Amazon EventBridge rules.

Joel Haubold | Nov 02 2023
5 min read

Cloud Native

Using AWS XRay for ECS Observability

Learn how AWS X-Ray is a vital tool for enhancing the observability of containerized applications on ECS.

Michele Mike Hjorleifsson Featured Team Member

Michele (Mike) Hjorleifsson | Sep 13 2023
10 min read

Spotlight

Measuring Cross AZ Data in Default VPC Flow Logs

How to Construct a Switch Statement in CloudWatch Log Insights

Joel Haubold | Aug 16 2023
5 min read

Overview

Overview

Overview

Related Content

AWS Lambda

Blog

What is Serverless and Why Does it Matter?

Overview

Related Content

Blog

Serverless Architectures: IoT

Blog

Is IoT Device Shadow Right for You?

Overview

Related Content

AWS CloudFormation

Containers on AWS

Overview

Related Content

Amazon WorkSpaces

Containers on AWS

Overview

Related Content

Machine Learning Ops

Amazon SageMaker

Overview

Related Content

Developer Acceleration

Overview

Related Content

Amazon CloudWatch

Disaster Recovery

Overview

Related Content

Amazon CloudWatch

Disaster Recovery

Overview

Related Content

AWS Premier Partner

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Monitoring, Ops & DevOps

All The Metrics - A Cloud Monitoring Blueprint

Six Pieces of the Cloud Monitoring Pie

VM Metrics

Cloud Provider Metrics

External Uptime/Ping Metrics

Application Performance Monitoring (APM)

Log Aggregation

Custom Metrics

Trek10’s Monitoring Toolbelt

Author

Andy Warzon

Similar Blog

Serverless

Replacing Amazon S3 Events with Amazon S3 Data Events

Cloud Native

Using AWS XRay for ECS Observability

Spotlight

Measuring Cross AZ Data in Default VPC Flow Logs