/ A Simplified Guide to OpenTelemetry

A Simplified Guide to OpenTelemetry

December 1, 2022

Page Contents

Digital services are increasingly built as a collection of components working in concert to deliver significant business functions. Understanding how these components of a system are working is crucial to reliably delivering a service. With many systems interacting, it can be difficult, if not impossible, to understand the state of your services and their dependencies without detailed data about how they function. This kind of data is known as telemetry, and OpenTelemetry (OTEL) is a vendor-neutral observability framework for instrumenting, generating, and collecting such data to help organizations gain deeper insights across their systems. In this guide, I’ll discuss what to know about OpenTelemetry, including specifications around types of telemetry data, how it works, and the crucial role OpenTelemetry plays in observability platforms.

What is Telemetry Data?

OpenTelemetry is designed to help service designers and operators, such as DevOps, SREs, and other IT teams, collect, standardize, and streamline data about their systems using the OpenTelemetry Protocol (OTLP). Some of the most important types of data collected about service operations are logs, metrics, and traces, which can be defined as:

Logs are messages emitted by services and components describing some state or other useful information about a system. Logs include timestamps reflecting the point in time the message was generated. Logs are useful for collecting service-specific information, such as the start or end response to a request or a warning about an unexpected input.
Metrics are aggregate, numeric values that describe the state of a system over some period. Metrics can describe utilizations, such as the percent of CPU utilization over the past five minutes, and rates, such as the number of service requests per minute. Metrics are sometimes used to measure the state of a service from the perspective of a user: for example, the average time to load a page. These metrics are used as service level indicators to help you understand how services are performing relative to business objectives.
The third type of telemetry data, or signal, is distributed traces or simply traces. Tracing data involves tracking the path of service requests through a distributed system, such as a microservices architected application. Traces are especially important for isolating problems in services that require a series of requests to multiple services, which are often difficult to reproduce because conditions within a distributed system can be difficult to duplicate when debugging.

It's important to note that traces are organized into spans. The root span represents a request from beginning to end. Within the root span is a series of shorter spans that reflect requests made during the scope of the root span. For example, a root span may describe a checkout process, with more detailed spans providing information on authentication processes and payment processing. Organizing performance information into spans is particularly helpful in identifying a component that is causing long latencies in an operation, such as checking out on an e-commerce website. open telemetry

Figure 1. Traces are organized into spans, which show the distribution of time spent in services.

How Does OpenTelemetry Work?

OpenTelemetry includes several components, such as APIs used to add instrumentation code to programs to collect data. OpenTelemetry instrumentation is language-specific. For example, OpenTelemetry API includes instrumentation for commonly used libraries, such as programming languages like Python, Java, Javascript, and Golang. OpenTelemetry also provides language-specific software development kits (SDKs) used to implement additional functionality when needed. OpenTelemetry SDKs also provides automatic instrumentation of services and supports custom instrumentation. With auto-instrumentation, a developer can rapidly incorporate the collection of metrics, logs, and traces into a service. Customization can allow developers to add functionally not natively included in OpenTelemetry instrumentation. OpenTelemetry can use exporters, which are modules that transmit data using OTLP from instrumented services to a backend service where the data can be persisted and analyzed. Sending data directly from an exporter to a backend service can work well in development or in lightweight production environments, but in general, it’s a good practice to use a collector. OpenTelemetry collectors are services that run alongside instrumented services and receive data from exporters. The collector will then make sure that data is reliably delivered to the backend. Collectors are designed to handle efficient batching of data, retry failed transmissions, encrypt data, and perform related operations.

Differences between OpenTelemetry and OpenTracing

Comparing OpenTelemetry and OpenTracing is like comparing peanut butter to a PB&J, as OpenTracing is actually one of the foundational open-source projects included in OpenTelemetry. Hosted by the Cloud Native Computing Foundation (CNCF), which is also responsible for Kubernetes and Prometheus, the OpenTelemetry project began as a way to address the need to collect telemetry data more efficiently across a wide range of services by merging the functionalities of the existing open-source projects OpenCensus and OpenTracing:

OpenTracing, as the name implies, collects, stores, and analyzes traces.
OpenCensus is a library for traces and metrics that use the same metadata for both metrics and traces.

By combining OpenTracing and OpenCensus, we now have a single standard for instrumentation and avoid having two different but overlapping methods of collecting traces. In addition, using a shared set of metadata across metrics and traces streamlines the process of analyzing both since one set of code can work with both.

Benefits of Using OpenTelemetry

OpenTelemetry provides a single framework for working with telemetry data. This contrasts earlier application performance monitoring (APM) systems, which provided support for some view into data—metrics, traces, or logs—but not all data types. This is important because, in the past, application performance monitoring services often depended on vendor-specific monitoring solutions, which led to tool sprawl as teams searched for disparate analysis tools to solve specific pain points instead of centralizing performance insights into a single solution. For those of us who have struggled to correlate metrics collected in one system with log messages from another system, the ability to have standardized metrics in one place is key to better understanding the performance of the entire ecosystem and unlocks opportunities to streamline troubleshooting, automation, and alerting. By using OpenTelemetry, developers have a common framework that standardizes siloed telemetry data and reduces vendor lock-in, which can allow for a more integrated and holistic view of network performance, security issues, and more in an observability platform. Another important feature of OpenTelemetry is that it’s a vendor-neutral, open-source project, which brings two significant advantages:

No single vendor can make decisions that would undermine use of the standard by others.
The project is governed by community members who use the standard in a wide array of use cases.

OpenTelemetry being open source, helps promote advances that benefit the community at large rather than potentially narrow interest groups.

How SolarWinds Helps Make OpenTelemetry More Secure

With any application or service, administrators must be concerned with security. In the case of OpenTelemetry, you should consider how to prevent the exposure of sensitive data, avoid spoofing, prevent the collection of unauthorized data as if it were legitimate data, and provide safeguards against denial-of-service attacks. The SolarWinds^® Platform was created with first-class support for OTLP. Since it’s an open format, our goal is to help improve the security of OpenTelemetry by implementing best practices for protecting sensitive information when you use our agent to send to our platform. Learn more about how the SolarWinds Platform came to be in this blog post.

Dan Sullivan

Dan Sullivan is a principal engineer and architect specializing in cloud architecture, data engineering, and analytics. He has designed and implemented solutions for a wide…