Demystifying OpenTelemetry: A Deep Dive into Observability’s Game-Changer

An observability framework called OpenTelemetry was developed for the purpose of managing telemetry data, including metrics, traces, and logs. The project in question is affiliated with the Cloud Native Computing Foundation (CNCF).

OpenTelemetry is an amalgamation of two preceding initiatives, namely OpenTracing and OpenCensus. Both of these projects addressed a common issue, namely the absence of a standardised approach for instrumenting code and transmitting telemetry data to an Observability backend.

Why OpenTelemetry?

The need for observability has never been higher than it is now. With the proliferation of cloud computing’s microservices architecture and the increasing complexity of businesses’ needs, observability has become more important than ever.

Observability refers to the capacity to comprehend the internal state of a system through the examination of its outputs. Within the realm of software, this concept pertains to the capacity to comprehend the internal condition of a system through the analysis of its telemetry data, encompassing traces, metrics, and logs.

To achieve observability of a system, it is necessary to implement instrumentation. The code is required to generate traces, metrics, and logs. The instrumented data must subsequently be transmitted to an Observability backend.

OpenTelemetry does two important things:

Enables organizations to possess the data they generate, as opposed to being constrained by a proprietary data format or tool.
Enables the acquisition of knowledge pertaining to a singular collection of application programming interfaces (APIs) and standardised practises.

These two things combined enables teams and organizations the flexibility they need in today’s modern computing world.

Defining Observability

Observability allows outsiders to understand a system’s performance and traits and ask questions about its operation without knowing its inner workings.

Furthermore, it aids in the identification and resolution of “unknown unknowns” as well as in providing explanations for the occurrence of certain phenomena by addressing the question of “Why is this happening?”

To inquire about a system effectively, it’s crucial to instrument the application code to generate traces, metrics, and logs as signals. OpenTelemetry serves as the mechanism through which application code is instrumented, enabling enhanced observability of a system.

Important Definitions

Term	Definition
Telemetry	Data emitted from a system about its behavior, encompassing traces, metrics, and logs.
Reliability	The extent to which a service performs as expected by users. A system may have high uptime but still be considered unreliable if it fails to fulfill user expectations or requirements.
Metrics	Aggregated numerical data over a specific time period, providing insights into the performance and behavior of infrastructure or applications. Examples include system error rate, CPU utilization, and request rate for a service.
SLI (Service Level Indicator)	A measurement that reflects the behavior of a service, typically from the perspective of users. For example, the loading speed of a web page can be an SLI.
SLO (Service Level Objective)	A target or goal that defines the desired level of reliability communicated to an organization or other teams. SLOs are often based on one or more SLIs and are tied to the business value of the service.

Table 1 – OTel Definitions

Exploring the concept of Distributed Tracing

Let’s begin with the fundamentals of distributed tracing.

1. Logs

A log refers to a message that is emitted by services or other components and is accompanied by a timestamp. In contrast to traces, it is important to note that they do not necessarily exhibit any specific association with a user request or transaction. Software bugs are ubiquitous and have historically played a crucial role in aiding developers and operators in comprehending system behaviour.

Sample Log:

I, [2021-02-23T13:26:23.505892 #22473]  INFO -- : [6459ffe1-ea53-4044-aaa3-bf902868f730] Started GET "/" for ::1 at 2021-02-23 13:26:23 -0800

Regrettably, logs have limited utility in monitoring code execution due to their inherent lack of contextual information, such as the origin of their invocation.

The inclusion of a span or the correlation with a trace and a span significantly enhances their utility.

2. Spans

A span is a discrete entity that signifies a specific task or operation. The system monitors and records the precise actions performed by a request, thereby creating a comprehensive record of the events that occurred during the execution of said operation.

A span encompasses various components such as a name, time-related data, structured log messages, and additional metadata, referred to as Attributes, which serve the purpose of conveying pertinent information regarding the operation being monitored.

3. Distributed Traces

A distributed trace, also referred to as a trace, documents the routes followed by requests initiated by an application or end-user as they traverse complex service-oriented architectures, such as microservice and serverless applications.

Identifying the root cause of performance issues in a distributed system becomes a formidable task in the absence of tracing mechanisms.

Enhancing the visibility of our application or system’s health facilitates the identification and resolution of complex behavioural issues that are challenging to replicate within a local environment. Tracing plays a crucial role in distributed systems, particularly in scenarios where non-deterministic issues are prevalent or when the complexity of the system makes local reproduction unfeasible.

Tracing facilitates the process of debugging and comprehending distributed systems by dissecting the sequence of events that occur within a request as it traverses a distributed system.

A trace consists of one or more spans. The initial span denotes the primary span. Every individual root span symbolises a complete request, encompassing its initiation and completion. The subordinates located beneath the primary entity offer a comprehensive elucidation of the processes involved in a request, outlining the specific steps that constitute a request.

Several observability back-ends employ waterfall diagrams to visually represent traces, which can be depicted as follows:

Image Source – Link

Conclusion

In conclusion, OpenTelemetry is an essential observability framework that addresses the need for managing telemetry data in today’s complex computing landscape. By amalgamating the efforts of OpenTracing and OpenCensus, OpenTelemetry provides a standardized approach for instrumenting code and transmitting telemetry data to an Observability backend.

With the increasing complexity of businesses and the proliferation of cloud computing’s microservices architecture, observability has become more important than ever. OpenTelemetry enables organizations to possess their generated data and acquire knowledge through a singular collection of APIs and standardized practices. This flexibility is vital in today’s modern computing world.

We have explored the concept of distributed tracing, which plays a crucial role in enhancing the visibility and understanding of system performance. Distributed traces help identify the root cause of performance issues in complex service-oriented architectures.

In part 2 of this series, we will delve further into the world of OpenTelemetry and enable an OTel tracing experiment. Stay tuned for more exciting insights into the untapped potential of observability with OpenTelemetry.

the scalable guy