From: aidotengineer

Open Telemetry is a large, open-source project maintained by the CNCF, standardizing cloud observability within cloud environments [00:00:28]. It is one of the largest projects under the CNCF, second only to Kubernetes [00:00:33]. The project is supported by all major observability platforms, including Splunk, Datadog, Dynatrace, New Relic, Grafana, and Honeycomb [00:00:50]. This broad support means that Open Telemetry can be used in conjunction with any of these platforms [00:00:57].

What Open Telemetry Standardizes

At its core, Open Telemetry is a protocol that standardizes how logging, metrics, and traces are handled in cloud applications [00:01:05].

Logging

Logging refers to sending arbitrary events at any point in an application’s lifecycle [00:01:29]. These logs are emitted as-is and can be viewed later, potentially with metadata [00:01:37].

Metrics

Metrics are designed for aggregate-level viewing, allowing users to observe behavior across days or users [00:01:44]. In traditional cloud environments, metrics typically include CPU usage, memory usage, and latency [00:01:56]. For Gen AI applications, relevant metrics include token usage, latency, and error rate [00:02:08].

Tracing

Tracing involves tracking multi-step processes, which was the first capability defined by Open Telemetry [00:02:21]. In cloud environments, tracing is used to monitor processes that span across multiple microservices, showing how a user request is processed across them [00:02:35]. For Gen AI, tracing is particularly common due to the prevalence of multi-step processes like chains, workflows, and agents that interact with tools [00:02:55].

The Open Telemetry Ecosystem

Beyond being a protocol, Open Telemetry is also an ecosystem comprising SDKs, instrumentations, and collectors [00:03:23].

SDKs

SDKs (Software Development Kits) allow developers to manually send logs, metrics, and traces from their applications [00:03:32]. Open Telemetry currently supports SDKs in 11 different languages, including Python, TypeScript, Go, and C++ [00:03:40].

Instrumentations

Instrumentations provide an automated way to gather observability data [00:03:55]. Unlike SDKs which require manual implementation, instrumentations can automatically capture logs, metrics, and traces from specific parts of an application [00:03:58]. They function by “monkey patching” the client library used within an application to emit relevant data [00:04:54]. These instrumentations are engineered to have a negligible latency impact [00:05:09].

Collectors

Collectors are self-deployable components that can be placed in a cloud environment (e.g., Kubernetes) to preprocess observability data before it is sent to an observability platform [00:05:30]. They can be used for tasks such as filtering unimportant data, obscuring Personally Identifiable Information (PII), or hiding sensitive data [00:05:40]. Collectors are open source and come with many built-in features [00:05:54]. They also allow for sending observability data to multiple providers simultaneously [00:06:00].

Extension for Generative AI: OpenLLMetry

OpenLLMetry is an open-source project that extends Open Telemetry to support Generative AI (Gen AI) frameworks, foundation models, and vector databases [00:06:16]. By leveraging the existing Open Telemetry standard, OpenLLMetry enables observability in any platform supporting Open Telemetry [00:06:42].

OpenLLMetry has developed over 40 different instrumentations for various providers [00:07:09]:

  • Foundation models: OpenAI, Anthropic, Cohere, Gemini, Bedrock, and others [00:07:16].
  • Vector databases: Pinecone, Chroma, and others [00:07:23].
  • Frameworks: LangChain, LlamaIndex, CrewAI, and Haystack [00:07:28].

These instrumentations automatically emit logs, metrics, and traces that can be connected to any desired platform for out-of-the-box observability [00:07:34]. For example, a Pinecone instrumentation can show queries, indexing activity, and allow investigation of returned vectors, including distances, scores, and latencies, all in the standard Open Telemetry format [00:07:54].

Because Open Telemetry is a standard protocol, it prevents vendor lock-in, allowing users to easily switch between platforms with a simple configuration change [00:08:41].