Logo

dev-resources.site

for different kinds of informations.

Integrating Datadog Instrumented Apps in your OpenTelemetry Stack

Published at
5/16/2024
Categories
tracetest
traces
opentelemetry
Author
danielbdias
Categories
3 categories in total
tracetest
open
traces
open
opentelemetry
open
Author
11 person written this
danielbdias
open
Integrating Datadog Instrumented Apps in your OpenTelemetry Stack

At Tracetest, we encountered an issue while assisting a customer to integrate legacy APIs, which were instrumented with older SDKs, with new APIs using OpenTelemetry (OTel) SDKs. We've documented our findings in this article, outlining how such a problem can arise in an observability stack and how it can be resolved using the OpenTelemetry Collector.

This guide also shows how to establish an observability pipeline that can process data from Datadog-instrumented apps. This allows you to receive and correlate data within OpenTelemetry stacks without having to send it to Datadog.

The Legacy SDK Issue

Have you ever faced challenges with telemetry data standardization from the Datadog SDKs? Integrating Datadog SDKs with OpenTelemetry SDKs can complicate observability systems due to inconsistent telemetry data sent to Datadog. This inconsistency can make detailed data analysis, such as traces, inaccessible before it's sent to Datadog, thereby complicating the setup of an efficient observability pipeline.

How Datadog SDKs Work

Datadog is a monitoring and analytics platform that allows developers and IT operators to observe the performance of their applications, servers, databases, and tools, all in one place, being one of the major players in the observability landscape.

To allow developers to integrate with their platform quickly, they provide a set of SDKs that enable applications to send telemetry automatically to Datadog. These SDKs are used for apps that were instrumented before OpenTelemetry started to standardize how to send telemetry with the OTLP protocol. Due to that, systems that use both Datadog SDKs and OTel SDKs can be complex in terms of observability, where their data is only available for analysis at the server level. In this scenario, developers cannot use other OTel solutions, like traces, to analyze data before sending it to Datadog.

Structure an Observability Stack with Datadog

When you instrument an API with Datadog SDKs, you must send the telemetry to an
agent, and then this agent sends your telemetry to Datadog servers.

https://res.cloudinary.com/djwdcmwdz/image/upload/v1715868912/Blogposts/datadog-instr-otel/Screenshot_2024-05-16_at_16.14.54_zay5mp.png

However, with the rise of OpenTelemetry as standard, new APIs usually centralize the telemetry into an OpenTelemetry Collector and then send data to Datadog, creating mixed environments where data is processed and correlated only on Datadog servers.

https://res.cloudinary.com/djwdcmwdz/image/upload/v1715869011/Blogposts/datadog-instr-otel/Screenshot_2024-05-16_at_16.16.38_oxenuh.png

To simplify this structure, we can centralize all the communication in an OpenTelemetry Collector and set up a datadog receiver that works like an agent, receiving traces and metrics in Datadog proprietary format:

https://res.cloudinary.com/djwdcmwdz/image/upload/v1715869122/Blogposts/datadog-instr-otel/Screenshot_2024-05-16_at_16.18.30_jhr2ac.png

To set up your OpenTelemetry Collector with this receiver, you must use the OTel Collector Contrib distribution and add the datadog receiver in the configuration file:

receivers:
  otlp:
    protocols:
      grpc:
      http:

  datadog:
    endpoint: 0.0.0.0:8126
    read_timeout: 60s

  # ...

  service:
  pipelines:
    traces:
      receivers: [otlp, datadog]
      # ...
Enter fullscreen mode Exit fullscreen mode

Sending Datadog Data to Jaeger

Using the latest proposed architecture as a head start, letā€™s download a demo example that shows how to adapt this architecture to send data to Jaeger, allowing you to run the API locally and make any observability changes needed before publishing to production.

First, open a terminal and download the demo with the following commands:

git clone [email protected]:kubeshop/tracetest.git
cd ./examples/datadog-propagation

docker compose -f ./docker-compose.step1.yaml up -d
Enter fullscreen mode Exit fullscreen mode

This action starts two Ruby on Rails APIs, one instrumented with ddtrace and another with OpenTelemetry SDK, both connecting to an OpenTelemetry Collector that sends data to Jaeger:

https://res.cloudinary.com/djwdcmwdz/image/upload/v1715869348/Blogposts/datadog-instr-otel/Screenshot_2024-05-16_at_16.20.43_el3us5.png

Next, you can call the OTelSDK-instrumented API, which executes an internal endpoint on the Datadog-instrumented API and returns an output. It is expected that this operation generates just one trace. Try it by executing the command:

> curl http://localhost:8080/remotehello
# it returns:
# {"hello":"world!","from":"quick-start-api-otel","to":"quick-start-api-datadog"}
Enter fullscreen mode Exit fullscreen mode

However, when accessing Jaeger locally (via the link http://localhost:16686/), you see that two disjointed traces were generated, one for each API:

https://res.cloudinary.com/djwdcmwdz/image/upload/v1715869464/Blogposts/datadog-instr-otel/step1-otel-api_cioprm.png

https://res.cloudinary.com/djwdcmwdz/image/upload/v1715869497/Blogposts/datadog-instr-otel/step1-datadog-api_shxj8x.png

To correct this problem, you need to change your stack a little bit. Read on to learn how to apply these changes.

Understand trace_id Representations for Datadog and OpenTelemetry

This issue happens due to a difference in the representation of trace_ids in Datadog format and OTel SDK format. Datadog considers a trace_id as an unsigned Int64, while the OpenTelemetry SDK considers it as a 128-bit hexadecimal representation:

  • Datadog TraceID (int64): 4289707906147384633
  • Datadog TraceID (hex): 3b881a9ce0197d39
  • OTelSDK TraceID (hex): f3c18530c08e00a43b881a9ce0197d39

Since you have two TraceID representations, you have two Traces in Jaeger.

To propagate their traces through an OpenTelemetry stack, Datadog has an internal representation that can be used to reconstruct a TraceID, which they call, internally in their SDK, as Upper TraceID, represented by the attribute _dd.p.tid appended to the first span of its trace:

https://res.cloudinary.com/djwdcmwdz/image/upload/v1715870223/Blogposts/datadog-instr-otel/Untitled_33_cx09ei.png

Concatenating this Upper TraceID with Datadogā€™s TraceID in hexadecimal representation (called Lower TraceID), you have the exact TraceID representation for an OpenTelemetry stack:

https://res.cloudinary.com/djwdcmwdz/image/upload/v1715870267/Blogposts/datadog-instr-otel/Screenshot_2024-05-16_at_16.37.40_ieqruf.png

Now, to create this representation, we will reconstruct the TraceID at the OpenTelemetry Collector level, using the transform processor.

Reconstructing TraceID for Datadog Spans

The OTel Collector transform processor is a component that transforms span data as it passes through the OpenTelemetry Collector. It can modify attributes of a span such as name, kind, attributes, resources, and instrumentation library, among others.

In this context, the transform processor is used to reconstruct the TraceID for Datadog spans to facilitate a unified tracing environment.

In the OTel Collector config, you can configure the transform processor as follows:

processors:
  # ...

  transform:
    trace_statements:
      - context: span
        statements:
          # transformation statements
          - set(cache["upper_trace_id"], attributes["_dd.p.tid"]) where attributes["_dd.p.tid"] != nil
          - set(cache["lower_trace_id"], Substring(trace_id.string, 16, 16)) where cache["upper_trace_id"] != nil
          - set(cache["combined_trace_id"], Concat([cache["upper_trace_id"], cache["lower_trace_id"]],"")) where cache["upper_trace_id"] != nil
          - set(trace_id.string, cache["combined_trace_id"]) where cache["combined_trace_id"] != nil
Enter fullscreen mode Exit fullscreen mode

Next, add four transformation statements:

  • set(cache["upper_trace_id"], attributes["_dd.p.tid"]) where attributes["_dd.p.tid"] != nil, where we will capture the Upper TraceID from the _dd.p.tid attribute if it is defined and set it to a temporary cache;
  • set(cache["lower_trace_id"], Substring(trace_id.string, 16, 16)) where cache["upper_trace_id"] != nil, where we will convert a Datadog TraceID into hexadecimal format (where we will have sixteen zeros plus the Lower TraceID). We will remove the "zerosā€ segment and grab the second half of it (this is why we have a sub string getting only the last 16 hex digits from the TraceID);
  • set(cache["combined_trace_id"], Concat([cache["upper_trace_id"], cache["lower_trace_id"]],"")) where cache["upper_trace_id"] != nil, where we concatenate upper_trace_id and lower_trace_id into the OpenTelemetry TraceID;
  • set(trace_id.string, cache["combined_trace_id"]) where cache["combined_trace_id"] != nil and finally we set the trace_id for this span with the string concatenation.

To see it working, run your example again, with a different docker compose file and then execute the same API call as before:

> docker compose -f ./docker-compose.step2.yaml up -d

> curl http://localhost:8080/remotehello
# it returns:
# {"hello":"world!","from":"quick-start-api-otel","to":"quick-start-api-datadog"}
Enter fullscreen mode Exit fullscreen mode

Looking at the Jaeger UI, you can see that the problem was partially solved. Now you have a trace propagated between the Datadog-instrumented API and the OpenTelemetry-instrumented API.

However, all child spans generated by Datadog are segregated in a different trace, defined only as the Lower TraceID. This happens because ddtrace only sends a _dd.p.tid attribute for the first span generated internally, which makes your transform statements skip the child spans.

https://res.cloudinary.com/djwdcmwdz/image/upload/v1715870335/Blogposts/datadog-instr-otel/step2-otel-api_oywxkr.png

https://res.cloudinary.com/djwdcmwdz/image/upload/v1715870337/Blogposts/datadog-instr-otel/step2-datadog-api_i4yyys.png

Patch Datadog Trace to Send the Upper TraceID to Every Child Span

Since the OpenTelemetry Collector was designed to process spans considering distributed systems, thereā€™s no way to maintain an internal state to replicate the _dd.p.tid attribute for the child spans received by the datadog receiver.

You can solve this problem directly in the Datadog-instrumented API by applying a minor patch to ddtrace to replicate _dd.p.tid to all child spans.

In the ddtrace Ruby version, the trace serialization is modified to send this data below:

module Datadog
  module Tracing
    module Transport
      class SerializableTrace
        def to_msgpack(packer = nil)
          if ENV.has_key?('INJECT_UPPER_TRACE_ID')
            return trace.spans.map { |s| SerializableSpan.new(s) }.to_msgpack(packer)
          end

          upper_trace_id = trace.spans.find { |span| span.meta.has_key?('_dd.p.tid') }.meta['_dd.p.tid']
          trace.spans.each do |span|
            span.meta["propagation.upper_trace_id"] = upper_trace_id
          end

          trace.spans.map { |s| SerializableSpan.new(s) }.to_msgpack(packer)
        end
      end
    end
  end
end
Enter fullscreen mode Exit fullscreen mode

Since it is a customization on the traces, you can opt to grab the _dd.p.tid attribute and inject it in each span as the propagation.upper_trace_id attribute. Then you can change the transform processor in the OpenTelemetry Collector to consider this:

  transform:
    trace_statements:
      - context: span
        statements:
          - set(cache["upper_trace_id"], attributes["propagation.upper_trace_id"]) where attributes["propagation.upper_trace_id"] != nil
          - set(cache["lower_trace_id"], Substring(trace_id.string, 16, 16)) where cache["upper_trace_id"] != nil
          - set(cache["combined_trace_id"], Concat([cache["upper_trace_id"], cache["lower_trace_id"]],"")) where cache["upper_trace_id"] != nil
          - set(trace_id.string, cache["combined_trace_id"]) where cache["combined_trace_id"] != nil
Enter fullscreen mode Exit fullscreen mode

With these changes done, letā€™s run a new setup with our example:

> docker compose -f ./docker-compose.step3.yaml up -d

> curl http://localhost:8080/remotehello
# it returns:
# {"hello":"world!","from":"quick-start-api-otel","to":"quick-start-api-datadog"}
Enter fullscreen mode Exit fullscreen mode

Opening the Jaeger UI again, you see that the problem is solved! Now you have a
single trace between both APIs and can evaluate a process as a whole on a
developer machine.

https://res.cloudinary.com/djwdcmwdz/image/upload/v1715869641/Blogposts/datadog-instr-otel/step3_rdfol3.png

Testing the Observability Stack

Until now, we have been testing the telemetry manually, which can be time-consuming if both APIs are constantly changing. To evaluate both APIs automatically and guarantee that everything working properly, we can create trace-based tests, trigger HTTP calls against the APIs, and validate if our traces are logged as intended.

To do that, we will use Tracetest, which triggers service calls (in our case, HTTP calls like our curl calls) and validate the emitted traces to ensure that our observability stack is working as intended.

First, we will create a new account on Tracetest.io, by accessing https://app.tracetest.io, and then create a new environment. Once we have an API Key for our agent, we will start our local stack with a new container with a Tracetest Agent:

> TRACETEST_API_KEY=your-api-key docker compose -f ./docker-compose.step4.yaml up -d
Enter fullscreen mode Exit fullscreen mode

Then, we will install Tracetest CLI and configure it to access our environment with the following command that will guide you to connect to your personal-org and environment:

> tracetest configure

# This command will print some instructions interactively to help to connect to your env:

# What tracetest server do you want to use? (default: https://app.tracetest.io/)

# What Organization do you want to use?:
#  > personal-org (ttorg_000000000000000)
# What Environment do you want to use?:
#  > OTel (ttenv_000000000000000)
# SUCCESS  Successfully configured Tracetest CLI
Enter fullscreen mode Exit fullscreen mode

Now we will configure our agent to connect to our local Jaeger, using the following command:

> tracetest apply datastore -f ./tracebased-tests/tracing-backend.yaml 

# It will send the following output, which means that our environment was correctly configured:

# type: DataStore
# spec:
#  id: current
#  name: Jaeger Tracing Backend
#  type: jaeger
#  default: true
#  createdAt: 2023-10-31T00:30:47.137194Z
#  jaeger:
#    endpoint: jaeger:16685
#    tls:
#      insecure: true
Enter fullscreen mode Exit fullscreen mode

Next, write a test that checks the trace generated by a calling the [http://localhost:8080/remotehello](http://localhost:8080/remotehello) URL and validate whether it has at least one span for both APIs. To do that, we will create a test file called ./tracebased-tests/test_integration.yaml with the following contents:

type: Test
spec:
  id: CdhJp_xIR
  name: Test observability integration between OTelSDK-instrumented API to Datadog-instrumented API
  trigger:
    type: http
    httpRequest:
      method: GET
      url: http://api-otel:8080/remotehello
      headers:
      - key: Content-Type
        value: application/json
  specs:
  - selector: "span[tracetest.span.type=\"http\" name=\"HelloController#remote_hello\" http.target=\"/remotehello\" http.method=\"GET\"]"
    name: OpenTelemetryAPI-instrumented API has been called
    assertions:
    - attr:http.status_code = 200
  - selector: span[tracetest.span.type="http" name="rack.request" http.method="GET"]
    name: Datadog-instrumented API has been called
    assertions:
    - attr:http.status_code = 200
Enter fullscreen mode Exit fullscreen mode

This test will call our API internally from our Docker Compose network at the endpoint http://api-otel:8080/remotehello , as specified in trigger section, and will validate if there are two spans on the resulting trace for this operation with two specs:

  • One that checks if there is an http span called HelloController#remote_hello with status code 200 , defined on the first selector item, validating if the OpenTelemetryAPI-instrumented API has been called .
  • Another that checks if there is an http span called rack.request with status code 200 , defined on the first selector item, validating if the Datadog-instrumented API has been called .

We can run the test with the command tracetest run test -f ./tracebased-tests/test_integration.yaml and see everything working:

> tracetest run test -f ./tracebased-tests/test_integration.yaml  

# It returns the following output

# āœ” RunGroup: #mDjJ2ZPIR (https://app.tracetest.io/organizations/ttorg_000000000/environments/ttenv_000000000/run/mDjJ2ZPIR)
# Summary: 1 passed, 0 failed, 0 pending
#  āœ” Test observability integration between OTelSDK-instrumented API to Datadog-instrumented API (https://app.tracetest.io/organizations/ttorg_1cbdabae7b8fd1c6/environments/ttenv_cfbdc98ade85ac15/test/CdhJp_xIR/run/4/test) - trace id: 64af63b48c2895fc1f91b811ef1d0ca3
#        āœ” OpenAPI-instrumented API has been called
#        āœ” Datadog-instrumented API has been called
Enter fullscreen mode Exit fullscreen mode

We can see this result in Tracetest UI to see the entire trace:

https://res.cloudinary.com/djwdcmwdz/image/upload/v1715870407/Blogposts/datadog-instr-otel/Untitled_34_z5m8j7.png

Final Remarks

We have discussed integrating Datadog-instrumented apps in an OpenTelemetry stack. We saw the problem of different TraceID representations for Datadog and OpenTelemetry, which can lead to disjointed traces. To solve that, we used an OpenTelemetry Collector with a transform processor to reconstruct the TraceID for Datadog spans, facilitating a unified tracing environment.

Additionally, we can use trace-based tests with Tracetest to automate the validation of these changes on the observability stack and ensure that the application works as intended.

As a team focused on building an open source tool in the observability space, the opportunity to help the overall OpenTelemetry community is important to us. Thatā€™s why we are researching and finding new ways of collecting traces from different tools and frameworks and making them work with the OpenTelemetry ecosystem.

TheĀ example sourcesĀ used in this article andĀ setup instructionsĀ are available in the Tracetest GitHub repository.

Would you like to learn more about Tracetest and what it brings to the table? Visit theĀ Tracetest docsĀ and try it out by signing up today!

Also, please feel free to join ourĀ Slack Community, giveĀ Tracetest a star on GitHub, or schedule aĀ time to chat 1:1.

Featured ones: