🧬 Copilot-LD

An intelligent agent leveraging GitHub Copilot and Linked Data

Observability Guide

Practical guide for operators to leverage Copilot-LD's distributed tracing system in production. Covers trace analysis, integration with external tools, performance monitoring, and troubleshooting workflows.

Overview

Copilot-LD includes comprehensive distributed tracing that records every service call, tool execution, and decision made by the agent. This guide shows production operators how to use traces for monitoring, debugging, and optimization.

Understanding Traces

What Gets Traced

Every request through the system generates a traceβ€”a complete record of all operations performed to handle that request:

Trace Structure

A trace consists of spans arranged in a hierarchical tree:

agent.ProcessRequest (SERVER)
β”œβ”€β”€ memory.GetWindow (CLIENT)
β”‚   └── memory.GetWindow (SERVER)
β”œβ”€β”€ llm.CreateCompletions (CLIENT)
β”‚   └── llm.CreateCompletions (SERVER)
β”œβ”€β”€ tool.CallTool (CLIENT)
β”‚   └── tool.CallTool (SERVER)
β”‚       └── vector.SearchContent (SERVER)
β”‚           └── llm.CreateEmbeddings (CLIENT)
β”‚               └── llm.CreateEmbeddings (SERVER)
└── memory.AppendMemory (CLIENT)
    └── memory.AppendMemory (SERVER)

Each span includes:

Accessing Traces

Interactive Trace Visualization

The visualize CLI tool provides interactive trace analysis with Mermaid sequence diagrams:

# Launch visualization REPL
npm run cli:visualize

# Visualize specific trace by ID
--trace f6a4a4d0d3e91

# Visualize all traces for a conversation
--resource common.Conversation.abc123

# Query traces with JMESPath expressions
> [?kind==`2`]  # All SERVER spans
> [?contains(name, 'llm')]  # LLM operations
> [?attributes."service.name"=='agent']  # Agent service spans

Visualization Features:

Local File Access

Traces are stored in data/traces/ with daily rotation:

# View today's traces
cat data/traces/2025-10-29.jsonl

# Format a single trace for readability
tail -n 1 data/traces/2025-10-29.jsonl | jq .

# Count traces by service
cat data/traces/2025-10-29.jsonl | jq -r '.attributes."service.name"' | sort | uniq -c

# Find traces with errors
cat data/traces/2025-10-29.jsonl | jq 'select(.status.code == "STATUS_CODE_ERROR")'

Common Analysis Patterns

Find All Spans for a Trace:

# Get all spans with specific trace_id
TRACE_ID="f6a4a4d0d3e91"
cat data/traces/2025-10-29.jsonl | jq "select(.trace_id == \"$TRACE_ID\")"

Analyze Performance:

# Calculate span durations (requires bc)
cat data/traces/2025-10-29.jsonl | jq -r '
  (.end_time_unix_nano.high * 4294967296 + .end_time_unix_nano.low) -
  (.start_time_unix_nano.high * 4294967296 + .start_time_unix_nano.low)
  | . / 1000000 | "\(.name): \(.) ms"
' | head -n 10

# Find slowest operations
cat data/traces/2025-10-29.jsonl | jq -r '
  (.end_time_unix_nano.high * 4294967296 + .end_time_unix_nano.low) -
  (.start_time_unix_nano.high * 4294967296 + .start_time_unix_nano.low)
  | { name: .name, duration_ms: (. / 1000000) }
' | jq -s 'sort_by(.duration_ms) | reverse | .[:10]'

Extract Token Usage:

# Get total token usage by conversation
cat data/traces/2025-10-29.jsonl | jq -r '
  select(.events[].attributes."response.usage.total_tokens") |
  {
    conversation: .events[].attributes."resource.id",
    tokens: .events[].attributes."response.usage.total_tokens"
  }
'

Find Tool Executions:

# See which tools were called and how often
cat data/traces/2025-10-29.jsonl | jq -r '
  select(.name == "tool.CallTool") |
  .events[].attributes."request.function.name"
' | sort | uniq -c | sort -nr

External Tool Integration

Jaeger

Jaeger provides a powerful UI for trace visualization and analysis.

Setup with Docker Compose:

Add Jaeger to your docker-compose.yml with OTLP collection enabled. The UI will be available on port 16686 and OTLP HTTP endpoint on port 4318:

services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    environment:
      - COLLECTOR_OTLP_ENABLED=true
    ports:
      - "16686:16686"
      - "4318:4318"
    networks:
      - backend

  trace:
    environment:
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4318

Enable Export:

The trace service includes a stub OTLP exporter. To implement actual export, modify services/trace/exporter.js:

/* eslint-env node */
/* eslint-disable no-undef */
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";

export function createOTLPExporter(config) {
  const endpoint = process.env.OTEL_EXPORTER_OTLP_ENDPOINT;
  if (!endpoint) return null;

  const exporter = new OTLPTraceExporter({
    url: `${endpoint}/v1/traces`,
  });

  return {
    async export(span) {
      await exporter.export([convertToOTLP(span)]);
    },
    async shutdown() {
      await exporter.shutdown();
    },
  };
}

Querying Traces in Jaeger:

Grafana Tempo

Tempo is designed for high-volume trace storage and integrates well with Grafana.

Setup with Docker Compose:

services:
  tempo:
    image: grafana/tempo:latest
    command: ["-config.file=/etc/tempo.yaml"]
    volumes:
      - ./tempo.yaml:/etc/tempo.yaml
      - tempo-data:/var/tempo
    ports:
      - "4318:4318" # OTLP HTTP
    networks:
      - backend

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3001:3000"
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
    volumes:
      - grafana-data:/var/lib/grafana
    networks:
      - backend

volumes:
  tempo-data:
  grafana-data:

Tempo Configuration (tempo.yaml):

server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        http:
          endpoint: 0.0.0.0:4318

storage:
  trace:
    backend: local
    local:
      path: /var/tempo/traces

Grafana Configuration:

  1. Add Tempo data source: http://tempo:3200
  2. Create dashboard for trace metrics
  3. Use TraceQL for advanced queries
  4. Set up alerts for error rates or latency

AWS X-Ray

For AWS deployments, export traces to X-Ray for integrated CloudWatch monitoring.

Setup:

/* eslint-env node */
// services/trace/exporter.js
import { XRayClient, PutTraceSegmentsCommand } from "@aws-sdk/client-xray";

export function createOTLPExporter(config) {
  const region = process.env.AWS_REGION;
  if (!region) return null;

  const xray = new XRayClient({ region });

  return {
    async export(span) {
      const segment = convertToXRay(span);
      await xray.send(new PutTraceSegmentsCommand(segment));
    },
    async shutdown() {
      // No cleanup needed for AWS SDK client
    },
  };
}

function convertToXRay(span) {
  return {
    TraceDocumentList: [
      {
        id: span.span_id,
        trace_id: span.trace_id,
        parent_id: span.parent_span_id,
        name: span.name,
        start_time: Number(span.start_time_unix_nano) / 1e9,
        end_time: Number(span.end_time_unix_nano) / 1e9,
        annotations: span.attributes,
      },
    ],
  };
}

CloudWatch Integration:

Datadog

Datadog provides end-to-end observability with APM, logs, and metrics.

Setup:

services:
  datadog-agent:
    image: gcr.io/datadoghq/agent:latest
    environment:
      - DD_API_KEY=${DD_API_KEY}
      - DD_SITE=datadoghq.com
      - DD_APM_ENABLED=true
      - DD_OTLP_CONFIG_RECEIVER_PROTOCOLS_HTTP_ENDPOINT=0.0.0.0:4318
    ports:
      - "4318:4318"
    networks:
      - backend

  trace:
    environment:
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://datadog-agent:4318

Benefits:

Production Monitoring

Key Metrics to Track

Request Performance:

Service Health:

Agent Behavior:

Creating Dashboards

Grafana Dashboard Example:

{
  "dashboard": {
    "title": "Copilot-LD Operations",
    "panels": [
      {
        "title": "Request Latency",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(trace_duration_ms[5m]))"
          }
        ]
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "rate(trace_errors_total[5m])"
          }
        ]
      },
      {
        "title": "Token Usage by Conversation",
        "targets": [
          {
            "expr": "sum by (resource_id) (trace_tokens_total)"
          }
        ]
      }
    ]
  }
}

Alert Configuration

High Error Rate:

alerts:
  - name: high_error_rate
    condition: error_rate > 0.05
    duration: 5m
    severity: critical
    message: "Error rate above 5% for 5 minutes"

High Latency:

alerts:
  - name: high_p99_latency
    condition: p99_latency > 10000ms
    duration: 5m
    severity: warning
    message: "p99 latency above 10 seconds"

LLM Service Failures:

alerts:
  - name: llm_service_failures
    condition: llm_error_rate > 0.01
    duration: 2m
    severity: critical
    message: "LLM service experiencing failures"

Troubleshooting Workflows

Slow Requests

Step 1: Identify Slow Traces

# Find traces with duration > 10 seconds
cat data/traces/2025-10-29.jsonl | jq -r '
  select(.name == "agent.ProcessRequest") |
  select(
    ((.end_time_unix_nano.high * 4294967296 + .end_time_unix_nano.low) -
     (.start_time_unix_nano.high * 4294967296 + .start_time_unix_nano.low))
    / 1000000000 > 10
  ) |
  .trace_id
'

Step 2: Analyze Trace Tree

# Get all spans for slow trace
TRACE_ID="<identified-trace-id>"
cat data/traces/2025-10-29.jsonl | jq "select(.trace_id == \"$TRACE_ID\")"

Step 3: Identify Bottleneck

Look for:

Step 4: Optimize

Failed Requests

Step 1: Find Error Spans

# Find all error spans
cat data/traces/2025-10-29.jsonl | jq '
  select(.status.code == "STATUS_CODE_ERROR")
'

Step 2: Examine Error Context

# Get full trace for error
ERROR_TRACE_ID="<error-trace-id>"
cat data/traces/2025-10-29.jsonl | jq "
  select(.trace_id == \"$ERROR_TRACE_ID\")
" | jq -s 'sort_by(.start_time_unix_nano.high, .start_time_unix_nano.low)'

Step 3: Identify Root Cause

Check:

Step 4: Remediate

Tool Execution Issues

Step 1: Find Tool Call Traces

# Find all tool calls for specific function
TOOL_NAME="search_content"
cat data/traces/2025-10-29.jsonl | jq "
  select(.name == \"tool.CallTool\") |
  select(.events[].attributes.\"request.function.name\" == \"$TOOL_NAME\")
"

Step 2: Analyze Tool Performance

# Calculate average duration for tool
cat data/traces/2025-10-29.jsonl | jq -r "
  select(.name == \"tool.CallTool\") |
  select(.events[].attributes.\"request.function.name\" == \"$TOOL_NAME\") |
  ((.end_time_unix_nano.high * 4294967296 + .end_time_unix_nano.low) -
   (.start_time_unix_nano.high * 4294967296 + .start_time_unix_nano.low))
  / 1000000
" | awk '{ sum += $1; n++ } END { if (n > 0) print sum / n }'

Step 3: Review Tool Parameters

# Check parameters used in tool calls
cat data/traces/2025-10-29.jsonl | jq '
  select(.name == "tool.CallTool") |
  .events[] | select(.name == "request.sent") | .attributes
'

Step 4: Optimize Tool Usage

Memory Pressure

Step 1: Track Token Usage

# Sum token usage by conversation
cat data/traces/2025-10-29.jsonl | jq -r '
  select(.events[].attributes."response.usage.total_tokens") |
  {
    conversation: .events[].attributes."resource.id",
    tokens: (.events[].attributes."response.usage.total_tokens" | tonumber)
  }
' | jq -s 'group_by(.conversation) | map({ conversation: .[0].conversation, total_tokens: map(.tokens) | add })'

Step 2: Identify High-Usage Conversations

Look for conversations consuming excessive tokens:

Step 3: Adjust Budget Allocation

Review config/assistants.yml:

assistants:
  - id: default
    budget:
      tokens: 100000 # Adjust based on usage patterns
      tools_percent: 0.3
      resources_percent: 0.5

Step 4: Implement Conversation Pruning

Consider adding conversation history limits:

Best Practices

Retention Policies

Local Storage:

# Rotate traces older than 30 days
find data/traces -name "*.jsonl" -mtime +30 -delete

# Compress old traces
find data/traces -name "*.jsonl" -mtime +7 -exec gzip {} \;

S3 Storage (if using AWS):

# Upload to S3 with lifecycle policy
aws s3 cp data/traces/ s3://copilot-ld-traces/ --recursive

Create S3 lifecycle policy:

{
  "Rules": [
    {
      "Status": "Enabled",
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "GLACIER"
        }
      ],
      "Expiration": {
        "Days": 365
      }
    }
  ]
}

Sampling Strategies

For high-volume production systems, implement sampling:

/* eslint-env node */
// services/trace/index.js - RecordSpan method implementation
function recordSpanWithSampling(req, config, traceIndex) {
  const span = req;
  const sampleRate = config.sample_rate || 1.0;
  if (Math.random() > sampleRate) {
    return { success: true };
  }

  // Normal recording
  traceIndex.add(span);
  return { success: true };
}

Configure per service in config/config.json:

{
  "services": {
    "trace": {
      "sample_rate": 0.1
    }
  }
}

Privacy Considerations

Remove Sensitive Data:

/* eslint-env node */
// Sanitize spans before storage
function sanitizeSpan(span) {
  // Remove user input from attributes
  if (span.attributes["request.text"]) {
    span.attributes["request.text"] = "[REDACTED]";
  }

  // Remove message content from events
  span.events = span.events.map((event) => {
    if (event.attributes["message.content"]) {
      event.attributes["message.content"] = "[REDACTED]";
    }
    return event;
  });

  return span;
}

Access Control:

Restrict trace file access to authorized operators:

# Set restrictive permissions
chmod 600 data/traces/*.jsonl
chown operator:operator data/traces/*.jsonl

Performance Tuning

Buffering Configuration:

Adjust trace service buffer settings based on load:

{
  "services": {
    "trace": {
      "flush_interval": 5000,
      "max_buffer_size": 1000
    }
  }
}

Async Recording:

The trace service uses buffered writes for efficiency. Monitor buffer flush frequency:

# Check trace service logs
DEBUG=trace:* npm run dev

Look for frequent flushes indicating buffer saturation.

Summary

Distributed tracing provides complete visibility into agent behavior in production. By following this guide, operators can:

For more information: