Observability Guide
Practical guide for operators to leverage Copilot-LD's distributed tracing system in production. Covers trace analysis, integration with external tools, performance monitoring, and troubleshooting workflows.
Overview
Copilot-LD includes comprehensive distributed tracing that records every service call, tool execution, and decision made by the agent. This guide shows production operators how to use traces for monitoring, debugging, and optimization.
Understanding Traces
What Gets Traced
Every request through the system generates a traceβa complete record of all operations performed to handle that request:
- Service Calls: Every gRPC call between services (Agent β Memory, Agent β LLM, etc.)
- Tool Executions: All tool calls including parameters and results
- Resource Access: Vector searches, graph queries, and memory retrievals
- Timing Information: Nanosecond-precision timestamps for performance analysis
- Metadata: Message counts, token usage, resource IDs, conversation context
Trace Structure
A trace consists of spans arranged in a hierarchical tree:
agent.ProcessRequest (SERVER)
βββ memory.GetWindow (CLIENT)
β βββ memory.GetWindow (SERVER)
βββ llm.CreateCompletions (CLIENT)
β βββ llm.CreateCompletions (SERVER)
βββ tool.CallTool (CLIENT)
β βββ tool.CallTool (SERVER)
β βββ vector.SearchContent (SERVER)
β βββ llm.CreateEmbeddings (CLIENT)
β βββ llm.CreateEmbeddings (SERVER)
βββ memory.AppendMemory (CLIENT)
βββ memory.AppendMemory (SERVER)
Each span includes:
- Trace ID: Shared by all spans in the request
- Span ID: Unique identifier for this operation
- Parent Span ID: Links to parent operation
- Timestamps: Start and end time in nanoseconds
- Attributes: Structured metadata about the operation
- Events: Point-in-time markers (request.sent, response.received)
- Status: Success (OK) or failure (ERROR)
Accessing Traces
Interactive Trace Visualization
The visualize CLI tool provides interactive trace analysis
with Mermaid sequence diagrams:
# Launch visualization REPL
npm run cli:visualize
# Visualize specific trace by ID
--trace f6a4a4d0d3e91
# Visualize all traces for a conversation
--resource common.Conversation.abc123
# Query traces with JMESPath expressions
> [?kind==`2`] # All SERVER spans
> [?contains(name, 'llm')] # LLM operations
> [?attributes."service.name"=='agent'] # Agent service spans
Visualization Features:
- Mermaid sequence diagrams: Shows service interaction timelines
- Request/response attributes: Displays parameters and results
- Error highlighting: Shows error status and messages
- Complete trace context: Includes all related service calls
- JMESPath filtering: Powerful query language for complex conditions
Local File Access
Traces are stored in data/traces/ with daily rotation:
# View today's traces
cat data/traces/2025-10-29.jsonl
# Format a single trace for readability
tail -n 1 data/traces/2025-10-29.jsonl | jq .
# Count traces by service
cat data/traces/2025-10-29.jsonl | jq -r '.attributes."service.name"' | sort | uniq -c
# Find traces with errors
cat data/traces/2025-10-29.jsonl | jq 'select(.status.code == "STATUS_CODE_ERROR")'
Common Analysis Patterns
Find All Spans for a Trace:
# Get all spans with specific trace_id
TRACE_ID="f6a4a4d0d3e91"
cat data/traces/2025-10-29.jsonl | jq "select(.trace_id == \"$TRACE_ID\")"
Analyze Performance:
# Calculate span durations (requires bc)
cat data/traces/2025-10-29.jsonl | jq -r '
(.end_time_unix_nano.high * 4294967296 + .end_time_unix_nano.low) -
(.start_time_unix_nano.high * 4294967296 + .start_time_unix_nano.low)
| . / 1000000 | "\(.name): \(.) ms"
' | head -n 10
# Find slowest operations
cat data/traces/2025-10-29.jsonl | jq -r '
(.end_time_unix_nano.high * 4294967296 + .end_time_unix_nano.low) -
(.start_time_unix_nano.high * 4294967296 + .start_time_unix_nano.low)
| { name: .name, duration_ms: (. / 1000000) }
' | jq -s 'sort_by(.duration_ms) | reverse | .[:10]'
Extract Token Usage:
# Get total token usage by conversation
cat data/traces/2025-10-29.jsonl | jq -r '
select(.events[].attributes."response.usage.total_tokens") |
{
conversation: .events[].attributes."resource.id",
tokens: .events[].attributes."response.usage.total_tokens"
}
'
Find Tool Executions:
# See which tools were called and how often
cat data/traces/2025-10-29.jsonl | jq -r '
select(.name == "tool.CallTool") |
.events[].attributes."request.function.name"
' | sort | uniq -c | sort -nr
External Tool Integration
Jaeger
Jaeger provides a powerful UI for trace visualization and analysis.
Setup with Docker Compose:
Add Jaeger to your docker-compose.yml with OTLP collection
enabled. The UI will be available on port 16686 and OTLP HTTP endpoint
on port 4318:
services:
jaeger:
image: jaegertracing/all-in-one:latest
environment:
- COLLECTOR_OTLP_ENABLED=true
ports:
- "16686:16686"
- "4318:4318"
networks:
- backend
trace:
environment:
- OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4318
Enable Export:
The trace service includes a stub OTLP exporter. To implement actual
export, modify services/trace/exporter.js:
/* eslint-env node */
/* eslint-disable no-undef */
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
export function createOTLPExporter(config) {
const endpoint = process.env.OTEL_EXPORTER_OTLP_ENDPOINT;
if (!endpoint) return null;
const exporter = new OTLPTraceExporter({
url: `${endpoint}/v1/traces`,
});
return {
async export(span) {
await exporter.export([convertToOTLP(span)]);
},
async shutdown() {
await exporter.shutdown();
},
};
}
Querying Traces in Jaeger:
- Navigate to
http://localhost:16686 -
Select service:
agent,memory,llm, etc. -
Filter by operation:
ProcessRequest,CreateCompletions, etc. -
Search by tags:
resource.id,request.function.name - View trace timelines with service dependencies
Grafana Tempo
Tempo is designed for high-volume trace storage and integrates well with Grafana.
Setup with Docker Compose:
services:
tempo:
image: grafana/tempo:latest
command: ["-config.file=/etc/tempo.yaml"]
volumes:
- ./tempo.yaml:/etc/tempo.yaml
- tempo-data:/var/tempo
ports:
- "4318:4318" # OTLP HTTP
networks:
- backend
grafana:
image: grafana/grafana:latest
ports:
- "3001:3000"
environment:
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
volumes:
- grafana-data:/var/lib/grafana
networks:
- backend
volumes:
tempo-data:
grafana-data:
Tempo Configuration (tempo.yaml):
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
http:
endpoint: 0.0.0.0:4318
storage:
trace:
backend: local
local:
path: /var/tempo/traces
Grafana Configuration:
- Add Tempo data source:
http://tempo:3200 - Create dashboard for trace metrics
- Use TraceQL for advanced queries
- Set up alerts for error rates or latency
AWS X-Ray
For AWS deployments, export traces to X-Ray for integrated CloudWatch monitoring.
Setup:
/* eslint-env node */
// services/trace/exporter.js
import { XRayClient, PutTraceSegmentsCommand } from "@aws-sdk/client-xray";
export function createOTLPExporter(config) {
const region = process.env.AWS_REGION;
if (!region) return null;
const xray = new XRayClient({ region });
return {
async export(span) {
const segment = convertToXRay(span);
await xray.send(new PutTraceSegmentsCommand(segment));
},
async shutdown() {
// No cleanup needed for AWS SDK client
},
};
}
function convertToXRay(span) {
return {
TraceDocumentList: [
{
id: span.span_id,
trace_id: span.trace_id,
parent_id: span.parent_span_id,
name: span.name,
start_time: Number(span.start_time_unix_nano) / 1e9,
end_time: Number(span.end_time_unix_nano) / 1e9,
annotations: span.attributes,
},
],
};
}
CloudWatch Integration:
- View traces in X-Ray console
- Create CloudWatch dashboards with trace metrics
- Set up alarms for error rates or p99 latency
- Use X-Ray Service Map for dependency visualization
Datadog
Datadog provides end-to-end observability with APM, logs, and metrics.
Setup:
services:
datadog-agent:
image: gcr.io/datadoghq/agent:latest
environment:
- DD_API_KEY=${DD_API_KEY}
- DD_SITE=datadoghq.com
- DD_APM_ENABLED=true
- DD_OTLP_CONFIG_RECEIVER_PROTOCOLS_HTTP_ENDPOINT=0.0.0.0:4318
ports:
- "4318:4318"
networks:
- backend
trace:
environment:
- OTEL_EXPORTER_OTLP_ENDPOINT=http://datadog-agent:4318
Benefits:
- Automatic service map generation
- Built-in anomaly detection
- Correlation with logs and metrics
- Real-time alerting and dashboards
Production Monitoring
Key Metrics to Track
Request Performance:
- p50, p95, p99 Latency: Percentile distribution of request durations
- Throughput: Requests per second
- Error Rate: Percentage of failed requests
Service Health:
- LLM Service Latency: Track external API call durations
- Vector Search Performance: Monitor query execution times
- Memory Service Load: Track conversation storage operations
Agent Behavior:
- Tool Call Frequency: Which tools are used most often
- Token Usage: LLM consumption by conversation
- Tool Call Success Rate: Percentage of successful tool executions
Creating Dashboards
Grafana Dashboard Example:
{
"dashboard": {
"title": "Copilot-LD Operations",
"panels": [
{
"title": "Request Latency",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(trace_duration_ms[5m]))"
}
]
},
{
"title": "Error Rate",
"targets": [
{
"expr": "rate(trace_errors_total[5m])"
}
]
},
{
"title": "Token Usage by Conversation",
"targets": [
{
"expr": "sum by (resource_id) (trace_tokens_total)"
}
]
}
]
}
}
Alert Configuration
High Error Rate:
alerts:
- name: high_error_rate
condition: error_rate > 0.05
duration: 5m
severity: critical
message: "Error rate above 5% for 5 minutes"
High Latency:
alerts:
- name: high_p99_latency
condition: p99_latency > 10000ms
duration: 5m
severity: warning
message: "p99 latency above 10 seconds"
LLM Service Failures:
alerts:
- name: llm_service_failures
condition: llm_error_rate > 0.01
duration: 2m
severity: critical
message: "LLM service experiencing failures"
Troubleshooting Workflows
Slow Requests
Step 1: Identify Slow Traces
# Find traces with duration > 10 seconds
cat data/traces/2025-10-29.jsonl | jq -r '
select(.name == "agent.ProcessRequest") |
select(
((.end_time_unix_nano.high * 4294967296 + .end_time_unix_nano.low) -
(.start_time_unix_nano.high * 4294967296 + .start_time_unix_nano.low))
/ 1000000000 > 10
) |
.trace_id
'
Step 2: Analyze Trace Tree
# Get all spans for slow trace
TRACE_ID="<identified-trace-id>"
cat data/traces/2025-10-29.jsonl | jq "select(.trace_id == \"$TRACE_ID\")"
Step 3: Identify Bottleneck
Look for:
- Long-running LLM completions (check token counts)
- Multiple tool call iterations
- Large vector search result sets
- Memory service contention
Step 4: Optimize
- Reduce token budgets if LLM calls are slow
- Adjust vector search thresholds to return fewer results
- Cache frequently accessed resources
- Review tool implementations for efficiency
Failed Requests
Step 1: Find Error Spans
# Find all error spans
cat data/traces/2025-10-29.jsonl | jq '
select(.status.code == "STATUS_CODE_ERROR")
'
Step 2: Examine Error Context
# Get full trace for error
ERROR_TRACE_ID="<error-trace-id>"
cat data/traces/2025-10-29.jsonl | jq "
select(.trace_id == \"$ERROR_TRACE_ID\")
" | jq -s 'sort_by(.start_time_unix_nano.high, .start_time_unix_nano.low)'
Step 3: Identify Root Cause
Check:
-
Which service failed (look at
service.nameattribute) - Error message in span status
- Request parameters that triggered the error
- Parent spans to understand context
Step 4: Remediate
- Fix service implementation if bug identified
- Adjust configuration if resource limits hit
- Review input validation if bad data caused error
- Scale service if capacity issue detected
Tool Execution Issues
Step 1: Find Tool Call Traces
# Find all tool calls for specific function
TOOL_NAME="search_content"
cat data/traces/2025-10-29.jsonl | jq "
select(.name == \"tool.CallTool\") |
select(.events[].attributes.\"request.function.name\" == \"$TOOL_NAME\")
"
Step 2: Analyze Tool Performance
# Calculate average duration for tool
cat data/traces/2025-10-29.jsonl | jq -r "
select(.name == \"tool.CallTool\") |
select(.events[].attributes.\"request.function.name\" == \"$TOOL_NAME\") |
((.end_time_unix_nano.high * 4294967296 + .end_time_unix_nano.low) -
(.start_time_unix_nano.high * 4294967296 + .start_time_unix_nano.low))
/ 1000000
" | awk '{ sum += $1; n++ } END { if (n > 0) print sum / n }'
Step 3: Review Tool Parameters
# Check parameters used in tool calls
cat data/traces/2025-10-29.jsonl | jq '
select(.name == "tool.CallTool") |
.events[] | select(.name == "request.sent") | .attributes
'
Step 4: Optimize Tool Usage
- Adjust filter parameters (threshold, limit, max_tokens)
- Review tool prompt engineering if LLM not calling tools correctly
- Consider caching for frequently called tools with same parameters
- Implement result streaming for large result sets
Memory Pressure
Step 1: Track Token Usage
# Sum token usage by conversation
cat data/traces/2025-10-29.jsonl | jq -r '
select(.events[].attributes."response.usage.total_tokens") |
{
conversation: .events[].attributes."resource.id",
tokens: (.events[].attributes."response.usage.total_tokens" | tonumber)
}
' | jq -s 'group_by(.conversation) | map({ conversation: .[0].conversation, total_tokens: map(.tokens) | add })'
Step 2: Identify High-Usage Conversations
Look for conversations consuming excessive tokens:
- Long conversation histories
- Large tool result sets
- Many resource retrievals
Step 3: Adjust Budget Allocation
Review config/assistants.yml:
assistants:
- id: default
budget:
tokens: 100000 # Adjust based on usage patterns
tools_percent: 0.3
resources_percent: 0.5
Step 4: Implement Conversation Pruning
Consider adding conversation history limits:
- Maximum message count per conversation
- Token budget per conversation lifetime
- Automatic archival of old conversations
Best Practices
Retention Policies
Local Storage:
# Rotate traces older than 30 days
find data/traces -name "*.jsonl" -mtime +30 -delete
# Compress old traces
find data/traces -name "*.jsonl" -mtime +7 -exec gzip {} \;
S3 Storage (if using AWS):
# Upload to S3 with lifecycle policy
aws s3 cp data/traces/ s3://copilot-ld-traces/ --recursive
Create S3 lifecycle policy:
{
"Rules": [
{
"Status": "Enabled",
"Transitions": [
{
"Days": 30,
"StorageClass": "GLACIER"
}
],
"Expiration": {
"Days": 365
}
}
]
}
Sampling Strategies
For high-volume production systems, implement sampling:
/* eslint-env node */
// services/trace/index.js - RecordSpan method implementation
function recordSpanWithSampling(req, config, traceIndex) {
const span = req;
const sampleRate = config.sample_rate || 1.0;
if (Math.random() > sampleRate) {
return { success: true };
}
// Normal recording
traceIndex.add(span);
return { success: true };
}
Configure per service in config/config.json:
{
"services": {
"trace": {
"sample_rate": 0.1
}
}
}
Privacy Considerations
Remove Sensitive Data:
/* eslint-env node */
// Sanitize spans before storage
function sanitizeSpan(span) {
// Remove user input from attributes
if (span.attributes["request.text"]) {
span.attributes["request.text"] = "[REDACTED]";
}
// Remove message content from events
span.events = span.events.map((event) => {
if (event.attributes["message.content"]) {
event.attributes["message.content"] = "[REDACTED]";
}
return event;
});
return span;
}
Access Control:
Restrict trace file access to authorized operators:
# Set restrictive permissions
chmod 600 data/traces/*.jsonl
chown operator:operator data/traces/*.jsonl
Performance Tuning
Buffering Configuration:
Adjust trace service buffer settings based on load:
{
"services": {
"trace": {
"flush_interval": 5000,
"max_buffer_size": 1000
}
}
}
Async Recording:
The trace service uses buffered writes for efficiency. Monitor buffer flush frequency:
# Check trace service logs
DEBUG=trace:* npm run dev
Look for frequent flushes indicating buffer saturation.
Summary
Distributed tracing provides complete visibility into agent behavior in production. By following this guide, operators can:
- Understand request flows through the microservices architecture
- Identify performance bottlenecks and optimize critical paths
- Debug failures with complete context across service boundaries
- Monitor agent decision-making patterns over time
- Integrate with industry-standard observability platforms
- Establish effective alerting and incident response workflows
For more information:
- Concepts Guide β Understanding why tracing is essential for agentic systems
- Reference Guide β Technical details of the libtelemetry package and OpenTelemetry compatibility