🧬 Copilot-LD

An intelligent agent leveraging GitHub Copilot and Linked Data

Evaluation Guide

Guide to evaluating agent response quality using automated scenarios.

Overview

The evaluation system validates agent behavior through automated scenarios defined in config/eval.yml. Each scenario sends a prompt to the agent and evaluates the response against defined criteria.

Running Evaluations

# Run all scenarios
npm run eval

# Run specific scenario
npm run eval -- --scenario search_drug_discovery_platform

# Run with concurrency and iterations
npm run eval -- --concurrency 5 --iterations 3

# Generate reports from stored results
npm run eval:report

Results are stored in data/eval/ with per-scenario markdown reports and a summary.

Configuration

Configure eval behavior in config/config.json:

{
  "evals": {
    "models": ["gpt-4o"],
    "judge_model": "gpt-5"
  }
}

Scenario Types

Criteria Evaluation

Validates response content using an LLM judge. The judge model is configured via evals.judge_model in config/config.json. Each criterion is evaluated independently.

search_drug_discovery_platform:
  type: criteria
  prompt: |
    What is MolecularForge and what capabilities does it provide?
  evaluations:
    - label: Response identifies MolecularForge as chemistry platform
      data: |
        Response is identifying MolecularForge as a computational chemistry 
        platform for drug discovery and development.
    - label: Response describes core capabilities
      data: |
        Response is describing at least two core capabilities such as 
        structure-based docking or multi-target scoring.

Recall Evaluation

Validates that specific resources are retrieved into context. All listed subjects must be found.

recall_drug_discovery_team:
  type: recall
  prompt: |
    Who are the members of the Drug Discovery Team?
  evaluations:
    - label: Retrieves Drug Discovery Team item
      data: https://bionova.example/id/org/drug-discovery-team

Trace Evaluation

Validates tool execution patterns using JMESPath queries on trace spans.

trace_search_execution:
  type: trace
  prompt: |
    Search for information about clinical trials
  evaluations:
    - label: Agent calls search tool
      data: "[?name==`tool.CallTool`]"

Writing Scenarios

Validation Workflow

Before writing a scenario, verify expected behavior:

# Check entity exists
echo "https://schema.org/Type" | npm -s run cli:subjects

# Explore relationships
echo "entity:name ? ?" | npm -s run cli:query

# Verify search retrieval
echo "query terms" | npm -s run cli:search

# Test agent response
echo "Your eval prompt" | npm -s run cli:chat

Best Practices

Debugging Failures

Check retrieved context (replace <resource-id> with actual ID from eval output):

cat data/memories/<resource-id>.jsonl | jq -c 'select(.identifier.subjects | length > 0) | .identifier.subjects'

Visualize trace spans:

echo "[]" | npm -s run cli:visualize -- --resource <resource-id>

Filter for specific operations:

echo "[?name==\`tool.CallTool\`]" | npm -s run cli:visualize -- --resource <resource-id>