Evaluation Guide

Guide to evaluating agent response quality using automated scenarios.

Overview

The evaluation system validates agent behavior through automated scenarios defined in config/eval.yml. Each scenario sends a prompt to the agent and evaluates the response against defined criteria.

Running Evaluations

# Run all scenarios
make eval

# Generate reports from stored results
make eval-report

Results are stored in data/eval/ with per-scenario markdown reports and a summary.

Configuration

Configure eval behavior in config/config.json:

{
  "evals": {
    "models": ["gpt-4o"],
    "judge_model": "gpt-5"
  }
}

models: List of models to evaluate (matrix testing)
judge_model: Model used for criteria/judge evaluations

Scenario Types

Criteria Evaluation

Validates response content using an LLM judge. The judge model is configured via evals.judge_model in config/config.json. Each criterion is evaluated independently.

search_drug_discovery_platform:
  type: criteria
  prompt: |
    What is MolecularForge and what capabilities does it provide?
  evaluations:
    - label: Response identifies MolecularForge as chemistry platform
      data: |
        Response is identifying MolecularForge as a computational chemistry 
        platform for drug discovery and development.
    - label: Response describes core capabilities
      data: |
        Response is describing at least two core capabilities such as 
        structure-based docking or multi-target scoring.

Recall Evaluation

Validates that specific resources are retrieved into context. All listed subjects must be found.

recall_drug_discovery_team:
  type: recall
  prompt: |
    Who are the members of the Drug Discovery Team?
  evaluations:
    - label: Retrieves Drug Discovery Team item
      data: https://bionova.example/id/org/drug-discovery-team

Trace Evaluation

Validates tool execution patterns using JMESPath queries on trace spans.

trace_search_execution:
  type: trace
  prompt: |
    Search for information about clinical trials
  evaluations:
    - label: Agent calls search tool
      data: "[?name==`tool.CallTool`]"

Writing Scenarios

Validation Workflow

Before writing a scenario, verify expected behavior:

# Check entity exists
echo "https://schema.org/Type" | make cli-subjects

# Explore relationships
echo "entity:name ? ?" | make cli-query

# Verify search retrieval
echo "query terms" | make cli-search

# Test agent response
echo "Your eval prompt" | make cli-chat

Best Practices

Ground scenarios in data from examples/knowledge/
Use one evaluation type per scenario
Write criteria that are specific and testable
Include only essential subjects in recall scenarios
Use scenario names that describe what is being tested

Debugging Failures

Check retrieved context (replace <resource-id> with actual ID from eval output):

cat data/memories/<resource-id>.jsonl | jq -c 'select(.identifier.subjects | length > 0) | .identifier.subjects'

Visualize trace spans:

echo "[]" | make cli-visualize ARGS="--resource <resource-id>"

Filter for specific operations:

echo "[?name==\`tool.CallTool\`]" | make cli-visualize ARGS="--resource <resource-id>"

Observability Guide – Understanding trace structure
Development Guide – Local setup and testing workflows