Processing Guide
Complete guide to processing knowledge bases, resources, tools, and vectors in Copilot-LD. This covers the offline pipeline that transforms raw knowledge into searchable, embedded content.
Prerequisites
- Configuration Guide completed
- Basic understanding of HTML microdata. See Using microdata in HTML on MDN Web Docs.
Overview
The Copilot-LD processing pipeline transforms your knowledge base through several stages in sequence:
- Storage Setup: Prepare local storage directories for data processing
- Knowledge Base Setup: Organize HTML content with structured microdata
- Resource Processing: Extract and index content from HTML files
- Tool Processing: Generate tool schemas and documentation
- Vector Processing: Create embeddings for semantic search
- Data Management: Upload/download processed data for deployment
1. Storage Setup
Prepare local storage directories for data processing. This ensures a consistent workspace for all pipeline stages.
mkdir -p data/{memories,policies,resources,vectors}
2. Knowledge Base Structure
Copilot-LD uses HTML files with structured microdata to organize knowledge. This approach provides semantic context and enables accurate content extraction.
HTML with Microdata
Knowledge files should use HTML5 microdata with Schema.org vocabularies to structure content:
<!-- Example: data/knowledge/security-practices.html -->
<html>
<head>
<title>Security Best Practices</title>
</head>
<body>
<article itemscope itemtype="http://schema.org/Article">
<h1 itemprop="headline">Docker Security Best Practices</h1>
<div itemprop="articleBody">
<p>Always use specific image tags instead of 'latest' to ensure reproducible builds.</p>
<p>Implement multi-stage builds to reduce attack surface and image size.</p>
<p>Run containers as non-root users whenever possible.</p>
</div>
</article>
<article itemscope itemtype="http://schema.org/Article">
<h1 itemprop="headline">Container Registry Security</h1>
<div itemprop="articleBody">
<p>Scan container images for vulnerabilities before deployment.</p>
<p>Use private registries for proprietary or sensitive applications.</p>
</div>
</article>
</body>
</html>
Schema.org Types
The processing pipeline works with any Schema.org types. Common examples include:
-
Article
: Technical articles, best practices, guides -
HowTo
: Step-by-step procedures and tutorials -
FAQPage
: Frequently asked questions -
TechArticle
: Technical documentation
You can use any Schema.org type that fits your content structure and semantic needs.
Example Knowledge Base
Copilot-LD includes example knowledge files to demonstrate the HTML microdata structure:
# Copy example knowledge base to your data directory
cp -r examples/knowledge data/
3. Resource Processing
Resource processing extracts structured content from HTML files and
creates searchable resources stored in the
data/resources/
directory.
Processing HTML Content
The resource processor scans HTML files for elements with microdata attributes and extracts them as individual knowledge items:
node scripts/resources.js
Custom CSS Selectors
By default, the processor looks for [itemscope]
elements.
You can specify custom selectors:
# Process only article elements
node scripts/resources.js --selector "article[itemscope]"
# Process multiple content types
node scripts/resources.js --selector "[itemtype*='Article'], [itemtype*='HowTo']"
Output Structure
Resource processing creates individual JSON files in
data/resources/
with Copilot-LD (CLD) identifiers:
data/resources/
├── cld:common.MessageV2.{hash}.json # Individual message resources
├── cld:common.Assistant.{name}.json # Assistant configurations
├── cld:common.ToolFunction.{name}.json # Tool definitions
└── cld:common.Conversation.{uuid}.json # Conversation metadata
Resource Format
Each extracted resource contains:
- Identifier: Unique resource ID based on content hash
- Content: Extracted text content from the HTML element
- Metadata: Schema.org type, source file, extraction timestamp
- Descriptor: AI-generated description of the content's purpose and applicability
Assistant Processing
The resource processor also processes assistant configurations, creating resources for each defined assistant persona. This enables the system to search and select appropriate assistants based on context.
4. Tool Processing
Tool processing generates OpenAI-compatible JSON schemas from Protocol Buffer definitions, enabling dynamic tool registration and validation.
Generate Tool Schemas
node scripts/tools.js
Protocol Buffer Tool Definitions
Tools are defined using Protocol Buffer messages that describe their
parameters and functionality. The tool processor scans
tools/
directory for *.proto
files:
// examples/tools/hash.proto
syntax = "proto3";
package hash;
service Hash {
rpc Sha256(HashRequest) returns (HashResponse);
rpc Md5(HashRequest) returns (HashResponse);
}
message HashRequest {
string input = 1;
}
message HashResponse {
string hash = 1;
string algorithm = 2;
}
JSON Schema Generation
The tool processor converts Protocol Buffer definitions into OpenAI-compatible JSON schemas that can be used for LLM tool calling. Each RPC method in the service definition becomes a separate tool function:
// Generated schema for sha256_hash tool
{
"type": "object",
"properties": {
"input": {
"type": "string",
"description": "input field"
}
},
"required": ["input"]
}
Tool Configuration
Generated tool schemas are stored as individual JSON files in
data/resources/
with the pattern
cld:common.ToolFunction.{name}.json
and automatically
registered with the Tool service during startup. Each tool entry
includes:
-
Tool Name: Method name from the Protocol Buffer
service (e.g.,
sha256_hash
,md5_hash
) - Parameters Schema: JSON schema for validating tool parameters
- Purpose Description: AI-generated description of tool functionality
- Usage Instructions: Detailed instructions for proper tool usage
- Applicability Guidelines: When and when not to use the tool
Example Tool Resource
The hash tool generates separate resources for each RPC method. Here's
the generated
sha256_hash
tool resource:
{
"id": {
"type": "common.ToolFunction",
"name": "sha256_hash"
},
"descriptor": {
"tokens": 89,
"purpose": "Create deterministic SHA-256 hash of input text.",
"instructions": "Input: Text string in 'input' field. Output: 64-character hexadecimal SHA-256 hash.",
"applicability": "Use ONLY when user explicitly requests SHA-256 hashing. DO NOT use for search or content analysis.",
"evaluation": "Returns exactly 64-character hexadecimal string."
},
"name": "sha256_hash",
"parameters": {
"type": "object",
"properties": {
"input": {
"type": "string",
"description": "input field"
}
},
"required": ["input"]
}
}
5. Vector Processing
Vector processing creates embeddings of resource content and descriptors for efficient similarity search and retrieval-augmented generation.
Generate Vector Embeddings
node scripts/vectors.js
Embedding Strategy
The vector processor creates two types of embeddings:
Content Embeddings
- Purpose: Direct semantic search of actual content
- Source: Full text content extracted from HTML elements
- Use Case: Finding specific information, facts, and detailed explanations
Descriptor Embeddings
- Purpose: Conceptual and categorical search
- Source: AI-generated descriptions of content purpose and applicability
- Use Case: Finding relevant content types, methodologies, and approaches
Vector Storage
Embeddings are stored in data/vectors/
as JSONL files:
data/vectors/
├── content.jsonl # Content-based embeddings
└── descriptors.jsonl # Descriptor-based embeddings
Each vector entry contains:
- Identifier: Links back to the original resource
- Embedding: 1536-dimensional vector from OpenAI text-embedding-3-small
6. Data Management Utilities
Copilot-LD includes utilities for managing processed data across development and deployment environments. These commands require that the Deployment Guide be completed first for proper S3 configuration.
Upload Processed Data
Upload all processed data from local storage to S3-compatible remote storage:
npx @copilot-ld/libutil upload
Upload Process
The upload utility synchronizes these storage areas:
- config/: Configuration files and secrets
- generated/: Generated code and Protocol Buffer artifacts
- memories/: Conversation history and chat memories
- resources/: Processed knowledge base resources
- vectors/: Embedding indices for semantic search
S3 Configuration Requirements
Upload requires S3-compatible storage configuration. See the Storage Configuration section in the Configuration Guide for complete setup details including environment variables and MinIO options.
Download Processed Data
Download pre-processed data bundle from remote storage:
npx @copilot-ld/libutil download
Download Process
The download utility retrieves and extracts a
bundle.tar.gz
archive containing generated code and
processed data. This is useful for:
- Quick Setup: Skip processing steps with pre-processed data
- CI/CD Pipelines: Download consistent data sets for automated deployments
- Team Synchronization: Share processed knowledge base across team members
Bundle Configuration
Configure the download source in config/config.json
:
tool:
download:
owner: "your-organization"
repo: "your-knowledge-repository"
Data Management Workflow
Typical workflow for managing processed data across environments:
Development Environment
# Process knowledge base locally
node scripts/resources.js
node scripts/tools.js
node scripts/vectors.js
# Upload processed data to S3
npx @copilot-ld/libutil upload
Production Environment
# Download pre-processed data bundle
npx @copilot-ld/libutil download
# Or synchronize from S3 if using upload/download pattern
# Deploy with processed data available
Storage Monitoring
Monitor data storage usage and processing status:
# Check local storage sizes
du -sh data/*/
# Monitor S3 bucket usage (if using AWS)
aws s3 ls s3://your-copilot-ld-bucket --recursive --human-readable --summarize
# Check resource count by type
ls data/resources/ | grep "MessageV2" | wc -l
ls data/resources/ | grep "ToolFunction" | wc -l
ls data/resources/ | grep "Assistant" | wc -l
Next Steps
Once processing is complete, proceed to:
- Deployment Guide - Deploy the system with your processed knowledge base
- Development Guide - Set up local development environment for further customization
- Architecture Overview - Understand how processing fits into the overall system