The Measure stage is where you quantify the quality and effectiveness of your AI . Instead of relying on anecdotal checks, this stage uses a systematic process called an to score your capability’s performance against a known set of correct examples (). This provides a data-driven benchmark to ensure a capability is ready for production and to track its quality over time.Documentation Index
Fetch the complete documentation index at: https://axiom-mano-metrics-builder.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
The Eval function
Coming soon
The primary tool for the Measure stage is the Eval function, which will be available in the axiom/ai package. It provides a simple, declarative way to define a test suite for your capability directly in your codebase.
An Eval is structured around a few key parameters:
data: An async function that returns yourcollectionof{ input, expected }pairs, which serve as your ground truth.task: The function that executes your AI capability, taking aninputand producing anoutput.scorers: An array ofgraderfunctions that score theoutputagainst theexpectedvalue.threshold: A score between 0 and 1 that determines the pass/fail condition for the evaluation.
/evals/text-match.eval.ts
Grading with scorers
Coming soon
A is a function that scores a capability’s output. Axiom will provide a library of built-in scorers for common tasks (e.g., checking for semantic similarity, factual correctness, or JSON validity). You can also provide your own custom functions to measure domain-specific logic. Each scorer receives the input, the generated output, and the expected value, and must return a score.
Running evaluations
Coming soon
You will run your evaluation suites from your terminal using the axiom CLI.
vitest in the background. Note that vitest will be a peer dependency for this functionality.
Analyzing results in the console
Coming soon
When you run an , the Axiom SDK captures a detailed OpenTelemetry trace for the entire run. This includes parent spans for the evaluation suite and child spans for each individual test case, task execution, and scorer result. These traces are enriched with eval.* attributes, allowing you to deeply analyze results in the Axiom Console.
The Console will feature leaderboards and comparison views to track score progression across different versions of a capability, helping you verify that your changes are leading to measurable improvements.