LangChain, a leading player in the field of artificial intelligence, has recently unveiled two new offerings, OpenEvals and AgentEvals, with the aim of simplifying the evaluation process for large language models (LLMs). These packages provide developers with a comprehensive framework and a range of evaluators to streamline the assessment of LLM-powered applications and agents, according to LangChain.
Understanding the Importance of Evaluations
Evaluations, often referred to as evals, play a critical role in determining the quality of LLM outputs. They consist of two key components: the data being evaluated and the metrics used for evaluation. The quality of the data has a significant impact on the evaluation’s ability to accurately reflect real-world usage. LangChain emphasizes the importance of curating a high-quality dataset tailored to specific use cases.
The metrics for evaluation are typically customized based on the goals of the application. To address common evaluation needs, LangChain has developed OpenEvals and AgentEvals, providing pre-built solutions that highlight prevalent evaluation trends and best practices.
Common Evaluation Types and Best Practices
OpenEvals and AgentEvals focus on two main approaches to evaluations:
- Customizable Evaluators: These evaluations, known as LLM-as-a-judge evaluations, are widely applicable and allow developers to adapt pre-built examples to their specific requirements.
- Specific Use Case Evaluators: These evaluators are designed for particular applications, such as extracting structured content from documents or managing tool calls and agent trajectories. LangChain plans to expand these libraries to include more targeted evaluation techniques.
LLM-as-a-Judge Evaluations
LLM-as-a-judge evaluations are widely used for assessing natural language outputs. These evaluations can be reference-free, enabling objective assessment without the need for ground truth answers. OpenEvals facilitates this process by providing customizable starter prompts, incorporating few-shot examples, and generating reasoning comments for transparency.
Structured Data Evaluations
For applications that require structured output, OpenEvals offers tools to ensure that the model’s output adheres to a predefined format. This is crucial for tasks such as extracting structured information from documents or validating parameters for tool calls. OpenEvals supports exact match configuration or LLM-as-a-judge validation for structured outputs.
Agent Evaluations: Trajectory Evaluations
Agent evaluations focus on assessing the sequence of actions an agent takes to accomplish a task. This involves evaluating tool selection and the trajectory of applications. AgentEvals provides mechanisms to evaluate and ensure that agents are using the correct tools and following the appropriate sequence.
Tracking and Future Developments
LangChain recommends using LangSmith for tracking evaluations over time. LangSmith offers tools for tracing, evaluation, and experimentation, supporting the development of production-grade LLM applications. Notable companies like Elastic and Klarna utilize LangSmith to evaluate their GenAI applications.
LangChain’s commitment to codifying best practices continues, with plans to introduce more specific evaluators for common use cases. Developers are encouraged to contribute their own evaluators or suggest improvements via GitHub.
Image source: Shutterstock