The Evaluator
Your go-to blog for insights on AI observability and evaluation.

Build More Accurate AI Apps through Fast Experimentation with Arize Phoenix, Langflow, and NVIDIA
One of the biggest challenges AI app developers face is ensuring the apps they build provide accurate answers. When the AI isn’t accurate, customers quickly lose trust in the app….

Merge, Ensemble, and Cooperate! A Survey on Collaborative LLM Strategies
LLMs have revolutionized natural language processing, showcasing remarkable versatility and capabilities. But individual LLMs often exhibit distinct strengths and weaknesses, influenced by differences in their training corpora. This diversity poses…

Arize Release Notes: Copilot Enhancements, Experiment Projects, and More
Welcome to our regular update on new releases, enhancements, and changes. What’s New Copilot Enhancements Span Chat The Copilot Span Chat skill makes getting value from spans faster and easier….
Sign up for our newsletter, The Evaluator — and stay in the know with updates and new resources:

AI Agent Workflows and Architectures Masterclass
While popular imagination and industry discourse can paint AI agents as complex autonomous systems with a mind of their own, practical implementations are far more straightforward. While specific definitions vary,…

Building an AI Agent that Thrives in the Real World
Building an AI agent and keeping it running smoothly in production can feel like a daunting task. When it comes to working with LLMs, it’s still a bit of uncharted…

Agent-as-a-Judge: Evaluate Agents with Agents
This week we dive into a paper that presents the “Agent-as-a-Judge” framework, a new paradigm for evaluating agent systems. Where typical evaluation methods focus solely on outcomes or demand extensive…

Instrumenting Your LLM Application: Arize Phoenix and Vercel AI SDK
Instrumentation is an important tool for developers building with LLMs. It provides insight into application performance, behavior, and impact. This blog will cover: Why instrumentation matters for LLM applications Benefits…

What is AutoGen?
Thanks to Ali Saleh for his contributions to this piece. AutoGen is a framework that helps you easily create multi-agent applications. Multi-agent applications are a relatively recent idea that involve…

Introduction to OpenAI’s Realtime API
We break down OpenAI’s realtime API. Sally-Ann DeLucia and Aparna Dhinakaran cover how to seamlessly integrate powerful language models into your applications for instant, context-aware responses that drive user engagement….

How to Improve LLM Safety and Reliability
As language models (LLMs) become integral to customer-facing applications, safety is a requirement. When AI systems are trusted to provide information, guidance, and customer support, lapses in safety can…