How AfterQuery Expert Data Drives Model Performance on τ²-bench
Michael E.Spencer MateegaArya F.
Michael E., Spencer M., Arya F.·April 7, 2026

Using just 1,057 rollouts from our off-the-shelf τ² dataset, AfterQuery researchers were able to improve Llama-3.1-8B-Instruct performance on the official τ² benchmark by up to 4.33× in some domains.

τ²-bench measures how well models operate as customer-service agents in dual-control scenarios, where success requires not just reasoning and tool use, but coordinating with a user who is actively modifying shared state.

The benchmark

τ²-bench simulates realistic customer-service interactions. Each domain defines an agent policy, a set of tools and APIs, a task suite, and a user simulator. The agent must navigate multi-turn conversations, call the right tools with the right arguments, and follow domain-specific policies, all while interacting with a simulated customer.

For example, in one official τ²-bench task a customer tries to cancel a flight reservation past the allowed window and pushes back when the refund is denied. The agent must hold firm and refuse the cancellation per airline policy.

Our data builds upon all 3 public τ² domains with new scenarios, and introduces 2 new domains built by AfterQuery:

Airline
Retail
Telecom
BankingAfterQuery
HealthcareAfterQuery

Our evaluation methodology used the test split for each domain where applicable (e.g. tau2-airline-splits): retail (40 tasks), telecom (40 tasks), airline (33 tasks). This differs from AA's methodology, which used all 114 tasks for telecom.

Training

We fine-tuned Llama-3.1-8B-Instruct on 1,057 rollouts from AfterQuery's off-the-shelf dataset, covering 500 unique tasks across 6 domain variants. Rollouts were filtered via rejection sampling to keep only passing trajectories. Every training sample was disjoint from both our validation set and the official τ² benchmark.

LoRA SFT Configuration
ParameterValue
Base Modelmeta-llama/Meta-Llama-3.1-8B-Instruct
Training Infrastructure8x H100
LoRA Rank32
Learning Rate2e-5
LR ScheduleCosine
Batch Size16
Epochs3
Max Sequence Length32,768

The full training run completed in under 20 minutes on 8x H100s. Training loss dropped by roughly 45%, with eval loss tracking closely and no significant overfitting, suggesting the model learned generalizable patterns rather than memorizing specific conversations.

Results

We evaluated each checkpoint against the unmodified base model. Both used gpt-4.1 as the user simulator, with temperature set to 0.0.

Retail saw the largest relative gain at 4.33x. Telecom improvement peaked early at step 60 then declined, as later checkpoints appeared to overfit away from telecom-specific patterns. Airline improved steadily through training.

Line chart showing τ² benchmark scores over training at Base, Step 60, 100, 160, and Final checkpoints for retail, telecom, and airline domains
Bar chart showing Llama-3.1-8B-Instruct Base vs Peak scores: retail 4.33x improvement (7.5% to 32.5%), telecom 3.50x (5.0% to 17.5%), airline 1.37x (33.3% to 45.5%)

Evaluation setup

Our Checkpoints

terminal
tau2 run --domain retail --agent llm_agent \
--agent-llm "openai/sft-s100" \
--agent-llm-args '{"api_base":"http://localhost:8000/v1","temperature":0.0}' \
--user-llm openrouter/openai/gpt-4.1 \
--user-llm-args '{"temperature":0.0}' \
--task-split-name test --num-trials 1 --max-concurrency 5

Base Model

terminal
tau2 run --domain retail --agent llm_agent \
--agent-llm openrouter/meta-llama/llama-3.1-8b-instruct \
--agent-llm-args '{"temperature":0.0}' \
--user-llm openrouter/openai/gpt-4.1 \
--user-llm-args '{"temperature":0.0}' \
--task-split-name test --num-trials 1 --max-concurrency 5

How model behavior changed

The most visible difference between the base model and our fine-tuned model is workflow, not raw capability. The base model tends to guess at tool arguments immediately, fabricating order IDs and placeholder emails. Our model consistently begins by asking the user for identifying information, then calls tools with real values.

Behavioral Metrics: Base vs SFT
MetricBaseSFT
Uses placeholder/fabricated tool arguments91% of tasks5% of tasks
Asks user for identity before acting6% of tasks95% of tasks
Avg tool calls per task8.68.0
Avg conversation length29 messages27 messages

What the improvement looks like in practice

Retail, Task 17: A user wants to update the delivery address on a pending order but doesn't remember her email. The agent must verify her identity via name and zip code, look up the correct order across multiple results, and apply the address change.

Base model: passes, but stumbles first

User

I need to update the delivery address for an order I placed.

Agent

[TOOL] modify_pending_order_address(order_id="#W0000000", ...)

Tool Response

Error: Order not found

Agent

[TOOL] find_user_id_by_email(email="your_email@example.com")

Tool Response

Error: User not found

The base model fabricates an order ID and a placeholder email—both fail. It eventually recovers but wastes multiple turns.

SFT model: passes, clean workflow

User

I need to update the delivery address for an order I placed.

Agent

I'd be happy to help! To get started, I'll need to verify your identity. Could you provide me with either your email or your name and zip code?

User

My name is Fatima Johnson and my zip code is 78712.

Agent

[TOOL] find_user_id_by_name_zip(first_name="Fatima", last_name="Johnson", zip="78712")

Tool Response

fatima_johnson_7581

The SFT model asks for real identifying information, then calls the correct tool with actual values.

Data is the lever

These results are driven primarily by data quality. Using only a small fraction of a single curated AfterQuery dataset, one SFT stage was enough to shift model behavior and generalize to unseen tasks. We believe adding RL via τ²'s gymnasium is a clear next step to push performance further.

Get in touch here to access our off-the-shelf τ² and agentic post-training datasets, or reach out to us directly at research@afterquery.com.

AfterQuery is an applied research lab curating data solutions to accelerate foundation model development.

Evaluation on τ² benchmark test splits. Retail: 40 tasks, Telecom: 40 tasks, Airline: 33 tasks. User simulator: gpt-4.1. Temperature 0.0 for both agent and user. Single trial.