APRIL 8, 2026

How Evals Changed the Way We Build AI Flashcards

We thought agentic AI would solve AI flashcards quality. Generate, review, update — the classic loop. It produced beautiful cards. Then our users told us the truth. Evals told us why.

AI/ML ENGINEERING

Allan Tan

Chief AI Scientist at Better-ed

We thought agentic AI would solve flashcard quality. Generate, review, update — the classic loop. It produced beautiful cards. Then our users told us the truth. Evals told us why.

The agentic starting point

When we first built our AI flashcards generation pipeline, we reached for the pattern that felt most natural: generate-review-update. An LLM generates the initial flashcard set from a teacher’s curriculum document. A second pass reviews each card for accuracy, clarity, and pedagogical soundness. A third pass rewrites anything the reviewer flagged.

The results looked excellent. Cards were well-structured, answers were grounded in the source material, and learning objectives were cleanly mapped. Early pilot teachers loved them. We shipped it.

Then the feedback arrived

Within weeks, teachers started asking questions we didn’t expect:

“Why am I only getting 10 cards for a 24-page module?”

“This took two minutes to generate. Can it be faster?”

“The hint basically tells the student the answer.”

“This question is confusing — a student could answer three different things.”

We had optimized for what looked good in a demo. We hadn’t measured what actually worked in a classroom. We needed a systematic way to separate signal from noise — a way to know, with data, whether a change made things better or worse.

We needed evals.

Building our evaluation framework

We designed three evaluation suites, each testing a different dimension of flashcard quality:

Model	Ungrounded	Grounded	Grounded + Hits	Review Pass	Avg Issues
Gemini 3 Flash	1.000	0.938	0.936	7/7	1.3
GPT 5.4 Mini	0.986	0.944	0.642	0/7	7.1
MiniMax M1	0.971	0.854	0.711	1/7	6.1
Gemini 3 Flash Lite	0.943	0.800	0.800	3/7	2.7

How Evals Changed the Way We Build AI Flashcards

The agentic starting point

Then the feedback arrived

Building our evaluation framework

How Evals Changed the Way We Build AI Flashcards

The agentic starting point

Then the feedback arrived

Building our evaluation framework

The surprising cost of “review”

Cutting the loop: one-shot generation

The hint problem

Model comparison: four models, three evals

Gemini 3 Flash: consistent quality at scale

GPT 5.4 Mini: reliable format, fragile hints

MiniMax M1: high volume, hint struggles

Gemini 3 Flash Lite: quality ceiling

What we learned