Stop Trusting LLM JSON: Build an AI Repair Engine

Labas: https://labas.rogasper.com﻿

repo: https://github.com/rogasper/labas-bahasa

﻿

Most AI apps trust the LLM output. Ours has a 434-line defensive layer that feels like a linter + spellcheck + babysitter for your AI.

When you ask an LLM to generate exam questions in structured JSON, it will lie to you. Not maliciously just confidently wrong. It'll return "Option A" as an option text. It'll answer "T" when you asked for "TRUE". It'll generate three options when you requested four. And it'll do all of this with perfect JSON formatting.

We built a repair engine to catch every one of these failures before they reach users.

Key Takeaways

•LLMs are unreliable at structured output a 434 line repair engine (repair.ts) catches placeholder text, coerces answer formats, deduplicates options, and provides exam-specific fallbacks before Zod validation even runs.

•The agentic pipeline uses a retry loop: generate -> validate -> repair -> regenerate invalid subset only, with context about why previous attempts failed. This "self-healing" pattern turns brittle AI output into production-grade content.

•Real-world edge case include DeepSeek reasoning tags(<think>), HTML error pages from proxies, and truncated JSON mid-stream each requiring specific defensive handling in the API client.

The Problem: LLMs Are Confidently Wrong

The best LLMs produce structurally valid but semantically incorrect JSON roughly 15-25% of the time when generating complex nested objects. For exam questions, that means 1 in 4 questions might have a wrong answer, a missing option, or a placeholder text.

Our experience matched this. When we first built the AI question generator for Labas an open-source exam pratice platform supporting IELTS, TOEFL, JLPT, HSK, and German exams we assumed the LLM would return perfect. It didn't.

Here's what an LLM actually returns when asked to generate a multiple-choice question:

Four problems in one response:

1.Placeholder options: "Option A" through "Option D" are meaningless.

2.Lowercase answer: "a" instead of "A".

3.Empty explanation: violates our requirement for Bahasa Indonesia explanations.

4.No passage text: the question references a passage that wasn't included.

Zod validation would catch #3 (empty string fails .min(1)). But #1, #2, and #4 are structurally valid they just produce bad questions.

Our finding: Before the repair engine, roughly 40% of AI-generated questions required manual correction. After, fewer than 5% needed human review. The repair engine handles the mechanical fixes; human only review semantic quality

﻿

The Repair Engine: Trust But Verify

The repair engine lives in packages/ai/src/repair.ts 434 lines of defensive programming. It runs before Zod validation, maximizing the chance that raw LLM output passes structural checks.

Step 1: Detect Placeholder Text

The first check catches lazy AI outputs:

This regex catches "Option A", "Pilihan B", "Choice 1", "opsi C", and even "placeholder". The multilingual pattern (pilihan is Indonesian, opsi is Indonesian) reflects our user base Indonesian students practicing exams in English, Japanese, Chinese, and German.

When a placeholder is detected, the question fails semantic validation and enters the regeneration queue.

Step 2: Coerce Answer Formats

Different exam formats expect different answer conventions. The repair engine normalizes them:

An LLM might return "T", "true", "True", or even "yes" for a True/False question. The coercion function maps all of theses to the canonical "TRUE". Same for "F" -> "FALSE", "NG" -> "NOT_GIVEN".

For author_view format (a different IELTS question type), the valid answers are "YES", "NO", or "NOT_GIVEN" and the coercion handles those too.

Step 3: Deduplicate Options

LLMS sometimes generate duplicate option keys:

If an LLM returns options with keys ["A", "B", "B", "C"], the duplication reduces it to ["A", "B", "C"]. The question then fails the "minimum options" check and enters regeneration.

Step 4: Exam-Specific Fallbacks

When question text is missing or too short, the repair engine inserts language-specific defaults:

For JLPT (Japanese), it inserts a standard Japanese question. For HSK (Chinese), a Chinese one. For TOPIK (Korean), Korean. This ensures that even if the LLM fails to generate question text, the output is still exam-appropiate.

The Agentic Pipeline: Self-Healing Through Retries

The repair engine is one piece of a larger system: the agentic generation pipeline in packages/ai/src/agentic.ts (585 lines). It's 4-step workflow with a retry loop:

After Step 4, the repair engine runs. If any questions fail validation, the pipeline enters a regeneration loop:

The key insight: only regenerate the invalid subset, not all questions. And pass the LLM context about why the previous attempts failed the repair log becomes part of the regeneration prompt.

This is the "self-healing" pattern in action. The pipeline doesn't just detect failures it fixes them automatically, retrying up to 2 times (in "full" strategy) before surfacing results to the user.

The API Client: Defensive Layers Beyond the Repair Engine

The repair engine handles LLM output. But the API client (packages/ai/src/clients.ts, 317 lines) handles everything else that can go wrong between your code and the LLM.

SSRF Protection

Users can bring their own AI provider including local models running on localhost. But in production, we block requests to metadata endpoints and private IPs:

This prevents a malicious user from setting their base_url to http://169.254.169.254/latest/meta-data/ and reading cloud credentials.

HTML Error Page Detection

Some API proxies return HTML error pages instead of JSON. The client detects this:

Without this check, the JSON parser would fail with an opaque error. With it, the user gets a clear message about what went wrong.

Reasoning Block Stripping

Reasoning models (DeepSeek, GLM) wrap their chain-of-thought in <think> tags. The client strips these before JSON parsing:

Without stripping, the LLM response would be <think>...reasoning...</think>{"questions": [...]}, which fails JSON parsing.

Truncation Retry

If the LLM's response looks like truncated JSON (ends mid-object), the client doubles max_tokens and retries:

This handles the common case where the LLM runs out of tokens mid-JSON.

The Philosophy: LLM as Junior Dev

The entire system repair engine, agentic pipeline, defensive API client embodies one principle: treat the LLM as junior developer who writes fast but needs review.

You wouldn't ship code from a junior dev without review. Don't ship LLM output without validation either.

The repair engine is the code review. The agentic pipeline is the pair programming session. The API client is the CI pipeline that catches infrastructure issues.

Contrarian take: The biggest mistake in AI app development isn't using the wrong model it's trusting the model's output without a validation layer. Zod schemas catch structural errors. Semantic checks catch logical errors. Retry loops catch transient erros. Build all three, or your users will find the bugs for you.

Frequently Asked Questions

Why not just use a better prompt instead of a repair engine?
Prompts help, but they're not reliable. Even with detailed prompts, LLMs occasionally produce placeholder text, wrong answer formats, or missing fields. The repair engine is a safety net it catches what prompts miss. In our testing, better prompts reduced failures from 40% to 25%. The repair engine brought it down to under 5%.
Does the repair engine add latency?
Minimal. The repair engine runs synchronously in-process no network cals. For a batch of 100 questions, repair and parsing takes under 5ms. The regeneration loop adds latency only when questions fail validation, and even then, it's just one additional LLM call (typically 2-5 seconds).
Can this pattern work for non-exam AI applications?
Absolutely. The repair engine pattern detect, coerce, fallback, retry applies to any domain where LLMS produce structured output. Form validation, data extraction, code generation, content moderation all benefit from a "trust but verify" layer between the LLM and your database.
What about the "lean" vs "full" strategy?
The agentic pipeline supports two strategies. "Full" runs all 4 steps (passage generation, passage validation, question generation, self-validation) with up to 2 regeneration attempts. "Lean" skips passage validation and self-validation, with only 1 regeneration attempt. Lean is faster and cheaper; full is higher quality. Users choose based on their needs.

Stop Trusting LLM Output: Why We Built a 400-Line "Repair Engine" for Our AI

Key Takeaways

The Problem: LLMs Are Confidently Wrong

The Repair Engine: Trust But Verify

Step 1: Detect Placeholder Text

Step 2: Coerce Answer Formats

Step 3: Deduplicate Options

Step 4: Exam-Specific Fallbacks

The Agentic Pipeline: Self-Healing Through Retries

The API Client: Defensive Layers Beyond the Repair Engine

SSRF Protection

HTML Error Page Detection

Reasoning Block Stripping

Truncation Retry

The Philosophy: LLM as Junior Dev

Frequently Asked Questions