Building a RAG pipeline for Minecraft coding

The base LLM said PlayerPickItemEvent doesn’t exist. It does. It’s in Paper 1.21.3 (the source I used for v1 of this RAG approach), it fires when a player middle-clicks a slot, and the base model scored 1/5 on that question in my benchmark. With MCP retrieval context, it scored 5/5.

That’s the whole pitch. But the pipeline that gets there is more interesting than the result.

Why I scrapped the web app

The original plan was a SvelteKit chat UI – custom LLM client, MongoDB for conversations, BetterAuth, the usual. Three weeks in I hadn’t touched the thing that actually mattered, which was the retrieval engine. About 80% of what I was building already existed in Claude Code and Cursor.

So I deleted the web app and built an MCP server instead. Plug it into whatever tool you’re already using. The server does one thing: take a query, search the Paper API knowledge base, return relevant code chunks. The host LLM does the rest.

This is the right call, and I should have made it sooner.

How the knowledge base is built

4,504 Java source files from the Paper 1.21.3 API. The first question is how to chunk them.

Sliding-window chunking – split every N tokens, step by M – is what most RAG tutorials reach for. It’s wrong here. A Java method is a natural semantic unit. Splitting one across a chunk boundary means your retrieval might return half a method signature without its parameter types, or a class docstring without the class declaration. The boundaries matter.

Tree-sitter parses each file into an AST. I chunk at class, method, and enum boundaries – 14,015 chunks total. Each chunk carries rich metadata: package, fully-qualified class name, parent class, implemented interfaces, method name and parameter types (for method chunks), return type, visibility, is_static, is_deprecated, a since_version tag, and a list of related types referenced by the chunk body. That metadata does work later at query time.

[Placeholder Figure 1 – Qdrant explorer screenshot: a single class chunk showing the full payload (text plus every metadata field) so the shape is visible at a glance.]

src/indexer/parser.ts

export async function parseJavaFile(
	source: string,
	filePath: string,
	mcVersion: string,
	sourceType: SourceType,
	library: string | null = null
): Promise<Chunk[]> {
	const p = await initParser();
	const tree = p.parse(source);
	const root = tree.rootNode;

	const pkg = extractPackage(root);
	const imports = extractImports(root);
	const ctx: ClassContext = { pkg, imports, filePath, mcVersion, sourceType, library };
	const chunks: Chunk[] = [];

	for (const child of root.children) {
		if (
			child.type === 'class_declaration' ||
			child.type === 'interface_declaration' ||
			child.type === 'enum_declaration'
		) {
			chunks.push(...processTypeDeclaration(child, ctx));
		}
	}

	return chunks;
}

The web-tree-sitter WASM build, not native tree-sitter. Native requires C++ compilation that’s reliably painful on Windows. WASM works everywhere, the overhead only matters at index time, and indexing is offline.

Each chunk then goes through fastembed – BGE-base-en-v1.5, local ONNX, no external API calls. 768-dimensional dense vectors. I also compute BM25 sparse vectors myself: tokenise every chunk at index time, build the IDF table, generate per-chunk sparse vectors. Both land in Qdrant. One collection, two vector indexes per chunk.

Building the BM25 sparse vectors

Dense embeddings handle semantics. They don’t handle exact identifier matches well – if the query says PlayerPickItemEvent, you want that exact string to score high, not just “things that feel related to item pickup”. BM25 is the other half of the retrieval: classic TF-IDF with document-length normalisation, built from the same corpus at index time.

The tokeniser is the important bit for code. Off-the-shelf BM25 libraries split on whitespace and punctuation. That would turn ItemStack into one token, which is wrong – queries like “make an item” should still match ItemStack chunks. So the tokeniser splits on camelCase and PascalCase boundaries before lowercasing:

src/retrieval/bm25.ts

export function tokenize(text: string): string[] {
	const split = text.replace(/([a-z])([A-Z])/g, '$1 $2').replace(/([A-Z]+)([A-Z][a-z])/g, '$1 $2');

	return split
		.split(/[^a-zA-Z0-9]+/)
		.map((t) => t.toLowerCase())
		.filter((t) => t.length >= 2);
}

That one regex pair (get Action, Item Stack, HTTP Server) matters more than any BM25 parameter choice. The vocabulary that falls out covers every class, method and enum name in the corpus alongside the natural-language tokens from Javadoc comments.

The scorer uses standard BM25 with k1=1.2, b=0.75. IDF is the log-scaled inverse document frequency per term. Documents get the full length-normalised TF term; queries use a simpler normalisation, because a query’s “length” isn’t comparable to a document’s. Everything serialises to a single JSON file that the MCP server deserialises on start.

Stored in Qdrant as sparse vectors – two arrays, indices and values, where each index is a vocabulary slot and each value is that token’s BM25 score for this chunk.

Hybrid search and why code needs both

Dense vector search is semantic. “Make a player fly” finds PlayerToggleFlightEvent because the embedding space knows they’re related, even though “fly” and “Flight” aren’t the same string. But if you ask about PlayerPickItemEvent by name, a pure vector search might miss it – the class name is a specific identifier, and semantic similarity doesn’t help much when you’re hunting for an exact match.

BM25 inverts this. Exact identifier matches score high. But “give a player a diamond sword” won’t surface ItemStack and PlayerInventory reliably because BM25 just sees token frequencies.

So both run in parallel on every query. Results merge via score normalisation plus a bonus for chunks found in both sets. After that, the retrieval engine walks the type graph – if PlayerToggleFlightEvent is in the results, pull in its parent classes and related interfaces from the chunk metadata. The LLM gets the full picture of the type hierarchy, not just the matching chunk in isolation.

[Placeholder Figure 2 – pipeline flow diagram: query → keyword extraction → term expansion → (dense + BM25 in parallel) → score merge → metadata graph traversal → optional reranker → context cap → return.]

Nothing in the retrieval pipeline calls an LLM. Query embedding, BM25 scoring, hybrid fusion, graph traversal – all deterministic. Latency is sub-500ms on a laptop. The host LLM only sees the final context window.

A query end-to-end: “how do I apply knockback to a player”

Worked example. The user query lands in the MCP server’s search_api tool.

Keyword extraction runs first. PascalCase regex finds no class names. camelCase finds none. Dot-notation finds none. The tokeniser splits the query into [how, do, i, apply, knockback, to, a, player], then the stemmer runs. apply stays apply, knockback stays knockback, player stays player (none of them match a suffix rule).

Term-expansion lookup is where the interesting work happens. The domain dictionary has a knockback entry:

data/terms/core.json

"knockback": {
  "primary": ["EntityKnockbackEvent", "Vector", "LivingEntity"],
  "secondary": ["Entity.setVelocity"],
  "category": "entity"
}

Primary terms are the ones the Paper API actually uses. Secondary are supporting types. A stemmed single-word lookup in a pre-built Map pulls this entry in O(1). Multi-word lookups are supported too (e.g. “action bar”) via pre-compiled regexes, with a proximity fallback so non-adjacent words still match if they sit inside a six-token window.

Query construction for BM25. Primary expansions are repeated twice to give them a 2x term-frequency weight; secondary expansions appear once. Extracted class names get pulled in too, and if any class name already lives in the Qdrant index, its parent class, interfaces and related types get concatenated onto the end of the BM25 query string:

src/retrieval/search.ts

const bm25Query = [
	query,
	...keywords.classNames,
	...keywords.methodNames,
	...keywords.qualifiedNames.flatMap((q) => q.split('.')),
	...keywords.primaryExpansions,
	...keywords.primaryExpansions, // 2x TF boost for primary
	...keywords.secondaryExpansions,
	...classContext
].join(' ');

Parallel search. Dense vector search runs against the query embedding. Sparse BM25 runs against the expanded query. Both return their top K chunks. They merge: scores are normalised to [0, 1] within each set, and chunks that appear in both get their scores added – a genuine cross-index agreement signal.

Graph traversal. The top merged hits get their related_types pulled. For the knockback query, EntityKnockbackEvent chunks reference Entity, Vector and LivingEntity in their metadata – those class-level chunks get added to the result set with a flat 0.3 score, below the search hits but high enough to stay in the context window.

Reranking and capping. A cross-encoder reranker (optional via USE_RERANKER=true) scores each candidate against the original query properly; final results get capped at ~20KB of text so noisy retrieval can’t flood the LLM’s context window.

The host LLM sees a flat text block: each chunk prefixed with its source (vector, bm25, or metadata) and score. That’s all.

Type-graph traversal

The get_method tool does something the search_api tool doesn’t: it walks the inheritance hierarchy. If you ask for Player#teleport and the method isn’t declared directly on Player, you probably want the inherited method off LivingEntity or Entity:

src/retrieval/search.ts

for (let depth = 0; depth < 5 && queue.length > 0; depth++) {
	const batch = queue.splice(0, queue.length);
	const lookups = await Promise.all(batch.map((cls) => this.methodLookup(cls, methodName)));

	const found = lookups.flat();
	if (found.length > 0) return found;

	// Get parent types for next level
	const parentLookups = await Promise.all(batch.map((cls) => this.getClass(cls)));
	for (const results of parentLookups) {
		for (const r of results) {
			const { parent_class, interfaces } = r.chunk.metadata;
			if (parent_class && !visited.has(parent_class)) {
				queue.push(parent_class);
				visited.add(parent_class);
			}
			for (const iface of interfaces) {
				if (!visited.has(iface)) {
					queue.push(iface);
					visited.add(iface);
				}
			}
		}
	}
}

BFS up to five levels, per-level parallel method lookups, return on the first hit. Five levels is deep enough for every Paper type I’ve checked. The visited set stops cycles through interface inheritance.

The MCP surface

Four tools, all narrow:

src/mcp/server.ts

server.tool("search_api",       /* ... */, { query: z.string(), limit: z.number().optional() }, handler);
server.tool("get_class",        /* ... */, { className: z.string() },                             handler);
server.tool("get_method",       /* ... */, { className: z.string(), methodName: z.string() },     handler);
server.tool("get_related_types",/* ... */, { className: z.string() },                             handler);

The LLM decides which tool fits the question. Natural-language queries land on search_api. “What’s the signature of Player#sendTitle?” routes to get_method. “Show me everything related to PlayerInteractEvent” goes through get_related_types, which pulls the class plus every type it references.

There’s also one MCP resource – a static system prompt with Paper-specific conventions (use Adventure components not legacy ChatColor, never block the main thread, use PersistentDataContainer over metadata hacks). The host can pin it into every session if it wants. Most tools just read it once.

Stdio transport, no auth, one process. Plug it into Claude Code or Cursor with a three-line config entry.

The evaluation harness

The benchmark isn’t hand-scored. 100 Paper/Spigot questions live in src/eval/benchmark.ts with metadata: difficulty, tags, expected classes. Each question runs twice – once against the base LLM with just the system prompt, once with the full retrieval context injected. A judge model then scores both answers 1-5 on API correctness.

The judge prompt is deliberately strict and structured:

src/eval/compare.ts (judge prompt fragment)

// Rate each code answer 1-5 for API correctness:
// 1 = Wrong/hallucinated Java API class names or methods
// 2 = Vaguely correct approach but wrong class/method specifics
// 3 = Mostly correct API usage, minor inaccuracies
// 4 = Correct classes and methods, good code explanation
// 5 = Perfect — exact API names, correct method signatures, working code examples
//
// Output your review in exactly this format:
// SCORE_A: [1-5]
// SCORE_B: [1-5]
// REASON: [one sentence explaining the difference]

The “exactly this format” bit matters: a regex parses SCORE_A, SCORE_B and REASON out of the judge’s response. If either score parses as 0 the run logs the raw text for inspection; GLM’s thinking-mode responses occasionally drop the plain content field and put the answer in reasoning_content instead, which the extractor handles.

Using the same model as both answerer and judge (GLM-5.1 on both sides) is less clean than using a different model for judging. It’s how this particular evaluation is set up because I was on a single coding-plan API quota and didn’t want to pay for two. That’s a known weakness – the judge has the same blind spots as the answerer. The absolute scores are softer than they look; deltas (base vs MCP) still mean something because both answers face the same judge.

Every run writes a full JSON log to logs/compare/ – every prompt, every response, every parse – so I can audit a case after the fact.

The benchmark numbers

v1 ran 20 questions against GLM-5. MCP won 9, base won 4, 5 ties. +7.9% overall. Fine.

Then I overhauled the domain term matching – stemming, synonym expansion, proximity matching for multi-word terms, ~370 curated terms. The idea: if the query says “knockback” and the Paper API uses EntityKnockbackByEntityEvent, the term dictionary bridges that gap. v2 re-ran on 70 questions (38 marked hard).

Hard cases: MCP wins 25, base wins 5, 8 ties. +27.1%. Win:loss ratio 5:1.

Full run (70 cases): +17.3%, 2.6:1 win:loss.

[Placeholder Figure 3 – benchmark bar chart: base vs MCP average scores across v1, v2-full and v2-hard, with win:loss ratio annotations. Data in docs/COMPARISON.md.]

The strongest wins were exactly the queries where the base model has nothing: PlayerPickItemEvent (+4), BeaconEffectChangeEvent (+3), DatapackManager API (+3), PlayerProfile/PlayerTextures (+3). Paper-specific APIs that postdate most training data, or are obscure enough to have never been well-covered.

Where it’s still wrong

Two notable losses in v2 were cases 78 and 84 – chunk generation and armour trims, both 4→3. The pattern was familiar from v1: case 28, also chunk generation, had 41KB of retrieval context and the LLM hallucinated generateNoise() and generateSurface(), methods that don’t exist. The context noise caused it to fabricate things that looked plausible given what it was reading.

More context is not better. There’s a point where the context window is full of loosely-related code and the model starts confabulating. Case 27 in v1 had the same pattern – retrieval surfaced similar-sounding event classes, and the LLM picked the wrong one: PlayerOpenSignEvent instead of PlayerSignOpenEvent. One character transposition, different event, wrong answer.

The fix for this class of problem is a reranker. Right now the retrieval engine returns the top-N hybrid search results, and “top-N” is doing a lot of work – some of those chunks are genuinely relevant, some are structurally similar but semantically off. A cross-encoder reranker scores each candidate against the original query properly, not just against each other.

The 5:1 win ratio looks good, but the losses are the part worth watching. Both v1 and v2 worst-case losses came from noisy retrieval, not from retrieval failing to find anything. The pipeline is more dangerous when it’s confidently wrong than when it returns nothing.

The reranker is actually built – bge-reranker-base via Transformers.js with DirectML GPU, combined 60/40 with the hybrid score. But the v2 numbers above are pre-reranker; they came from the term-expansion overhaul alone. The reranker is optional (USE_RERANKER=false by default) and its uplift on the failing cases is still unmeasured. That’s the next A/B – flip the flag, re-run, see if 78 and 84 stop losing.

The benchmark sits at 70 cases right now. I want 100 hard-only cases before I’d call the evaluation credible. src/eval/ has the tooling for it – just needs more questions written.

Why I scrapped the web app

How the knowledge base is built

Building the BM25 sparse vectors

Hybrid search and why code needs both

A query end-to-end: “how do I apply knockback to a player”

Type-graph traversal

The MCP surface

The evaluation harness

The benchmark numbers

Where it’s still wrong

/ Projects

Projects