How mcp.sv Does Multilingual Search and Structured RAG

mcp.sv is the national MCP for El Salvador — a knowledge service that lets
an AI assistant answer grounded questions about Salvadoran laws, decrees, taxes, and government
procedures, with citations. It has two hard problems baked in:

The corpus is almost entirely in Spanish (decretos, leyes, normativas, DTE specs), but
users ask in Spanish, English, and Russian.
It's a retrieval system for legal text, where "close enough" isn't good enough — you need
the right article of the right decree, and you need to cite it.

The interesting part is how it solves this. Plain "embed everything, do a vector search" — the
default RAG everyone starts with — is the floor, not the ceiling. mcp.sv stacks several
layers on top, and crucially, it does almost all the expensive work at ingest time so the
public query path does zero model inference. This post walks through every layer, with the
real code.

Architecture in one line: a private "Plan A" box (Mac Studio + LM Studio) does all the
crawling, OCR, embedding, summarizing, and translating; a public "Plan B" server (Postgres +
pgvector + an ASP.NET Core API + the MCP endpoint) only reads and blends. The query side
never calls a model.

Part 1 — Digesting queries in three languages

The naive approach is "detect the query language, translate it, then search." mcp.sv
deliberately avoids runtime language detection and runtime translation on the server. Instead it
makes the documents findable in all three languages ahead of time, and leans on Postgres
full-text search. Four techniques stack up.

1. Synthetic-text aliases (the workhorse)

Every document gets an extra, invisible field — synthetic_text — full of English and Russian
keywords mapped from the Spanish content. The mapping comes from one shared file,
term_dictionary.json: 92 curated Spanish→(English, Russian) term pairs covering the legal and
financial vocabulary that actually shows up in queries.

{
  "ley":     ["law statute",              "закон"],
  "decreto": ["decree",                   "декрет указ"],
  "iva":     ["vat value added tax",      "ндс налог добавленную стоимость"],
  "bitcoin": ["bitcoin",                  "биткоин криптовалюта"],
  "trabajo": ["work labor employment",    "труд работа занятость"]
}

The same dictionary feeds two code paths so they can never drift. At ingest, the C# crawler
builds the alias string in-process — longest keys first, with word-boundary matching so iva
doesn't fire inside privado:

// SyntheticTextBuilder.cs — runs during ingest
foreach (var (key, en, ru) in _entriesByDescendingKeyLength)
{
    var pattern = $@"(?<!{spanishLetter}){Regex.Escape(key)}(?!{spanishLetter})";
    if (Regex.IsMatch(combined, pattern))
    {
        enBuilder.Append(en);
        ruBuilder.Append(ru);
    }
}

A Python script, synthetic_text_backfill.py, can run the same mapping as a backfill — and it
has a smarter qa mode that asks a local LLM to generate the questions users would
actually type, in all three languages:

You generate likely user questions about legal and government documents from
El Salvador. Given a document's title and summary, output 5-7 natural-language
questions in EACH of three languages: English, Spanish, and Russian.

So a Spanish decree about electronic invoicing becomes findable by an English speaker typing
"how do I register as a DTE issuer?" — because that exact question was generated and indexed
against the document ahead of time. The synthetic_text column is indexed under Postgres's
'simple' FTS config (case-folding only, no stemming — it's language-agnostic on purpose).

2. Stored translations with language-specific stemmers

For the title and summary, mcp.sv goes further than aliases and stores real translations in
dedicated columns — title_en/summary_en, title_ru/summary_ru — each indexed under the
matching Postgres stemmer:

CREATE INDEX ix_documents_en_fts ON documents USING gin (
    to_tsvector('english', coalesce(title_en,'') || ' ' || coalesce(summary_en,'')));

CREATE INDEX ix_documents_ru_fts ON documents USING gin (
    to_tsvector('russian', coalesce(title_ru,'') || ' ' || coalesce(summary_ru,'')));

Why per-language configs matter: the 'russian' Snowball stemmer knows that законов,
закону, and закон are the same word. A generic index wouldn't, and inflected languages would
quietly fail to match.

3. Unaccent folding for Spanish

Spanish speakers type without accents constantly — "constitucion," "codigo," "informacion." A
custom immutable Postgres function wraps both sides of the comparison so accents stop
mattering:

EF.Functions.ToTsVector(TsConfig, UnaccentImmutable(d.Title))
    .Matches(EF.Functions.WebSearchToTsQuery(TsConfig, UnaccentImmutable(trimmed)))

Now "constitucion" matches "Constitución" without anyone having to think about it.

4. OR-rewriting the alias query for recall

There's a subtle mismatch: full-text queries default to AND semantics, but the synthetic
field holds compact aliases, not full sentences. So when searching the alias index, mcp.sv
rewrites the query to OR:

"Как биткоин работает"  →  Как OR биткоин OR работает

That trades a little precision for a lot of recall on the cross-language layer — exactly where
you want recall, because a separate, higher-weighted layer (the real title match) handles
precision.

How it ranks

All four signals run in one query and are summed with deliberate weights — the Spanish title
dominates, aliases are deliberately discounted so a synonym-only hit never outranks a real one:

Score = SpanishTitleRank + SpanishSummaryRank
      + 0.5 * SyntheticTextRank      (EN/RU aliases — recall, discounted)
      + 1.0 * EnglishTranslationRank
      + 1.0 * RussianTranslationRank

And the MCP tool itself nudges the client to stay query-side: its description literally says
"Free-text query in Spanish. Translate from English/Russian/other languages first." The
multilingual heavy lifting is in the index, not in a per-request model call.

Part 2 — Better retrieval through structure and extra layers

Cross-language is half the story. The other half is making retrieval good — and that's where
"structured RAG" comes in. Vector search alone is fuzzy; legal text rewards structure.

Heading-aware, page-aware chunking

Documents aren't split into blind 2000-character windows. The chunker (mirrored in C# and
Python) carries structure into every chunk: the page it came from, and the nearest legal
heading above it.

MAX_CHARS  = 2000
OVERLAP    = 200

HEADING_RE = re.compile(r'(?i)(?:'
    r'(?:art(?:\.|[íi]culo)\s+\d+\b\.?-?)'   # Art. 7, Artículo 35
    r'|(?:cap[íi]tulo\s+[IVXLCDM\d]+\b)'     # CAPÍTULO IV
    r'|(?:t[íi]tulo\s+[IVXLCDM\d]+\b)'       # TÍTULO III
    r'|(?:secci[óo]n\s+[\wáéíóúñ]+\b)'       # SECCIÓN PRIMERA
    r'|(?:libro\s+[IVXLCDM\d]+\b)'           # LIBRO PRIMERO
    r'|(?:decreto\s+n[º°]?\s*\d+)')          # DECRETO N° 233
)
HEADING_LOOKBACK = 4000   # look back ~2 pages for the governing heading

Page numbers come from the form-feed (\x0C) markers that pdftotext/Tesseract leave
between pages, so the chunker can compute which page each chunk lives on. Each chunk ends up
with metadata like:

{ "page": 3, "chars": 1847, "offset": 4200, "heading": "Art. 57" }

That metadata is the difference between "the model said something about taxes" and "Art. 57 of
Decreto N° 233, page 3 says…" — real, checkable citations. The 200-char overlap exists so a
sentence split across a chunk boundary still survives intact in at least one chunk.

The vector layer

Chunks are embedded with nomic-embed-text-v1.5 (768-dimensional, good ES/EN quality) via
LM Studio's OpenAI-compatible /v1/embeddings, and stored in pgvector with an HNSW index
for fast approximate nearest-neighbor search using cosine distance:

CREATE INDEX ON document_chunks USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

The embedding model name and dimension count are pinned in one constant (Embeddings.cs)
because changing them means re-embedding the entire corpus and a schema migration — it's a
load-bearing decision, treated as one.

Hybrid: keyword + vector, blended

Pure vector search drifts; pure keyword search misses paraphrases. mcp.sv blends three signals
and lets the caller tune the mix:

var keywordSide = Math.Max(docKeywordRank, chunkKeywordRank);  // best of doc- or chunk-level FTS
var blended = hasEmbedding
    ? alpha * keywordSide + (1 - alpha) * vectorSimilarity
    : keywordSide;                                             // alpha defaults to 0.5

The two keyword sub-signals matter: document-level FTS (title weighted heavily) wins for
literal lookups like "Ley Bitcoin," while chunk-level FTS rescues natural-language questions
whose words appear in the body but not the title. Vector similarity then adds semantic recall on
top.

The extra information layers

Beyond the chunk text, several layers ride along to sharpen results:

Generated summaries. During ingest, a local LLM writes a neutral 2–3 sentence Spanish
summary per document. It's searchable (FTS weight B, below the title) and doubles as the
result snippet when nothing better is available.
Typed legal metadata. decreto_number, decreto_date, last_reforma_date, regulator
(DGII, SSF, BCR…), audience, and category — all nullable, btree-indexed — so you can
filter to "fiscal docs from the DGII" before ranking.
Trust / provenance. Each source has a TrustLevel (Official › Verified › Community) and
a registry entry (name, URL, country), so citations carry authority, not just a link.

Why precompute everything?

The thread running through all of it: aliases, translations, summaries, embeddings, chunk
metadata — every expensive, model-driven step happens once, at ingest, on the private box. The
public server is a read-only Postgres + a thin API; a query is just a few indexed tsvector
matches plus an optional pgvector lookup, blended in SQL. That's what keeps the national MCP
fast, cheap, and resilient: the slow, messy, GPU-hungry work is done long before a user ever
types a question.

The takeaway

If your RAG is "chunk → embed → cosine search," you've built the floor. The ceiling looks like
this:

Bridge languages in the index, not per request — synthetic aliases + stored translations +
language-specific stemmers beat a translate-on-every-query loop.
Carry structure into chunks — headings and page numbers turn answers into citations.
Blend keyword and vector — they fail in opposite ways; together they cover each other.
Layer metadata and summaries — typed filters and generated summaries add precision a raw
embedding can't.
Precompute the expensive parts — keep the read path inference-free.

None of these are exotic; the win is in stacking them. Curious about a specific layer, or
building something similar? Find me on the links on the about page.