We replaced our LLM-powered search with a grep. Here's what happened.

I want to tell you about the worst engineering decision I was party to last year, partly because it cost us six weeks, partly because the correct decision was obvious in retrospect, and partly because the blog posts about LLM-powered search have been uniformly written by the people selling the LLM-powered search.

Here’s the setup. We run a small SaaS for industrial inspection reports. Customers upload PDFs, the system extracts structured data, and there’s a search box. The search box, as of mid-2025, was Postgres full-text. It worked. It was not great. The “ask a question” feature in our competitor’s product was, by all accounts, lovely.

So we replaced it. Of course we did.

The RAG arc

I will spare you the full vendor tour, but the shape of it is the shape everyone else’s took.

Vendor one was a hosted service that promised “drop-in natural language search.” The drop-in took three days, the natural language worked about 60% of the time, and the moment a customer phrased a question slightly differently from how the docs phrased it, we were back to “0 results.” We could not tune our way out of this. We tried.

Vendor two was an embedding model we ran ourselves. We embedded the documents, embedded the queries, did cosine similarity. This was better, in the sense that it was worse in different places. The “I want the report from the Q3 turbine inspection” query, which had been a one-token match against metadata, now returned three paragraph-long results about Q3 turbines, semantically clustered. The user could not have cared less.

Vendor three was a custom pipeline. We did the responsible thing: we built an evaluation set of 200 real queries with hand-labeled expected results. We instrumented the pipeline. We tracked the metric. We watched the number go up. Then we shipped it to a small percentage of customers, watched them abandon the search box at a higher rate than before, and rolled it back.

The pipeline was, by every LLM benchmark, correct. The customers did not care. They wanted the report. They knew the date and the asset ID. They did not want a paragraph.

What we built instead

The thing we shipped in February is, in essence, a slightly fancier grep over a structured index. The query parser does three things, in order:

Try to match a date range, an asset ID, or an inspector name against indexed metadata.
Try to match a phrase against the document title or the auto-generated summary.
Fall back to Postgres FTS on the body.

There is a fourth, optional step: if the user types a question, we route it to a small model that extracts the structured predicate (“reports about turbine X in Q3”) and run that against step 1. The model never sees the documents. It only sees the query. The retrieval is grep. Always grep.

It is not magical. The “ask a question” affordance is gone, replaced by a more honest “search.” A handful of customers have complained, mostly in the form of “I miss the AI search.” We’ve had a long internal debate about whether to put it back.

We have not put it back.

What I actually learned

There are three things I think are true after this exercise, and I’d like them to be true for you also, because they were expensive.

First: the gap between “the LLM got the right answer” and “the user found what they wanted” is much larger than LLM evaluation tools suggest. Every benchmark we ran said we were winning. Every customer behavior metric said we were losing. The benchmarks were measuring the model’s accuracy. The customers were measuring the product’s usefulness. These are not the same number, and we had no good way to measure the second one until we shipped.

Second: the right question is almost never “what if we added a model to step X.” The right question is “what is the user actually trying to do, and what is the cheapest system that does that.” A model is a candidate answer to the second question, but it is rarely the cheapest. The reason it gets picked anyway is that, when the team is standing around in a planning meeting, “use an LLM” is the proposal that fills the silence. It is a default, not a conclusion.

Third: there is no version of this story where the postmortem makes us look good, and I’m writing it anyway. The honest version of “we tried three vendors and a custom pipeline and ended up with grep” is not a story with a hero. But it is a real story, and I think there are people at companies right now who are spending weeks on the same arc, in part because the only stories on the internet about LLM search are the success stories, and the success stories are mostly from the vendors.

If you are in week two of your RAG project, you are not in trouble. If you are in week six and the evaluation metric looks fine but the customer behavior does not, you are in trouble, and the answer is probably not a better model.

The answer is probably grep.