
The gap between a compelling GenAI demo and a working production system is wider than most organisations expect. And the reason is almost never the technology.
It is governance. Or rather, the absence of it.
We have built and shipped GenAI products at Xomnia across telecom, healthcare, logistics, and maritime over the past few years. The same three failure modes show up every single time, regardless of industry, company size, or the sophistication of the team involved. This article addresses all three directly, with the approaches we actually use. Not theory. Approaches you can start applying this quarter.
The three areas are: Access and Control, Unstructured Data Quality, and LLM Ops and Monitoring. Together they form a practical framework for getting GenAI out of demo mode and into production that holds up under real conditions.
Shadow AI is already happening in your organisation. When employees do not have the right tools, they find their own. That means sensitive company data ends up in consumer-grade ChatGPT interfaces with no oversight, no audit trail, and no policy enforcement. The answer is not to restrict access. The answer is to give people something better to use.
Start by giving business users an AI accelerator, a place to experiment and safely use AI. Large vendors such as Anthropic and OpenAI offer great options for building with generative AI without having to write any code. Perfect for finding out if ideas work the way that is expected.
If you prefer to be less dependent on technology providers there are plenty of open source alternatives. Tools like Open WebUI, which we have deployed internally, offer the full functionality of enterprise GenAI solutions: custom model configurations, local model hosting, knowledge base management, and pipeline integration, all in an open-source platform you actually control. When business users can experiment with system prompts and structured outputs in a safe environment, use-case discovery accelerates fast. You stop waiting for IT to find the right use cases and start getting them from the people closest to the work.
One thing we have learned from this: steer users toward simple, well-scoped process steps rather than full autonomous agents. The right first question is "can I use an LLM to scan Excel files or PowerPoint presentations at scale?" Building a multi-step autonomous agent from week one is how projects stall.
Adoption, though, is as much about placement as it is about access. Asking someone to leave their primary workflow, navigate three login screens, and open a separate chatbot tab is functionally asking them not to use GenAI at all. In a healthcare project we delivered, clinicians were managing patient interactions across a chat application, a medical information system, and several supporting tools simultaneously. We integrated LLM-generated responses, summarisations, and priority assessments directly into their primary application window via a Chrome extension and custom UI components. Adoption improved significantly. Not because the model was better. Because the capability was right where the work was happening.
Once GenAI applications reach production, model management becomes a governance challenge of its own. Teams build their own PII-reduction logic in isolation. Compliance standards get implemented inconsistently. Model updates require changes in multiple places across multiple teams. An API gateway resolves this by routing all application calls through a centralised layer, whether you are running on Azure, AWS, or Google Vertex AI. PII filtering, safety guardrails, and compliance standards get enforced once and distributed everywhere. API key management becomes centralised. Token quotas become team-specific. Policy becomes code. We route our own internal traffic through OpenRouter for exactly this reason.
A knowledge agent that retrieves the wrong information confidently is worse than no knowledge agent at all. And most organisations underestimate how quickly a poor knowledge base destroys trust in an otherwise solid system.
The typical starting point for a RAG project is a data dump: thousands of documents, multiple versions, inconsistent metadata, and statements that contradict each other depending on which document you read. The LLM has no way to resolve these conflicts. It picks one answer or blends them together into something that sounds plausible but is not quite right. Users notice. They stop trusting the system. That trust is hard to rebuild.
We experienced this directly with Genie, our internal knowledge agent. Genie answers questions about consultant availability and past project experience by querying CVs and project records. It works well, until it hits conflicting knowledge. An English CV and a Dutch CV for the same consultant may contain different information. Both get retrieved for the same query. The output becomes ambiguous, and a human has to verify the answer manually. That defeats the purpose entirely.
The fix is a data quality experiment before you ingest at scale. Even 10 to 50 documents is enough to surface systemic issues. The approach has three steps.
First: run a similarity search. For each chunk in your vector database, run a similarity search and flag candidate pairs above a set threshold. These are chunks that may conflict with each other. Second: run a redundancy check. Pass similar chunk pairs through a structured prompt to identify full or partial duplicates. Start with high-confidence findings and work down. Third: run a contradiction check. For chunks that are not exact duplicates, check for disagreeing factual statements. A simple contradiction prompt, taking two chunks and asking the LLM to identify conflicting claims, surfaces issues you would never find manually at scale.
This experiment also gives you an objective baseline for evaluating vendors who offer data quality tooling. In our own work, a brute-force chunk comparison approach outperformed a commercial SaaS solution on detection quality. Scalability is the remaining challenge we are actively working through.
The goal is a single source of truth. A knowledge base where the LLM only ever encounters consistent, fresh, and reliable information. That is the only version of RAG that holds up in production.
Most GenAI projects start with a print statement. That is fine in a notebook. It does not scale.
Once your application is live, you need observability across the full pipeline. Not just token costs and usage counts, but full trace visibility. Did the right information get retrieved at this step? Did the tool call execute correctly? Is the answer faithful to what is in the knowledge base? Without this, debugging a broken pipeline is guesswork. You know something went wrong somewhere. You do not know where.
We use LangFuse to evaluate and optimise Genie. It provides granular tracing across RAG calls, tool calls, and LLM completions. You can see exactly where a response went wrong, whether the issue sits in retrieval, tool selection, or prompt instruction, and optimise accordingly. Beyond tracing, LangFuse enables continuous evaluation against a golden data set: a curated benchmark covering faithfulness, answer completeness, retrieval accuracy, and custom metrics you define based on what your application actually needs to do.
The golden data set is worth spending time on. Building a meaningful benchmark requires covering distinct sources, edge cases, rare entities, and paraphrased queries. Testing only against straightforward single-source questions is like writing unit tests while ignoring integration tests. Coverage looks good on paper and fails in production.
We recommend synthetic generation with human validation. Define diverse query templates that reflect real user intent. Generate synthetic Q&A pairs across your knowledge base. Have domain experts validate and correct them. Assign clear ownership: the person who knows the data best should own the benchmark for that data domain. Set update triggers so that when a CV is updated, new Q&A pairs are generated for validation. When a policy document changes, the relevant benchmark entries get refreshed. Tools like DSPy can use these data sets to systematically optimise prompt configurations as your application and knowledge base grow. Without this discipline, you end up with static benchmarks reporting green while production quality quietly degrades.
You do not need to be at an optimised level across all three pillars before you ship your first application. What you do need is to make the right foundational choices early. How you structure your API layer, how you design your ingestion pipeline, and what you choose to measure are all far easier to get right upfront than to fix once you are running in production.
Pick one domain this quarter and map where you currently stand across these three areas. Is shadow AI a real risk in your organisation? Do your knowledge bases have a single source of truth? Can you trace a failed LLM response back to its root cause?
Those three questions tell you more about your production readiness than any maturity framework will.
If the answers are unclear, that is the starting point.
Analytics Translator at Xomnia
