How to transform any website into a chatbot using LLMs?

Tue Jun 10 2025
Technology
LLM
Topic
Data Science
Generative AI
A common frustration among expats working at Xomnia is the occasional struggle with the communication style of the Dutch government. When a colleague accidentally missed a deadline on a fine which turned out to be a fine for missing a fine, this “fineception” led us to wonder how great it would be to have a chatbot that could answer all questions related to the City of Amsterdam - since all the relevant information is on their website, but it’s spread throughout multiple pages that are several clicks deep.

This inspired a discussion on how to use Large Language Models (LLM) to turn any website into a chatbot. By extracting the content from a website and using LLMs to understand and respond to natural language questions about that content, we could create a much more accessible interface to the information.

While our initial motivation came from the challenges with the Amsterdam municipality website, the approach we'll describe can be applied to virtually any website – from corporate knowledge bases and product documentation to educational resources and government portals. The fundamental problem is universal: information exists but is difficult to access efficiently. Our solution provides a more intuitive way to interact with that information through natural language questions.

This blog will explain what and how to hypothetically do this, and the peculiarities of creating a chatbot based on LLMs.

Note: This is an updated version of our 2023 blog post. The field of AI and LLMs has evolved rapidly since then (with new models and more efficient implementations), so we've refreshed this guide to reflect the current state of the art in 2025 while maintaining our original approach. Our solution now dramatically reduces API costs by caching responses to commonly asked questions (making website chatbots economically practical for most organizations while maintaining effectiveness).

From Traditional NLP to Modern Approaches

Traditional NLP-based chatbots typically rely on rule-based systems, pattern matching, template-based responses, and neural networks trained on specific question-answer pairs. These techniques continue to work for specific applications but require either extensive datasets of question-answer pairs or human-defined rules.

In contrast, modern LLM-based approaches offer greater flexibility by leveraging general language understanding capabilities combined with contextual information. Rather than requiring exhaustive question-answer pairs for every possible query, these systems can process and reason about content dynamically when provided with relevant context.

For websites like the City of Amsterdam portal, this means we can leverage the wealth of information already published without creating a new dataset of specific questions and answers. The same approach applies to any content-rich website, from corporate documentation to e-commerce platforms.

Choosing the Right LLM and Hosting Option

Throughout 2025, the LLM landscape has continued  to evolve rapidly. With increasingly powerful models released by major AI labs and tech companies, and so when building a website chatbot, you have two main options:

  1. Fully Managed LLMs: Cloud-based API services like OpenAI's GPT, Anthropic's Claude, or similar offerings from Google, Cohere, and others
  2. Self-Hosted Models: Open-source models like Llama, Mistral, or various other options you can run on your own infrastructure

After experimenting with both approaches, we've found that managed LLM services make more sense for most website chatbot projects. Think of it like choosing between generating your own electricity versus connecting to the power grid - the latter is simply more practical for most homes.

Managed APIs eliminate infrastructure headaches while providing seamless scaling, automatic improvements, and generally higher quality responses. Yes, there are trade-offs: usage-based costs, less control over the model, and data privacy considerations. But with proper implementation strategies (discussed later), these concerns can be mitigated while preserving the quality and simplicity advantages.

Self-hosting does make sense in some scenarios - particularly for data sovereignty requirements or implementing more advanced caching strategies that require access to model internals. However, just be prepared for substantial hardware investments, specialized expertise requirements, and development time spent on infrastructure rather than your core product.

As we noted in our 2023 blog, and as remains true today: "Only self-host an LLM if you really need to!"

Figure 1: Self-Hosted GPUs vs. API-Based LLM Access (Image courtesy of S. Anand)

Prompt Crafting: The Foundation of LLM Applications

One of the most crucial aspects of working with LLMs is effective prompt crafting. When building a website chatbot, we use a template-based approach that combines the user's question with relevant context:

Answer this question:

{question}

 

By using this context:

{document 1}  

{document 2}  

{document 3}  

{document …}

 

The LLM doesn't need to be fine-tuned on your website's content. Instead, it uses its general language understanding capabilities to interpret the question and generate an answer based on the provided context.

Extracting and Processing Website Content

To build an effective website chatbot, we need to extract and process the relevant content. Our approach involves:

  • Web Crawling: Using a crawler to systematically navigate the website, always respecting the rules in robots.txt
  • Content Extraction: Isolating the meaningful content from navigational elements, headers, footers, and other non-essential parts
  • Text Processing: We apply text processing techniques including removing repetitive elements, eliminating navigation snippets, standardizing formats, and applying text normalization
  • Content Organization: Using URL structure to assign topics to different pages, creating a logical content hierarchy

Our Approach: RAG with Multi-tier Caching

Before explaining our implementation, Table 1 below highlights the landscape of varying techniques for building website chatbots. 

Understanding Different Approaches

Table 1: Key Characteristics of RAG, CAG, and Simple Caching

As we evaluated these methods, we found that none provided the ideal balance we were seeking, as each with limitations for practical website chatbot implementations. RAG requires complex infrastructure and has higher ongoing costs, CAG needs access to model internals that commercial APIs don't provide, and simple response caching lacks semantic understanding. These limitations inspired us to develop our hybrid approach. Here's how it works:

1. Content Processing and Vector Caching

We first build and cache a searchable knowledge base:

  1. Website Crawling: As mentioned above, we extract content from the target website
  2. Content Processing: This content is split into semantic chunks that fit within context windows
  3. Vector Embedding: Each chunk is converted into a numerical vector representation
  4. Vector Store Creation: These vectors are organized into a searchable index
  5. Index Caching: The complete vector store is saved to disk with expiration metadata

Our approach implements RAG with multi-tier caching to improve efficiency. Unlike “true” CAG, which would require direct access to the LLM's internal key-value cache tensors (not possible with commercial LLM APIs, as we’ve noted), our system achieves performance gains by caching at the retrieval application level. We preserve processed vector embeddings and retrieval results to avoid redundant computation.

Our caching system uses simple, standard storage technologies rather than specialized vector databases. It’s important to note that while we are still using vector embeddings, we found that dedicated vector databases were unnecessarily complex for many website chatbots, and this simpler storage approach offers several key advantages:

  • Simplified Infrastructure: A basic key-value store like Redis or even a simple database table is sufficient for our needs
  • Efficient Performance: Hash-based lookups are extremely fast, providing near-instant responses for previously asked questions compared to vector similarity searches
  • Lower Resource Requirements: No need for specialized infrastructure to compute embeddings or perform vector searches

This simpler storage and structural approach translates to lower infrastructure costs and easier maintenance compared to full vector database implementations (as we noted in our original blog, that they were probably a bit overkill).

We've found this approach particularly effective for small to medium websites (up to a few thousand pages) with relatively stable content that changes infrequently. Where it really shines is in applications with budget-conscious implementations where many users ask similar questions but still need semantic understanding, and of course, in use cases where response time affects user satisfaction. That being said, vector databases are still incredibly powerful and remain prevalent when your needs scale to billions of vectors, complex metadata filtering,  or need advanced ANN (Approximate Nearest Neighbor) algorithms optimized for high-dimensional vectors.

2. Three-Tier Question Processing

At the core of our system lies a progressive question-handling strategy to balance speed, cost, and accuracy. The idea is to avoid immediately calling the expensive LLM API for every query when simpler, faster methods might suffice. 

First, we start with an exact match approach. After normalising the question by removing stop words, stripping punctuation, and standardising case, we check if this exact question exists in our cache using a simple hash lookup. This lightning-fast straightforward method instantly returns cached answers for questions we've seen before.

That being said, of course, people rarely phrase questions identically - which is why our second tier employs similarity matching. When no exact match exists, we use Jaccard similarity to identify questions that share substantial keyword overlap. If we find a question that exceeds our 90% similarity threshold, we deliver the cached answer along with a note about the match. This maintains the speed advantage while accommodating natural language variations. 

Only when these faster methods fail do we activate our third tier, which combines vector search with LLM generation. This process begins with a semantic search through our vector index to find the most relevant text chunks from the website. We then insert these retrieved passages into our prompt template alongside with the user's original question. The complete prompt goes to the LLM API, and the freshly generated answer joins our cache for future use.

Although implementing a comprehensive feedback system was out of scope for our project, we recommend adding this feature to any production deployment. A simple thumbs up/down mechanism after each response can provide invaluable data for improving your chatbot over time.

 What Tools You’ll Need

  • Website Access – Sitemap or list of URLs
  • Crawler Crawl4AI
  • Indexing LlamaIndex for embeddings and retrieval
  • LLM API – GPT-4, Claude, Mistral, etc.
    Embeddings – OpenAI or open-source model
  • Caching – Redis, SQLite, or file-based

The evolution of website chatbots since 2023 has been focused not just on new capabilities, but on making implementations more practical, cost-effective, and maintainable. 

What's most exciting is how accessible this technology has become—organizations of all sizes can now implement effective website chatbots without massive AI expertise or infrastructure.

If you are interested in what we can do for your company, contact us!

Written by 

Andy Ho

Analytics Engineer at Xomnia

 

Folkert Ritsma

Data Scientist at Xomnia 

Technology
LLM
Topic
Data Science
Generative AI
crossmenuchevron-down