The Government Says

Developing GOV.UK Chat: Our data science and AI engineering journey

Our tests of GOV.UK Chat, the experimental AI-powered chatbot we’re developing at the Government Digital Service (GDS), have given us clear evidence that it saves people time and provides them with helpful information. This has given us the confidence to begin opening up access more widely – and to get to this point, we’ve been on a data science and AI engineering journey.

At its core, GOV.UK Chat is a conversational retrieval-augmented generation (RAG) system, combining semantic search with generative AI to deliver clear, contextualised answers. The work is one of the Prime Minister's AI Exemplars and part of the roadmap for modern digital government.

In this post, we explore the technical details behind GOV.UK Chat, how we used insights from live pilots and user testing to guide improvements, and the iterations that led to the version tested in the GOV.UK app in 2025.

Evaluation-driven development

GOV.UK Chat has changed significantly since its first prototype in July 2023. Initially built with Langchain (an open-source framework for developing Large Language Model, or LLM, applications) and Gradio (a Python library for creating web interfaces), it has since transitioned to a Ruby-powered AWS (Amazon Web Services) application powered by Anthropic models. Throughout this evolution, we focused on improving query processing and response quality through advanced retrieval, better question handling, and robust validation, guided by an evaluation-driven approach that assessed both individual components and the system end-to-end. It has been a collaborative journey, working closely with our teammates with expertise in content and interaction design, and software engineering.

Our evaluation framework builds on 3 pillars:

Automated evaluation: the backbone of our iterative process, testing changes against predefined metrics and ground-truth datasets. We used LLM-as-a-judge and custom metrics to evaluate answer quality, and established classification and information retrieval metrics to optimise other system components. Manual evaluation: expert analysis of Chat’s conversational logs has been used to identify issues and refine the system. This has included collaborations with content designers and subject matter experts across GDS and other departments to assess outputs against our quality framework, and red teaming conducted with the Incubator for Artificial Intelligence and the AI Security Institute (AISI) to stress-test safety and resilience. Automated live monitoring: continuous tracking of real-world performance, ensuring potential safety and quality issues are flagged for human review in a timely manner.

Across all 3 pillars, qualitative log analysis (systematic review of conversational logs, model outputs and failure cases) has been central to identifying pain points and driving iteration.

Turning insights into iterations

Pilots over the past 2 years with different user groups showed promising results. As far back as 2023 nearly 70% of users said they found the prototype version GOV.UK Chat useful to them. But this early testing also highlighted clear areas for improvement, particularly around accuracy and safety.

As data scientists, our task was to turn these insights into actionable, testable iterations. We began by defining what needed improvement and how to measure success, establishing 6 evaluation criteria. These were:

groundedness – responses strictly follow retrieved GOV.UK content relevance and answer rate – responses address queries and correctly handle out-of-scope questions factual accuracy – all facts verified against GOV.UK sources factual completeness – responses include all necessary details reliability – consistent performance and answers reputational safety – mitigating risks of inappropriate or inaccurate AI-generated content

Each iteration aimed to improve at least one of these aspects without compromising any of the others.

Five areas for improvement

We focused on 5 main areas for iterative improvement. These act as building blocks by working together to enhance GOV.UK Chat’s overall performance.

1. Data quality

The underlying principle is simple: a Large Language Model (LLM) can only generate accurate answers if it is given the right information to work with.

GOV.UK Chat exclusively draws on guidance published on GOV.UK. Before retrieval, we filter GOV.UK content using metadata to prioritise authoritative, up-to-date sources and exclude any content containing personally identifiable information. We also use hierarchical semantic chunking, splitting pages into coherent sections while preserving the HTML header structure, so that content is well-structured and contextually meaningful before it reaches the retrieval stage.

2. Retrieval

Retrieval (selecting the most relevant information for a user query) is a critical part of GOV.UK Chat, as it determines what content the LLM sees.

We use semantic search to match user queries to GOV.UK pages based on meaning, rather than exact wording - this approach handles variation in how users phrase their questions well, though some edge cases remain. To further improve result quality, we introduced a metadata-based re-ranking layer: following a comprehensive review of all document types on GOV.UK, we developed a weighting schema (in collaboration with GDS colleagues specialising in content design and information architecture) to surface more current and contextually relevant content alongside semantic similarity.

To optimise retrieval, we also tested several dense embedding models and tuned retrieval parameters, settling on Amazon's Titan model for its balance of quality and infrastructure efficiency.Retrieval is still being actively developed. Past experiments with hybrid search and cross-encoder re-ranking have informed our approach, and improvements are ongoing.

3. Question handling

We introduced question routing as a vital component of GOV.UK Chat, enabling us to classify user intent and direct queries to the most appropriate response strategy. This functionality has 3 main purposes: helping users ask clearer, more actionable questions, and so get better answers; protecting the system from generating responses to harmful or irrelevant queries; and providing users with a constructive way forward even when we cannot fulfil their request (for example, requests for advice).We use a tool-calling approach to classify distinct user intents, such as greetings, requests for advice, or genuine questions best answered with GOV.UK content. Rather than following a fixed decision path, the LLM autonomously selects the most suitable response strategy from a predefined set based on its understanding of the user's intent. This offers a good balance of performance, cost, user experience and scalability.

4. Generation

We have iterated on the system prompt, incorporating techniques like chain of thought and goal-oriented prompting, and structured JSON input and output to ensure the LLM receives and outputs data in a consistent format. We also give clear instructions to base answers only on GOV.UK guidance written to GOV.UK style.

This approach has significantly improved factual accuracy, completeness, and user focus. It has also enhanced reliability by ensuring the LLM cites the GOV.UK content it uses and avoids hallucinated hyperlinks. To balance helpfulness with accuracy, where a complete answer cannot be found on GOV.UK, we provide “partial” answers to guide users towards the most relevant available guidance.

5. Minimising safety risks with guardrails

To ensure GOV.UK Chat remains reliable and secure, we developed 2 separate layers of safety guardrails. These mitigate risks, with the ‘pre-generation’ guardrail blocking inappropriate user queries and the ‘post-generation’ guardrail validating system responses. These were developed in consultation with content designers, red teams (including AISI), and legal advisors. While no LLM application is completely foolproof, these measures significantly reduce risks and maintain GOV.UK Chat's integrity.

Key lessons and next steps

Of the many lessons developing GOV.UK Chat has taught us, the 4 main ones are:

Evaluation-driven development and error analysis are essential. Defining clear success criteria, measuring every change, and being willing to abandon what does not work ensures AI applications evolve responsibly and effectively. Looking closely at outputs, not just metrics, is what drives meaningful iteration. Balancing safety with a natural user experience is a central system design challenge. Users expect a conversational, intuitive interaction, while the system must maintain robust safety measures and minimise inaccurate or inappropriate outputs. Delivering both simultaneously is an ongoing tension we actively design for. Be realistic about what the system can answer. GOV.UK Chat's answers rely solely on all of the published guidance on GOV.UK. Where guidance on a topic is limited, or if a question is asked that is out of scope, it is better for Chat to provide a partial response acknowledging the limitation or not provide an answer than to give misleading information. Solutions must generalise. GOV.UK Chat covers the entire landscape of published GOV.UK guidance - an array of topics across hundreds of thousands of pages. That means we cannot make ad-hoc fixes to the product targeted at specific queries. Every improvement must work consistently and reliably at scale.

Based on the findings of our pilots, we're opening up access to GOV.UK Chat more widely, starting by making it available to all users of the GOV.UK app. At the same time, we will continue to refine retrieval, evaluate new models, and experiment with agentic AI to enhance the user experience, support more complex queries, and extend to new tools and content.

As ever, our aim is to ensure any change to GOV.UK Chat meets user needs responsibly – balancing value, accuracy, trustworthiness and scalability as we continue to make it easier for people to find the information they need on GOV.UK.

Subscribe to Inside GOV.UK to get the latest updates about our work.

https://insidegovuk.blog.gov.uk/2026/05/15/developing-gov-uk-chat-our-data-science-and-ai-engineering-journey/

seen at 10:41, 15 May in Inside GOV.UK.

TGS