When using generative AI solutions, domain-specific, fine-tuned large language models, or LLMs, can greatly reduce "hallucinations" – the wildly, demonstrably incorrect responses you can sometimes get from these tools. Common examples of hallucinations include generative AI incorrectly identifying the number of R’s in the word “strawberry” or generating bogus court data that a New York attorney actually unknowingly cited in a legal brief.

For government agencies, hallucinations are a major concern, especially in high-stakes environments like national security, homeland defense, aviation security, healthcare benefits decisions, or medical research.

So how do hallucinations happen? And how can we avoid them?

The answer starts with understanding the foundational LLMs that underpin generative AI. These models, trained on massive general-purpose datasets, perform advanced language capabilities – from text classification to question answering – pushing the boundaries of what was previously possible. However, foundational LLMs do not contain domain-specific data. One might ask a foundational LLM about the Family Leave Act and get a get a correct and useful answer. However, that foundational model will have no insight into your organization’s leave benefits. Fine-tuning these models with domain-specific data can significantly enhance their performance in a specific domain and thereby reduce hallucinations.

Further, fine-tuning leverages the inherent capabilities of foundational models – such as sentiment analysis, semantic search, summarization, question answering, and text generation – and applies these capabilities to domain-specific content, resulting in more accurate responses.

With most federal agencies experimenting with generative AI at least at the proof-of-concept stage, it’s important for their data science teams to understand how to fine-tune their LLMs to ensure maximum accuracy and performance. Below is a simple, six-step process to do it:

1. Choose a Foundational LLM and Dataset
Start by selecting a suitable, pre-trained model for your task. Then, obtain a high-quality dataset relevant to your specific task.

2. Data Preparation
To prepare your data for use, break it into smaller units or tokens to ensure it’s in the correct format for the model.

3. Model Initialization
Next, load the foundational LLM and specify task-specific parameters such as how many labels you want to assign to your data.

4. Define Evaluation Metrics
From there, set up evaluation functions to measure the model’s performance during training.

5. Training
Fine-tune the model on the prepared dataset using a training framework like Hugging Face’s Trainer and evaluate regularly on a validation set.

6. Model Evaluation & Optimization
Post-training, assess the model’s performance on a test set and optimize hyperparameters as needed.

While this process is straightforward, the reality is that it takes time, compute resources and staff– all things that mission partners like GDIT are equipped to provide so that agencies can take advantage of the considerable benefits of fine-tuning LLMs. This approach not only enhances the accuracy of AI models, making them more reliable for critical tasks, but also reduces costs as fine-tuning is significantly more economical than training a model from scratch. Additionally, it improves performance on specific tasks by leveraging the vast knowledge of foundational models and increases the model’s flexibility and adaptability across various domains and applications.

As agencies continue to experiment with AI and work to adhere to the principles set out in the Executive Order on safe, secure and trustworthy AI, the conversation on fine-tuning LLMs is an important one. It facilitates the “trustworthy” development and use of AI solutions and sets a foundation for continued experimentation and integration with generative AI because hallucination reduction makes LLMs and, by extension, generative AI solutions more effective.

Looking ahead, GDIT is collaborating with customers and with partners leveraging our Luna AI Digital Accelerator to develop new capabilities that make it continually easier for agencies to integrate AI into their missions. Specifically, we implemented iterative adaptive processes to correct hallucinations in real-time, as they’re detected, ensuring that AI models remain accurate and trustworthy throughout their use.