Making LLMs Work for Enterprise Part 1: Overview, LLM Selection & Performance Measures

Organizations are increasingly adopting large language models (LLMs) to generate near-human quality text for many use cases. However, generative AI applications that meet enterprise needs of domain-specificity and up-to-date information won’t suffice with LLMs alone. Retrieval-augmented generation (RAG) is a technique that brings relevant data to LLMs to yield more informed AI-generated outputs.

In this series, we are discussing the process of creating an enterprise-level RAG application using LLMs. This first article gives an overview what to expect in each article of the series and the deliverables you can build by following them and provides guidance on model selection and performance measures for an effective RAG system.

Articles in this series:

Overview, LLM Selection & Performance Measures (this article)
RAG Fine-tuning Dataset Creation
Generative LLM Fine-tuning for RAG

Introduction

Building an effective retrieval-augmented generation (RAG) application requires careful planning end execution. To achieve this, you’ll need to follow these essential steps:

Choose the right large language models (LLMs) for generating synthetic data, generating responses to end users, and evaluating performance.
Create a comprehensive dataset that covers the range of user inputs and desired responses.
Fine-tune an LLM for domain-specific RAG applications to ensure accurate results using LoRA.

When choosing LLMs, consider factors such as model sizing, licensing terms, the domains of previously trained on data, and context window (maximum prompt) size. Maximizing the use of larger models while adhering to available hardware restraints can significantly impact application performance. Benchmarks should also be consulted to ensure selected LLMs exhibit robust capabilities for handling RAG question answering tasks.

Throughout the development process, it's vital to quantitatively evaluate both retrieval and generation components of the RAG application using a suitable framework such as Ragas. Ragas is an open-source toolkit that uses LLMs to scalably score the accuracy of both the retrieval and generation components of RAG systems. By adhering to these guidelines and employing best practices, an enterprise LLM RAG application can deliver superior results in handling user queries while enhancing overall system efficiency. Stay tuned for the next article where we dive into creating a dataset for validating and evaluating the RAG application.

Series Overview

This article is the first in a three-part series that covers the process of creating an enterprise LLM RAG application. The series can be followed in chronological order as a guide. Table 1 shows the main deliverables that can result from following the instructions in each article.

Table 1. Main deliverables from each article in the series
#	Article Title	Deliverables
1	Overview, LLM Selection & Performance Measures (this article)	Open source LLM for synthetic dataset generation Open source LLM to fine-tune Open source LLM for application performance evaluation Application performance evaluation toolkit
2	RAG Fine-tuning Dataset Creation	Definition of scope for user inputs to the application Document collection for context in retrieval augmented generation Dataset for fine-tuning an LLM for domain-specific retrieval augmented generation
3	Generative LLM Fine-tuning for RAG	Foundation model selected to LoRA fine-tune Hardware requirements for fine-tuning and inference Software selection for parameter-efficient fine-tuning Fine-tuned LLM for domain-specific retrieval augmented generation using LoRA

LLM Selection

This process uses LLMs in three ways:

Synthetic dataset generation: An LLM is inferenced to synthetically generate a dataset.
Fine-tune: An LLM undergoes parameter-efficient fine-tuning (PEFT) and will be inferenced in production.
Evaluation: An LLM is inferenced to evaluate outputs from the RAG application.

Model Sizing

The LLMs can be the same base model, but it is better to use a larger LLM for inference and fine-tune a smaller LLM to make the most of development resources. For example, if 1xA100 80GB GPU is available for development, that GPU can inference Llama 2 13B, but can only fine-tune Llama 2 7B, since inference requires less VRAM than fine-tuning. In general, larger models have higher performance, so it is important to maximize the model size at each step given our resources.

Most LLMs can be loaded in 16-bit datatypes (i.e., fp16 or bf16). For inference, the amount of VRAM needed, in bytes, is equal to the number of parameters multiplied by 2 (16 bits equals 2 bytes). Therefore, on a GPU with 80GB VRAM, it is possible to inference an LLM of up to 40 billion parameters. For training, the amount of VRAM needed is quadrupled compared to inference, so an LLM of up to 10 billion parameters could be fine-tuned on the same GPU.

In production, the fine-tuned LLM can be inferenced on a smaller GPU than the one used for fine-tuning, following these same guidelines.

Benchmark Performance

Once the maximum model sizes have been determined, models should be selected that score well on relevant benchmarks. Table 2 below shows common tasks and corresponding benchmarks.

Table 2. Consider relevant tasks and select models that perform well on relevant benchmarks (David Taubenheim, The Fast Path to Developing with LLMs, NVIDIA LLM Developer Day, November 17, 2023)
Task Type	Benchmarks	Benchmark Summary
Reasoning	HellaSwag	Common sense reasoning that is trivial for humans
	WinoGrande	Common sense reasoning by resolving ambiguities in sentences
	PIQA	Physical commonsense in everyday scenarios
Reading comprehension/question answering	BoolQ	Yes/no question answering based on questions and corresponding passages
	TriviaQA	Answering trivia questions based on very long context
	NaturalQuestions	Answering real user Google searches based on Wikipedia articles
Math word problems	MATH	Challenging competition mathematics problems
	GSM8K	Grade school math word problems
	SVAMP	Elementary-level math word problems
Coding	HumanEval	Simple interview-type software questions
Coding	MBPP	Basic Python programming problems
Multi-task	MMLU	Question answering on a wide range of academic subjects
Multi-task	GLUE	Reading comprehension and logic
Separating fact from fiction in training data	TruthfulQA	Truthfulness and avoiding common human misconceptions on a wide range of topics

For RAG question answering applications, reading comprehension/question answering tasks are relevant, so BoolQ, TriviaQA, and NaturalQuestions are all benchmarks to consider. A good resource for seeing leader boards of high performing LLMs for notable benchmarks is https://paperswithcode.com/.

Other Considerations

Training type: For both synthetic data generation and fine-tuning, it is best to use base models instead of already fine-tuned models. For evaluation, an instruction-fine-tuned model (also called a “chat” model) will generally perform better.
Context size: Each model can act on a static prompt size. Models with larger context sizes give applications the benefit of putting more context into the prompts. It is important that the synthetic dataset generation LLM has enough context size to see several examples of questions, answers, and supporting context within each prompt to generate a new data point. It is also important that the LLM for fine-tuning has a large enough context window to fit a user question and enough context to support in answering the question. In evaluation, the LLM must have similar context size as the fine-tuned LLM, since it will have to fit a question, supporting context, and answers to evaluate the full performance of the application.
License terms: It is important to review the license of any model selected. Allowed uses of the model should be reviewed, since some model licenses may forbid using the model’s outputs to train another model.
Domain: If the application focuses on a certain domain, such as finance or medicine, consider models that have been pretrained with a larger portion of data in that domain.

Performance Measures

Throughout the development process, it is important to quantitatively evaluate both the retrieval and generation components of the RAG application. The second article in this series discusses the process of creating a dataset that covers the scope of user inputs the application should be able to respond to, and the desired responses to those inputs. A subset of that dataset can be used for validation of the application.

Ragas is a framework for evaluating RAG applications that uses an LLM to score the outputs of a RAG application. The Ragas framework measures performance of retrieval and generation components independently, so developers can identify which step in the application needs improvement. Using Ragas with a validation dataset can yield the following metrics, as defined by the Ragas documentation, which should be tracked with every iteration of the application’s development to measure improvements:

Retrieval
- Context precision - the signal to noise ratio of retrieved context
- Context recall - can it retrieve all the relevant information needed to answer the question
Generation
- Faithfulness - how factually accurate is the generated answer
- Answer relevancy - how relevant is the generated answer to the question
End-to-end evaluation
- Answer semantic similarity - semantic resemblance between the generated answer and the ground truth
- Answer correctness - accuracy of the answer compared to the ground truth

It is important to measure the performance of RAG applications quantitatively to support metrics-driven development. Ragas enables the use of an LLM for measuring performance, so the process is scalable and repeatable, generating a set of actionable metrics.

Conclusion

In conclusion, the creation of an enterprise LLM RAG application involves several crucial steps, each delivering distinct deliverables that contribute to the overall performance and effectiveness of the system. These steps include selecting appropriate LLMs for synthetic dataset generation, fine-tuning for domain-specific retrieval augmented generation, fine-tuning embeddings models, implementing search enhancements, and evaluating the application using quantitative measures.

To create an effective enterprise-level RAG application using LLMs, consider factors like model size, license terms, training methods, context capacity, and domain expertise. Maximizing larger models while respecting hardware constraints enhances performance. Benchmarks are crucial for verifying LLM suitability in handling RAG question answering tasks.

Quantitative evaluation of both retrieval and generation components is essential throughout the RAG application development process, employing frameworks such as Ragas for accurate assessment. Following these guidelines and best practices ensures informed development for continuous system improvement. In the next article, we'll discuss defining the scope of user inputs and creating and augmenting a dataset that can be used for fine-tuning an LLM.

Read the next article in this series, Part 2: RAG Fine-Tuning Dataset Creation.

For more information on Lenovo offerings for Generative AI, see the Reference Architecture for Generative AI Based on Large Language Models (LLMs), available from https://lenovopress.lenovo.com/lp1798-reference-architecture-for-generative-ai-based-on-large-language-models.

Authors

David Ellison is the Chief Data Scientist for Lenovo ISG. Through Lenovo’s US and European AI Discover Centers, he leads a team that uses cutting-edge AI techniques to deliver solutions for external customers while internally supporting the overall AI strategy for the Worldwide Infrastructure Solutions Group. Before joining Lenovo, he ran an international scientific analysis and equipment company and worked as a Data Scientist for the US Postal Service. Previous to that, he received a PhD in Biomedical Engineering from Johns Hopkins University. He has numerous publications in top tier journals including two in the Proceedings of the National Academy of the Sciences.

Chris Van Buren is a Staff Data Scientist at Lenovo. He researches generative AI for enterprise use cases and has developed retrieval augmented generation (RAG) applications with open source, on-premises LLMs.

Related product families

Product families related to this document are the following:

Trademarks

Lenovo and the Lenovo logo are trademarks or registered trademarks of Lenovo in the United States, other countries, or both. A current list of Lenovo trademarks is available on the Web at https://www.lenovo.com/us/en/legal/copytrade/.

The following terms are trademarks of Lenovo in the United States, other countries, or both:
Lenovo®

Other company, product, or service names may be trademarks or service marks of others.

Lenovo Press

Lenovo Press

Making LLMs Work for Enterprise Part 1: Overview, LLM Selection & Performance Measures

Planning / Implementation

Authors

Published

Form Number

PDF size

Abstract

Introduction

Series Overview

LLM Selection

Model Sizing

Benchmark Performance

Other Considerations

Performance Measures

Conclusion

Authors

Related product families

Trademarks