Making LLMs smarter with Dynamic Knowledge Access

This guide shows how to use Retrieval Augmented Generation (RAG) to enhance a large language model (LLM). RAG is the process of enabling an LLM to reference context outside of its initial training data before generating its response. Training a model that is useful for your own domain-specific purposes can be extremely expensive in both time and computing power. Therefore, using RAG is a cost-effective alternative to extending the capabilities of an existing LLM. To demonstrate RAG in this guide, we'll provide Llama 3.2 with access to Nitric's documentation so that it can answer specific questions. You can adjust this guide with another data source that meets your needs.

Prerequisites

uv - for Python dependency management
The Nitric CLI
(optional) An AWS account

Getting started

We'll start by creating a new project using Nitric's python starter template.

If you want to take a look at the finished code, it can be found here.

nitric new llama-rag py-starter
cd llama-rag

Next, let's install our base dependencies, then add the llama-index libraries. We'll be using llama index as it makes creating RAGs extremely simple and has support for running our own local Llama 3.2 models.

# Install the base dependencies
uv sync
# Add Llama index dependencies
uv add llama-index llama-index-embeddings-huggingface llama-index-llms-llama-cpp

We'll organize our project structure like so:

+--common/
|  +-- __init__.py
|  +-- model_parameters.py
+--model/
|  +-- Llama-3.2-1B-Instruct-Q4_K_M.gguf
+--services/
|  +-- chat.py
+--.gitignore
+--.python-version
+-- build_query_engine.py
+-- pyproject.toml
+-- python.dockerfile
+-- python.dockerignore
+-- nitric.yaml
+-- README.md

Setting up our LLM

Before we even start writing code for our LLM we'll want to download the model into our project. For this project we'll be using Llama 3.2 with the Q4_K_M quantization.

mkdir model
cd model
curl -OL https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_K_M.gguf
cd ..

Now that we have our model we can load it into our code. We'll also define our embed model using a recommend model from Hugging Face. At this point we can also create a prompt template for prompts with our query engine. It will just sanitize some of the hallucinations so that if the model does not know an answer it won't pretend like it does.

from llama_index.core import ChatPromptTemplate
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.llama_cpp import LlamaCPP


# Load the locally stored Llama model
llm = LlamaCPP(
  model_url=None,
  model_path="./model/Llama-3.2-1B-Instruct-Q4_K_M.gguf",
  temperature=0.7,
  verbose=False,
)

# Load the embed model from hugging face
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-large-en-v1.5", trust_remote_code=True)

# Set the location that we will persist our embeds
persist_dir = "query_engine_vectors"

# Create the prompt query templates to sanitise hallucinations
text_qa_template = ChatPromptTemplate.from_messages([
  (
    "system",
    "If the context is not useful, respond with 'I'm not sure'.",
  ),
  (
    "user",
    (
      "Context information is below.\n"
      "---------------------\n"
      "{context_str}\n"
      "---------------------\n"
      "Given the context information and not prior knowledge "
      "answer the question: {query_str}\n."
    )
  ),
])

Building a Query Engine

The next step is where we embed our context into the LLM. For this example we will embed the Nitric documentation. It's open-source on GitHub, so we can clone it into our project.

git clone https://github.com/nitrictech/docs.git nitric-docs

We can then create our embedding and store it locally.

from common.model_parameters import llm, embed_model, persist_dir

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings


# Set global settings for llama index
Settings.llm = llm
Settings.embed_model = embed_model

# Load data from the documents directory
loader = SimpleDirectoryReader(
    # The location of the documents you want to embed
    input_dir = "./nitric-docs/",
    # Set the extension to what format your documents are in
    required_exts=[".mdx"],
    # Search through documents recursively
    recursive=True
)
docs = loader.load_data()

# Embed the docs into the Llama model
index = VectorStoreIndex.from_documents(docs, show_progress=True)

# Save the query engine index to the local machine
index.storage_context.persist(persist_dir)

You can then run this using the following command. This should output the embeds into your persist_dir.

uv run build_query_engine.py

Creating a Websocket for querying our model

With our LLM ready for querying, we can create a websocket to handle prompts.

import os

from common.model_parameters import embed_model, llm, persist_dir, text_qa_template

from nitric.resources import websocket
from nitric.context import WebsocketContext
from nitric.application import Nitric
from llama_index.core import StorageContext, load_index_from_storage, Settings


# Set global settings for llama index
Settings.llm = llm
Settings.embed_model = embed_model

socket = websocket("socket")

# Handle socket connections
@socket.on("connect")
async def on_connect(ctx):
  print(f"socket connected with {ctx.req.connection_id}")
  return ctx

# Handle socket disconnections
@socket.on("disconnect")
async def on_disconnect(ctx):
  # handle disconnections
  print(f"socket disconnected with {ctx.req.connection_id}")
  return ctx

# Handle socket messages
@socket.on("message")
async def on_message(ctx: WebsocketContext):
  # Query the model with the requested prompt
  prompt = ctx.req.data.decode("utf-8")

  response = await query_model(prompt)

  # Send a response to the open connection
  await socket.send(ctx.req.connection_id, response.encode("utf-8"))

  return ctx

async def query_model(prompt: str):
  print(f"Querying model: \"{prompt}\"")

  # Get the model from the stored local context
  if os.path.exists(persist_dir):
    storage_context = StorageContext.from_defaults(persist_dir=persist_dir)

    index = load_index_from_storage(storage_context)

    # Get the query engine from the index, and use the prompt template for santisation.
    query_engine = index.as_query_engine(
      streaming=False,
      similarity_top_k=4,
      text_qa_template=text_qa_template
    )
  else:
    print("model does not exist")
    return "model does not exist"

  # Query the model
  query_response = query_engine.query(prompt)

  print(f"Response: \n{query_response}")

  return query_response.response

Nitric.run()

Test it locally

Now that we have the Websocket defined, we can test it locally. You can do this using nitric start and connecting to the websocket through either the Nitric Dashboard or another Websocket client. Once connected, you can send a message with a prompt to the model. Sending a prompt like "What is Nitric?" should produce an output similar to:

Nitric is a cloud-agnostic framework designed to aid developers in building full cloud applications, including infrastructure.

Get ready for deployment

Now that its tested locally, we can get our project ready for containerization. The default python dockerfile uses python3.11-bookworm-slim as its basic container image, which doesn't have the right dependencies to load the Llama model. So, all we need to do is update the Dockerfile to use python3.11-bookworm (the non-slim version) instead.

Update line 2:

-
FROM ghcr.io/astral-sh/uv:python3.11-bookworm-slim AS builder
+
FROM ghcr.io/astral-sh/uv:python3.11-bookworm AS builder

And line 18:

-
FROM python:3.11-slim-bookworm
+
FROM python:3.11-bookworm

When you're ready to deploy the project, we can create a new Nitric stack file that will target AWS:

nitric stack new dev aws

Update the stack file nitric.dev.yaml with the appropriate AWS region and memory allocation to handle the model:

WebSockets are supported across all of AWS regions

provider: nitric/aws@1.14.0
region: us-east-1
config:
  # How services will be deployed by default, if you have other services not running models
  # you can add them here too so they don't use the same configuration
  default:
    lambda:
      # Set the memory to 6GB to handle the model, this automatically sets additional CPU allocation
      memory: 6144
      # Set a timeout of 30 seconds (this is the most API Gateway will wait for a response)
      timeout: 30
      # We add more storage to the lambda function, so it can store the model
      ephemeral-storage: 1024

We can then deploy using the following command:

nitric up

Testing on AWS we'll need to use a Websocket client or the AWS portal. You can verify it in the same way as locally by connecting to the websocket and sending a message with a prompt for the model.

Once you're finished querying the model, you can destroy the deployment using nitric down.

Summary

In this project we've successfully augmented an LLM using Retrieval Augmented Generation (RAG) techniques with Llama Index and Nitric. You can modify this project to use any LLM, change the prompt template to be more specific in responses, or change the context for your own personal requirements. We could extend this project to maintain context between requests using WebSockets to have more of a chat-like experience with the model.