Moicko

Building a simple RAG for learning

Cover Image for Building a simple RAG for learning

Introduction

There are many articles that explain RAG (Retrieval-Augmented Generation), but these articles often make extensive use of databases and libraries, making them appear complex to beginners.

Therefore, in this article, we will explain RAG in a simple structure for learning purposes, making it easier to understand its basic operation.

In the code for this article, we use Node.js, OpenAI API, and the Vercel AI SDK.

Does an LLM Know About Me?

Can an LLM answer the question, "What is my favorite food?"
No, it can't. That's because an LLM doesn't know what my favorite food is.

Let's try it out with the following code.
Of course, the input is "What is my favorite food?"

import { openai } from "@ai-sdk/openai";
import { generateText } from "ai";

export const generateTextWithoutRAG = async (input: string) => {
  const { text } = await generateText({
    model: openai("gpt-4o"),
    messages: [{ role: "user", content: input }],
  });

  return text;
};

When executed, the generated text was as follows:
“You haven't mentioned your favorite food yet. What is it?”

Getting an LLM to Know About Me

If an LLM doesn't know about me, I can just provide the information. So, I created a brief self-introduction.

By giving this self-introduction along with the question, an LLM should be able to answer the question based on the provided information.

Let's try it out.

const resources = `My favorite food is sushi.
My hobby is reading.
I often go hiking on weekends.
My favorite movie is "Spirited Away."
I like jazz music.
I enjoy traveling with my family.
I have fun chatting with friends at cafes.
I love finding new restaurants.
I go jogging every morning.
`;

export const generateTextWithoutRAG = async (input: string) => {
  const { text } = await generateText({
    model: openai("gpt-4o"),
    messages: [
      { role: "user", content: input },
      {
        role: "system",
        content: `Here are some relevant information about the user:
      ${resources}
      `,
      },
    ],
  });

  return text;
};

When executed, the generated text was as follows:
“Based on the information provided, your favorite food is sushi."
Great.

How Much Information to Provide?

Wait a minute.

For a short self-introduction like this, it's easy to include it with the question. But what if the self-introduction is very long?

We need to find a way to filter the information we provide. Let's refine the method to pass only the information related to the question.

How to Find Relevant Information

To find the relevant content, we can use embeddings.
Embeddings are numerical representations of text, used to compare the similarity between different texts.

For example, if you had to pick one similar word to "bus" from ["bike", "cat", "lemon"], which would it be?
The answer is "bike", as follows:

[
  { content: "bike", similarity: 0.880791523658849 },
  { content: "cat", similarity: 0.8036448043753511 },
  { content: "lemon", similarity: 0.7747867483955078 },
];

This list shows the similarity values with the text "bus". The higher the "similarity" value, the closer the text.
The process is straightforward:

  1. Create embeddings for bus, bike, cat, and lemon.
  2. Compare these embeddings with the target, bus, and store the values in similarity.
  3. Sort by similarity in descending order.

Here's the code. The Vercel AI SDK provides convenient functions.

import { openai } from "@ai-sdk/openai";
import { cosineSimilarity, embedMany, generateText } from "ai";

type Embedding = {
  content: string;
  embedding: number[];
};

const embeddingModel = openai.embedding("text-embedding-ada-002");

async function main() {
  const resource = `
	bike
	cat
	lemon
	`;
  const targetText = "bus";
  const result = await findRelevantContent(targetText, resource);
}

const findRelevantContent = async (input: string, resource: string) => {
  const embeddingsFromTarget = await generateEmbeddings(input);
  const embeddingsFromResource = await generateEmbeddings(resource);

  const topSimilarEmbeddings = await findTopSimilarEmbeddings(
    embeddingsFromTarget[0],
    embeddingsFromResource
  );

  return topSimilarEmbeddings;
};

const findTopSimilarEmbeddings = async (
  targetEmbeddings: Embedding,
  resourceEmbeddings: Embedding[],
  top = 3
) => {
  const similarities = resourceEmbeddings.map((e) => ({
    content: e.content,
    similarity: cosineSimilarity(targetEmbeddings.embedding, e.embedding),
  }));

  similarities.sort((a, b) => b.similarity - a.similarity);

  return similarities.slice(0, top);
};

const generateEmbeddings = async (value: string) => {
  const chunks = generateChunks(value);
  const { embeddings } = await embedMany({
    model: embeddingModel,
    values: chunks,
  });

  return embeddings.map((e, i) => ({ content: chunks[i], embedding: e }));
};

const generateChunks = (input: string): string[] => {
  return input
    .trim()
    .split("\n")
    .filter((i) => i !== "");
};

An LLM Knows About Me

Let's apply this to the original self-introduction to create a new function.

const resource = `My favorite food is sushi.
My hobby is reading.
I often go hiking on weekends.
My favorite movie is "Spirited Away."
I like jazz music.
I enjoy traveling with my family.
I have fun chatting with friends at cafes.
I love finding new restaurants.
I go jogging every morning.
`;

export const generateTextWithRAG = async (input: string) => {
  const topSimilarEmbeddings = await findRelevantContent(input, resource);

  const { text } = await generateText({
    model: openai("gpt-4o"),
    messages: [
      {
        role: "system",
        content:
          "Respond to the user's prompt using only the provided context.",
      },
      { role: "user", content: input },
      {
        role: "system",
        content: `Here are some relevant information about the user:
         ${topSimilarEmbeddings.map((e) => e.content).join("\n")}`,
      },
    ],
  });

  return text;
};

The relevant content looks like this:

[
  {
    content: "My favorite food is sushi.",
    similarity: 0.8662724175188703,
  },
  {
    content: "My hobby is reading.",
    similarity: 0.801633592562232,
  },
  {
    content: "I love finding new restaurants.",
    similarity: 0.79013408442398,
  },
];

And the generated text was:
“Your favorite food is sushi.”
Wonderful.

Even with an incredibly long self-introduction, an LLM can still answer questions about you.

Conclusion

In this article, we built a RAG with a simple structure for learning purposes. I hope this has helped you understand the potential and basic workings of RAG.

Please feel free to check out the repository I created during my learning process as well. You can easily find more advanced code samples by searching online.

If you have any feedback or questions, please feel free to let me know.
Thank you for your continued support.

Reference Links

Here are some sites and libraries I used to learn about RAG.