The Rise of Vector Databases in AI and Beyond

16 min readMar 12, 2024

“The Rise of Vector Database in AI minimalistic graphic” Generated with AI (Bing Copilot) 12. mars 2024 kl. 0:28 p.m.

In my previous article, I talked about how transformers are highly efficient for model training and how Large Language Models (LLMs) can have trillions of parameters, enabling human-like responses in Natural Language Processing (NLP). We’ve explored how Semantic Kernel, an SDK, manages prompts for AI services using LLMs for C#, Python, and Java developers. You can read the article here.

In this post, I will discuss vectors, embeddings, vector databases, and an architecture for AI applications that you must be familiar with.

This article is written by Devlin Duldulao, a senior consultant at the IT consultancy company Inmeta.

Devlin Duldulao is an experienced full-stack developer at Inmeta, recognized as a Microsoft MVP, Certified Trainer, and Azure Developer Associate. With over 10 years of experience, he assists businesses in modernizing and innovating digital services, focusing on Enterprise Cloud, web and mobile development, as well as system design and cloud services. Devlin shares his expertise through courses and talks at national and international conferences. He has also authored three books, including “Spring Boot and Angular,” published in 2023.

Keyword Search vs Semantic Search

We can start by showing the difference between a keyword search and a vector search by an example, and then I will discuss in detail what they are.

I searched an online drugstore for the keyword “eye pain.” The website is an online drug store that does keyword searches.

The image above shows three results using the query “eye pain.” The results are joint pain, eye drops, and loratadine for itchy eyes. All results have the word either eye or pain. This method retrieves documents that contain exact matches or close variations of the terms in the query. That’s how keyword search works.

We can improve the results of an online drugstore. Let’s try Weaviate’s demo of an online drugstore.

The image above shows three results related to eye pain or discomfort. This result is an improvement because we do not see any joint pain here or in other parts of the body. Vector search is a method that retrieves documents that are conceptually similar to the query, even if they don’t contain the exact terms. It relies on the semantic understanding of the content.

Vector

What is a vector? Simply put, a vector is just an array of numbers. In mathematics, a vector is a quantity with both magnitude and direction. Writing a vector in your favorite programming language is easy. You can hard code an array of numbers.

Embedding (vector embedding)

What about vector embedding, also known as just embedding? An embedding is a specific type of vector representation where high-dimensional data is mapped to vectors of significantly lower dimensions. If this is hard to grasp, think that an embedding is an array of numbers with meaning, meaning it has coordinates and grouping of the structured and unstructured data. The idea is to capture the semantic meaning of the data in a dense vector form where similar items are placed closer together in the vector space.

How do you generate or create embeddings?

The process of generating embeddings is that you need an embedding model from a paid source like OpenAI’s text-embedding-ada-002 model or an open source from HuggingFace’s SentenceTransformers, which you can install as one of your Python app’s dependencies.

*An embedding model generating vector embeddings*

The image above shows an embedding model generating embeddings from unstructured data. We will talk more about embeddings later in this article.

Now that we have embeddings, we need a system to store them and the data they represent.

Databases

Before we focus on what is a vector database, let’s have a quick look at what popular databases we usually use in the applications we build.

The image above shows a diagram of a relational database. You can see that a relational database is built to store structured data in columns and rows.

Another popular database type would be a NoSQL database, which looks like the image below.

The document database structure, which you see in the above image, has collection and sub-collection instead of rows and columns. A document database is a NoSQL database that focuses on storing, retrieving, and managing document-oriented information.

Now, let’s see what a vector database would look like.

What is a vector database?

A vector database is a specialized database where you can store many vectors alongside its structured and unstructured data and then retrieve them when needed.

And here’s a better way of looking at it.

*Clusters of embeddings based on their similarity*

In the image above, the embeddings are grouped depending on how closely the words are related. Your query, for example, a desktop, might bring a Macbook and laptop because they are all personal computers or gadgets.

Take a look at the image above. The thyme, rosemary, and oregano are located in the herbs area. And just like in GPS, the embeddings are like latitude and longitude coordinates.

Vector Indexing

Vector indexing involves organizing and structuring high-dimensional data, typically vectors, to facilitate efficient querying. It’s about how the data is stored and accessed.

Here are some vector indexing grouped with the suitable search algorithms.

· KD-Trees — uses a data structure that organizes points in a k-dimensional space, enhancing the efficiency of nearest neighbor search by reducing the search space.

· Ball Trees — organizes data in nested hyperspheres, improving nearest neighbor search efficiency, particularly in higher-dimensional spaces where KD-Trees are less effective.

· Tree-Based Indexing — includes structures like ANNOY, KD-trees, and Ball trees, organizing data in tree-like formats for efficient spatial searches.

· Hashing (LSH) — Locality-Sensitive Hashing (LSH) uses hash functions to group similar items into ‘buckets’, speeding up similarity searches.

· Clustering-Based Index (FAISS) — FAISS clusters high-dimensional vectors, enabling faster similarity searches through efficient grouping.

· Proximity Graph-Based Indexing (HNSW) — HNSW forms layered graphs for quick navigation and retrieval in high-dimensional space.

· Compression-Based Index (PQ, SCANN) — these methods compress data into compact representations, balancing search efficiency with accuracy.

So, vector indexing is about organizing and storing high-dimensional data, but how do we efficiently search for the stored data? Let’s move to the next topic, which is search algorithms.

Search Algorithms

Search algorithms are the techniques to retrieve the most relevant vectors from the database in response to a query. It’s about how the data is queried and retrieved. These algorithms, such as k-Nearest Neighbors (KNN) and Approximate Nearest Neighbor (ANN), leverage advanced indexing structures to quickly sift through large datasets, identifying vectors closest or most similar to the given query.

KNN is used when the dataset is manageable, and the accuracy of the result is paramount, while ANN is preferred when the dataset is large. In ANN, the application can tolerate a degree of approximation for speed and scalability. Simply because in KNN, the time required to return results would increase linearly with the data size. ANN tradeoff is losing a bit of accuracy for speed. That’s why it’s called approximate.

So, after collecting the matching possible data, there is a metric to check how similar or dissimilar the vectors are. That would be the next topic.

Distance Metrics

A distance metric is a function that defines the distance between two points (or vectors) in a space. This distance reflects how similar or dissimilar these points are. And here are common distance metrics.

Euclidean

Euclidean distance is the “ordinary” straight-line distance between two points in Euclidean space. For vectors, it’s the square root of the sum of the squared differences between corresponding elements.

Cosine similarity

Cosine similarity measures the cosine of the angle between two vectors. It determines whether two vectors are pointing in roughly the same direction.

Dot product

The dot product (also known as the scalar product) measures the product of the magnitudes of the two vectors and the cosine of the angle between them. Algebraically, it is the sum of the products of the corresponding entries of the two sequences of numbers.

So that’s distance metrics that evaluate how similar two points are. Let’s see the options for vector database vendors in the next section.

Vendors

These vector databases below are widely used for managing high-dimensional data, particularly in machine learning and AI applications.

Pinecone excels at simplicity and user-friendliness, providing a cloud-native, managed service that abstracts away the complexities of vectorization and indexing.
Weaviate is engineered for an excellent developer experience and performance, supporting fast, sub-millisecond search results and keyword and vector-based search functionalities.
Built-in Rust, Qdrant promises efficient resource utilization and is emerging as a first-choice vector search backend, with rapid community growth and a strong focus on search accuracy & scalability.
Milvus is a highly mature and scalable database, offering a comprehensive set of algorithms and the unique advantage of a DiskANN implementation for efficient on-disk vector indexing. Zillis is a commercial parent entity of Milvus and provides a fully managed cloud solution built on top of Milvus.
Chroma enables rapid prototyping with its Python/JavaScript interface and embedded architecture, offering hosting options and unique features like quantifying query relevance.
LanceDB innovates with a serverless architecture and a novel columnar data format, simplifying infrastructure complexity and facilitating semantic search applications directly linked to data lakes.
Vald specializes in handling multi-modal data through a highly distributed architecture, focusing uniquely on fast ANN search algorithms like NGT.
Elasticsearch, Redis, and pgvector offer the advantage of integrating vector search functionalities into existing data stores, providing a smooth transition into semantic search capabilities for current users.

That’s most of the vector database vendors when writing this article. I highly recommend trying them all to feel the DX of each vector database. Now, let’s go to the roles of vector databases in AI applications.

Roles of vector databases in AI applications

Now that we have explained what a vector database is, let’s see how it can help AI applications.

Let’s do a quick recap of what an LLM is.

Large Language Models

Large language models are machine learning models trained to predict the next word of a sentence. It would appear that the application talks to you like a human because the output is grammatically correct, related to the input, reasoning, etc. You will notice this in ChatGPT.

Evolution of Large Language Models

Why is vector database hot right now? Look at the image above. It is a diagram of the evolution of LLMs. We are in a period now where businesses are bringing generative AI into their applications.

But there’s a problem.

Limitations of LLMs

LLMs could be better. They have shortcomings. And here they are.

• Limited to the information used to train the AI (information until April 2023), but this is using the latest GPT-4 turbo

• Hallucinations (incorrect responses) of LLMs. Simply because LLMs are trained to predict the next word. It is just completing the text, even adding non-factual things

The solutions to the problems of LLMs

Here are a few solutions for dealing with LLMs’ problems.

• Custom model for your company. It is any model that has been tailored to meet specific requirements

• Fine–tuning is a process in ML where you start with a pre-trained model and continue the training process with a new dataset that is typically smaller and more domain-specific

• RAG is a design pattern for augmenting a model’s capabilities by combining it with a retrieval component. You are not training a model here.

Let’s see how RAG solves this problem by using a vector database.

So, from the user, you convert your prompt into embeddings for similarity search in the vector db. Then, arrange the original prompt plus the vector DB’s results into the prompt template before sending it to the LLM provider.

Enough said, here’s a quick and dirty demo of a vector database and then followed by design pattern in RAG using.

Vector database demo

In this quick example, I will use Qdrant vector database in a docker container and Python programming language, and LangChain in a Jupyter notebook.

Installation

Let’s install an open-source embedding model and a vector database.

% pip install sentence-transformers
% pip install qdrant-client

Import embedding models from the vendor

Now, we import the Qdrant and the SentenceTransfor model.

from qdrant_client import models, QdrantClient
from sentence_transformers import SentenceTransformer

Loading the model and the data

Let’s now use the model, hard code the information that we will process to the embedding model, and save it in the vector database later.

encoder = SentenceTransformer(model_name_or_path='all-MiniLM-L6-v2') 

documents = [{ "name": "The Time Machine", "description": "A man travels through time and witnesses the evolution of humanity.", "author": "H.G. Wells", "year": 1895 },
{ "name": "Ender's Game", "description": "A young boy is trained to become a military leader in a war against an alien race.", "author": "Orson Scott Card", "year": 1985, and the rest…]

Creating collection and records in Qdrant

The demo below is about connecting to the database using the Qdrant client, creating a collection with the name my_books, assigning the cosine distance metric, and, lastly, saving or uploading the documents using records.

qdrant = QdrantClient("localhost", port=6333)
collection_name = "my_books"

qdrant.recreate_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=encoder.get_sentence_embedding_dimension(), # Vector size is defined by used model
        distance=models.Distance.COSINE
    )
)

qdrant.upload_records(
    collection_name=collection_name,
    records=[
        models.Record(
            id=idx,
            vector=encoder.encode(doc["description"]).tolist(),
            payload=doc
        ) for idx, doc in enumerate(documents)
    ]
)

Querying and showing the output

Now, we are querying below and asking only the top two results, limit 2, for the sake of the demo, then printing them in the console.

query = "movies about meals"

hits = qdrant.search(
    collection_name=collection_name,
    query_vector=encoder.encode(query).tolist(),
    limit=2
)
for hit in hits:
    print(hit.payload, "score:", hit.score)

{'author': 'Laura Esquivel', 'description': 'A tale where each chapter is a month in the life of the protagonist and is preceded by a recipe. Food and love intermingle in this passionate novel.', 'name': 'Like Water for Chocolate', 'year': 1989} score: 0.45089203

{'author': 'Michael Pollan', 'description': 'A comprehensive look into the modern food chain, questioning what we should eat and why.', 'name': "The Omnivore's Dilemma", 'year': 2006} score: 0.4049138

Retrieval-Augmented Generation (RAG) demo

In this demo, I will show you prompts with and without RAG for you to see the problem that RAG is solving.

Not using RAG

The model is not aware of the date today. It still thinks the present time is the last time of its training.

% pip install langchain

from langchain.llms import OpenAI

llm = OpenAI(openai_api_key="sk-…")

prompt = "When is 294 UFC going to happen?"

print(llm(prompt))

Output:
As of October 2020, there is no event scheduled for UFC 294. The next scheduled event is UFC 254, which is set to take place on October 24, 2020.

Another example below will demonstrate that the model is hallucinating or responding to an event that did not happen.

from langchain.llms import OpenAI

prompt = "Why the main card of 294 ufc fight was cancelled?"

print(llm(prompt))

Output:
The main card of UFC 294 was canceled due to a positive COVID-19 test from one of the fighters on the card. Health and safety protocols were followed, and the event was canceled out of an abundance of caution.

Using RAG with data and a prompt template

The sample below improves the quality of the response.

user_input = 'Why the main card of 294 ufc fight was cancelled?'
vector_db_result = 'Charles Oliveira suffered a nasty laceration that forced him out of the UFC 294 main event vs. Islam Makhachev. Shortly after UFC CEO Dana White confirmed Alexander Volkanovski stepped in, Oliveira (34-9 MMA, 22-9 UFC) shared a pair of images of the gash, as well as a brief apology video where he sported a bandage on his left brow. “Sorry to everyone but you know everything,” Oliveira wrote in Portuguese.'
note = 'Be concise.'

prompt = f"""Act as a search copilot, be helpful and informative. \n
-------------- \n
Based on the user's query below: \n
'{user_input}'. \n
Here is some information about the query. It has the following information: \n
{vector_db_result} \n
------------- \n
note: {note}"""

print(llm(prompt))

Output:
It appears that the main card of UFC 294 was canceled because Charles Oliveira suffered a laceration to his face, which forced him out of the fight. Alexander Volkanovski stepped in to take his place.

Using RAG with no data to support the response

The below sample will make the model say that it does not know the answer if the model does not know the answer and there is no information provided.

user_input = 'Why the main card of 294 ufc fight was cancelled?'
vector_db_result = 'No information found'
note = 'Be concise and dont add any other details if you don\'t know about it.'

prompt = f"""Act as a search copilot, be helpful and informative. \n
-------------- \n
Based on the user's query below: \n
'{user_input}'. \n
Here is some information about the query. It has the following information: \n
{vector_db_result}.
------------- \n
note: {note}"""

print(llm(prompt))

Output:
I’m sorry, I’m not able to provide any information about the main card of the 294 UFC fight being cancelled. However, you may be able to find more information by searching through news articles or through the UFC website.

The RAG design pattern enhances the chat LLM model, supplemented by the sports league’s access to diverse data sources like databases, player bios, and news feeds. This integration allows the AI to offer more timely, contextually relevant, and accurate information.

However, RAG alone does not guarantee accepted results. Sometimes, a keyword search is also essential when feeding information to LLM. Let me show you.

Hybrid RAG

Below is a diagram of a hybrid RAG, which uses a keyword search as a supplement for improving results.

So, we combine the vector and keyword search results before sending them to the LLMs. There will be improvements in the results here because sometimes you are looking for an exact match, like email addresses, measurements, and things about medical, legal, and technical terminologies without synonyms.

There is still room for improvement here by adding a cross-encoder model that will check the results of our vector and keyword searches.

Hybrid RAG + Re-ranking:

The goal of re-ranking is to improve the relevance of the results returned by an initial retrieval query. So, we have an extra stage here where we send both results from semantic and keyword searches to the re-ranking model. An encoder will check and arrange these results for us before sending them to the LLM with our original prompt in a prompt template. The hybrid RAG plus re-ranking is an excellent combination for handling searches.

Here’s a table to show you a study made by Microsoft:

This table is about evaluations of approved customer data sets of Microsoft’s using relevant industry benchmarks. It shows the improvements in search relevance, measured by the NDCG metric, across various search configurations. Normalized Discounted Cumulative Gain (NDCG) is a measure used to evaluate the quality of the results returned by a search algorithm or recommendation system. It highlights a significant 22-point increase in relevance when advancing from basic keyword searches to a hybrid search enhanced with a semantic re-ranker. This enhancement is crucial for generating more relevant content efficiently and reducing computational costs in retrieval-augmented generation systems.

Now, let’s move to the last section of this article.

LLM application architecture

As a bonus in this article, it’s good also to learn what an ideal architecture for an LLM application could be. Asides from the placing the hybrid RAG with Re-ranking solution, there are services that we should consider. Here they are:

· Data filtering — ensure that the LLM isn’t processing unauthorized data, like personal identifiable information, after getting the database results. e.g. amoffat/HeimdaLLM

· Prompt optimization tool — minimizes token complexity of the prompt template to save money. e.g. vaibkumr/prompt-optimizer

· LLM cache — retrieves outputs that have been used for similar queries, reduces latency, computational costs, and variability in suggestions. e.g. zilliztech/GPTCache

· Content classifier — detection of harmful language, sanitization, data leakage, prompt injection attacks, appropriateness, and manipulativeness before sending the LLM output back to the user. e.g. derwiki/llm-prompt-injection-filtering and laiyer-ai/llm-guard

· Telemetry service — enables the assessment of your app’s performance by transparently tracking user interactions, such as acceptance and modification of suggestions, providing insights for enhancements. e.g. OpenTelemetry

There you go. That was an overview of a system design for an LLM with a search query.

Let’s sum up everything we have learned here.

Conclusion

We have learned that embeddings are vectors with meanings, and you can use a vector database to save embeddings. We also know that indexing is a process in the vector database of organizing the embeddings for faster querying. Search algorithms are algorithms for searching for results, and distance metrics are used to compare relevancy between two embeddings.

We have also examined different vector database vendors and the limitations of LLMs, such as hallucinations and outdated knowledge.

Lastly, we have learned about RAG, hybrid RAG, and re-ranking to solve the limitations of LLMs.

Repository

github.com/webmasterdevlin/vector-database-presentation

Resources

I hope this article will help you get started with vector databases. Start building AI applications that bring value to your users.

Until next time.

Happy coding. Peace out! ✌️