Building a Vector Database with MongoDB and Python

Vector databases have emerged as a powerful solution for managing high-dimensional data, which is common in AI applications. In this blog post, we will guide you on how to build a vector database using MongoDB and Python.

What is a Vector Database?

A vector database stores data in the form of vectors, which can represent anything from text embeddings to image features. These vectors are crucial for various machine learning tasks, including similarity search, recommendation systems, and anomaly detection.

Also see –> What are the advantages of self-hosted vector DBs?

Why MongoDB?

MongoDB is a NoSQL database known for its flexibility, scalability, and ease of use. It allows for the storage of complex data structures, making it a suitable choice for implementing a vector database.

Prerequisites

Before we start, ensure you have the following installed:

Python 3.x
MongoDB
pymongo (Python MongoDB driver)
numpy

You can install pymongo and numpy using pip:

pip install pymongo numpy

Step 1: Setting Up MongoDB

First, make sure MongoDB is installed and running on your system. You can download it from the official MongoDB website.

Start the MongoDB server using the following command:

mongod

Step 2: Connecting to MongoDB with Python

Create a Python script to connect to your MongoDB instance:

from pymongo import MongoClient

# Connect to the MongoDB server
client = MongoClient('localhost', 27017)

# Create (or switch to) the database
db = client['vector_database']

# Create (or switch to) the collection
collection = db['vectors']

Step 3: Storing Vectors

Let’s define a function to insert vectors into our MongoDB collection. We will use numpy to generate and manage vectors:

import numpy as np

def insert_vector(vector, metadata):
    document = {
        'vector': vector.tolist(),
        'metadata': metadata
    }
    collection.insert_one(document)

# Example vector and metadata
vector = np.random.rand(128)
metadata = {'description': 'Sample vector'}

# Insert the vector into the collection
insert_vector(vector, metadata)

Step 4: Retrieving Vectors

To retrieve vectors from the database, we can query the collection:

def get_vector_by_metadata(metadata):
    document = collection.find_one({'metadata.description': metadata})
    if document:
        vector = np.array(document['vector'])
        return vector
    return None

# Retrieve the vector
retrieved_vector = get_vector_by_metadata('Sample vector')
print(retrieved_vector)

Step 5: Vector Similarity Search

A core feature of a vector database is similarity search. We can implement a basic similarity search using cosine similarity:

from numpy.linalg import norm

def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (norm(vec1) * norm(vec2))

def find_most_similar(vector, top_n=5):
    all_vectors = collection.find()
    similarities = []

    for document in all_vectors:
        db_vector = np.array(document['vector'])
        similarity = cosine_similarity(vector, db_vector)
        similarities.append((similarity, document['metadata']))

    # Sort by similarity and get top_n results
    similarities.sort(reverse=True, key=lambda x: x[0])
    return similarities[:top_n]

# Find the most similar vectors
similar_vectors = find_most_similar(vector)
print(similar_vectors)

Vectors are the language of LLMs and their memory. As we say – humans speak in text, computers in binary and LLMs in vectors.

Conclusion

In this post, we’ve demonstrated how to build a vector database using MongoDB and Python. This basic implementation covers vector storage, retrieval, and similarity search. Depending on your use case, you can extend this system to include more sophisticated indexing, retrieval, and storage mechanisms.

Building a vector database opens up numerous possibilities for applications in machine learning, recommendation systems, and more. With MongoDB’s flexibility and Python’s extensive libraries, you have a robust foundation to explore these advanced capabilities.

Happy coding!

For more advanced implementations and optimizations, consider exploring MongoDB’s aggregation framework and indexing capabilities. Stay tuned for more posts on AI and data engineering!