Experimenting with Vector Databases: Chromadb, Pinecone, Weaviate and Pgvector

Vishnu Sivan
CoinsBench
Published in
10 min readNov 14, 2023

--

Vector databases are specialized systems designed for storing, managing, and searching embedding vectors. The widespread adoption of embeddings, which encode unstructured data like text, audio, and video as vectors for machine learning, has surged due to AI’s growing effectiveness in natural language processing and image recognition. Enterprises leverage vector databases to effectively scale and deliver solutions for these use cases. These databases serve as foundational tools in diverse fields such as GIS, computer graphics, data science, and machine learning, providing efficient storage and retrieval mechanisms for datasets with intricate spatial and geometric relationships.

In this article, we will try out various vector databases such as Chromadb, Pinecone, Weaviate and Pgvector.

Getting Started

Table of contents

If you are interested to know more about Vector Databases checkout the following link,

1. Chromadb

ChromaDB is an open-source vector database designed for the storage of vector embeddings, specifically for the development of extensive language model applications. It provides a rapid and effective solution for storing information and factual data essential for Large Language Model (LLM) applications. Additionally, it serves the purpose of enabling semantic search engines to operate efficiently over textual data.

Installing the dependencies

To get started, the initial step involves the installation of Chromadb. Ensure that you have installed Python 3.7 to 3.10, along with latest version PIP and your preferred IDE in your machine.

  • Create and activate the virtual environment by executing the following commands.
python -m venv venv
source venv/bin/activate #for ubuntu
venv/Scripts/activate #for windows
  • To upgrade the version of pip, follow the below code.
venv\scripts\python.exe -m pip install --upgrade pip
  • Install chromadb, openai, wget, numpy and pandas library using pip.
pip install chromadb==0.3.29 openai==0.28.1 wget numpy pandas

In this blog, specific version of chromadb and openai is used to avoid exceptions such as,

RuntimeError: Your system has an unsupported version of sqlite3. Chroma requires sqlite3 >= 3.35.0.

AttributeError: module ‘openai’ has no attribute ‘Embedding’

Importing the libraries

Create a file named app.py and import the necessary libraries to experiment chromadb database.

import openai
import pandas as pd
import os
import wget
import zipfile
from ast import literal_eval

# Chroma's client library for Python
import chromadb

# I've set this to our new embeddings model, this can be changed to the embedding model of your choice
EMBEDDING_MODEL = "text-embedding-ada-002"

# Ignore unclosed SSL socket warnings - optional in case you get these errors
import warnings

warnings.filterwarnings(action="ignore", message="unclosed", category=ResourceWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)

Loading and importing data

Load the wikipedia articles embedded from the openai. To display the top 10 data from the dataset use head() method.

embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'

# The file is ~700 MB so this will take some time
wget.download(embeddings_url)

with zipfile.ZipFile("vector_database_wikipedia_articles_embedded.zip","r") as zip_ref:
zip_ref.extractall("data")

article_df = pd.read_csv('data/vector_database_wikipedia_articles_embedded.csv')
print(article_df.head())

# Read vectors from strings back into a list
article_df['title_vector'] = article_df.title_vector.apply(literal_eval)
article_df['content_vector'] = article_df.content_vector.apply(literal_eval)

# Set vector_id to be a string
article_df['vector_id'] = article_df['vector_id'].apply(str)

print(article_df.info(show_counts=True))

Creating collections

Chroma collections offer the capability to store and filter data using flexible metadata, simplifying the process of querying specific subsets of embedded data.

Chroma seamlessly integrates with OpenAI’s embedding functions. The optimal approach for their utilization is during the creation of a collection, as outlined below.

  • Start with creating the Chroma client.
chroma_client = chromadb.Client()
  • Create collection using openai embeddings.
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

os.environ["OPENAI_API_KEY"] = 'your-openai-key'
if os.getenv("OPENAI_API_KEY") is not None:
openai.api_key = os.getenv("OPENAI_API_KEY")
print ("OPENAI_API_KEY is ready")
else:
print ("OPENAI_API_KEY environment variable not found")

embedding_function = OpenAIEmbeddingFunction(api_key=os.environ.get('OPENAI_API_KEY'), model_name=EMBEDDING_MODEL)

wikipedia_content_collection = chroma_client.create_collection(name='wikipedia_content', embedding_function=embedding_function)
wikipedia_title_collection = chroma_client.create_collection(name='wikipedia_titles', embedding_function=embedding_function)

Filling the collections

Chroma collections enable you to populate and filter data based on any desired metadata. Chroma also has the capability to store text alongside vectors and retrieve all relevant information in a single query call.

wikipedia_content_collection.add(
ids=article_df.vector_id.tolist(),
embeddings=article_df.content_vector.tolist(),
)

wikipedia_title_collection.add(
ids=article_df.vector_id.tolist(),
embeddings=article_df.title_vector.tolist(),
)

Searching the collections

Chroma seamlessly manages embedding queries when an embedding function is configured. The below code is used to create a function called query_collection for generating a query collection.

def query_collection(collection, query, max_results, dataframe):
results = collection.query(query_texts=query, n_results=max_results, include=['distances'])
df = pd.DataFrame({
'id':results['ids'][0],
'score':results['distances'][0],
'title': dataframe[dataframe.vector_id.isin(results['ids'][0])]['title'],
'content': dataframe[dataframe.vector_id.isin(results['ids'][0])]['text'],
})

return df

title_query_result = query_collection(
collection=wikipedia_title_collection,
query="modern art in India",
max_results=10,
dataframe=article_df
)

print(title_query_result.head())


content_query_result = query_collection(
collection=wikipedia_content_collection,
query="Famous battles in Indian history",
max_results=10,
dataframe=article_df
)
print(content_query_result.head())

For more usages, refer the following documentation

2. Pinecone

Pinecone is a cloud-based, fully-managed vector database designed to offer long-term memory for high-performance AI applications. Specifically tailored for the storage and retrieval of high-dimensional vectors, Pinecone facilitates rapid and efficient semantic search across these vector embeddings.

Installing the dependencies

Pinecone provides a straightforward REST API to interact with the vector database. It provides an option to engage with the API directly or leverage one of the endorsed Pinecone clients such as Python client and a Node.js client.

Install the dependencies using pip. In this section,Python client will be used as the pincone client.

pip install pinecone-client

Getting your API key

To initiate API calls to your Pinecone project, you’ll require an API key and an environment name.

Follow the below steps to obtain the key and environment:

  1. Open the Pinecone Console.
  2. Go to API Keys.
  3. Copy your API key and environment.

Initializing your connection

Initialize your client connection to Pinecone using the following code.

import pinecone
pinecone.init(api_key="xxxxxxxxxxx", environment="xxxxxxxxx")

Creating an index

To create an 8-dimensional vector index called “quickstart”, follow the below code,

pinecone.create_index("quickstart", dimension=8, metric="euclidean")
pinecone.describe_index("quickstart")

Inserting vectors

Use the upsert operation to insert 5 vectors into the index:

index = pinecone.Index("quickstart")
index.upsert(
vectors=[
{"id": "A", "values": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]},
{"id": "B", "values": [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2]},
{"id": "C", "values": [0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3]},
{"id": "D", "values": [0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4]},
{"id": "E", "values": [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]}
]
)

Performing a search

The below code uses nearest neighbor search approach to search for data.

res = index.query(
vector=[0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3],
top_k=3,
include_values=True
)
print(res)

Cleanning up

Pinecone’s Starter plan supports only 1 index. To remove the “quickstart” index, use the delete_index operation.

pinecone.delete_index("quickstart")

3. Weaviate

Weaviate is an open-source vector database designed for storing data objects and vector embeddings generated by your preferred machine learning models. It seamlessly scales to handle billions of data objects, functioning as a cloud-native, modular, real-time vector search engine tailored to enhance the scalability of your machine learning models. It is accessible through GraphQL, REST, and various language clients.

Creating a Weaviate Cluster

  • Open the Weaviate Console, and click on Register to create a new account.
  • To create a new Weaviate Cluster, click the Create cluster button, Select the Free sandbox tier., provide a Cluster name and Click on the Create button.
  • Click Details to view the Weaviate URL and API key.

Installing a client library

Install the python client using pip.

pip install "weaviate-client==3.*"

Connecting to Weaviate

Create a file named app.py. Add the following code to it for connecting to Weaviate. Provide Weaviate URL, API key and OpenAI key for establishing a connection.

import weaviate
import json

client = weaviate.Client(
url = "https://some-endpoint.weaviate.network", # Replace with your endpoint
auth_client_secret=weaviate.AuthApiKey(api_key="YOUR-WEAVIATE-API-KEY"), # Replace w/ your Weaviate instance API key
additional_headers = {
"X-OpenAI-Api-Key": "YOUR-OPENAI-API-KEY" # Replace with your inference API key
}
)

Definning a class

Define a data class to store objects.

class_obj = {
"class": "Question",
"vectorizer": "text2vec-openai", # If set to "none" you must always provide vectors yourself. Could be any other "text2vec-*" also.
"moduleConfig": {
"text2vec-openai": {},
"generative-openai": {} # Ensure the `generative-openai` module is used for generative queries
}
}

client.schema.create_class(class_obj)

Adding objects

Add the following code to import data with corresponding pre-computed vectors in your weaviate database.

import requests

fname = "jeopardy_tiny_with_vectors_all-OpenAI-ada-002.json" # This file includes pre-generated vectors
url = f'https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/{fname}'
resp = requests.get(url)
data = json.loads(resp.text) # Load data

client.batch.configure(batch_size=100) # Configure batch
with client.batch as batch: # Configure a batch process
for i, d in enumerate(data): # Batch import all Questions
print(f"importing question: {i+1}")
properties = {
"answer": d["Answer"],
"question": d["Question"],
"category": d["Category"],
}
batch.add_data_object(
data_object=properties,
class_name="Question",
vector=d["vector"] # Add custom vector
)

Semantic search

Performing a nearText search in Weaviate involves identifying objects whose vectors closely resemble the vector associated with the provided input text.

Execute the following code to search for objects exhibiting vectors most similar to those of “biology.”

response = (
client.query
.get("Question", ["question", "answer", "category"])
.with_near_text({"concepts": ["biology"]})
.with_limit(2)
.do()
)

print(json.dumps(response, indent=4))

The output is as follows,

4. Pgvector

Pgvector is an open-source PostgreSQL extension, empowers you to store, query, and index vectors. Addressing the gap in PostgreSQL’s native vector capabilities, pgvector enables the storage of vectors alongside other data, facilitating exact and approximate nearest neighbor searches. It supports operations such as L2 distance, inner product, and cosine distance.

Installing the extension

  • Install with PostgresSQL 15+ version. Download the necessary installer from the below link.
  • Double click on the installer file and provide the necessary details such as root password to install postgres in your machine.
  • Open command prompt as administrator and run the following command to install pgvector extension in your machine. Ensure C++ support in Visual Studio is installed.
call "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvars64.bat"
  • Execute the following command to build pgvector using nmake.
set "PGROOT=C:\Program Files\PostgreSQL\15"
git clone --branch v0.5.1 https://github.com/pgvector/pgvector.git
cd pgvector
nmake /F Makefile.win
nmake /F Makefile.win install

Installing the dependencies

Create a virtual environment and install the required dependencies to experiment with the pgvector functionalities.

pip install pgvector psycopg openai==0.28.1 psycopg-binary

We have used openai 0.28.1 version as openai.Embedding is not available in the latest version of the libarary (openai>=1.0.0).

Creating a database

Open pgAdmin 4, click on Databases → Create → Database. Provide the name as pgvector_example and click on Save to create the database.

Connecting with the database

  • Create a file named app.py. Add the following code to it for establishing a connect with the database using psycopg.
import openai
from pgvector.psycopg import register_vector
import psycopg

conn = psycopg.connect(
host='localhost',
port=5432,
user='postgres',
password='postgres-password',
dbname='pgvector_example',
autocommit=True)

Registering the vector type

Enable and register the pgvector extension with your connection using the following code.

conn.execute('CREATE EXTENSION IF NOT EXISTS vector')
from pgvector.psycopg import register_vector
register_vector(conn)

Creating the table

Create a documents table to store the content and the vector embeddings.

conn.execute('DROP TABLE IF EXISTS documents')
conn.execute('CREATE TABLE documents (id bigserial PRIMARY KEY, content text, embedding vector(1536))')

Creating embeddings

Create the embeddings using openai embeddings and insert it into the database table.

input = [
'The dog is barking',
'The cat is purring',
'The bear is growling',
'The bird is singing',
'The elephant is trumpeting',
'The lion is roaring',
'The horse is neighing',
'The monkey is chattering'
]

import os
os.environ["OPENAI_API_KEY"] = 'your-openai-key'
if os.getenv("OPENAI_API_KEY") is not None:
openai.api_key = os.getenv("OPENAI_API_KEY")
print ("OPENAI_API_KEY is ready")
else:
print ("OPENAI_API_KEY environment variable not found")

response = openai.Embedding.create(input=input, model='text-embedding-ada-002')
embeddings = [v['embedding'] for v in response['data']]

for content, embedding in zip(input, embeddings):
conn.execute('INSERT INTO documents (content, embedding) VALUES (%s, %s)', (content, embedding))

Querying the vector

Query the data based on the embedding vector.

document_id = 1
neighbors = conn.execute('SELECT content FROM documents WHERE id != %(id)s ORDER BY embedding <=> (SELECT embedding FROM documents WHERE id = %(id)s) LIMIT 5', {'id': document_id}).fetchall()
for neighbor in neighbors:
print(neighbor[0])

The output is as follows,

You can also view the embedding vectors on pgAdmin 4.

Thanks for reading this article.

Thanks Gowri M Bhatt for reviewing the content.

If you enjoyed this article, please click on the clap button 👏 and share to help others find it!

The full source code for this tutorial can be found here,

The article is also available on Dev.

--

--

Try not to become a man of SUCCESS but rather try to become a man of VALUE