Tired of sifting through mountains of data, desperately searching for that one elusive piece of information? Or perhaps you've heard whispers of the transformative power of vector databases but find yourself swimming in a sea of confusion when it comes to putting them into practice? Fear not, intrepid explorer of the digital realm, for we have just the solution to your woes!
In this blog, we'll take you on a journey through the ins and outs of semantic search using Pinecone, enabling you to work on vast data with ease and finesse. So grab your metaphorical compass, buckle up, and get ready to embark on a voyage of discovery unlike any other!
Pinecone
Pinecone is a cloud-native vector database designed to streamline and accelerate the process of building and deploying applications that rely on similarity search and recommendation systems. It provides a robust infrastructure for storing, indexing, and querying high-dimensional vector embeddings, allowing developers to efficiently retrieve nearest neighbors or similar items based on vector representations.
At its core, Pinecone leverages state-of-the-art indexing techniques and scalable infrastructure to handle large volumes of high-dimensional data effectively. It offers a simple yet powerful API that abstracts away the complexities of managing vector databases, making it easy for developers to integrate semantic search capabilities into their applications with minimal effort.
In essence, Pinecone stands as a versatile ally for those eager to tap into the potential of vector databases. Whether you're crafting recommendation engines, curating content discovery platforms, or fine-tuning personalized search experiences, Pinecone acts as a catalyst, swiftly transforming raw data into tangible insights with remarkable speed and efficiency.
Now that we've familiarized ourselves with Pinecone, it's time to roll up our sleeves and delve into the practical aspects of this powerful tool. So, without further ado, let's start this tutorial on using Pinecone for semantic search.
A Practical Guide to Semantic Search using Pinecone
To begin we must install the required prerequisite libraries:
Step 1: Install Required Libraries
Alright, let's kick things off by making sure we have all the tools we need to dive into semantic search using Pinecone. To get started, we'll need to install a few essential libraries. These libraries are like the secret sauce that makes everything tick behind the scenes.
We'll be installing three key ingredients: pinecone-client, pinecone-datasets, and sentence-transformers. These are the building blocks that we will be utilizing to use Pinecone for our semantic search adventure.
Code:
!pip install -qU \
pinecone-client==3.0.0 \
pinecone-datasets==0.7.0 \
sentence-transformers==2.2.2
(Now, here's a little trick: if you're working within a Jupyter notebook, you can simply run the provided pip install command with an exclamation mark (!). But if you're coding in a different environment, just drop the ! and run the command as is.)
Step 2: Downloading the data
Now that we've got our toolkit all set up, it's time to get our hands on some data. But hold your horses! We're going to take a shortcut here to save us some precious time.
Instead of tediously preparing our own dataset (Â which can be incredibly time-consuming and might even warrant its own separate blog post), we're going to tap into Pinecone's treasure trove of prebuilt datasets. Think of it like having access to a ready-made library of information.
The dataset used in this tutorial is derived from Quora questions and answers, which have been preprocessed and encoded using the MiniLM-L6 model and BM25 ranking.
Code:
from pinecone_datasets import load_dataset
dataset = load_dataset('quora_all-MiniLM-L6-bm25')
# we drop metadata as will use blob column
dataset.documents.drop(['metadata'], axis=1, inplace=True)
dataset.documents.rename(columns={'blob': 'metadata'}, inplace=True)
# we will use 80K rows of the dataset between rows 240K -> 320K
dataset.documents.drop(dataset.documents.index[320_000:], inplace=True)
dataset.documents.drop(dataset.documents.index[:240_000], inplace=True)
dataset.head()
Each entry in the dataset represents a question from Quora, along with its corresponding answer. The dataset is structured such that each question-answer pair is treated as a single document, with the question serving as the main content and the answer as supplementary information.
The dataset has undergone preprocessing to ensure consistency and compatibility with the semantic search task. This includes tokenization, encoding, and ranking using the MiniLM-L6 model and BM25 algorithm.
Step 3: Serverless or Pod-based?
We now have our dataset ready to go, it's time to decide how we want to deploy our index. This decision boils down to whether we want to use a serverless or pod-based approach.
If you're unfamiliar with these terms, let me break it down for you. A serverless index is managed by Pinecone and runs on their infrastructure, allowing for seamless scalability and minimal maintenance overhead. On the other hand, a pod-based index is deployed on your own Kubernetes cluster, giving you more control over the environment and potentially better performance for high-throughput applications.
To make this decision, we need to consider factors like scalability, resource management, and cost. If you're just getting started and want to keep things simple, a serverless index is the way to go. But if you have specific requirements or need more control over the deployment environment, a pod-based index might be the better choice.
In this tutorial, we are using a serverless approach.
Code:
import os
use_serverless = os.environ.get("USE_SERVERLESS", "False").lower() == "true"
Step 4: Creating an Index
With our dataset prepared and our deployment approach decided, it's time to set up our index in Pinecone.
First, we'll initialize our connection to Pinecone using our API key. If you haven't already obtained your API key, you can sign up for a free account on the Pinecone website to get one.
Code:
from pinecone import Pinecone
# initialize connection to pinecone (get API key at app.pc.io)
api_key = os.environ.get('PINECONE_API_KEY') or 'PINECONE_API_KEY'
environment = os.environ.get('PINECONE_ENVIRONMENT') or 'PINECONE_ENVIRONMENT'
# configure client
pc = Pinecone(api_key=api_key)
Next, we'll specify the configuration for our index. This includes parameters such as the dimensionality of the vectors, the distance metric to use for similarity calculations, and the deployment specification (serverless or pod-based).
Code:
from pinecone import ServerlessSpec, PodSpec
if use_serverless:
spec = ServerlessSpec(cloud='aws', region='us-west-2')
else:
spec = PodSpec(environment=environment)
Once our index specification is ready, we'll create the index. Pinecone will handle the deployment and initialization process, ensuring that our index is ready for use.
We'll give it a name, define the dimensionality of the vectors (matching the embeddings we'll be using), specify the distance metric, and provide the deployment specification.
Code:
import time
index_name = 'semantic-search-fast'
existing_indexes = [
index_info["name"] for index_info in pc.list_indexes()
]
# check if index already exists (it shouldn't if this is first time)
if index_name not in existing_indexes:
# if does not exist, create index
pc.create_index(
index_name,
dimension=384, # dimensionality of minilm
metric='dotproduct',
spec=spec
)
# wait for index to be initialized
while not pc.describe_index(index_name).status['ready']:
time.sleep(1)
# connect to index
index = pc.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats()
We have now created a new index called 'semantic-search-fast'. It's important that we align the index dimension and metric parameters with those required by the MiniLM-L6 model.
Step 5: Upserting Data into the Index
Now that our index is created, it's time to populate it with our dataset. This process, known as "upserting," involves inserting new data points into the index or updating existing ones if they already exist.
Once the dataset is upserted into the index, we'll be ready to perform semantic search queries to retrieve relevant documents based on input queries. This marks a crucial step in our journey towards building a powerful semantic search engine using Pinecone.
We'll iterate through our dataset in batches and upsert each batch into the index. This ensures efficient processing and minimizes resource usage.
Code:
from tqdm.auto import tqdm
for batch in tqdm(dataset.iter_documents(batch_size=500), total=160):
index.upsert(batch)
Note: The tqdm module is used to display a progress bar, providing visibility into the upserting process.
Step 6: Making Queries
With our dataset upserted into the index, we can now start making queries to retrieve relevant documents based on input queries. In this step, we'll perform semantic search queries using Pinecone to find similar documents to a given input query.
First, we'll need to prepare our query by encoding it into a vector representation using a pre-trained sentence embedding model. We'll use the SentenceTransformer library to encode our query.
Code:
from sentence_transformers import SentenceTransformer
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2', device=device)
query = "which city has the highest population in the world?"
# create the query vector
xq = model.encode(query).tolist()
Next, we'll use the encoded query vector to perform a semantic search query on our index. We'll specify the number of similar documents (top_k) to retrieve and include metadata in the query response to obtain additional information about the retrieved documents.
Now, lets query.
Code:
# now query
xc = index.query(vector=xq, top_k=5, include_metadata=True)
Upon receiving the response in variable xc, we observe a collection of questions closely related to our initial query. While we may not have exact matches, it's evident that the retrieved questions share common themes and topics.
To enhance readability, we can opt to reformat this response for better clarity and comprehension.
Code:
for result in xc['matches']:
print(f"{round(result['score'], 2)}: {result['metadata']['text']}")
Let us look at one more example query search.
Code:
query = "which metropolis has the highest number of people?"
# create the query vector
xq = model.encode(query).tolist()
# now query
xc = index.query(vector=xq, top_k=5, include_metadata=True)
for result in xc['matches']:
print(f"{round(result['score'], 2)}: {result['metadata']['text']}")
In this demonstration, we intentionally utilized distinct terms in our query compared to those found in the retrieved documents. Specifically, we replaced "city" with "metropolis" and "populated" with "number of people."
Remarkably, despite the considerable disparity in terms and the absence of direct term overlap between the query and the retrieved documents, the results remained highly relevant. This exemplifies the remarkable capability of semantic search algorithms to comprehend contextual similarities beyond literal word matching.
Step 8: Index Deletion
In this final step, we will tidy up our environment by deleting the index. After completing our experimentation and queries, it's essential to clean up resources to avoid unnecessary costs and clutter.
To delete the index, we'll utilize the Pinecone client library to issue a deletion command:
Code:
pc.delete_index(index_name)
With the index successfully deleted, we've concluded our tutorial on semantic search using Pinecone.
Ta-da! 🎉 So, with this, we wrap up our tutorial journey. We sincerely hope that this tutorial has been both insightful and practical for you. By now, you've gained valuable hands-on experience in leveraging Pinecone for semantic search, unlocking the potential of vector databases in your projects.
Don't hesitate to experiment, innovate, and apply what you've learned here to your projects and endeavors.
Thank you for joining us on this tutorial adventure! If you have any questions, feedback, or just want to share your experiences, feel free to reach out.
If you require assistance with the implementation of vector databases, or if you need help with related projects, please don't hesitate to reach out to us.
Comentários