top of page

Retrieval-Driven Generative QnA with OpenAI and Pinecone

Welcome aboard, knowledge seekers! Ever wonder how to fine-tune those language models that sometimes seem to spin tales out of thin air? Well, here's the scoop – we're about to unravel the magic behind Retrieval Enhanced Generative Question Answering with none other than OpenAI and the powerhouse, Pinecone vector databases.

In this tutorial, we will be working to achieve accuracy and reliability in AI responses. Get ready to explore how querying relevant contexts through Pinecone and feeding them into OpenAI's generative model can ground your answers in the real data universe. It's time to level up your understanding of language models – OpenAI and Pinecone style!


Retrieval Augmented Generation (RAG)

First, let's understand what Retrieval Augmented Generation (RAG) is all about.

Picture this: you have a question, and you want an answer – but not just any answer, one that's not only accurate but also rooted in real-world data. That's where RAG swoops in like a superhero.

It's a groundbreaking approach that marries the power of retrieval – think fetching relevant information from vast databases like Pinecone's vector databases – with the finesse of generative models, such as those crafted by OpenAI.

In simpler terms, RAG is like having a super-smart librarian who not only knows where to find the right books but can also write custom responses based on them. Intrigued? Let's dive deeper into the mechanics of RAG and discover how it's revolutionizing the world of question answering.

At its core, RAG represents a paradigm shift in how we approach question answering. Rather than relying solely on generative models to conjure responses from scratch, RAG leverages the vast reservoirs of knowledge stored in retrieval databases, such as the robust vector databases that we will be dealing with in this tutorial.



To get started with the tutorial, ensure you have the following prerequisites installed:

Python Environment: Ensure you have Python installed on your system. This tutorial assumes familiarity with Python programming.

API Keys:

  • OpenAI API Key: You'll need an API key from OpenAI. You can obtain one from the OpenAI platform.  Enter the key where required in the code.

  • Pinecone API Key: Similarly, you'll require a Pinecone API key, which you can obtain from the Pinecone platform.


Step 1: Installing Libraries

Before we can proceed with building our question answering system, we need to make sure that we have all the necessary libraries installed. In this step, we'll install the required dependencies using pip.


# installing required libraries
!pip install -qU \
    openai==0.27.7 \
    pinecone-client==3.0.0 \
    pinecone-datasets==0.7.0 \

This ensures that we have access to the necessary functionality and tools needed to build and run our question answering system. Once the installation is complete, we can move on to the next steps of the process.

Step 2: Building a Knowledge Base

In this step, we lay the groundwork for our question answering system by creating a knowledge base. This involves retrieving relevant information from Pinecone's vector databases and organizing it for efficient retrieval.


from pinecone_datasets import load_dataset

dataset = load_dataset('youtube-transcripts-text-embedding-ada-002')

# we drop sparse_values as they are not needed for this example
dataset.documents.drop(['metadata'], axis=1, inplace=True)
dataset.documents.rename(columns={'blob': 'metadata'}, inplace=True)

Step 3: Initializing Pinecone Index

The next step is to initialize our connection to Pinecone and create a new index to store our embeddings. This index will serve as the backbone of our retrieval system, allowing us to efficiently query and retrieve relevant information.

We first initialize our connection to Pinecone using its API key.


from pinecone import Pinecone

# initialize connection to pinecone (get API key at
api_key = os.environ.get('PINECONE_API_KEY') or 'PINECONE_API_KEY'
environment = os.environ.get('PINECONE_ENVIRONMENT') or 'PINECONE_ENVIRONMENT'

# configure client
pc = Pinecone(api_key=api_key)

Before initializing the Pinecone index, we need to configure the specifications for our deployment. Depending on whether we choose a serverless or pod-based deployment, the configuration parameters will vary.

Serverless and pod-based deployments are two different approaches to hosting and managing applications in a cloud environment. In a serverless deployment, the cloud provider manages the infrastructure for running the application, and developers only need to focus on writing and deploying code. Serverless platforms automatically scale resources up or down based on demand, and users are billed based on usage rather than fixed infrastructure costs. On the other hand, in a pod-based deployment, developers have more control over the underlying infrastructure, typically using container orchestration platforms like Kubernetes. Pods are groups of one or more containers that share resources and network space, providing greater flexibility and control over deployment configurations. While serverless deployments are simpler and more cost-effective for small-scale applications with unpredictable traffic patterns, pod-based deployments offer more customization and scalability options for larger and more complex applications.

In this tutorial we will be using the serverless deployment. For serverless, we specify the cloud provider and region, while for pod-based deployment, we use the environment variable.

Let's take a closer look at how we set up these specifications:


from pinecone import ServerlessSpec, PodSpec
import os

use_serverless = True

if use_serverless:
    cloud = os.environ.get('PINECONE_CLOUD') or 'PINECONE_CLOUD'
    spec = ServerlessSpec(cloud='aws', region='us-west-2')
    spec = PodSpec(environment=environment)

Here, we import the ServerlessSpec and PodSpec classes from the Pinecone library. These classes represent the specifications for serverless and pod-based deployments, respectively.

We then define the name of our index as 'gen-qa-openai-fast'.


index_name = 'gen-qa-openai-fast'

This is the last step in the creation of our index. Before proceeding, we need to ensure that the index doesn't already exist. Here's how we handle this check:


# check if index already exists (it shouldn't if this is first time)

if index_name not in pc.list_indexes().names():
# if does not exist, create index
        dimension=1536,  # dimensionality of text-embedding-ada-002

And now we create our index.


# connect to index
index = pc.Index(index_name)

# view index stats

Once the index is created (or if it already exists), we connect to it using pc.Index(index_name). This allows us to interact with the index and perform operations such as querying and updating.

Finally, to ensure that the index was successfully created and to gather some basic statistics about it, we use the describe_index_stats() method on the index object. This provides information such as the number of vectors stored in the index and its current utilization, helping us confirm that the index setup was successful.

Step 4: Populating the Index with Data

In this step, we add our precomputed language model embeddings to the Pinecone index. By populating the index with our data, we create a searchable database that will enable us to efficiently retrieve relevant information for our question answering system.


for batch in dataset.iter_documents(batch_size=100):

By iterating through the dataset in batches and adding the embeddings to the index using the upsert() method, we create a searchable database that forms the foundation of our question answering system.

The upsert() method both inserts new documents and updates existing ones if they are already present in the index.

This indexing process ensures that our system can efficiently retrieve relevant information when queried, enabling accurate and timely responses to user questions.

Step 5: Retrieval

In this step, we're going to witness the power of OpenAI's text-embedding model to find the most relevant contexts for our queries.

First things first, we need to set up our OpenAI environment. We initialize the text-embedding model and get ready to roll. Now, imagine you've got a burning question, something like, "Which training method should I use for sentence transformers when I only have pairs of related sentences?" That's our query.


import openai

# get api key from
openai.api_key = os.getenv('OPENAI_API_KEY') or 'sk-...'

embed_model = "text-embedding-ada-002"

Using OpenAI's text-embedding-ada-002 model, we transform this query into a numerical representation called a query vector. Think of it as a unique fingerprint that captures the essence of our question. Now, armed with this vector, we're ready to search our Pinecone index for the juiciest bits of information.


query = (
    "Which training method should I use for sentence transformers when " +
    "I only have pairs of related sentences?"

res = openai.Embedding.create(

We then fire off our query to the index, asking it to find the top two most relevant contexts.


# retrieve from Pinecone
xq = res['data'][0]['embedding']

# get relevant contexts (including the questions)
res = index.query(vector=xq, top_k=2, include_metadata=True)

Once Pinecone does its thing, we get back a response with the relevant contexts neatly packaged up for us. It's like having the most relevant passages from a library handed to us on a silver platter. Now, armed with this wealth of information, our question answering system is ready to tackle even the toughest of queries with confidence and accuracy.

Step 6: Handling Retrieval and Completion

Now that we've retrieved the relevant contexts from our Pinecone index, it's time to handle the retrieval and completion steps of the question answering process.

In this step, we'll define functions to manage the retrieval and completion steps seamlessly.

We've defined this nifty function called retrieve(). Its job is to fetch all the relevant bits of information for a given query. Imagine it as our trusty assistant scouring through a library to find the perfect books for our research.


limit = 3750

import time

def retrieve(query):
    res = openai.Embedding.create(

    # retrieve from Pinecone
    xq = res['data'][0]['embedding']

    # get relevant contexts
    contexts = []
    time_waited = 0
    while (len(contexts) < 3 and time_waited < 60 * 12):
        res = index.query(vector=xq, top_k=3, include_metadata=True)
        contexts = contexts + [
            x['metadata']['text'] for x in res['matches']
        print(f"Retrieved {len(contexts)} contexts, sleeping for 15 seconds...")
        time_waited += 15

    if time_waited >= 60 * 12:
        print("Timed out waiting for contexts to be retrieved.")
        contexts = ["No contexts retrieved. Try to answer the question yourself!"]

    # build our prompt with the retrieved contexts included
    prompt_start = (
        "Answer the question based on the context below.\n\n"+
    prompt_end = (
        f"\n\nQuestion: {query}\nAnswer:"
    # append contexts until hitting limit
    for i in range(1, len(contexts)):
        if len("\n\n---\n\n".join(contexts[:i])) >= limit:
            prompt = (
                prompt_start +
                "\n\n---\n\n".join(contexts[:i-1]) +
        elif i == len(contexts)-1:
            prompt = (
                prompt_start +
                "\n\n---\n\n".join(contexts) +
    return prompt

Alright, now that we've got our relevant contexts, it's time to put them to good use with our complete() function. This function is like the finishing touch, where we take those contexts and generate a beautifully crafted response.


def complete(prompt):
    # instructions
    sys_prompt = "You are a helpful assistant that always answers questions."
    # query text-davinci-003
    res = openai.ChatCompletion.create(
            {"role": "system", "content": sys_prompt},
            {"role": "user", "content": prompt}
    return res['choices'][0]['message']['content'].strip()

And there you have it! With our retrieve() and complete() functions working in tandem, we've got a powerful system ready to provide accurate and contextually relevant answers to any query thrown its way. It's like having your own personal research assistant at your beck and call!

Now that we have defined these utility functions its time to put them to use.

We use the retrieve() function to fetch relevant contexts based on a user query. This function takes the query as input, generates a prompt with the retrieved contexts, and returns the formatted prompt.


# first we retrieve relevant items from Pinecone
query_with_contexts = retrieve(query)

Next, we pass the retrieved prompt to the complete() function, which generates a response using OpenAI's ChatCompletion model. This function completes the context-infused query and returns the generated response.


# then we complete the context-infused query

Step 7: Finalizing and Cleaning Up

In this final step, we'll wrap things up by finalizing our system and cleaning up any resources we no longer need. It's always good practice to tidy up after ourselves and ensure everything is in order.



After we've completed our question-answering tasks and no longer need the Pinecone index, we call the delete_index() method to remove it. Deleting the index helps to free up resources and avoid unnecessary costs associated with maintaining unused indexes.

This step ensures that we're being efficient with our resources and keeping our environment tidy.


In this tutorial, we explored retrieval-enhanced generative question answering using OpenAI and Pinecone. By combining the strengths of retrieval-based search with generative language models, we've created a powerful system capable of providing accurate and contextually relevant answers to a wide range of queries.

Throughout our journey, we've learned how to set up a Pinecone index to store and efficiently retrieve information, utilize OpenAI's text-embedding model to retrieve relevant contexts, and use advanced language models to generate responses based on these contexts. We've seen how each step contributes to the overall effectiveness of our question-answering system, from data indexing to response generation.

With this newfound knowledge, the possibilities are endless. Whether it's enhancing customer support systems, building intelligent chatbots, or facilitating information retrieval in complex datasets, retrieval-enhanced generative question answering opens up a world of opportunities for innovation and problem-solving.

As we conclude this tutorial, remember that the key to success lies in experimentation and iteration. Don't be afraid to explore different approaches, tweak parameters, and fine-tune your system to suit your specific needs. With dedication and creativity, you'll witness the full potential of retrieval-enhanced generative question answering and can possibly transform the way we interact with information in the digital age.

So, what are you waiting for? Try your hand at retrieval and generative AI to access new possibilities and make a difference in the world of question answering today!


If you require assistance with the implementation of vector databases, or if you need help with related projects, please don't hesitate to reach out to us.


bottom of page