Document Question Answering with BERT Embeddings
The Python script enables PDF Question Answering using BERT embeddings. It extracts text from a PDF file, preprocesses it using spaCy and BERT tokenizer, generates sentence embeddings with a pre-trained BERT model, and allows users to input queries related to the document. The script finds similar sentences to the query and utilizes the OpenAI API to provide answers based on the most relevant sentences.
Category:
Sub-category:
Natural Language Processing (NLP)
Chatpdf
Overview:
This code is a Python script that performs PDF Question Answering using BERT embeddings. It takes a PDF file as input, extracts the text from the PDF, preprocesses the text using spaCy and BERT tokenizer, generates sentence embeddings using a pre-trained BERT model, and then allows the user to input queries related to the document. The code finds similar sentences to the query and uses the OpenAI API to answer the questions based on the most relevant sentences.
Description:
The script starts by importing the required libraries, including “argparse” for command-line argument parsing, PyPDF2 for PDF text extraction, ‘openai’ for using the OpenAI API, ‘nltk’ for natural language processing, torch for handling BERT embeddings, spacy for text preprocessing, and transformers for loading the pre-trained BERT model and tokenizer. It also imports the OPEN_API_KEY required for the OpenAI API.
The script defines a function to extract text from the input PDF file using PyPDF2. Another function uses spaCy and BERT tokenizer to preprocess the text and generate BERT embeddings for each sentence in the text.
A New function takes a user query and the text, uses OpenAI's text-davinci-003 engine to find the most relevant sentence to the query, and returns the answer from the API.
Then a function that finds the top N similar sentences to the user's query based on cosine similarity between the query's BERT embedding and the embeddings of sentences in the document. The main function uses ‘argparse’ to get the path of the input PDF file from the command line. It then processes the PDF, generates BERT embeddings for each sentence, and enters a loop where the user can input queries. The code finds similar sentences to the query, and then uses the OpenAI API to answer the questions based on those similar sentences.
Programming Language:
Python
Library:
‘argparse’, PyPDF2, ‘openai’,’nltk’, torch, spacy, transformers, numpy