This repository contains tutorials and code implementations for MultiHop Question Answering (QA) using advanced tools such as DSPy, ColBERT, HotPotQA, TuneAPI, and Qwen 2.5 72B. MultiHopQA is a task where a question requires synthesizing information from multiple documents to derive the correct answer.
The project demonstrates how to combine powerful retrieval and language modeling techniques to tackle multi-hop question-answering tasks. The pipeline uses DSPy to handle data pipelines and multi-step reasoning, ColBERT for dense retrieval, and Qwen 2.5 72B as the backbone generative model to provide the final answers. The HotPotQA dataset is used for training and evaluation.
- DSPy: Modular tool to help structure multi-hop question answering pipelines.
- ColBERT: A dense retrieval model that helps retrieve relevant passages for complex QA.
- TuneAPI: A proxy API for interacting with language models like Qwen.
- Qwen 2.5 72B: A state-of-the-art large language model for reasoning and text generation.
- HotPotQA: A dataset specifically designed for multi-hop QA.
Clone the repository to your local machine:
git clone https://github.com/aryankargwal/genai-tutorials.git
cd nlp-tutorials
Install the required dependencies:
pip install -r requirements.txt
To use the TuneAPI for Qwen, export your API key as an environment variable:
export API_KEY="your_api_key_here"
Run the main script to perform multi-hop QA:
python multihopqa.py
This will initiate the multi-hop reasoning process by:
- Loading the HotPotQA dataset.
- Using ColBERT for dense passage retrieval.
- Utilizing Qwen 2.5 72B to generate answers based on the retrieved contexts.
- Data Loading: The HotPotQA dataset is loaded and split into train/dev sets.
- Retrieval: The ColBERT model retrieves relevant passages from a knowledge base using the input question.
- Reasoning: The Qwen 2.5 72B model, via the TuneAPI, processes the retrieved context to answer the question.
- Prediction: The final answer and relevant contexts are returned.
- Multi-Hop Reasoning: Tackles complex QA tasks that require synthesizing information from multiple sources.
- Dense Retrieval with ColBERT: Efficient passage retrieval from large knowledge bases.
- State-of-the-Art Generative Model: Uses Qwen 2.5 72B to process and answer questions.
- HotPotQA: Handles the popular QA dataset tailored for multi-hop reasoning.
Simplified Baleen is a key component of this repository, designed to streamline the process of combining retrieval and generation for multi-hop reasoning tasks. It integrates ColBERT for retrieval with Qwen 2.5 72B for reasoning and answering. The name Baleen is inspired by the baleen plates in whales, which filter information efficiently—just as the system filters through large corpora to retrieve relevant data.
- Unified Interface: Simplified Baleen abstracts the retrieval and reasoning process into a single pipeline, making it easier to use for complex QA tasks.
- Retrieval-Augmented Generation: Baleen leverages retrieval models like ColBERT to provide context for the generative model, allowing the language model to answer multi-hop questions more effectively.
- Customizable Pipeline: It allows users to define the retrieval method and the language model to create a flexible question-answering pipeline.
This repository uses the HotPotQA dataset, designed for multi-hop QA. You can find more details about it here.
- Fine-Tuning: Future releases will include scripts for fine-tuning the model on custom datasets.
- Enhanced Retrieval: Improved passage retrieval techniques are in progress to further enhance the accuracy of multi-hop reasoning.
This project is licensed under the Apache 2.0 License. See the full license here.