Merge pull request #687 from Tanish2207/fix-issue-681

Add a text-based RAG chatbot
UppuluriKalyani · Oct 31, 2024 · 52719c9 · 52719c9
2 parents e2cf40d + a4df74a
commit 52719c9
Show file tree

Hide file tree

Showing 4 changed files with 294 additions and 0 deletions.
diff --git a/Generative Models/RAG-Chatbot/ML_Nexus.txt b/Generative Models/RAG-Chatbot/ML_Nexus.txt
@@ -0,0 +1,146 @@
+Welcome to the ML-NEXUS Zone⚙️⏳
+
+ML-Nexus
+A dynamic hub of Machine Learning innovations, where hands-on projects and collaborative experiments come together to inspire open-source contributions and foster a community of shared learning.
+
+This repository is a diverse collection of projects ranging from beginner-friendly models to advanced AI applications. Whether you're new to the field or a seasoned expert, there's something for everyone to contribute to. Dive into neural networks, computer vision, natural language processing (NLP), and more. Join our vibrant community, share your ideas, and help shape the future of AI—together!
+
+NOTE: You're limited to earning a maximum of 200 points from this repo. Additionally, we can't accept any ideas or features if your score already exceeds 200 points.
+
+Join official Discord Channel for discussion
+
+
+Natural Language Processing (NLP)
+Meshery - Service Mesh Management PlaneNatural Language Processing (NLP) Projects in this area involve working with text data, such as sentiment analysis, language translation, text summarization, and chatbot development using techniques like tokenization, word embeddings, and transformers.
+
+
+
+Computer Vision
+Meshery - Service Mesh Management PlaneComputer Vision Contributors can explore projects related to image classification, object detection, facial recognition, and image segmentation using tools like OpenCV, convolutional neural networks (CNNs), and transfer learning.
+
+
+
+Neural Networks
+Meshery - Service Mesh Management PlaneNeural Networks Neural networks power most deep learning models. Contributions could include creating models for image classification, regression tasks, sequence prediction, and generative models using frameworks like TensorFlow or PyTorch.
+
+
+
+Generative Models
+Meshery - Service Mesh Management PlaneGenerative Models This includes working on projects related to Generative Adversarial Networks (GANs) for image generation, text-to-image models, or style transfer, contributing to fields like art creation and synthetic data generation.
+
+
+
+Time Series Analysis
+Meshery - Service Mesh Management PlaneTime Series Analysis Contributors can work on analyzing temporal data, building models for stock price prediction, climate forecasting, or IoT sensor data analysis using LSTM or GRU networks.
+
+
+
+
+Transfer Learning
+Meshery - Service Mesh Management PlaneTransfer Learning Explore projects where pre-trained models are fine-tuned for specific tasks, such as custom object detection or domain-specific text classification, reducing the need for extensive training data.
+
+
+📚 Machine Learning Resources
+This project uses a number of key libraries to implement machine learning models and data processing pipelines. To help you better understand these libraries and their roles in the project, we've created a dedicated guide.
+
+For an in-depth overview of the most important libraries used in this project, including their features and functionalities, check out the Machine Learning Libraries Overview.
+
+This guide covers:
+
+NumPy 🧮 for numerical computations.
+Pandas 📊 for data manipulation.
+TensorFlow 🤖 and PyTorch 🔥 for deep learning.
+And more!
+We encourage you to explore this document to gain a deeper understanding of the tools that power our machine learning workflows.
+
+📚 Generative AI resources
+To get in-depth overview and roadmap to learn Generative AI. Check out Generative AI Roadmap.
+
+This guide covers:
+
+Overview of generative AI
+Roadmap to learn Generative AI
+LLM models 🤖
+Retrieval Augumented Generation (RAG)
+Vector and graph databases
+Embedding models
+Inference APIs
+PDF scrapping 🗒️
+AI agents 🤖
+
+📚 Deep Learning Roadmap
+To get an in-depth overview and roadmap to learn Deep Learning, check out Deep Learning Roadmap.
+
+This guide covers:
+
+Overview of deep learning
+Roadmap to learn deep learning
+Types of neural networks 🧠
+Key deep learning concepts
+Regularization techniques 💡
+Model optimization 🔧
+Transfer learning 🚀
+Deep learning applications 📷📝🔊
+Best practices and resources
+
+
+⭐ How to get started with open source?
+
+
+You can refer to the following articles on the basics of Git and Github.
+
+Watch this video to get started, if you have no clue about open source
+Forking a Repo
+Cloning a Repo
+How to create a Pull Request
+Getting started with Git and GitHub
+
+
+
+💥 How to Contribute to ML-Nexus?
+Take a look at the Existing Issues or create your own Issues!
+Wait for the Issue to be assigned to you.
+Fork the repository
+click on the uppermost button 
+
+Fork the repository to your own GitHub account.
+
+Clone the repository to your local machine:
+
+git clone https://github.com/<your-username>/ML-Nexus.git
+Navigate into the directory:
+
+cd ML-Nexus
+Install dependencies (if applicable):
+
+npm install
+Create a new branch for your changes:
+
+git checkout -b <your-branch-name>
+Make your changes, commit, and push:
+
+git add .
+git commit -m "Your message here"
+git push origin <your-branch-name>
+Submit a pull request:
+
+Go to the original repository on GitHub.
+Click on the "Pull Requests" tab.
+Click the "New Pull Request" button.
+Select your feature branch and submit the pull request.
+Wait for review and feedback.
+
+Address any comments or requested changes.
+Once approved, your feature will be merged into the main branch.
+Have a look at Contributing Guidelines
+Read the Code of Conduct
+
+❤️ Project Admin
+
+👑 Admin
+Kalyani
+
+💻 Project Mentors
+
+Sai Nivedh V  🔧 Mentor
+Pratyay Banerjee 🔧 Mentor
diff --git a/Generative Models/RAG-Chatbot/image.png b/Generative Models/RAG-Chatbot/image.png
diff --git a/Generative Models/RAG-Chatbot/rag_llama.py b/Generative Models/RAG-Chatbot/rag_llama.py
@@ -0,0 +1,119 @@
+import ollama
+import faiss
+import numpy as np
+from langchain.text_splitter import RecursiveCharacterTextSplitter
+from langchain_huggingface import HuggingFaceEmbeddings
+
+modelPath = "sentence-transformers/all-MiniLM-l6-v2"
+model_kwargs = {"device": "cpu"}
+encode_kwargs = {"normalize_embeddings": False}
+
+embeddings = HuggingFaceEmbeddings(
+    model_name=modelPath, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs
+)
+
+
+emb = embeddings.embed_query("Hello World")
+# print(len(emb))
+
+
+with open("ML_Nexus.txt", "r", encoding="utf-8") as file:
+    document_text = file.read()
+text_splitter = RecursiveCharacterTextSplitter(chunk_size=90, chunk_overlap=30)
+chunks = text_splitter.split_text(document_text)
+# print(len(chunks))
+
+
+from langchain_core.documents import Document
+
+document_obj = []
+for i, doc_content in enumerate(chunks, start=-1):
+    temp = Document(page_content=doc_content)
+    document_obj.append(temp)
+
+
+from uuid import uuid4
+
+uuids = [str(uuid4()) for _ in range(len(document_obj))]
+
+
+# print(document_obj)
+# print(len(uuids))
+
+
+final_emb = []
+for doc in document_obj:
+    embedding = embeddings.embed_query(doc.page_content)
+    final_emb.append(embedding)
+final_emb = np.array(final_emb)
+# print(final_emb.shape)
+
+
+import faiss
+from langchain_community.docstore.in_memory import InMemoryDocstore
+from langchain_community.vectorstores import FAISS
+
+# Create the Faiss index
+# dimension = document_embeddings.shape[1]
+# print(dimension)
+emb = embeddings.embed_query("Hello World")
+index = faiss.IndexFlatL2(len(emb))  # Using L2 distance for simplicity
+
+index.add(final_emb)
+
+index_to_docstore_id = {i: uuids[i] for i in range(len(uuids))}
+
+# print(index.d)
+
+vector_store = FAISS(
+    embedding_function=embeddings,
+    index=index,
+    docstore=InMemoryDocstore(),
+    index_to_docstore_id=index_to_docstore_id,
+)
+
+vector_store.add_documents(documents=document_obj, ids=uuids)
+
+
+def get_context(query):
+    # similar_search_result = vector_store.similarity_search(query, k=5)
+    # print(similar_search_result)
+    retriever = vector_store.as_retriever(search_kwargs={"k": 5})
+
+    similar_search_result = retriever.invoke(query)
+    # print(similar_search_result)
+    compiled_context = "\n\n".join(doc.page_content for doc in similar_search_result)
+    return compiled_context
+
+
+def llm(question):
+    compiled_context = get_context(question)
+
+    formatted_prompt = """
+	"Answer the question below with the context.\n\n"
+	Context :\n\n{}\n\n----\n\n"
+	"Question: {}\n\n"
+	"Write an answer based on the context. "
+	"If the context provides insufficient information reply "
+    '"The information given in the context is insufficient. Thus, answering without context: "'
+	"and then answer the question with the existing knowledge you have"
+    "If quotes are present and relevant, use them in the answer."
+	""".format(
+        compiled_context, question
+    )
+    res = ollama.chat(
+        model="llama3.1:latest", messages=[{"role": "user", "content": formatted_prompt}]
+    )
+    print(res["message"]["content"])
+    print("-" * 100)
+    # return res["message"]["content"]
+
+
+queries = [
+    "Who are the mentors of the given project?",
+    "What is the maximum number of points I can earn from this repository?"
+]
+for query in queries:
+    print(f"Question: {query}\n")
+    llm(query)
+    print()
diff --git a/Generative Models/RAG-Chatbot/readme.md b/Generative Models/RAG-Chatbot/readme.md
@@ -0,0 +1,29 @@
+# RAG Chatbot
+Most large language models can only provide information based on the corpus of data that they’ve been trained on.
+These models might hallucinate if they don't have the required data or context.
+This is where RAG or Retrieval-Augmented-Generation helps.
+
+By incorporating a retriever, RAG pulls relevant information from external knowledge sources, such as databases or documents, to enrich the generated output with up-to-date, contextually accurate information. This approach helps to mitigate the limitations of static model training data, enabling real-time responses that adapt to the specific needs of each query.
+
+---
+
+The following technologies are used in making of this RAG based chatbot:
+
+- sentence-transformers/all-MiniLM-l6-v2 as sentence embedding model
+- RecursiveCharacterTextSplitter for chunking text
+- Llama3.1 as the LLM
+- FAISS as Vector DB
+
+---
+### Outputs
+
+I had made a sample text file of the README of the [ML-Nexus](https://github.com/UppuluriKalyani/ML-Nexus) repo.
+
+* Questions I asked:
+1. Who are the mentors of the given project?
+2. What is the maximum number of points I can earn from this repository?
+
+* Answers:
+
+
+    ![](image.png)