SciQu is an innovative tool designed to streamline the literature review process by automating data extraction and query handling from PDF files. Built using Streamlit for the user interface and various LangChain components for backend processing, SciQu facilitates efficient and accurate retrieval of information from scientific documents.
- PDF Upload and Processing: Users can upload PDF files, which are processed using the UnstructuredPDFLoader to extract text.
- Text Chunking: The extracted text is split into manageable chunks using RecursiveCharacterTextSplitter, with a chunk size of 700 and an overlap of 100.
- Embedding and Storage: Chunks are embedded using OllamaEmbeddings and stored in a Chroma vector database.
- Dynamic Query Handling: Users can query the contents of the uploaded documents through a text input field.
- Multi-Perspective Retrieval: Queries are processed using a MultiQueryRetriever, generating multiple perspectives to enhance retrieval accuracy.
- Contextual Response Generation: Retrieved contexts are passed to a ChatOllama model to generate responses, which are displayed to the user.
- Session History Tracking: Query-answer pairs are saved in the session state for history tracking.
- ML Training Integration: Demonstrates the use of machine learning for predicting material properties.
- Upload PDF Files: Users can upload PDF files through the Streamlit interface.
- Text Extraction: Uploaded PDFs are processed using UnstructuredPDFLoader to extract text.
- Text Chunking: The extracted text is split into chunks of 700 characters with a 100-character overlap using RecursiveCharacterTextSplitter.
- Embedding: The text chunks are embedded using OllamaEmbeddings.
- Storage: Embedded chunks are stored in a Chroma vector database.
- Query Input: Users input queries through a text field.
- MultiQuery Retrieval: Queries are processed using MultiQueryRetriever to generate multiple perspectives.
- Response Generation: Contexts retrieved are passed to a ChatOllama model to generate responses.
- Session State: Query-answer pairs are stored for session history tracking.
- Materials: 20 materials and their properties are used as input descriptors for predicting the refractive index. The materials include K2Te, K2O, BaS, Na2Te, SnSe, CaS, MgS, CdI2, CdBr2, YN, HgF2, SnO, BN, PtO2, K2S, BeS, MgI2, RbBr, VCl2, Na2S.
- Library Installation: Install necessary libraries.
- Dataset Loading: Load the dataset containing materials and their properties.
- Attribute Extraction: Extract selected attributes, including refractive index, band gap, ferroelectricity, etc.
- Data Preprocessing: Check the dataset for any missing values.
- Feature Selection: Define input features (X) and the target variable (y), selecting relevant columns.
- Data Splitting: Split the data into training and testing sets (70-30 split).
- Model Training: Create and train a Random Forest Regressor model with 100 estimators on the training data.
- Model Evaluation: Make predictions on the test set and evaluate the model's performance using RMSE and R-squared score.
- Visualization: Generate regression and residual plots using Seaborn to visualize model performance.
To set up and run the SciQu tool, follow these steps:
-
Clone the repository:
git clone https://github.com/yourusername/sciqu.git cd sciqu
-
Create a virtual environment and activate it:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install the required libraries:
pip install -r requirements.txt
-
Run the application:
streamlit run app.py
- Upload a PDF: Use the file uploader to select a PDF document.
- Query the Document: Enter your query in the text input field and submit.
- View Responses: The response generated by the ChatOllama model will be displayed, and the query-answer pairs will be saved in the session history.
- ML Training: Follow the provided steps to train the ML model using the sample dataset.
Contributions are welcome! Please submit a pull request or open an issue to discuss any changes.
This project is licensed under the MIT License. See the LICENSE file for details.
Special thanks to the Prof. Dipankar Mandal for their discussion.