SciQu - Automated Data Mining and ML Training Integration

Project Overview

SciQu is an innovative tool designed to streamline the literature review process by automating data extraction and query handling from PDF files. Built using Streamlit for the user interface and various LangChain components for backend processing, SciQu facilitates efficient and accurate retrieval of information from scientific documents.

Key Features

PDF Upload and Processing: Users can upload PDF files, which are processed using the UnstructuredPDFLoader to extract text.
Text Chunking: The extracted text is split into manageable chunks using RecursiveCharacterTextSplitter, with a chunk size of 700 and an overlap of 100.
Embedding and Storage: Chunks are embedded using OllamaEmbeddings and stored in a Chroma vector database.
Dynamic Query Handling: Users can query the contents of the uploaded documents through a text input field.
Multi-Perspective Retrieval: Queries are processed using a MultiQueryRetriever, generating multiple perspectives to enhance retrieval accuracy.
Contextual Response Generation: Retrieved contexts are passed to a ChatOllama model to generate responses, which are displayed to the user.
Session History Tracking: Query-answer pairs are saved in the session state for history tracking.
ML Training Integration: Demonstrates the use of machine learning for predicting material properties.

Project Structure

1. SciQu for Automated Data Mining

Steps:

Upload PDF Files: Users can upload PDF files through the Streamlit interface.
Text Extraction: Uploaded PDFs are processed using UnstructuredPDFLoader to extract text.
Text Chunking: The extracted text is split into chunks of 700 characters with a 100-character overlap using RecursiveCharacterTextSplitter.
Embedding: The text chunks are embedded using OllamaEmbeddings.
Storage: Embedded chunks are stored in a Chroma vector database.
Query Input: Users input queries through a text field.
MultiQuery Retrieval: Queries are processed using MultiQueryRetriever to generate multiple perspectives.
Response Generation: Contexts retrieved are passed to a ChatOllama model to generate responses.
Session State: Query-answer pairs are stored for session history tracking.

2. Integration of ML Training with SciQu

Dataset:

Materials: 20 materials and their properties are used as input descriptors for predicting the refractive index. The materials include K2Te, K2O, BaS, Na2Te, SnSe, CaS, MgS, CdI2, CdBr2, YN, HgF2, SnO, BN, PtO2, K2S, BeS, MgI2, RbBr, VCl2, Na2S.

Steps:

Library Installation: Install necessary libraries.
Dataset Loading: Load the dataset containing materials and their properties.
Attribute Extraction: Extract selected attributes, including refractive index, band gap, ferroelectricity, etc.
Data Preprocessing: Check the dataset for any missing values.
Feature Selection: Define input features (X) and the target variable (y), selecting relevant columns.
Data Splitting: Split the data into training and testing sets (70-30 split).
Model Training: Create and train a Random Forest Regressor model with 100 estimators on the training data.
Model Evaluation: Make predictions on the test set and evaluate the model's performance using RMSE and R-squared score.
Visualization: Generate regression and residual plots using Seaborn to visualize model performance.

Installation

To set up and run the SciQu tool, follow these steps:

Clone the repository:

git clone https://github.com/yourusername/sciqu.git
cd sciqu

Create a virtual environment and activate it:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install the required libraries:
```
pip install -r requirements.txt
```
Run the application:
```
streamlit run app.py
```

Usage

Upload a PDF: Use the file uploader to select a PDF document.
Query the Document: Enter your query in the text input field and submit.
View Responses: The response generated by the ChatOllama model will be displayed, and the query-answer pairs will be saved in the session history.
ML Training: Follow the provided steps to train the ML model using the sample dataset.

Contributions

Contributions are welcome! Please submit a pull request or open an issue to discuss any changes.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgments

Special thanks to the Prof. Dipankar Mandal for their discussion.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md
SciQu.py		SciQu.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SciQu - Automated Data Mining and ML Training Integration

Project Overview

Key Features

Project Structure

1. SciQu for Automated Data Mining

Steps:

2. Integration of ML Training with SciQu

Dataset:

Steps:

Installation

Usage

Contributions

License

Acknowledgments

About

Releases

Packages

Languages

License

ABnano/SciQu

Folders and files

Latest commit

History

Repository files navigation

SciQu - Automated Data Mining and ML Training Integration

Project Overview

Key Features

Project Structure

1. SciQu for Automated Data Mining

Steps:

2. Integration of ML Training with SciQu

Dataset:

Steps:

Installation

Usage

Contributions

License

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages