This project leverages large language model (LLM) agents within LangGraph to develop workflows for processing academic manuscripts. Users can execute these workflows individually through the provided Jupyter notebook (currently the more reliable option) or orchestrate the entire process via a robust LLM accessed through a Streamlit interface. The ultimate aim is to create an app that simplifies some of the more tedious aspects of academic research for researchers. For instance, a tangible goal would be to summarize the state of the art and the historical development of a chosen topic, providing a comprehensive survey. A longer-term goal will be to add the ability for the workflow to explain mathematical proofs and even categorize them.
Before you begin, ensure you have the following installed:
- Git
- Python 3.8 or later
- pip (Python package installer)
-
Arxiv Paper Retrieval: Given a list of keywords, this workflow retrieves the most relevant papers from arXiv.
-
PDF to Text: Utilizes Nougat from Meta for OCR conversion of papers. This method, while effective, requires significant computational resources.
-
OCR Enhancer: Since Nougat sometimes distorts citation formats, this workflow uses MuPDF to generate a secondary text file. Although MuPDF's quality is lower (especially for mathematical content), it corrects citation formats. By comparing the two text files, the workflow merges the best aspects of both.
-
Proof Remover: Aiming to clean texts by removing proofs before summarization, this workflow has been the least successful. Suggestions for improvement are welcome.
-
Keyword Extraction and Topic Summarization: This workflow extracts keywords and summarizes the content of an article, page by page. There is potential for improvement by performing multiple passes over the paper.
-
Citation_Extraction: This workflow extracts citations from a paper. You can provide a context file (like summary) and a type of citations (Full citations, most important, etc etc)
-
Context-Based Translation: This workflow has been particularly successful. It uses the summarization from the previous step to translate the article into different languages. Context-based translation ensures that the terminology is appropriate for the specific academic community.
-
Folder awareness: The manager is aware of the conents of the two folders (pdfs/markdowns) and can guide you to open the right files. Also minimizing the possibility of using the tools wrong
-
Take A Peak: It looks inside a file for some quick information. Good for getting citations out of a file to push to the ArXiv retrieval
-
Survey Creation: The LLM will identify the most relevant citations in a paper, retrieve related papers from arXiv or other sources, process each paper using the current workflows, and compile a comprehensive survey. This feature aims to streamline the creation of complex surveys.
-
Proof Explainer: This ambitious feature intends to analyze mathematical texts, generate context, and explain proofs. A secondary goal is to categorize proofs with similar arguments, enhancing understanding and learning.
This project continues to evolve, aiming to significantly aid academic researchers by automating and improving various manuscript processing tasks.
git clone https://github.com/artnoage/Langgraph_Manuscript_Workflows.git
cd Langgraph_Manuscript_Workflows
python -m venv Langgraph_Manuscript_Workflows
conda create -n Langgraph_Manuscript_Workflows python=3.8
- Windows
.\Langgraph_Manuscript_Workflows\Scripts\activate
- macOS/Linux
source Langgraph_Manuscript_Workflows/bin/activate
conda activate Langgraph_Manuscript_Workflows
- Windows
pip install -r requirements_win.txt
- macOS/Linux
pip install -r requirements_linux.txt
pip install -r requirements_conda.txt
python create_env.py
python create_env.py
If you don't have an OPENAI_API_KEY, get one from OpenAI and add it to the .env
file:
OPENAI_API_KEY="your_key"
If you don't have an NVIDIA_API_KEY, get one and some free credits from NVIDIA and add it to the .env
file:
NVIDIA_API_KEY="your_key"
You have three options to work with the various mini-workflows in the project:
-
Jupyter Notebook: Use
playground.ipynb
to run the workflows independently. This option provides more control over the LLMs used and gives insight into the inner workings of each workflow. Recommended for first-time users. -
Terminal: Run the
metaworkflow.py
file:python metaworkflow.py
-
Streamlit UI: Run the Streamlit app:
streamlit run streamlit_app.py
Streamlit has various issues 1. LLM APIs have difficulty sending the cancelling signal to the tool calls while they are running. 2. The tool printouts are not currently forwarded to the Streamlit interface. For these reasons, it is recommended to use only with small files and only for fun. For main use, try the playground or metaworflow.py
If you have key errors, deactivate and activate the enviroment.
Contributions are welcome! You can contribute significantly without any coding knowledge by modifying the system prompts in the prompts.py
file until the corresponding workflow behaves better. This process is known as "prompt engineering."
If you have experience with Streamlit, you can suggest ways to stream the printouts from the tools to the Streamlit interface.
You can also create different workflows that you think are relevant to manipulating academic manuscripts.
- Slide template: Srikar Sharma Sadhu (https://tinyurl.com/3n2cfpda)
- OCR: Nougat from Meta (https://github.com/facebookresearch/nougat)
- Assistance from ChatGPT and Claude Opus
Special thanks to:
- My girlfriend for her patience during the preparation of this project for the NVIDIA competition.
- My friend Anna for participating in the promotional video.
- Nikos Mouratidis and Kostas Kostopoulos for beta testing.
This project is licensed under the MIT License - see the LICENSE file for details.