Extracting Structured Data from Organic Synthesis Procedures Using a Fine-Tuned Large Language Model
Organic synthesis procedures are traditionally represented by free-form texts. This project explores how large language models can convert such unstructured texts to structured data, so they can be used for downstream data science or machine learning applications.
- workplace_data: Data downloading and processing. Most of the organic synthesis procedures, free text or structured, came from the Open Reaction Database.
- workplace_cde: Comparing with chemdataextractor2.
- workplace_evaluation: Model evaluations.
- workplace_finetune: Finetuning using LLaMA-Adapter.
- workplace_rclf: Reaction role classification.
This project was conceived during the LLM Hackathon on 2023/03/29. We thank Ben Blaiszik for his generous financial support to this project.
For more details, see
- The 2-min demo video by Marcus Schwarting.
- Section I.C.b of the preprint arXiv:2306.06283.
- The demo app on GitHub pages.
- This app demo_apps/github_page shows precomputed inference results from an
OPENAI davinci
model. It is a static page fromDash
using Epix Zhang's code, and is synced to the github_page branch.
- This app demo_apps/github_page shows precomputed inference results from an
- Data processing and inference scripts for OPENAI models can be found in the folder models_openai. These models are fine-tuned with 300 data points and evaluated using another set of 50 data points.