Skip to content

Latest commit

 

History

History

omniparser

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Thanks for sharing! I’ll align the OmniParser project details to this template for the README.


OmniParser - Multimodal Document and Screen Parsing Application

made-with-python License

This repository features an advanced multimodal parsing application using OmniParser by Microsoft, a robust vision-language model (VLM) developed to interpret complex visual and text elements in documents and screen layouts. OmniParser offers precise, context-aware parsing, making it ideal for applications in automated workflows and agent pipelines.

Overview

OmniParser is a sophisticated model for extracting data from documents, screens, and layouts that integrate both text and visuals. Its specialized architecture allows it to recognize both what an element is and where it is located, significantly enhancing tasks like form automation and screen parsing. OmniParser handles structured, semi-structured, and unstructured data, adapting easily to diverse formats.

This repo includes:

  • Streamlit applications for image parsing and screen layout analysis.
  • Simple setup scripts to enable you to deploy OmniParser with minimal configuration.

Tools & Frameworks:

  • OmniParser: Vision-Language Model specialized for parsing multimodal layouts.
  • Streamlit: Web app framework to demonstrate OmniParser’s parsing and visual layout capabilities.

Steps to Run

1. Clone the Repository

git clone https://github.com/aryankargwal/genai-tutorials
cd omniparser

2. Install Dependencies

Use the command below to install the necessary requirements:

pip install -e .

3. Download Weights

Then download the model ckpts files in: https://huggingface.co/microsoft/OmniParser, and put them under weights/, default folder structure is: weights/icon_detect, weights/icon_caption_florence, weights/icon_caption_blip2.

4. Run the Streamlit Applications

For document parsing and layout analysis:

streamlit run app.py

## Features

- **Multimodal Parsing**: OmniParser interprets text, tables, forms, and other layout components with high accuracy.
- **Screen Layout Analysis**: Extracts and visualizes text and element positions, ideal for automating software interaction and support workflows.
- **Efficient and Scalable**: Optimized for real-time performance, providing rapid insights even with complex layouts.

## Future Work
- **Enhanced Fine-tuning**: Additional tutorials for fine-tuning OmniParser on specialized datasets.
- **Expanded Use Cases**: Further explorations into applications for customer service, virtual assistance, and automated form-filling.

## License
This project is licensed under the Apache 2.0 License. See the full license [here](LICENSE).