Skip to content

Commit

Permalink
Add support for TrOCR models (for Optical Character Recognition) (#375)
Browse files Browse the repository at this point in the history
* Add support for `TrOCR` models

* Add example code snippet for OCR
  • Loading branch information
xenova authored Nov 8, 2023
1 parent f7321ad commit 6b6e03b
Show file tree
Hide file tree
Showing 5 changed files with 50 additions and 1 deletion.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -302,6 +302,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (from Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
Expand Down
1 change: 1 addition & 0 deletions docs/snippets/6_supported-models.snippet
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@
1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (from Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
Expand Down
9 changes: 8 additions & 1 deletion scripts/supported_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -200,7 +200,7 @@
'distilbert-base-uncased',
'distilbert-base-cased',
],
'donut': [
'donut': [ # NOTE: also a `vision-encoder-decoder`
# Image-to-text
'naver-clova-ix/donut-base-finetuned-cord-v2',
'naver-clova-ix/donut-base-finetuned-zhtrainticket',
Expand Down Expand Up @@ -417,6 +417,13 @@
'hkunlp/instructor-base',
'hkunlp/instructor-large',
],
'trocr': [ # NOTE: also a `vision-encoder-decoder`
# Text-to-image
'microsoft/trocr-small-printed',
'microsoft/trocr-base-printed',
'microsoft/trocr-small-handwritten',
'microsoft/trocr-base-handwritten',
],
'vision-encoder-decoder': [
# Text-to-image
'nlpconnect/vit-gpt2-image-captioning',
Expand Down
32 changes: 32 additions & 0 deletions src/models.js
Original file line number Diff line number Diff line change
Expand Up @@ -3707,6 +3707,37 @@ export class SpeechT5HifiGan extends PreTrainedModel {
}
//////////////////////////////////////////////////


//////////////////////////////////////////////////
// TrOCR models
export class TrOCRPreTrainedModel extends PreTrainedModel {
/**
* Creates a new instance of the `TrOCRPreTrainedModel` class.
* @param {Object} config The configuration of the model.
* @param {any} session The ONNX session containing the model weights.
* @param {GenerationConfig} generation_config The generation configuration.
*/
constructor(config, session, generation_config) {
super(config, session);
this.generation_config = generation_config;

// config doesn't contain pad_token_id, so we assume it is the eos_token_id
this.config.pad_token_id = this.config.eos_token_id;

this.num_encoder_layers = this.num_decoder_layers = this.config.decoder_layers;
this.num_encoder_heads = this.num_decoder_heads = this.config.decoder_attention_heads;
this.encoder_dim_kv = this.decoder_dim_kv = this.config.d_model / this.num_decoder_heads;
}
}

/**
* The TrOCR Decoder with a language modeling head.
*/
export class TrOCRForCausalLM extends TrOCRPreTrainedModel { }

//////////////////////////////////////////////////


//////////////////////////////////////////////////
// AutoModels, used to simplify construction of PreTrainedModels
// (uses config to instantiate correct class)
Expand Down Expand Up @@ -3897,6 +3928,7 @@ const MODEL_WITH_LM_HEAD_MAPPING_NAMES = new Map([
['mpt', ['MptForCausalLM', MptForCausalLM]],
['opt', ['OPTForCausalLM', OPTForCausalLM]],
['mbart', ['MBartForCausalLM', MBartForCausalLM]],
['trocr', ['TrOCRForCausalLM', TrOCRForCausalLM]],
]);

const MODEL_FOR_MASKED_LM_MAPPING_NAMES = new Map([
Expand Down
8 changes: 8 additions & 0 deletions src/pipelines.js
Original file line number Diff line number Diff line change
Expand Up @@ -1332,6 +1332,14 @@ export class AutomaticSpeechRecognitionPipeline extends Pipeline {
* let output = await captioner(url);
* // [{ generated_text: 'a cat laying on a couch with another cat' }]
* ```
*
* **Example:** Optical Character Recognition (OCR) w/ `Xenova/trocr-small-handwritten`.
* ```javascript
* let url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/handwriting.jpg';
* let captioner = await pipeline('image-to-text', 'Xenova/trocr-small-handwritten');
* let output = await captioner(url);
* // [{ generated_text: 'Mr. Brown commented icily.' }]
* ```
*/
export class ImageToTextPipeline extends Pipeline {
/**
Expand Down

0 comments on commit 6b6e03b

Please sign in to comment.