diff --git a/README.md b/README.md index 50044322f..2933a31e8 100644 --- a/README.md +++ b/README.md @@ -302,6 +302,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te 1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo. 1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu. 1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu. +1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (from Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei. 1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. 1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli. 1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei. diff --git a/docs/snippets/6_supported-models.snippet b/docs/snippets/6_supported-models.snippet index 0e69bba41..b35b03496 100644 --- a/docs/snippets/6_supported-models.snippet +++ b/docs/snippets/6_supported-models.snippet @@ -47,6 +47,7 @@ 1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo. 1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu. 1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu. +1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (from Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei. 1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. 1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli. 1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei. diff --git a/scripts/supported_models.py b/scripts/supported_models.py index 03711b806..5d74b1bce 100644 --- a/scripts/supported_models.py +++ b/scripts/supported_models.py @@ -200,7 +200,7 @@ 'distilbert-base-uncased', 'distilbert-base-cased', ], - 'donut': [ + 'donut': [ # NOTE: also a `vision-encoder-decoder` # Image-to-text 'naver-clova-ix/donut-base-finetuned-cord-v2', 'naver-clova-ix/donut-base-finetuned-zhtrainticket', @@ -417,6 +417,13 @@ 'hkunlp/instructor-base', 'hkunlp/instructor-large', ], + 'trocr': [ # NOTE: also a `vision-encoder-decoder` + # Text-to-image + 'microsoft/trocr-small-printed', + 'microsoft/trocr-base-printed', + 'microsoft/trocr-small-handwritten', + 'microsoft/trocr-base-handwritten', + ], 'vision-encoder-decoder': [ # Text-to-image 'nlpconnect/vit-gpt2-image-captioning', diff --git a/src/models.js b/src/models.js index 7c3b55964..57c8a8cf8 100644 --- a/src/models.js +++ b/src/models.js @@ -3707,6 +3707,37 @@ export class SpeechT5HifiGan extends PreTrainedModel { } ////////////////////////////////////////////////// + +////////////////////////////////////////////////// +// TrOCR models +export class TrOCRPreTrainedModel extends PreTrainedModel { + /** + * Creates a new instance of the `TrOCRPreTrainedModel` class. + * @param {Object} config The configuration of the model. + * @param {any} session The ONNX session containing the model weights. + * @param {GenerationConfig} generation_config The generation configuration. + */ + constructor(config, session, generation_config) { + super(config, session); + this.generation_config = generation_config; + + // config doesn't contain pad_token_id, so we assume it is the eos_token_id + this.config.pad_token_id = this.config.eos_token_id; + + this.num_encoder_layers = this.num_decoder_layers = this.config.decoder_layers; + this.num_encoder_heads = this.num_decoder_heads = this.config.decoder_attention_heads; + this.encoder_dim_kv = this.decoder_dim_kv = this.config.d_model / this.num_decoder_heads; + } +} + +/** + * The TrOCR Decoder with a language modeling head. + */ +export class TrOCRForCausalLM extends TrOCRPreTrainedModel { } + +////////////////////////////////////////////////// + + ////////////////////////////////////////////////// // AutoModels, used to simplify construction of PreTrainedModels // (uses config to instantiate correct class) @@ -3897,6 +3928,7 @@ const MODEL_WITH_LM_HEAD_MAPPING_NAMES = new Map([ ['mpt', ['MptForCausalLM', MptForCausalLM]], ['opt', ['OPTForCausalLM', OPTForCausalLM]], ['mbart', ['MBartForCausalLM', MBartForCausalLM]], + ['trocr', ['TrOCRForCausalLM', TrOCRForCausalLM]], ]); const MODEL_FOR_MASKED_LM_MAPPING_NAMES = new Map([ diff --git a/src/pipelines.js b/src/pipelines.js index 4873f35a9..b2b3b19f1 100644 --- a/src/pipelines.js +++ b/src/pipelines.js @@ -1332,6 +1332,14 @@ export class AutomaticSpeechRecognitionPipeline extends Pipeline { * let output = await captioner(url); * // [{ generated_text: 'a cat laying on a couch with another cat' }] * ``` + * + * **Example:** Optical Character Recognition (OCR) w/ `Xenova/trocr-small-handwritten`. + * ```javascript + * let url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/handwriting.jpg'; + * let captioner = await pipeline('image-to-text', 'Xenova/trocr-small-handwritten'); + * let output = await captioner(url); + * // [{ generated_text: 'Mr. Brown commented icily.' }] + * ``` */ export class ImageToTextPipeline extends Pipeline { /**