Seamless is a family of AI models that enable more natural and authentic communication across languages. SeamlessM4T is a massive multilingual multimodal machine translation model supporting around 100 languages. SeamlessM4T serves as foundation for SeamlessExpressive, a model that preserves elements of prosody and voice style across languages and SeamlessStreaming, a model supporting simultaneous translation and streaming ASR for around 100 languages. SeamlessExpressive and SeamlessStreaming are combined into Seamless, a unified model featuring multilinguality, real-time and expressive translations.
SeamlessM4T v2 | SeamlessExpressive | SeamlessStreaming | |
---|---|---|---|
Demo | SeamlessM4T v2 Demo | SeamlessExpressive Demo | |
HuggingFace Space Demo | 🤗 SeamlessM4T v2 Space | 🤗 SeamlessExpressive Space | 🤗 SeamlessStreaming Space |
An exhaustive tutorial given at the NeurIPS 2023 - Seamless EXPO, which is a one-stop shop to learn how to use the entire suite of Seamless models. Please feel free to play with the notebook.
SeamlessM4T is our foundational all-in-one Massively Multilingual and Multimodal Machine Translation model delivering high-quality translation for speech and text in nearly 100 languages.
SeamlessM4T models support the tasks of:
- Speech-to-speech translation (S2ST)
- Speech-to-text translation (S2TT)
- Text-to-speech translation (T2ST)
- Text-to-text translation (T2TT)
- Automatic speech recognition (ASR)
🌟 We are releasing SeamlessM4T v2, an updated version with our novel UnitY2 architecture. This new model improves over SeamlessM4T v1 in quality as well as inference latency in speech generation tasks.
To learn more about the collection of SeamlessM4T models, the approach used in each, their language coverage and their performance, visit the SeamlessM4T README or 🤗 Model Card.
Note
Seamless M4T is also available in the 🤗 Transformers library. Visit this section for more details.
SeamlessExpressive is a speech-to-speech translation model that captures certain underexplored aspects of prosody such as speech rate and pauses, while preserving the style of one's voice and high content translation quality.
To learn more about SeamlessExpressive models, visit the SeamlessExpressive README or 🤗 Model Card
SeamlessStreaming is a streaming translation model. The model supports speech as input modality and speech/text as output modalities.
The SeamlessStreaming model supports the following tasks:
- Speech-to-speech translation (S2ST)
- Speech-to-text translation (S2TT)
- Automatic speech recognition (ASR)
To learn more about SeamlessStreaming models, visit the SeamlessStreaming README or 🤗 Model Card
The Seamless model is the unified model for expressive streaming speech-to-speech translations.
- [12/18/2023] We are open-sourcing our Conformer-based W2v-BERT 2.0 speech encoder as described in Section 3.2.1 of the paper, which is at the core of our Seamless models.
- [12/14/2023] We are releasing the Seamless tutorial given at NeurIPS 2023.
Installation with AIME MLC
Easy installation within an AIME ML-Container.
Clone this repo:
git clone https://github.com/aime-labs/seamless_communication
Create ml-container
mlc-create -arch CUDA_AMPERE sc_container Pytorch 2.1.1-aime
Run the container:
mlc-open sc_container
Navigate to the destination of this repo and install the requirements:
pip install --extra-index-url https://fair.pkg.atmeta.com/fairseq2/whl/pt2.1.1/cu118 -r requirements.txt
Note
Transcribing inference audio for computing metric uses Whisper, which is automatically installed. Whisper in turn requires the command-line tool ffmpeg
to be installed on your system, which is available from most package managers.
To run SeamlessM4T as HTTP/HTTPS API with AIME API Server start following Python script through the command line:
python3 run_with_api_server.py --api_server <address of api server>
It will start SeamlessM4T as worker, waiting for job request through the AIME API Server.
Here’s an example of using the CLI from the root directory to run inference.
S2ST task:
m4t_predict <path_to_input_audio> --task s2st --tgt_lang <tgt_lang> --output_path <path_to_save_audio>
T2TT task:
m4t_predict <input_text> --task t2tt --tgt_lang <tgt_lang> --src_lang <src_lang>
Please refer to the inference README for detailed instruction on how to run inference and the list of supported languages on the source, target sides for speech, text modalities.
For running S2TT/ASR natively (without Python) using GGML, please refer to the unity.cpp section.
Note
Please check the section on how to download the model.
Here’s an example of using the CLI from the root directory to run inference.
expressivity_predict <path_to_input_audio> --tgt_lang <tgt_lang> --model_name seamless_expressivity --vocoder_name vocoder_pretssel --output_path <path_to_save_audio>
Streaming Evaluation README has detailed instructions for running evaluations for the SeamlessStreaming and Seamless models. The CLI has an --no-scoring
option that can be used to skip the scoring part and just run inference.
You can duplicate the SeamlessStreaming HF space to run the streaming demo.
You can also run the demo locally, by cloning the space from here. See the README of the SeamlessStreaming HF repo for more details on installation.
Running SeamlessM4T & SeamlessExpressive Gradio demos locally
To launch the same demo Space we host on Hugging Face locally:
cd demo
pip install -r requirements.txt
python app.py
Model Name | #params | checkpoint | metrics |
---|---|---|---|
SeamlessM4T-Large v2 | 2.3B | 🤗 Model card - checkpoint | metrics |
SeamlessM4T-Large (v1) | 2.3B | 🤗 Model card - checkpoint | metrics |
SeamlessM4T-Medium (v1) | 1.2B | 🤗 Model card - checkpoint | metrics |
To access and download SeamlessExpressive, please request the model artifacts through this request form. Upon approval, you will then receive an email with download links to each model artifact.
Please note that SeamlessExpressive is made available under its own License and Acceptable Use Policy.
Model Name | #params | checkpoint | metrics |
---|---|---|---|
SeamlessStreaming | 2.5B | 🤗 Model card - monotonic decoder checkpoint - streaming UnitY2 checkpoint | metrics |
Seamless model is simply the SeamlessStreaming model with the non-expressive vocoder_v2
swapped out with the expressive vocoder_pretssel
.
Please check out above section on how to acquire vocoder_pretssel
checkpoint.
Model Name | #params | checkpoint |
---|---|---|
W2v-BERT 2.0 | 600M | 🤗 Model card - checkpoint |
Here's how you should do a foward pass through the speech encoder:
import torch
from fairseq2.data.audio import AudioDecoder, WaveformToFbankConverter
from fairseq2.memory import MemoryBlock
from fairseq2.nn.padding import get_seqs_and_padding_mask
from fairseq2.data import Collater
from pathlib import Path
from seamless_communication.models.conformer_shaw import load_conformer_shaw_model
audio_wav_path, device, dtype = ...
audio_decoder = AudioDecoder(dtype=torch.float32, device=device)
fbank_converter = WaveformToFbankConverter(
num_mel_bins=80,
waveform_scale=2**15,
channel_last=True,
standardize=True,
device=device,
dtype=dtype,
)
collater = Collater(pad_value=1)
model = load_conformer_shaw_model("conformer_shaw", device=device, dtype=dtype)
model.eval()
with Path(audio_wav_path).open("rb") as fb:
block = MemoryBlock(fb.read())
decoded_audio = audio_decoder(block)
src = collater(fbank_converter(decoded_audio))["fbank"]
seqs, padding_mask = get_seqs_and_padding_mask(src)
with torch.inference_mode():
seqs, padding_mask = model.encoder_frontend(seqs, padding_mask)
seqs, padding_mask = model.encoder(seqs, padding_mask)
To reproduce our results, or to evaluate using the same metrics over your own test sets, please check out the README here.
Below is the script for efficient batched evaluation.
export MODEL_DIR="/path/to/SeamlessExpressive/model"
export TEST_SET_TSV="input.tsv" # Your dataset in a TSV file, with headers "id", "audio"
export TGT_LANG="spa" # Target language to translate into, options including "fra", "deu", "eng" ("cmn" and "ita" are experimental)
export OUTPUT_DIR="tmp/" # Output directory for generated text/unit/waveform
export TGT_TEXT_COL="tgt_text" # The column in your ${TEST_SET_TSV} for reference target text to calcuate BLEU score. You can skip this argument.
export DFACTOR="1.0" # Duration factor for model inference to tune predicted duration (preddur=DFACTOR*preddur) per each position which affects output speech rate. Greater value means slower speech rate (default to 1.0). See expressive evaluation README for details on duration factor we used.
expressivity_evaluate ${TEST_SET_TSV} \
--gated-model-dir ${MODEL_DIR} --task s2st --tgt_lang ${TGT_LANG} \
--audio_root_dir "" --output_path ${OUTPUT_DIR} --ref_field ${TGT_TEXT_COL} \
--model_name seamless_expressivity --vocoder_name vocoder_pretssel \
--text_unk_blocking True --duration_factor ${DFACTOR}
Please check out this README section
Streaming Evaluation README has detailed instructions for running evaluations on the SeamlessStreaming and Seamless models.
To enable Seamless Communication Everywhere, we implemented unity.cpp so users could run SeamlessM4T models in GGML - a C tensor library allowing easier integration on verbose platforms.
To transcribe/translte a given audio,
./ggml/bin/unity --model seamlessM4T_medium.ggml input.wav
For details of build and more usage please check out unity.cpp
We created two expressive speech-to-speech translation datasets, mExpresso and mDRAL, between English and five other languages -- French, German, Italian, Mandarin and Spanish. We currently open source the speech-to-text of mExpresso for out-of-English directions, and we will open source the remaining part of the datasets soon. For details, please check out README
We’re introducing the first expressive speech alignment procedure. Starting with raw data, the expressive alignment procedure automatically discovers pairs of audio segments sharing not only the same meaning, but the same overall expressivity. To showcase this procedure, we are making metadata available to create a benchmarking dataset called SeamlessAlignExpressive, that can be used to validate the quality of our alignment method. SeamlessAlignExpressive is the first large-scale (11k+ hours) collection of multilingual audio alignments for expressive translation. More details can be found on the SeamlessAlignExpressive README.
Please check out the README here. Note that SeamlessM4T v1 model uses reduced units and other models use non-reduced units.
Seamless Communication depends on 4 libraries developed by Meta.
fairseq2 is our next-generation open-source library of sequence modeling components that provides researchers and developers with building blocks for machine translation, language modeling, and other sequence generation tasks. All SeamlessM4T models in this repository are powered by fairseq2.
SONAR, Sentence-level multimOdal and laNguage-Agnostic Representations is a new multilingual and -modal sentence embedding space which outperforms existing sentence embeddings such as LASER3 and LabSE on the xsim and xsim++ multilingual similarity search tasks. SONAR provides text and speech encoders for many languages. SeamlessAlign was mined based on SONAR embeddings.
BLASER 2.0 is our latest model-based evaluation metric for multimodal translation. It is an extension of BLASER, supporting both speech and text. It operates directly on the source signal, and as such, does not require any intermediate ASR system like ASR-BLEU. As in the first version, BLASER 2.0 leverages the similarity between input and output sentence embeddings. SONAR is the underlying embedding space for BLASER 2.0. Scripts to run evaluation with BLASER 2.0 can be found in the SONAR repo.
As part of the seamless communication project, we've extended the stopes library. Version 1 provided a text-to-text mining tool to build training dataset for translation models. Version 2 has been extended thanks to SONAR, to support tasks around training large speech translation models. In particular, we provide tools to read/write the fairseq audiozip datasets and a new mining pipeline that can do speech-to-speech, text-to-speech, speech-to-text and text-to-text mining, all based on the new SONAR embedding space.
SimulEval is a library used for evaluating simulaneous translation models. SimulEval also provides a backend for generation using partial/incremental inputs with flexible/extensible states, which is used to implement streaming inference. Users define agents which implement SimulEval's interface, which can be connected together in a pipeline. You can find agents implemented for SeamlessStreaming here.
Please check out the README here.
Apart from Seamless-M4T large (2.3B) and medium (1.2B) models, we are also releasing a small model (281M) targeted for on-device inference. To learn more about the usage and model details check out the README here.
We open-source the metadata to SeamlessAlign, the largest open dataset for multimodal translation, totaling 270k+ hours of aligned Speech and Text data. The dataset can be rebuilt by the community based on the SeamlessAlign readme.
If you use Seamless in your work or any models/datasets/artifacts published in Seamless, please cite :
@inproceedings{seamless2023,
title="Seamless: Multilingual Expressive and Streaming Speech Translation",
author="{Seamless Communication}, Lo{\"i}c Barrault, Yu-An Chung, Mariano Coria Meglioli, David Dale, Ning Dong, Mark Duppenthaler, Paul-Ambroise Duquenne, Brian Ellis, Hady Elsahar, Justin Haaheim, John Hoffman, Min-Jae Hwang, Hirofumi Inaguma, Christopher Klaiber, Ilia Kulikov, Pengwei Li, Daniel Licht, Jean Maillard, Ruslan Mavlyutov, Alice Rakotoarison, Kaushik Ram Sadagopan, Abinesh Ramakrishnan, Tuan Tran, Guillaume Wenzek, Yilin Yang, Ethan Ye, Ivan Evtimov, Pierre Fernandez, Cynthia Gao, Prangthip Hansanti, Elahe Kalbassi, Amanda Kallet, Artyom Kozhevnikov, Gabriel Mejia, Robin San Roman, Christophe Touret, Corinne Wong, Carleigh Wood, Bokai Yu, Pierre Andrews, Can Balioglu, Peng-Jen Chen, Marta R. Costa-juss{\`a}, Maha Elbayad, Hongyu Gong, Francisco Guzm{\'a}n, Kevin Heffernan, Somya Jain, Justine Kao, Ann Lee, Xutai Ma, Alex Mourachko, Benjamin Peloquin, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Anna Sun, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang, Mary Williamson",
journal={ArXiv},
year={2023}
}
We have three license categories.
The following non-generative components are MIT licensed as found in MIT_LICENSE:
- W2v-BERT 2.0 speech encoder
- Code
- Text only part of the mExpresso dataset found in the SeamlessExpressive README.
- UnitY2 forced alignment extractor found in the UnitY2 Aligner README.
- Speech toxicity tool with the etox dataset found in the Toxicity README.
The following models are CC-BY-NC 4.0 licensed as found in the LICENSE:
- SeamlessM4T models (v1 and v2).
- SeamlessStreaming models.
The following models are Seamless licensed as found in SEAMLESS_LICENSE:
- Seamless models.
- SeamlessExpressive models.