Seamless Speech-to-Speech Translation with Voice Replication (S3TVR)

Project Overview

Seamless Speech-to-Speech Translation with Voice Replication (S3TVR) is an advanced AI cascaded feamework designed for real-time speech-to-speech translation while maintaining the speaker's voice characteristics in a Zero-shot fashion. This project balances latency and output quality, focusing on English and Spanish languages, and involves multiple open-source models and algorithms. The system is optimized for local execution, allowing for dynamic and efficient voice translation with an average latency of ~ 3 seconds per sentence.

Technologies and Framework

Voice Activity Detection (VAD): Differentiates speech from silence for efficient segmentation.
Noise Reduction Model: Enhances audio clarity.
Automatic Speech Recognition (ASR): Converts speech to text.
Machine Translation Model (MT): Translates text between languages.
Text-to-Speech (TTS) Synthesis: Converts translated text back to speech with voice replication.

Checklist

Zero-shot speech-to-speech translation with voice replication
MultiLingual: SOON
Clustered fine-tuned XTTS_V2: SOON
Fine-tuning XTTS_V2 structure
Fine-tuned XTTS_V2 Automatic integration: SOON
Models direct downloading: SOON
HuggingFace Space: SOON

Hardware Specifications and Python Versions

NVIDIA GPU: The system is tested with an NVIDIA GPU that supports CUDA, specifically on CUDA 12.1.
- Note: If you are running this model on Windows, higher versions of CUDA may not be supported by DeepSpeed. Please refer to the official DeepSpeed tutorial for further information.
Memory: A minimum of 16 GB of RAM is necessary to handle the model's computational requirements efficiently.
Python Version: The model framework is designed to be compatible with Python 3 and above.
- Note: Specific libraries might require certain versions. For best compatibility and performance, the framework has been tested on Python 3.11.7.

Installation Instructions

Create a New Conda Environment:

conda create --name s3tvr_env python=3.8
conda activate s3tvr_env

Requirements:

pip install -r requirements.txt

DeepSpeed Installation:

For Linux users, follow this installation guide.
For Windows users, follow this tutorial.

TTS Installation:

for TTS advanced installation, follow the installation guide in their repo.

Models Used and Adding New Models

The project utilizes several models for different tasks, including:

Automatic Speech Recognition (ASR):
- Parakeet CTC 1.1 B Model (by Nvidia)
- STT Es FastConformer Hybrid Transducer-CTC Large P&C Model (by Nvidia)
Machine Translation (MT):
- NLLB-200 (by Meta)
Text-to-Speech (TTS):
- XTTS V2 (by Coqui)

Each model is stored in the models folder. To add a new model:

Place the model files in the models folder.
Create a function to initialize the model.
Create another function to process the input with the model.
modify run.py and stream_VAD.py with the new models.

Example initialization and processing functions for a model:

def initialize_model(model_path):
    model = SomeModelClass.from_pretrained(model_path)
    return model

def process_input(model, input_data):
    output = model(input_data)
    return output

XTTS_V2 Model

Manual Download: XTTS_V2 needs to be downloaded manually and passed to the main workflow.
Or the model weights could be downloaded automaticlly from hugging face by running the following file:

python xtts_v2_download.py

Using TTS API: Implemented methods can use TTS API with necessary workflow modifications.

Running the S3TVR Cascaded Framework

To run the Framework, use the following command:

python run.py

Example command with arguments:

python run.py /path/to/xtts /path/to/xtts_config en --record_temp record_temp.json --record_per record_per.json --record_path audio_segments/ --result_dir results --segments_dir audio_segments/

Overall Framework

This project is part of my bachelor's thesis for Computer Science and Artificial Intelligence at IE University. It aims to reduce linguistic barriers in real-time communication by integrating various AI models for seamless speech-to-speech translation and voice replication.

Latency Performance

Latency is a critical metric for evaluating the efficiency of real-time speech-to-speech translation systems. The S3TVR Framework has been benchmarked for both English and Spanish translations, providing insights into its performance under various conditions. The following table summarizes the average, best-case, and worst-case latencies recorded for the Framework:

Metric	English	Spanish
Average	3.09 seconds	3.27 seconds
Best Case	1.92 seconds	1.88 seconds
Worst Case	6.95 seconds	7.95 seconds

Framework Characteristics

Understanding the inherent characteristics of the S3TVR Framework is essential for appreciating its adaptability and performance in different scenarios. The table below outlines key features of the Framework, including its adaptability, customizability, latency control, and resource efficiency. These features highlight the Framework's design philosophy and its suitability for various applications:

Feature	Details
Adaptability	Designed for easy adaptation to new models
Customizability	Supports adjustments to model parameters and configurations to meet specific needs
Latency Control	Implements strategies to minimize processing times and maintain low latency
Resource Efficiency	Optimized for local execution with manageable resource requirements, suitable for production environments

Comparison with Seamless Streaming Model

To provide a comprehensive evaluation, the S3TVR Framework is compared against the Seamless Streaming model. This comparison focuses on key performance indicators such as latency, memory usage, translation quality, flexibility, and resource efficiency. By examining these aspects, we can understand the strengths and limitations of each model, guiding potential improvements and use-case considerations:

Feature	S3TVR Framework (English)	S3TVR Framework (Spanish)	Seamless Streaming Model
Latency	Average: 3.09 seconds	Average: 3.27 seconds	4.59 to 4.73 seconds depending on threshold settings
Memory Usage	5618 MB	3641 MB	Large: 2326 MB, Medium: 1151 MB
Translation Quality (BLEU Score)	0.36	0.41	0.198 to 0.203 depending on threshold
Flexibility	High: Modular design allows easy updates	High: Modular design allows easy updates	Less modular, more rigid design
Resource Efficiency	Optimized for local execution	Optimized for local execution	Requires more computational resources, less optimized for local execution

Models Training and Biases

Each of the models integrated into the S3TVR framework is of course trained on different datasets and may have inherent biases based on their respective training data. S3TVR itself does not alter these models but serves as a framework to make them work together in an optimized structure, ensuring efficient and seamless speech-to-speech translation.

Contributions and Feedback

I am more than happy to receive suggestions, pull requests, and constructive criticism. Your feedback is invaluable in improving the S3TVR framework. If you have any ideas, encounter issues, or want to contribute to the project, please feel free to submit a pull request or open an issue.

Acknowledgments

I would like to thank my supervisor, Adrian Carrio, for his guidance and support throughout this project.

License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Seamless Speech-to-Speech Translation with Voice Replication (S3TVR)

Project Overview

Technologies and Framework

Checklist

Hardware Specifications and Python Versions

Installation Instructions

Requirements:

DeepSpeed Installation:

TTS Installation:

Models Used and Adding New Models

XTTS_V2 Model

Running the S3TVR Cascaded Framework

Overall Framework

Latency Performance

Framework Characteristics

Comparison with Seamless Streaming Model

Models Training and Biases

Contributions and Feedback

Acknowledgments

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
audio_segments		audio_segments
models		models
results		results
xtts_fine_tune		xtts_fine_tune
xtts_v2_weights		xtts_v2_weights
.gitignore		.gitignore
readme.md		readme.md
record_per.json		record_per.json
record_temp.json		record_temp.json
requirements.txt		requirements.txt
run.py		run.py
stream_VAD.py		stream_VAD.py
stream_prod_main.py		stream_prod_main.py
workflow.gif		workflow.gif
xtts_v2_download.py		xtts_v2_download.py

yalsaffar/S3TVR

Folders and files

Latest commit

History

Repository files navigation

Seamless Speech-to-Speech Translation with Voice Replication (S3TVR)

Project Overview

Technologies and Framework

Checklist

Hardware Specifications and Python Versions

Installation Instructions

Requirements:

DeepSpeed Installation:

TTS Installation:

Models Used and Adding New Models

XTTS_V2 Model

Running the S3TVR Cascaded Framework

Overall Framework

Latency Performance

Framework Characteristics

Comparison with Seamless Streaming Model

Models Training and Biases

Contributions and Feedback

Acknowledgments

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages