This repository contains the codebase for my bachelor's Major Technical Project 2023-24. The project aims to develop a spoken language diarisation system for Indian native languages codemixed with English.
The problem addressed in this project is the lack of robust language diarisation systems for Indian native languages, especially when they are codemixed with English. Existing diarisation systems often struggle to accurately identify and distinguish between different languages spoken within the same conversation, leading to poor performance in multilingual environments.
- Develop a language diarisation system capable of accurately identifying and segmenting speech in Indian native languages codemixed with English.
- Investigate the effectiveness of different feature extraction and modelling techniques, including Wave2Vec2, X-Vector, and U-Vector, for language diarisation in multilingual contexts.
- Implement the system using efficient parallel processing techniques to handle large-scale datasets and real-time applications.
Through extensive experimentation and evaluation, the developed language diarisation system demonstrates promising results in accurately identifying and segmenting speech in Indian native languages codemixed with English. The use of advanced deep learning models such as Wave2Vec2, coupled with parallel processing techniques, has significantly improved the performance of the system compared to traditional approaches.
- Further optimize the system for real-time performance and scalability, particularly for handling larger datasets and streaming data.
- Explore additional feature extraction and modelling techniques to enhance the robustness and adaptability of the system to diverse linguistic contexts.
- Extend the system to support a wider range of Indian native languages and dialects, considering the linguistic diversity present in the region.
- Overview
- Problem Statement
- Objectives
- Conclusion
- Future Work
- Installation
- Usage
- Files Description
- Final Report
- Contributing
- License
-
Clone the repository:
git clone https://github.com/yourusername/your-repository.git
-
Install the required dependencies:
pip install -r requirements.txt
To use this codebase, follow the steps below:
-
Training Wave2Vec2 Model:
For fine-tuning the Wave2Vec2 model, execute:
python src/wav2vec2/main_v1_1.py
-
Hidden Feature Extraction:
To extract hidden features from the Wave2Vec2 model, run:
python src/tdnn/fasterHiddenFeatures-ddp.py
-
X-Vector and U-Vector Training:
Train X-Vector and U-Vector models using the extracted hidden features:
python src/tdnn/xVectorTraining-ddp.py
python src/uVector/uVectorTraining-ddp.py
-
Language Diarization:
Evaluate language diarization using either X-Vector or U-Vector approach:
python src/evaluate/languageDiarizer-fast-uvector.py
python src/evaluate/languageDiarizer-fast-xvector.py
These scripts will generate the final RTTM files. Make sure to update hardcoded paths and variables in all above files as per your requirements.
src/common/datasetPrebuild.py
: Generates prebuilt datasets in the required Hugging Face format.src/wav2vec2/main_v1_1.py
: Fine-tunes the Wave2Vec2 model.src/tdnn/fasterHiddenFeatures-ddp.py
: Extracts hidden features from the Wave2Vec2 model.src/tdnn/xVectorTraining-ddp.py
: Trains X-Vector model using extracted hidden features.src/uVector/uVectorTraining-ddp.py
: Trains U-Vector model using extracted hidden features.src/evaluate/languageDiarizer-fast-uvector.py
: Evaluates language diarization using U-Vector approach.src/evaluate/languageDiarizer-fast-xvector.py
: Evaluates language diarization using X-Vector approach.scripts/
: Contains scripts for running the codebase on HPC. Make appropriate changes if not on Param-Siddhi HPC tailored to your HPC specs.
For detailed information about the project, refer to the Final Report.