Spoken Language Diarisation for Indian Native Languages Codemixed with English

Overview

This repository contains the codebase for my bachelor's Major Technical Project 2023-24. The project aims to develop a spoken language diarisation system for Indian native languages codemixed with English.

Problem Statement

The problem addressed in this project is the lack of robust language diarisation systems for Indian native languages, especially when they are codemixed with English. Existing diarisation systems often struggle to accurately identify and distinguish between different languages spoken within the same conversation, leading to poor performance in multilingual environments.

Objectives

Develop a language diarisation system capable of accurately identifying and segmenting speech in Indian native languages codemixed with English.
Investigate the effectiveness of different feature extraction and modelling techniques, including Wave2Vec2, X-Vector, and U-Vector, for language diarisation in multilingual contexts.
Implement the system using efficient parallel processing techniques to handle large-scale datasets and real-time applications.

Conclusion

Through extensive experimentation and evaluation, the developed language diarisation system demonstrates promising results in accurately identifying and segmenting speech in Indian native languages codemixed with English. The use of advanced deep learning models such as Wave2Vec2, coupled with parallel processing techniques, has significantly improved the performance of the system compared to traditional approaches.

Future Work

Further optimize the system for real-time performance and scalability, particularly for handling larger datasets and streaming data.
Explore additional feature extraction and modelling techniques to enhance the robustness and adaptability of the system to diverse linguistic contexts.
Extend the system to support a wider range of Indian native languages and dialects, considering the linguistic diversity present in the region.

Installation

Clone the repository:

git clone https://github.com/yourusername/your-repository.git

Install the required dependencies:
```
pip install -r requirements.txt
```

Usage

To use this codebase, follow the steps below:

Training Wave2Vec2 Model:

For fine-tuning the Wave2Vec2 model, execute:
```
python src/wav2vec2/main_v1_1.py
```
Hidden Feature Extraction:

To extract hidden features from the Wave2Vec2 model, run:
```
python src/tdnn/fasterHiddenFeatures-ddp.py
```
X-Vector and U-Vector Training:

Train X-Vector and U-Vector models using the extracted hidden features:
```
python src/tdnn/xVectorTraining-ddp.py
```
```
python src/uVector/uVectorTraining-ddp.py
```
Language Diarization:

Evaluate language diarization using either X-Vector or U-Vector approach:
```
python src/evaluate/languageDiarizer-fast-uvector.py
```
```
python src/evaluate/languageDiarizer-fast-xvector.py
```
These scripts will generate the final RTTM files. Make sure to update hardcoded paths and variables in all above files as per your requirements.

Files Description

src/common/datasetPrebuild.py: Generates prebuilt datasets in the required Hugging Face format.
src/wav2vec2/main_v1_1.py: Fine-tunes the Wave2Vec2 model.
src/tdnn/fasterHiddenFeatures-ddp.py: Extracts hidden features from the Wave2Vec2 model.
src/tdnn/xVectorTraining-ddp.py: Trains X-Vector model using extracted hidden features.
src/uVector/uVectorTraining-ddp.py: Trains U-Vector model using extracted hidden features.
src/evaluate/languageDiarizer-fast-uvector.py: Evaluates language diarization using U-Vector approach.
src/evaluate/languageDiarizer-fast-xvector.py: Evaluates language diarization using X-Vector approach.
scripts/: Contains scripts for running the codebase on HPC. Make appropriate changes if not on Param-Siddhi HPC tailored to your HPC specs.

Final Report

For detailed information about the project, refer to the Final Report.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readme.md

Readme.md

Spoken Language Diarisation for Indian Native Languages Codemixed with English

Overview

Problem Statement

Objectives

Conclusion

Future Work

Table of Contents

Installation

Usage

Files Description

Final Report

Files

Readme.md

Latest commit

History

Readme.md

File metadata and controls

Spoken Language Diarisation for Indian Native Languages Codemixed with English

Overview

Problem Statement

Objectives

Conclusion

Future Work

Table of Contents

Installation

Usage

Files Description

Final Report