- Motivation
- Project Objectives
- Key Features
- Dataset
- Speech-to-Text Conversion
- Personalized Image Generation
- Model Specifications
- Training Strategy
- Technologies Used
- Results
- Contributors
- Future Work
- How to Run
- Project Attachments
With the growing demand for personalized content, this project bridges audio and visual modalities by generating images from speech. It has real-world applications in accessibility, content creation, and immersive environments like AR/VR.
- Develop a system that generates high-quality personalized images from speech inputs.
- Seamlessly combine speech-to-text conversion and fine-tuned image generation models.
- Optimize training efficiency and ensure high output quality.
- Accurate speech-to-text conversion using OpenAI's Whisper model.
- Fine-tuned Stable Diffusion with DreamBooth to generate personalized images.
- Modular workflow ensuring efficiency, flexibility, and high-quality results.
We created a custom dataset by capturing our own images to fine-tune the Stable Diffusion model. This small, targeted dataset ensures the generated images are highly personalized and contextually accurate.
The OpenAI Whisper model accurately transcribes spoken input into text prompts. It handles multiple languages, various audio formats, and noisy environments effectively. These text prompts act as inputs for the image generation process.
Using Stable Diffusion v2, we fine-tuned the model with DreamBooth to generate personalized and context-specific images. By leveraging our dataset, the model produces results that are visually relevant and high-quality.
Stable Diffusion uses a latent diffusion process to generate images from text prompts. Its components include:
- U-Net: Core neural network for noise prediction and removal.
- Variational Autoencoder (VAE): Compresses images into latent space.
- CLIP Text Encoder: Translates text prompts into numeric representations.
DreamBooth enables efficient fine-tuning for personalized image generation. Key training parameters include prior preservation, batch size control, and memory-efficient optimizations like mixed precision.
The model was fine-tuned with optimized configurations such as a low learning rate (1e-6), small batch sizes, and prior preservation techniques. Training involved iterative evaluations to monitor quality and ensure computational efficiency.
- Speech-to-Text: OpenAI Whisper
- Image Generation: Stable Diffusion v2
- Fine-Tuning Framework: DreamBooth
- Optimizations: Mixed precision training and memory-efficient optimizers
- Adil Qureshi: Fine-tuning Stable Diffusion and model training
- Jaskirat Sudan: Speech-to-text integration, testing and image generation analysis.
- Shubham Jagtap: Dataset creation.
- Expand the dataset for better generalization and diversity.
- Implement advanced metrics for cross-modal evaluation.
- Integrate the model into augmented and virtual reality applications.
To run the final_project.ipynb
file for inference in Google Colab:
-
Mount Google Drive:
Open the notebook and run the Drive mounting cell.from google.colab import drive drive.mount('/content/drive')
-
Specify the Model Path: Provide the path to the "Model Files" folder inside your ece5831-2024-final-project directory. You can create a shortcut to the shared (ece5831-2024-final-project) folder in your own Google Drive for easier access.
-
Run All cells of
final_project.ipynb
Ensure the Colab session is connected to a free GPU runtime. Check runtime type under Runtime > Change runtime type and select GPU.