Skip to content

JaskiratSudan/ece5831-2024-final-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 

Repository files navigation

ece5831-2024-final-project

Speech-to-Image Generation Using Fine-Tuned Latent Diffusion Models

Table of Contents

  1. Motivation
  2. Project Objectives
  3. Key Features
  4. Dataset
  5. Speech-to-Text Conversion
  6. Personalized Image Generation
  7. Model Specifications
  8. Training Strategy
  9. Technologies Used
  10. Results
  11. Contributors
  12. Future Work
  13. How to Run
  14. Project Attachments

Motivation

With the growing demand for personalized content, this project bridges audio and visual modalities by generating images from speech. It has real-world applications in accessibility, content creation, and immersive environments like AR/VR.

Project Objectives

  • Develop a system that generates high-quality personalized images from speech inputs.
  • Seamlessly combine speech-to-text conversion and fine-tuned image generation models.
  • Optimize training efficiency and ensure high output quality.

Key Features

  • Accurate speech-to-text conversion using OpenAI's Whisper model.
  • Fine-tuned Stable Diffusion with DreamBooth to generate personalized images.
  • Modular workflow ensuring efficiency, flexibility, and high-quality results.

Dataset

We created a custom dataset by capturing our own images to fine-tune the Stable Diffusion model. This small, targeted dataset ensures the generated images are highly personalized and contextually accurate. BeFunky-collage


Speech-to-Text Conversion

The OpenAI Whisper model accurately transcribes spoken input into text prompts. It handles multiple languages, various audio formats, and noisy environments effectively. These text prompts act as inputs for the image generation process.


Personalized Image Generation

Using Stable Diffusion v2, we fine-tuned the model with DreamBooth to generate personalized and context-specific images. By leveraging our dataset, the model produces results that are visually relevant and high-quality.


Model Specifications

Stable Diffusion Architecture

Stable Diffusion uses a latent diffusion process to generate images from text prompts. Its components include:

  • U-Net: Core neural network for noise prediction and removal.
  • Variational Autoencoder (VAE): Compresses images into latent space.
  • CLIP Text Encoder: Translates text prompts into numeric representations.

Fine-Tuning with DreamBooth

DreamBooth enables efficient fine-tuning for personalized image generation. Key training parameters include prior preservation, batch size control, and memory-efficient optimizations like mixed precision.


Training Strategy

The model was fine-tuned with optimized configurations such as a low learning rate (1e-6), small batch sizes, and prior preservation techniques. Training involved iterative evaluations to monitor quality and ensure computational efficiency.


Technologies Used

  • Speech-to-Text: OpenAI Whisper
  • Image Generation: Stable Diffusion v2
  • Fine-Tuning Framework: DreamBooth
  • Optimizations: Mixed precision training and memory-efficient optimizers

Results

image

Contributors

  • Adil Qureshi: Fine-tuning Stable Diffusion and model training
  • Jaskirat Sudan: Speech-to-text integration, testing and image generation analysis.
  • Shubham Jagtap: Dataset creation.

Future Work

  • Expand the dataset for better generalization and diversity.
  • Implement advanced metrics for cross-modal evaluation.
  • Integrate the model into augmented and virtual reality applications.

How to Run

To run the final_project.ipynb file for inference in Google Colab:

  1. Mount Google Drive:
    Open the notebook and run the Drive mounting cell.

    from google.colab import drive
    drive.mount('/content/drive')
  2. Specify the Model Path: Provide the path to the "Model Files" folder inside your ece5831-2024-final-project directory. You can create a shortcut to the shared (ece5831-2024-final-project) folder in your own Google Drive for easier access.

  3. Run All cells of final_project.ipynb Ensure the Colab session is connected to a free GPU runtime. Check runtime type under Runtime > Change runtime type and select GPU.

Project Attachments

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published