Skip to content

Latest commit

 

History

History
125 lines (85 loc) · 10.6 KB

README.md

File metadata and controls

125 lines (85 loc) · 10.6 KB

Cover Generator

🎶 ➡ 🧠 ➡ 🖼️

Update

The discussed pipeline can also be effectively applied to the cover generation of books, podcasts, music albums, meetings, documents, story books, and theatre scripts, as in serving a multi-purpose role.

Table of Contents
  1. Description
  2. Our Approach with AI Tools
  3. Pyhon Script
  4. Websites Deployed
  5. Papers Reviewed

Description

your_alt_text

* This is a simple application that uses the spectacular Stable Diffusion model to generate images from song lyrics.
  • We apply a large multilingual language model in open-ended generation of English song lyrics, and evaluate the resulting lyrics for coherence and creativity using human reviewers.
  • We find that current computational metrics for evaluating large language model outputs have limitations in evaluation of creative writing. We note that the human concept of creativity requires lyrics to be both comprehensible and distinctive — and that humans assess certain types of machine-generated lyrics to score more highly than real lyrics by popular artists.
  • Inspired by the inherently multimodal nature of album releases, we leverage a English-language stable diffusion model to produce high quality lyric-guided album art, demonstrating a creative approach for an artist seeking inspiration for an album or single.

Pipeline

your_alt_text

Generates music album covers using Latest AI tools, namely

1. Stable Diffusion and DALL·E

  • We’ve trained a neural network called DALL·E that creates images from text captions for a wide range of concepts expressible in natural language. DALL·E is a 12-billion parameter version of GPT-3 trained to generate images from text descriptions, using a dataset of text–image pairs. We’ve found that it has a diverse set of capabilities, including creating anthropomorphized versions of animals and objects, combining unrelated concepts in plausible ways, rendering text, and applying transformations to existing images.

  • Like GPT-3, DALL·E is a transformer language model. It receives both the text and the image as a single stream of data containing up to 1280 tokens, and is trained using maximum likelihood to generate all of the tokens, one after another.

  • This training procedure allows DALL·E to not only generate an image from scratch, but also to regenerate any rectangular region of an existing image that extends to the bottom-right corner, in a way that is consistent with the text prompt.

How are lyrics transcribed?

This notebook uses openai's recently released 'whisper' model for performing automatic speech recognition. OpenAI was kind enough to offer several different sizes of this model which each have their own pros and cons. This notebook uses the largest whisper model for transcribing the actual lyrics. Additionally, we use the smallest model for performing the lyric segmentation. Neither of these models is perfect, but the results so far seem pretty decent.

2. OpenAI Whisper for transcript

  • Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. We show that the use of such a large and diverse dataset leads to improved robustness to accents, background noise and technical language. Moreover, it enables transcription in multiple languages, as well as translation from those languages into English. We are open-sourcing models and inference code to serve as a foundation for building useful applications and for further research on robust speech processing.

  • The Whisper architecture is a simple end-to-end approach, implemented as an encoder-decoder Transformer. Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder. A decoder is trained to predict the corresponding text caption, intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation.

3. Chat GPT and GPT-2 models

  • We trained this model using Reinforcement Learning from Human Feedback (RLHF), using the same methods as InstructGPT, but with slight differences in the data collection setup. We trained an initial model using supervised fine-tuning: human AI trainers provided conversations in which they played both sides—the user and an AI assistant. We gave the trainers access to model-written suggestions to help them compose their responses. We mixed this new dialogue dataset with the InstructGPT dataset, which we transformed into a dialogue format.

  • To create a reward model for reinforcement learning, we needed to collect comparison data, which consisted of two or more model responses ranked by quality. To collect this data, we took conversations that AI trainers had with the chatbot. We randomly selected a model-written message, sampled several alternative completions, and had AI trainers rank them. Using these reward models, we can fine-tune the model using Proximal Policy Optimization. We performed several iterations of this process.

Notebooks:

The whole process is divided into three sections:

  • The generation of Lyrics/Transcript from given audio file

For notebook of meeting audio ---> Transcript

For notebook of music audio ---> Lyrics

  • The generation of Prompt from the lyrics

For notebook of lyrics ---> prompt

  • The generation of Stable-diffused image from the Prompt

For notebook of prompt ---> image

Final Python Script

Python Script - Click here

scripts/meeting_cover_baseline.py

Notebook for creation of meeting/book/document covers using transcript

Meeting/Book/Document covers notebook - click here

final-baseline/meeting-cover-baseline.ipynb

Notebook for creation of music covers using lyrics

final-baseline/music-cover-baseline.ipynb

final-baseline/meeting-final-pipeline.ipynb

Websites deployed

Papers reviewed:

Large-scale text-to-image diffusion models have madeamazing advances. However, the status quo is to usetext input alone, which can impede controllability. In thiswork, we propose GLIGEN,Grounded-Language-to-ImageGeneration, a novel approach that builds upon and extendsthe functionality of existing pre-trained text-to-image dif-fusion models by enabling them to also be conditioned ongrounding inputs. To preserve the vast concept knowledge ofthe pre-trained model, we freeze all of its weights and injectthe grounding information into new trainable layers via agated mechanism. Our model achieves open-world groundedtext2img generation with caption and bounding box condi-tion inputs, and the grounding ability generalizes well tonovel spatial configurations and concepts. GLIGEN’s zero-shot performance on COCO and LVIS outperforms existingsupervised layout-to-image baselines by a large margin