Training for dummies: Creating voices off video game files. #2434
Replies: 2 comments 1 reply
-
You're trying to do very closely what I'm trying to do -- this is "cloning" and/or fine-tuning. I've been struggling with it myself, and have recently posted this: While we wait for assistance, the one constant I've heard is that you need a lot of samples. They don't have to be long, but you will need quite a lot. I've not played the Persona games before -- about how many hours of audio are there per character? You will probably want to extract ~100 wav samples of anywhere from 5-15 seconds each and transcribe them for each voice you want to model. Yes, you have the LJSpeech structure mostly right. One gotcha I noticed with
I'm not sure if it was required, but the LJSpeech CSV repeated the text as I show above. I did that myself and was able to run some trainings. However, all attempts resulted in dismal failure. My next goal is to attempt to fine-tune the YourTTS model as a starting point with my samples and see if it doesn't just produce noise. I'll let you know how that goes -- it's on my list of things to do this week. Voice modeling / AI voice for gaming is an area of interest for me, so I'm glad to see someone doing it! For Windows, download the Python Windows Launcher, which enables you to more easily use multiple Python versions. I can't recall which version I was using for my experimentation. I did not bother with WSL or trying to run things in Linux. I just used VSCode and Python directly on Windows. I did have to install the CUDA toolkit from NVIDIA. Be sure to use CUDA 11 -- CUDA 12 is not yet supported by PyTorch (one of the underlying libs that Coqui uses) |
Beta Was this translation helpful? Give feedback.
-
If you want you are free to do what you do but you could easily use Coqui Studio to create, clone any voice and use the API to voice anything you want. https://coqui.ai/ Just FYI |
Beta Was this translation helpful? Give feedback.
-
Hello!
So, this title isn't the most intuitive, but there is so much going on that I just couldn't decide. Right now, I am working with a friend - well, mainly me... - to come up with a plan to add voice to dialogue in Persona 4 Golden and Persona 5 Royal that wasn't previously voiced. And the method of getting there is where I am, in parts, a little stumped.
Using Reloaded-II we can hook into the game and several of it's methods. Right now, I am waiting for a reply to get into reading the current dialogue box to pass it to the NVDA screen reader via it's
nvdaController
API (this only showcases 32bit usage, but a 64bit version of the library is also available with the same methods).But I want to take it a step further in the long-run: I want to be able to reproduce - even if slightly scuffed and/or inperfect - the character voices, so I can generate any unspoken dialogue and them play the result at appropriate times.
As my main modding "experience" is with Persona 4 Golden - Steam version, mainly the 32bit release - the methodology of getting all the resources for Persona 5 Royal are yet a little less known to me - so I will focus on that, first.
Within the game, there is a table containing all the strings, scripts that call the dialogues in events. So it is totally possible to get a text-voice pair for each dialogue used. In return, it is also possible to find out which dialogues have not been voiced, and then use the Reloaded-II based mod to insert the generated voicelines where there wouldn't be one normally.
This means it is possible for me to generate a dataset for each character of what is seen on the screen and heared.
From what I learned so far about training TTS models, this is the basics of what I need. So far, so good. Let's say I have several folders like this:
How would I go about training a new model off this, and then use the generated model to create new voices for the un-voiced dialogue? Let's say I had a listing of each un-voiced dialogue that Yukiko has (
.../strings/Yukiko/text_N.txt
). What would the process be?This whole project is because my friend is nearly blind and has an incredibly hard time reading text as he had lost a lot of his vision in the past few years - and could not finish Persona 5 Royal due to that and it's absolute onslaught of unvoiced text.
And because we are extremely tired of avoiding talking about spoilery content when he is in the voice chat...Here's my PC's specs:
CPU: AMD Ryzen 9 3900X (stock)
RAM: 32GB @ 3200MHz
Storage: A lot. xD
GPU: NVIDIA 2080 TI (Founders Edition)
I do also have WSL2 set up and running and can see my GPU from within:
I did see the link to Windows installation instructions but Python 3.8 is quite old by now (from what I know, anyway) so it'd be great to hear from the experts. :)
I hope to use my experience from this to maybe help others make similiar mods in the future to assist with making games more accessible in the future! Like my friend, I too have a visual impairment but it is not as severe and whilst it takes me a little longer, I can survive a textbox-hell. :) So I hope to help those that can not.
Thank you for reading! Kind regards,
Ingwie
Beta Was this translation helpful? Give feedback.
All reactions