Training for dummies: Creating voices off video game files. #2434

IngwiePhoenix · 2023-03-20T00:56:05Z

IngwiePhoenix
Mar 20, 2023

Hello!

So, this title isn't the most intuitive, but there is so much going on that I just couldn't decide. Right now, I am working with a friend - well, mainly me... - to come up with a plan to add voice to dialogue in Persona 4 Golden and Persona 5 Royal that wasn't previously voiced. And the method of getting there is where I am, in parts, a little stumped.

Using Reloaded-II we can hook into the game and several of it's methods. Right now, I am waiting for a reply to get into reading the current dialogue box to pass it to the NVDA screen reader via it's nvdaController API (this only showcases 32bit usage, but a 64bit version of the library is also available with the same methods).

But I want to take it a step further in the long-run: I want to be able to reproduce - even if slightly scuffed and/or inperfect - the character voices, so I can generate any unspoken dialogue and them play the result at appropriate times.

As my main modding "experience" is with Persona 4 Golden - Steam version, mainly the 32bit release - the methodology of getting all the resources for Persona 5 Royal are yet a little less known to me - so I will focus on that, first.

Within the game, there is a table containing all the strings, scripts that call the dialogues in events. So it is totally possible to get a text-voice pair for each dialogue used. In return, it is also possible to find out which dialogues have not been voiced, and then use the Reloaded-II based mod to insert the generated voicelines where there wouldn't be one normally.

This means it is possible for me to generate a dataset for each character of what is seen on the screen and heared.

From what I learned so far about training TTS models, this is the basics of what I need. So far, so good. Let's say I have several folders like this:

.../voices/Yukiko
|- metadata.csv
|- wavs
   |- voice_1.wav
   |- voice_2.wav
   |- ...
   |- voice_n.wav

How would I go about training a new model off this, and then use the generated model to create new voices for the un-voiced dialogue? Let's say I had a listing of each un-voiced dialogue that Yukiko has (.../strings/Yukiko/text_N.txt). What would the process be?

This whole project is because my friend is nearly blind and has an incredibly hard time reading text as he had lost a lot of his vision in the past few years - and could not finish Persona 5 Royal due to that and it's absolute onslaught of unvoiced text. ~~And because we are extremely tired of avoiding talking about spoilery content when he is in the voice chat...~~

Here's my PC's specs:
CPU: AMD Ryzen 9 3900X (stock)
RAM: 32GB @ 3200MHz
Storage: A lot. xD
GPU: NVIDIA 2080 TI (Founders Edition)

Reports as 11GB VRAM in Task Manager

I do also have WSL2 set up and running and can see my GPU from within:

ingwie@bigboi:~$ nvidia-smi
Mon Mar 20 01:53:37 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 531.18       CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti      On | 00000000:2B:00.0  On |                  N/A |
| 40%   39C    P8               18W / 260W|   3298MiB / 11264MiB |      5%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A        22      G   /Xwayland                                 N/A      |
+---------------------------------------------------------------------------------------+

I did see the link to Windows installation instructions but Python 3.8 is quite old by now (from what I know, anyway) so it'd be great to hear from the experts. :)

I hope to use my experience from this to maybe help others make similiar mods in the future to assist with making games more accessible in the future! Like my friend, I too have a visual impairment but it is not as severe and whilst it takes me a little longer, I can survive a textbox-hell. :) So I hope to help those that can not.

Thank you for reading! Kind regards,
Ingwie

thoraxe · 2023-03-21T14:52:01Z

thoraxe
Mar 21, 2023

You're trying to do very closely what I'm trying to do -- this is "cloning" and/or fine-tuning. I've been struggling with it myself, and have recently posted this:
#2429

While we wait for assistance, the one constant I've heard is that you need a lot of samples. They don't have to be long, but you will need quite a lot. I've not played the Persona games before -- about how many hours of audio are there per character?

You will probably want to extract ~100 wav samples of anywhere from 5-15 seconds each and transcribe them for each voice you want to model. Yes, you have the LJSpeech structure mostly right. One gotcha I noticed with metadata.csv was that it needed the pipe character (|) as the delineator, and the names of the files should NOT have .wav:

002|and the aggravation of losing is not a comfortable one|and the aggravation of losing is not a comfortable one
003|I think sport’s one of the greatest developers of people|I think sport’s one of the greatest developers of people

I'm not sure if it was required, but the LJSpeech CSV repeated the text as I show above. I did that myself and was able to run some trainings. However, all attempts resulted in dismal failure.

My next goal is to attempt to fine-tune the YourTTS model as a starting point with my samples and see if it doesn't just produce noise. I'll let you know how that goes -- it's on my list of things to do this week.

Voice modeling / AI voice for gaming is an area of interest for me, so I'm glad to see someone doing it!

For Windows, download the Python Windows Launcher, which enables you to more easily use multiple Python versions. I can't recall which version I was using for my experimentation. I did not bother with WSL or trying to run things in Linux. I just used VSCode and Python directly on Windows. I did have to install the CUDA toolkit from NVIDIA. Be sure to use CUDA 11 -- CUDA 12 is not yet supported by PyTorch (one of the underlying libs that Coqui uses)

1 reply

thoraxe Mar 22, 2023

@IngwiePhoenix did you do anything special or follow a guide for installing WSL and getting your GPU visible inside it? I have avoided using WSL for various reasons, but I've encountered two bugs that I think are "windows only" bugs and I want to see if I can move past that via WSL.

erogol · 2023-03-21T18:10:30Z

erogol
Mar 21, 2023
Maintainer

If you want you are free to do what you do but you could easily use Coqui Studio to create, clone any voice and use the API to voice anything you want. https://coqui.ai/

Just FYI

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training for dummies: Creating voices off video game files. #2434

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Training for dummies: Creating voices off video game files. #2434

IngwiePhoenix Mar 20, 2023

Replies: 2 comments · 1 reply

thoraxe Mar 21, 2023

thoraxe Mar 22, 2023

erogol Mar 21, 2023 Maintainer

IngwiePhoenix
Mar 20, 2023

Replies: 2 comments 1 reply

thoraxe
Mar 21, 2023

erogol
Mar 21, 2023
Maintainer