The English language can be narrowed down to just 39 basic sounds, called "phonemes". Each phoneme has a specific character representation, and the set of representations is called the ARPABET, which is described in more detail here. Fun fact, you might have heard the sentence "the quick brown fox jumps over the lazy dog", which includes all english letters. Pronunciation has its own device like this, a story called Arthur the Rat, which includes all english pronunciations. If you're interested, check out recordings in different dialects here. Okay, I digress. While the 39 phonemes are the only sounds, they can each be pronounced in slightly different ways by varying syllable stress, cadence, etc. All these slight modifications make it so that pronunciation is a little more nuanced than just stringing together the phoneme makeup of a word. This project was promped by my curiosity as to how accurate a text-to-speech bot would be given just one pronunciation of each of the 39 phonemes. With such a simple setup, a text-to-speech bot can be easily made to mimic almost anyone, as long as they've been recorded saying words with each of the 39 phonemes at least once.
At a high level, this is what my program does:
Gathers user input
Splits user input into individual words
Stores the ARPABET pronunciation for each word using the "pronouncing" package
Plays all phonemes in order, accounting for punctuation with pauses
Other than the source code, which you can find here, there are a few external libraries and additional setup you need to do for this project.
pronouncing: A Python library that returns the ARPABET pronunciations of words. You can download with "pip install pronouncing" and read the documentation here.
playsound: This Python library only adds one function, "playsound" which is used to play audio (specifically mp3s) from a Python program. You can download it with "pip install playsound". In order to use this, I also had to download PyObjC for some reason, so if you have an issue running playsound, try running "pip3 install -U PyObjC".
Here is where you customize the voice: In the same folder as voice_emulator.py, make a folder that has all of the pronunciation audio files in it. For example, for the phoneme "AA", you need its pronunciation stored in a file at "folder_name/AA.mp3". If any of this is confusing, take a look at the project on my GitHub and just copy the format there.
After completing the setup, simply run the program from your terminal with: $ python3 voice_emulator.py folder_name
Conclusions & Next Steps
Okay, so, I am well aware this is not the best voice emulator in the world LOL. However, I would say you can pretty much understand everything that is said as well as whose voice it is, and that's pretty cool! Considering I recorded all the sounds in 20 minutes using my default MacBook mic, used a single example word as guidance for each ARPABET pronunciation, and spent 10 minutes clipping each recording to scale, I don't think it turned out too bad! With this super simple format, it would be easy to make a bot of someone with even a small amount of available audio.
Beyond the obvious and pedantic things like using a better mic, pronouncing the sounds better, and clipping the audio better, the best way to improve the project would be to simply increase the number of sounds. While I ignored them in my implementation, ARPABET actually also uses stresses, which indicate what syllable to add the main or secondary stress on when you speak a word. Adding these would require each phoneme to be recorded multiple times (once for each tier of stress) but would, obviously, increase how accurate or "human" it sounds. I could also account for punctuation stresses, such as questioning tone (?), exclamation (!), or trailing (...). Lastly, I could switch from ARPABET to "International Phonetic Alphabet" or "IPA", as IPA accounts for a few more sounds than the version of ARPABET I'm using in this project.