Hey, some of you might remember me from the janky vocal synthesizer I attempted to make in Sonic Pi a while ago. Since the developers seemed interested in exploring this possibility further, I’ve been looking into the documentation of Supercollider to explore the possibilities.
I would say that the two main questions regarding vocal synthesis on Sonic Pi are whether a given vocal synthesis engine can run effectively (i.e. without constant underruns) on the hardware of a Raspberry Pi, and whether the Ruby front-end could be as easy to learn as any of the other Sonic Pi synths.
The speech synthesizer I created called FRAM was a “cascade” formant synthesizer, i.e. one that uses three band-pass filters in series. The advantage of this architecture is that it automatically calculates the phase and amplitude of the filtered signal to create a vowel-like sound. The disadvantage is that cascade formant synthesizers tend to not be very good at consonant sounds at all (as you could probably tell if you ran the code yourself). A “parallel” formant synthesizer (which also uses three BPFs but in parallel) may be better at consonants, but would also involve having to calculate a discrete phase and amplitude for each filter. Perhaps that wouldn’t be such a problem for the Supercollider back-end, but somebody who knows more about the physics and math of acoustics than I do would have to be the one to implement that.
Doing a cursory scroll through the Supercollider classes documentation, I’ve found a handful of objects that might be of use if vocal synthesis in Sonic Pi is to be pursued further. The “FormantTable” object in particular seems handy, as it “returns a set of frequencies+resonances+amplitudes for a set of 5 bandpass filters, useful for emulating the main 5 formants of a vocal tract”. I have a feeling this might be part of Sonic Pi’s existing vowel FX, though as it stands, this FX can only reproduce static vowels and is unable to interpolate between them. (FRAM originally started as a self-made vowel filter which included the ability to “glide” between one vowel and another, but I got a bit carried away.)
That said, FormantTable is an object that only stores formant filter data, so it has nothing to do with the actual synthesis itself. Using a bank of BPF filters in Sonic Pi might prove problematic, as my first post on this forum which mentioned FRAM was me complaining about a “possible buffer underrun issue”. I’m not sure whether that’s because I was making a cascade model, or whether anything involving complex filtering is too computationally expensive with the limited hardware of the Pi. I wrote FRAM on a 2015 MacBook Pro, so perhaps the problem is even worse on the Pi? (I should test this once I can find my Pi again.)
SuperCollider also has an object simply called “Formant”, which “generates a set of harmonics around a formant frequency at a given fundamental frequency”. I have not tested this in Supercollider (I had issues getting the program to even RUN the last time I attempted to use it), but perhaps an oscillator which directly allows formant generation (instead of being put through a series of filters) might be less computationally expensive. Unfortunately, the documentation seems to indicate that it is only able to generate one formant at a time, which is no good for synthesizing something as complex as speech.
The “Formlet” object seems to be based on FOF synthesis, a technique which appears to generate something similar to the output of a BPF-based resonant formant filter by directly synthesizing decaying sine waves, which, when repeated at a high enough rate, create a sound similar to a resonant filter. I would imagine that sine waves would be one of the easiest things for a digital audio program to generate, so perhaps this could reduce CPU load over a filter-based model. FOF has been used for vocal synthesis (most notably in IRCAM’s “CHANT” software), but I’m not sure how well it would do with consonants (which are honestly difficult to synthesize in general).
Supercollider also includes an “LPCSynth” object. Linear prediction coding was a widely-used form of speech synthesis prior to the storage of digital audio being practical, as LPC data takes up much less storage space than digital audio does. It’s also well-known in the field of computer music, having been used by composers such as Charles Dodge, Paul Lansky, Paul Demarinis, and, most recently, Neil Cicierega. While it is not impossible to make a “by-rule” speech synthesizer (i.e. one that can synthesize arbitrary output) with an LPC synthesizer, it is mostly used for resynthesis of recorded audio, which is why it is often confused with a vocoder or with pitch correction software. Most of the LPC-related objects in Supercollider seem to be tailored to this latter function, so it may be impractical to make a “text-to-speech” style interface for it. Supercollider also has a “PVSynth” phase vocoder object, but this also appears to have been designed for resynthesis rather than synthesis by rule.
Finally, there is a “VosimOsc” object in Supercollider. While VOSIM does stand for “vocal simulation”, it is less useful at synthesizing speech than its name would imply. Plogue Art et Technologie, a music software development studio based out of Montreal, has attempted to create a VOSIM-based speech synthesizer, but quickly realized prior to its release that its clarity of speech was lacking. Tyler Koziol, a friend and occasional collaborator of mine, did release a song showcasing Plogue’s VOSIM-based speech engine some years ago, which ironically commented on its lack of speech clarity (“Everybody runs away from me, screaming/Why do they do that? It hurts my feelings”). Despite not being terribly useful for speech synthesis, however, VOSIM certainly does create interesting and musically useful timbres and has been implemented in a number of DSP-based synthesizers, most notably Mutable Instruments’ Braids and Plaits macro-oscillator modules. In my opinion, VOSIM might make a good “conventional” synth for Sonic Pi, and I myself have used Plogue’s VOSIM engine for quasi-string pads and buzzy sawtooth bass sounds.
Wall of text on the mechanics of vocal synthesis in Supercollider aside, the concern of creating a “user-friendly” interface for speech synthesis in Sonic Pi may be even more problematic than the synthesis method itself. Text to speech, especially for a language as inconsistent as English, relies on huge dictionaries to convert typed text into phonetic data before the synthesis is even attempted. If you’ve noticed, the code I wrote for FRAM does not bother with converting text into speech at all. I had to write a different definition for every diphone used (which was only a TINY portion compared to every phoneme that exists in the English language), and named them according to a rough phonetic code. FRAM was not designed to sing anything other than the lyrics of “Harder, Better, Faster, Stronger”. I ran into this problem previously for my college capstone project, another singing vocal synthesizer created in Max for Live called MYNA. Even though the quality of the output is much clearer (largely because it was concatenated from recordings I made of my friend Neal Anderson at a single pitch, then put through a pitch correction algorithm in Max which changed with incoming MIDI data), I ran into the exact same issue. MYNA as it exists now cannot sing anything other than Daisy Bell, as I did not have the time even during an entire semester of college to index every diphone we recorded, and furthermore, it can only take phonetic code as input, as I certainly did not have the time to write a dictionary including every word in the English language.
I appreciate that my experiment with speech synthesis in Sonic Pi caught the interest of the developers, and I would love a text-to-speech robot vocalist in the program, but if the developers are serious about trying to implement this, I would just like to explain what all they would be getting into. That said, I haven’t yet done any research on third-party “Quarks” in Supercollider, so if there is an “off-the-shelf” solution for text-to-vocal-synthesis, it would likely be the most practical way to move forward.