More thoughts for potential vocal synthesis in Sonic Pi [LONG]

Hey, some of you might remember me from the janky vocal synthesizer I attempted to make in Sonic Pi a while ago. Since the developers seemed interested in exploring this possibility further, I’ve been looking into the documentation of Supercollider to explore the possibilities.

I would say that the two main questions regarding vocal synthesis on Sonic Pi are whether a given vocal synthesis engine can run effectively (i.e. without constant underruns) on the hardware of a Raspberry Pi, and whether the Ruby front-end could be as easy to learn as any of the other Sonic Pi synths.

The speech synthesizer I created called FRAM was a “cascade” formant synthesizer, i.e. one that uses three band-pass filters in series. The advantage of this architecture is that it automatically calculates the phase and amplitude of the filtered signal to create a vowel-like sound. The disadvantage is that cascade formant synthesizers tend to not be very good at consonant sounds at all (as you could probably tell if you ran the code yourself). A “parallel” formant synthesizer (which also uses three BPFs but in parallel) may be better at consonants, but would also involve having to calculate a discrete phase and amplitude for each filter. Perhaps that wouldn’t be such a problem for the Supercollider back-end, but somebody who knows more about the physics and math of acoustics than I do would have to be the one to implement that.

Doing a cursory scroll through the Supercollider classes documentation, I’ve found a handful of objects that might be of use if vocal synthesis in Sonic Pi is to be pursued further. The “FormantTable” object in particular seems handy, as it “returns a set of frequencies+resonances+amplitudes for a set of 5 bandpass filters, useful for emulating the main 5 formants of a vocal tract”. I have a feeling this might be part of Sonic Pi’s existing vowel FX, though as it stands, this FX can only reproduce static vowels and is unable to interpolate between them. (FRAM originally started as a self-made vowel filter which included the ability to “glide” between one vowel and another, but I got a bit carried away.)

That said, FormantTable is an object that only stores formant filter data, so it has nothing to do with the actual synthesis itself. Using a bank of BPF filters in Sonic Pi might prove problematic, as my first post on this forum which mentioned FRAM was me complaining about a “possible buffer underrun issue”. I’m not sure whether that’s because I was making a cascade model, or whether anything involving complex filtering is too computationally expensive with the limited hardware of the Pi. I wrote FRAM on a 2015 MacBook Pro, so perhaps the problem is even worse on the Pi? (I should test this once I can find my Pi again.)

SuperCollider also has an object simply called “Formant”, which “generates a set of harmonics around a formant frequency at a given fundamental frequency”. I have not tested this in Supercollider (I had issues getting the program to even RUN the last time I attempted to use it), but perhaps an oscillator which directly allows formant generation (instead of being put through a series of filters) might be less computationally expensive. Unfortunately, the documentation seems to indicate that it is only able to generate one formant at a time, which is no good for synthesizing something as complex as speech.

The “Formlet” object seems to be based on FOF synthesis, a technique which appears to generate something similar to the output of a BPF-based resonant formant filter by directly synthesizing decaying sine waves, which, when repeated at a high enough rate, create a sound similar to a resonant filter. I would imagine that sine waves would be one of the easiest things for a digital audio program to generate, so perhaps this could reduce CPU load over a filter-based model. FOF has been used for vocal synthesis (most notably in IRCAM’s “CHANT” software), but I’m not sure how well it would do with consonants (which are honestly difficult to synthesize in general).

Supercollider also includes an “LPCSynth” object. Linear prediction coding was a widely-used form of speech synthesis prior to the storage of digital audio being practical, as LPC data takes up much less storage space than digital audio does. It’s also well-known in the field of computer music, having been used by composers such as Charles Dodge, Paul Lansky, Paul Demarinis, and, most recently, Neil Cicierega. While it is not impossible to make a “by-rule” speech synthesizer (i.e. one that can synthesize arbitrary output) with an LPC synthesizer, it is mostly used for resynthesis of recorded audio, which is why it is often confused with a vocoder or with pitch correction software. Most of the LPC-related objects in Supercollider seem to be tailored to this latter function, so it may be impractical to make a “text-to-speech” style interface for it. Supercollider also has a “PVSynth” phase vocoder object, but this also appears to have been designed for resynthesis rather than synthesis by rule.

Finally, there is a “VosimOsc” object in Supercollider. While VOSIM does stand for “vocal simulation”, it is less useful at synthesizing speech than its name would imply. Plogue Art et Technologie, a music software development studio based out of Montreal, has attempted to create a VOSIM-based speech synthesizer, but quickly realized prior to its release that its clarity of speech was lacking. Tyler Koziol, a friend and occasional collaborator of mine, did release a song showcasing Plogue’s VOSIM-based speech engine some years ago, which ironically commented on its lack of speech clarity (“Everybody runs away from me, screaming/Why do they do that? It hurts my feelings”). Despite not being terribly useful for speech synthesis, however, VOSIM certainly does create interesting and musically useful timbres and has been implemented in a number of DSP-based synthesizers, most notably Mutable Instruments’ Braids and Plaits macro-oscillator modules. In my opinion, VOSIM might make a good “conventional” synth for Sonic Pi, and I myself have used Plogue’s VOSIM engine for quasi-string pads and buzzy sawtooth bass sounds.

Wall of text on the mechanics of vocal synthesis in Supercollider aside, the concern of creating a “user-friendly” interface for speech synthesis in Sonic Pi may be even more problematic than the synthesis method itself. Text to speech, especially for a language as inconsistent as English, relies on huge dictionaries to convert typed text into phonetic data before the synthesis is even attempted. If you’ve noticed, the code I wrote for FRAM does not bother with converting text into speech at all. I had to write a different definition for every diphone used (which was only a TINY portion compared to every phoneme that exists in the English language), and named them according to a rough phonetic code. FRAM was not designed to sing anything other than the lyrics of “Harder, Better, Faster, Stronger”. I ran into this problem previously for my college capstone project, another singing vocal synthesizer created in Max for Live called MYNA. Even though the quality of the output is much clearer (largely because it was concatenated from recordings I made of my friend Neal Anderson at a single pitch, then put through a pitch correction algorithm in Max which changed with incoming MIDI data), I ran into the exact same issue. MYNA as it exists now cannot sing anything other than Daisy Bell, as I did not have the time even during an entire semester of college to index every diphone we recorded, and furthermore, it can only take phonetic code as input, as I certainly did not have the time to write a dictionary including every word in the English language.

I appreciate that my experiment with speech synthesis in Sonic Pi caught the interest of the developers, and I would love a text-to-speech robot vocalist in the program, but if the developers are serious about trying to implement this, I would just like to explain what all they would be getting into. That said, I haven’t yet done any research on third-party “Quarks” in Supercollider, so if there is an “off-the-shelf” solution for text-to-vocal-synthesis, it would likely be the most practical way to move forward.

3 Likes

Wow! Thanks for the summary here. You could turn this into an academic literature review paper :slight_smile:

can run effectively

I’m sure it can, it’s just a case of figuring out how

whether the Ruby front-end could be as easy to learn

Again, possible but we need to find a way.

I implemented the vowel effect, porting it over from the Tidal project but I have to admit at the time I didn’t really understand it and I don’t think it works particularly well. We could do with something better.

In my opinion, VOSIM might make a good “conventional” synth for Sonic Pi

This sounds like a good idea. Would anyone like to take a shot at designing a basic version in SuperCollider? We (the developers) can give some tips on how to get it ready for inclusion into Sonic Pi. My tips would be, think about whether it will be with_fx :vosim or synth :vosim. Sketching out a rough idea of how you imagine using it from within Sonic Pi is a good starting point.

Shortly after reading this post I did see something relevant in the SuperCollider forums here FM formant synthesis (e.g. Chowning, "Phone") - SynthDefs - scsynth The code there is fun, sounds good and seems to use 1.5% CPU on my macbook which is pretty good. Worth trying it out on an RPi. Only problem is that it only handles vowels but its still better than nothing.

I was also thinking about whether neural networks could handle this. I suspect that they could, and that lots of research will have been done into making them run on mobile devices but I doubt that it will handle text to singing voice synthesis very easily - it would likely only be text for now.

Re: harder, better, faster stronger I’ve been listening to that tune a lot recently because my 5 year old has just discovered Daft Punk (there are worse things!). I’ve noticed that the effect they use there (and elsewhere) is probably autotune, as opposed to a vocoder. I think they got a sample of a vocoder on one pitch and then used the MIDI tracking feature of autotune to repitch it accordingly. I’ve done a lot of work on implementing the Autotune patent recently so I’m familiar with it at the moment. Hoping to finish my patch to get it into Sonic Pi soon! (I have an older version of Autotune already in Sonic Pi but it doesn’t work as well)

2 Likes

p.s. To give a flavour of whats possible with neural networks this is one option for singing voice synthesis Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens - NVIDIA ADLR

1 Like

Sheesh, that sounds like a pretty tall order to me. I’ve fiddled around with audio-generating neural nets before (specifically OpenAI Jukebox), and I absolutely love the things Jukebox can come up with, but there are two huge problems with trying to use something like that in Sonic Pi: the first being the sheer amount of time it takes to render even just a few seconds worth of audio, and the second being that it sort of has a tendency to go off the rails.

I’m no expert in neural networks, I’m just a tech-minded artist exploring what they can come up with, but the thing I’ve noticed about Jukebox or any number of image-generating neural networks that I’ve used is that they really tend to work in broad strokes. Jukebox can certainly output things that are recognizable as music, it can even emulate the styles and voices of particular artists and performers to an uncanny degree, but whenever I try to give it a lyrical prompt, it really feels like trying to herd cats. It’s often able to synthesize the first few words of the “lyrics” prompt just fine, but afterward it tends to do things like repeat the same set of lyrics or even devolve into vocal gibberish, rather than continue on with the lyrics in the prompt (even when I’m using the semi-supervised “co-composer” mode which renders only a couple seconds at a time).

Maybe there’s an application in Sonic Pi for a much smaller neural network for a much simpler audio synthesizer, but as far as I can tell (and, again, I’m not at all an expert), speech/singing is inherently too complicated of an output for a neural network that could fit on board a Pi system to generate. If I’m wrong, though, I’d love to hear about it!

I’ve also wondered whether the existing open-source eSpeak engine would be of any use at all? From the (rather unstructured) research of myself and a few friends of mine, eSpeak seems to be a hybrid formant and concatenative engine, with the former part handling vowels and voiced consonants, and the latter using a small bank of samples of unvoiced consonants (a bit like a version of the “linear arithmetic” synthesis in the Roland D-50 that is tailored specifically to speech).

Somebody on GitHub has already made a front-end program that turns eSpeak output into “sung” vocals called eCantorix, though it can only take .kar files as input, and from the way it’s explained in the readme, it sounds like it adjusts pitch and timing after the eSpeak speech is rendered, rather than utilizing any scripting parameters for pitch and timing in the eSpeak engine itself? I haven’t looked at its code in any meaningful level of depth yet, so I’m not sure whether that is indeed the case, or if it was just worded awkwardly in the readme.

I think it would have to be the latter, since VOSIM is a technique based on frequency and amplitude modulation and oscillator sync. I suppose the latter could be achieved by trying to figure out how to sync the two sine waves which are the output of it to the frequency of the input audio, but that seems like a bit more trouble than it’s worth. Might be interesting to try, though. It would probably be a bit chaotic and glitchy since a VOSIM fx would have to figure out the sync trigger itself (think like, the Moog Freqbox), but chaotic and glitchy isn’t always a BAD thing :slight_smile:

By the way, here’s Werner Kaegi’s original paper on VOSIM, in case anyone wanted to check it out. Be warned, though, it’s full of a ton of equations that I cannot make heads or tails of.

You know, I’ve been watching a YouTube channel called PhilosophyTube lately, and there was something the channel’s host Abigail Thorn said in one of her videos critiquing a certain best-selling pop-philosopher who seems to have trouble following his own rules (though really, isn’t that all of us?).

“On this channel, we don’t care about being ‘right’. Every episode of PhilosophyTube is ‘wrong’…in fact, I refuse to be ‘correct’. The goal is to be ‘wrong’ in interesting ways.

I’ve been thinking about this concept a lot. It reminds me a lot of the things Brian Eno has said about the “ugliest” features of a medium becoming the most prized–think of the asymmetry of tube distortion, or the saturation of magnetic tape, or the surreal glassiness of that one Yamaha DX7 piano patch (you know the one).

The thing that struck me about your janky :vowel fx is that…while it is way too resonant to emulate the human voice in a useful way, that same resonance is perfect for modal synthesis! Like…tune those sine partials to ratios other than vowel formant frequencies, and you could get some really nice sounding “tuned percussion” noises in varying degrees of inharmonicity!

On that note, I unfortunately haven’t played with any of the audio input devices enough yet to know whether Sonic Pi can do real-time processing of signals, but if you can get the latency low enough, you could make an INCREDIBLY responsive virtual percussion instrument with it, something like Applied Acoustics Systems’ Objeq app for iOS! I personally find the idea of physically actuated physical modeling synthesizers incredibly fascinating, so this might be another possibility to consider…:thinking:

So, I fiddled around a little with the :vowel FX as a modal resonator, and my suspicions that it could produce semi-tuned percussion noises were confirmed! The code here is a bit spaghetti, but I really do like how it turned out sonically…

define :exciter do |cut|
  with_fx :hpf, cutoff: cut do
    sample :bd_ada
  end
end

define :modal do |vow, inst|
  with_fx :vowel, vowel_sound: vow, voice: inst do
    exciter (rand_i 40) + 50
  end
end

# i can probably figure out a better way to do this than if-then but
define :vowel_scale do |v_note|
  if v_note == 0
    return [3, 0]
  end
  if v_note == 1
    return [5, 0]
  end
  if v_note == 2
    return [5, 3]
  end
  if v_note == 3
    return [2, 0]
  end
  if v_note == 4
    return [5, 2]
  end
  if v_note == 5
    return [4, 0]
  end
  if v_note == 6
    return [1, 0]
  end
end

define :playmodal do |i|
  v = vowel_scale i
  modal v[0], v[1]
end

in_thread do
  loop do
    with_fx :pan, pan: -0.3 do
      playmodal rand_i(7) if (spread 7, 11).tick
      sleep 0.25
    end
  end
end

in_thread do
  sleep 8
  loop do
    with_fx :pan, pan: 0.3 do
      playmodal rand_i(7) if (spread 11, 13).tick
      sleep 0.25
    end
  end
end

in_thread do
  sleep 16
  loop do
    with_fx :pan, pan: -0.5, amp: 0.7 do
      playmodal rand_i(7) if (spread 13, 17).tick
      sleep 0.125
    end
  end
end

in_thread do
  sleep 16
  loop do
    with_fx :pan, pan: 0.5, amp: 0.7 do
      playmodal rand_i(7) if (spread 17, 19).tick
      sleep 0.125
    end
  end
end
1 Like

Definitely sounds interesting! My poor old computer struggled to keep up a bit though :joy: :sweat_smile:

1 Like

Hmm. Is there a way to “choke” polyphony in Sonic Pi? Like, limit the number of audio renders going at once? I suppose it wouldn’t be TOO hard to do manually in Ruby, as long as there’s some means of getting the number of voices currently playing in_thread and some way to “kill” a playing voice. Something kinda like…

in_thread do
 # play some notes
 if (voices_playing > 4) do
  kill voices_playing.oldest
 end
end

Unfortunately, I don’t know enough about every command in Sonic Pi to know whether the program can do something like this or not.

I just tried consolidating :exciter and :modal into one definition. It seems to improve performance on my end, but I don’t know about other hardware.

# FAILURE (the GOOD KIND)
# by Alex Hauptmann
# 2/15/22
# v0.2 created 2/17/22

# CHANGELOG: consolidating :exciter and :modal into one definition to see if performance improves

# NOTES: the :vowel FX in Sonic Pi only has 7 distinct "vowels". This piece involves
# a semi-tonal "scale" of these vowels, actuated by a bass drum sample. While the
# :vowel FX object has limited use for synthesis of vocal sounds, it is FANTASTIC
# as a modal resonator.

define :modal do |vow, inst|
  with_fx :vowel, vowel_sound: vow, voice: inst do
    with_fx :hpf, cutoff: (rand_i 40) + 50 do
      sample :bd_ada
    end
  end
end

# i can probably figure out a better way to do this than if-then but
define :vowel_scale do |v_note|
  if v_note == 0
    return [3, 0]
  end
  if v_note == 1
    return [5, 0]
  end
  if v_note == 2
    return [5, 3]
  end
  if v_note == 3
    return [2, 0]
  end
  if v_note == 4
    return [5, 2]
  end
  if v_note == 5
    return [4, 0]
  end
  if v_note == 6
    return [1, 0]
  end
end

define :playmodal do |i|
  v = vowel_scale i
  modal v[0], v[1]
end

in_thread do
  loop do
    with_fx :pan, pan: -0.3 do
      playmodal rand_i(7) if (spread 7, 11).tick
      sleep 0.25
    end
  end
end

in_thread do
  sleep 8
  loop do
    with_fx :pan, pan: 0.3 do
      playmodal rand_i(7) if (spread 11, 13).tick
      sleep 0.25
    end
  end
end

in_thread do
  sleep 16
  loop do
    with_fx :pan, pan: -0.5, amp: 0.7 do
      playmodal rand_i(7) if (spread 13, 17).tick
      sleep 0.125
    end
  end
end

in_thread do
  sleep 16
  loop do
    with_fx :pan, pan: 0.5, amp: 0.7 do
      playmodal rand_i(7) if (spread 17, 19).tick
      sleep 0.125
    end
  end
end

EDIT: okay so the SECOND time I ran it I ended up getting a ton of late notes? Inconsistent failure is the most frustrating sort :grimacing: