Possible buffer underrun issue?

Hey, I’m sorry if this isn’t the right forum for this, I do see that there’s a “feature request” forum for patreon supporters (which I am not, currently), but this feels like it would be a bit more of a quality-of-life improvement rather than “please put this One Particular Synth In”.

So, for context, I’m messing around with trying to make a speech synthesizer in Sonic Pi. The included vowel filter is fine for choir pads and stuff, but I felt limited by the fact that it couldn’t glide between vowels, so I knocked out a cascade Klatt formant filter with three :bpf’s in series that would be able to do that. It sounded…a lot better than I expected my own code to, so I kept going with it and tried to add some consonants.

My problem happened when I tried to code up a phonetic sequence (“Harder Better Faster Stronger” by Daft Punk because Of Course It Was). The words came out…fine (they sound a bit like an old Votrax chip), but they weren’t consistent with the tempo of the project. Whenever the sequence lagged, I got a message in the log stating “Timing warning: running slightly behind…”

I could totally be misunderstanding how Sonic Pi works here, but from my own experiences with DAW software and environments like Pd and Max/MSP, I’m wondering if this issue could be related to buffer underruns? I tried looking for a way to increase the size of the audio buffer to allow Sonic Pi more time to calculate the result at the expense of more overall latency, but I couldn’t find one.

Additionally, I didn’t hear any sort of “crackling” in the output that tends to be a sign of audio buffer underruns, so perhaps the problem lies elsewhere? I’d just like to better understand Sonic Pi’s audio engine so that I can optimize my code for it. Calculating the output of multiple filters in series seems like a fairly hardcore DSP task anyway, so I might just be pushing the program to its limits with this.

Hello @IMLXH,

It might be handy to see the code that you are attempting to use, to get an idea of what it is you’re trying to get Sonic Pi to process.

As for the exact cause of your issue, I am not 100% sure. I do know for example that for FX such as :reverb and :echo, these are fairly ‘expensive’, and this is made more obvious due to the fact that Sonic Pi’s sound engine, AKA Supercollider’s scsynth, can only use one CPU core at a time. It may be a similar situation with your new filter.

I’m not using :reverb or :echo, but yeah I’ll go ahead and post what I have so far of the source code.

# what's new: NOW WITH CONSOMANTS

# might wanna make the unvoiced consonant addition
# more robust

use_bpm 65

define :voxcords do |note, length|
  use_synth :pulse
  use_synth_defaults pulse_width: 0.1, cutoff: 80, resonance: 0.1 # thin pulse with some rolloff
  play note, duration: length, release: 0.1
end


define :klatt do |note, init_consonant, f1_a, f2_a, f3_a, f1_b, f2_b, f3_b, end_consonant, glide, breakpoint, length|
  with_fx :normaliser do
    with_fx :bpf, res: 0.7, centre: f3_a, centre_slide: (glide * breakpoint) do |c3|
      if init_consonant == "S"
        use_synth :noise
        play 40, cutoff: 130, amp: 0.01, attack: 0.125, decay: 0.125/8, release: 0.125/8
      end
      if init_consonant == "D"
        use_synth :pnoise
        play 40, cutoff: 130, amp: 0.1, attack: 0, decay: 0.125/8, release: 0.125/8
      end
      
      with_fx :bpf, res: 0.8, centre: f2_a, centre_slide: (glide * breakpoint) do |c2|
        
        if init_consonant == "K"
          use_synth :pnoise
          play 40, cutoff: 130, amp: 0.5, decay: 0.125/8, release: 0.125/8
        end
        
        with_fx :bpf, res: 0.9, centre: f1_a, centre_slide: (glide * breakpoint) do |c1|
          voxcords note, length
          sleep breakpoint # transition to next phoneme
          control c1, centre: f1_b
        end
        control c2, centre: f2_b
        if end_consonant == "K"
          sleep (length*0.9)
          use_synth :pnoise
          play 40, cutoff: 130, amp: 0.5, decay: 0.125/8, release: 0.125/8
        end
      end
      control c3, centre: f3_b
      if end_consonant == "T"
        sleep (length*0.9)
        use_synth :noise
        play 40, cutoff: 130, amp: 0.4, attack: 0, decay: 0.125/8, release: 0.125/8
      end
      if end_consonant == "S"
        sleep (length*0.5)
        use_synth :noise
        play 40, cutoff: 130, res: 0.5, amp: 0.005, attack: 0.125/2, decay: 0.125/8, release: 0.125/8
      end
    end
  end
end

define :klatt_vowels do |vowel|
  if vowel == "AW"
    return [99, 80, 74]
  end
  
  if vowel == "OO"
    return [62, 81, 97]
  end
  
  if vowel == "MM"
    return [62, 62, 74] #stabbing in the dark here lol
  end
  
  if vowel == "OU"
    return [69, 84, 97]
  end
  
  if vowel == "AH"
    return [78, 85, 99]
  end
  
  if vowel == "UH"
    return [72, 86, 98]
  end
  
  if vowel == "ER"
    return [71, 88, 92]
  end
  
  if vowel == "AE"
    return [76, 93, 98]
  end
  
  if vowel == "EH"
    return [72, 94, 99]
  end
  
  if vowel == "IH"
    return [67, 95, 99]
  end
  
  if vowel == "EE"
    return [61, 98, 102]
  end
end

define :klatt_diphones do |note, init_consonant, diphone1, diphone2, end_consonant, glide, breakpoint, sustain|
  d1 = klatt_vowels diphone1 # [f1, f2, f3] of vowel lookup table
  d2 = klatt_vowels diphone2 # arrays, h*ck yeah
  
  klatt note, init_consonant, d1[0], d1[1], d1[2], d2[0], d2[1], d2[2], end_consonant, glide, breakpoint, sustain
  
end

# MAYBE i oughta put slide value data in the phonemes themselves?
# "i might do it @$*!ing later."

klatt_diphones 40, 0, "OO", "ER", 0, 1, 0.125/4, 0.25
sleep 0.25
klatt_diphones 40, "K", "IH", "IH", "T", 0, 0.125, 0.25
sleep 0.75
klatt_diphones 52, 0, "MM", "EH", 0, 0, 0.125/2, 0.25
sleep 0.25
klatt_diphones 52, "K", "IH", "IH", "T", 0, 0.125, 0.25
sleep 0.75
klatt_diphones 43, "D", "OO", "OO", 0, 0, 0.125, 0.25
sleep 0.25
klatt_diphones 43, 0, "IH", "IH", "T", 0, 0.125, 0.25
sleep 0.75
klatt_diphones 55, 0, "MM", "EH", 0, 0, 0.125/2, 0.25
sleep 0.25
klatt_diphones 55, "K", "UH", "UH", "S", 0, 0.125, 0.25
sleep 0.75

Just tested the code. Aside from any timing problems that you may have perceived, it sounds cool :grinning_face_with_smiling_eyes:

*Wonders if we can create this speech system as a built in synth… * :thinking:

1 Like

Whoa I would be HONORED to work on something like that that’s included in Sonic Pi? My only concerns involve whether it would be a bit heavy on the CPU, since it’s three band pass filters in series. Dennis Klatt also described a parallel-BPF speech synthesizer, but that seems a bit more tricky to do because you’d need to automate a bunch of amplitude and phase values for every filter for each different vowel, whereas a “cascade” model is more like the actual human vocal tract in its calculations.

I’ll admit that I don’t even have much coding experience, I just know a lot about synthesis and knocked this out in my free time just to see if I could. I do happen to know at least one person who has been involved in creating a Klatt synth in VST format though, though I’m pretty sure he’s busy with other projects currently. But yeah, I’m delighted to hear that someone other than me finds this cheap robot voice impressive :smiley:

3 Likes

For what it’s worth, I just made a github page for this script if anyone else wanted to give pointers and/or play around with it. Maybe it needs a better name than “Fram”, but I couldn’t really think of anything else. :upside_down_face:

1 Like

Oh yes, this is indeed very impressive! Would not have expected that this can be achieved with the built-in SP synths and FXes. :+1:

1 Like

Honestly I was astounded that it sounded as good as it did. Human? no, but intelligible? At least…kinda! I mean, for what it’s worth my parents gave me a copy of “Radio-Activity” by Kraftwerk on my 9th birthday, so that’s my supervillain origin story :stuck_out_tongue:

I’ll agree with mlange here, it is definitely impressive :grinning_face_with_smiling_eyes:

As far as possibly including it as a built in synth, the biggest challenges are things like: can it be done in a 10-year-old friendly manner (eg, can the syntax to invoke the synth be relatively simple)? And, is the final implementation feasible?
If we can solve things like these, then I think it would definitely be fun to bundle it with the app :grin:

1 Like

Amazing - great to see stuff like this being made with the default synths!

I made the vowel fx but at the time I didn’t really have a clue about formants etc. - I was mainly just copying it from the TidalCycles project :sweat_smile: I understand stuff a little better now. Looking at what you’ve done, it seems plausible to me that we could port it to a SuperCollider synth for a bit of a performance boost.

In terms of a UI, I think it would also be totally possible to have something like

synth :robot, phrase: "work it, make it, do it, makes us", note: :a4

and use a gem like GitHub - hilarysk/string_to_ipa: Ruby gem that converts a string to the International Phonetic Alphabet, and a string of IPA characters to American English. to break up the phrase into a list of phonemes. We’d then need to have a lookup table of all the possible phonemes to map them to an integer index (because supercollider doesn’t accept strings as arguments via OSC directly). A reasonable amount of work to get right but I don’t see any issues with that approach.

Going back to your original question about buffer runs it is sort of like that, but not quite :smiley: Sonic Pi isn’t rendering audio directly - instead it is handling the sequencing and triggering of events, (either midi, osc or supercollider synth events). In order to trigger the events in an accurate way it uses a technique very similar to audio programming known as double buffering. When you see the timing warnings, you’re hitting a buffer underrun in the event timing buffer. It’s also possible to overwhelm supercollider and get real audio buffer underruns there but it’s less likely.

If it is really bugging you, you can increase the length of the event timing buffer by running

use_sched_ahead_time 1
4 Likes

I think Xav’s idea is great! with the added suggestion of making the synthesiser an FX rather than a synth - that way we could hopefully make it easy to change the ‘voice’ simply by wrapping the synth of your choice with the robot FX :slightly_smiling_face:

1 Like

@ethancrawford Okay yeah, I was ABSOLUTELY going to make the :voxcords wave switchable. Kraftwerk got some HAUNTING sounds from switching out the tone source on their Votrax chip (or putting it through a vocoder, depending on who you ask).

@xavierriley I…had no idea about the string_to_ipa gem?? Like. All of my experience with Ruby is just from messing around with Sonic Pi. Yeah, that would be honestly really helpful to have, though I’d prefer being able to sequence phonemes themselves. Though it wouldn’t hurt to have a way to type in either plain text OR phonetic code? A lot of old speech synths (DecTalk and the Software Automatic Mouth come to mind) include the ability to enter some sort of phonetic/stress code in order to make clearer output in certain contexts.

And dang, thanks for the note about use_sched_ahead_time! I hadn’t heard about that. Maybe it’s in the manual somewhere but I didn’t see it there. (I think it would kinda be nice for the IDE to have a way to search the documentation, actually. Or something like Max/MSP has where you can right click an object or command in the editor and have the option to pull up the documentation for it but…one thing at a time. :P)

1 Like

This is actually possible using Ctrl+I.

+1 for the searchability suggestion.

1 Like

Alright, so I’ve messed around with the code a bit more and tried to comment a bit better on some of the issues and how it works.

# FRAM v0.1.1
# written by Alex Hauptmann
# for Sonic Pi v3.3.1
# last updated 04/08/21

# FRAM is a set of Sonic Pi scripts
# describing and sequencing a cascade
# formant synthesizer to emulate
# speech output. Why? Because I can.

# CHANGELOG
# * Attempting to use "set_sched_ahead_time!" to make sequence timing more consistent (greetz to @xavierriley)
# * THREE NEW WORDS: "Harder", "Better" and "Faster" (no "Stronger" YET because I have NOT optimized this for Consonant Clusters)

use_bpm 65
set_sched_ahead_time! 2

# Simulating vocal cords
# (This can be changed for Sonovox-style effects)
define :voxcords do |note, length|
  use_synth :pulse
  use_synth_defaults pulse_width: 0.1, cutoff: 80, resonance: 0.1 # thin pulse with some rolloff
  play note, duration: length, release: 0.1
end

# Defining a cascade formant filter as described by Dennis Klatt
# "consonant" conditionals are still...problematic at this point
define :klatt do |note, init_consonant, f1_a, f2_a, f3_a, f1_b, f2_b, f3_b, end_consonant, glide, breakpoint, length|
  with_fx :normaliser do # normalizing :bpfs after the series chain decreases distortion
    with_fx :bpf, res: 0.7, centre: f3_a, centre_slide: (glide * breakpoint) do |c3| # lips & teeth
      if init_consonant == "S"
        use_synth :noise
        play 40, cutoff: 130, amp: 0.01, attack: 0.125, decay: 0.125/8, release: 0.125/8
      end
      if init_consonant == "F"
        use_synth :noise
        play 40, cutoff: 90, amp: 0.01, attack: 0.125/2, decay: 0.125/16, release: 0
      end
      if init_consonant == "D"
        use_synth :pnoise
        play 40, cutoff: 130, amp: 0.1, attack: 0, decay: 0.125/16, release: 0.125/8
      end
      if init_consonant == "B"
        use_synth :pnoise
        play 40, cutoff: 70, amp: 0.1, attack: 0, decay: 0.125/16, release: 0.125/8
      end
      
      with_fx :bpf, res: 0.8, centre: f2_a, centre_slide: (glide * breakpoint) do |c2| # tongue
        
        if init_consonant == "K"
          use_synth :pnoise
          play 40, cutoff: 130, amp: 0.5, decay: 0.125/8, release: 0.125/8
        end
        
        with_fx :bpf, res: 0.9, centre: f1_a, centre_slide: (glide * breakpoint) do |c1| # throat
          if init_consonant == "H"
            use_synth :noise
            play 40, cutoff: 130, amp: 0.05, attack: 0.125, decay: 0.125/8, release: 0.125/8
          end
          
          voxcords note, length
          sleep breakpoint # transition to next phoneme
          control c1, centre: f1_b
        end
        control c2, centre: f2_b
        if end_consonant == "K"
          sleep (length*0.9)
          use_synth :pnoise
          play 40, cutoff: 130, amp: 0.5, decay: 0.125/8, release: 0.125/8
        end
      end
      control c3, centre: f3_b
      if end_consonant == "T"
        sleep (length*0.9)
        use_synth :noise
        play 40, cutoff: 130, amp: 0.4, attack: 0, decay: 0.125/8, release: 0.125/8
      end
      if end_consonant == "S"
        sleep (length*0.5)
        use_synth :noise
        play 40, cutoff: 130, res: 0.5, amp: 0.005, attack: 0.125/2, decay: 0.125/8, release: 0.125/8
      end
    end
  end
end

# Table of :bpf centre values to simulate formants
# Formants taken from "Musical Signal Processing with LabVIEW" by Ed Doering
# Rounded to the nearest MIDI note because HAHAHA WHAT THE HECK
define :klatt_vowels do |vowel|
  if vowel == "AW"
    return [99, 80, 74]
  end
  
  if vowel == "OO"
    return [62, 81, 97]
  end
  
  if vowel == "MM"
    return [62, 62, 74] # stabbing in the dark here for formant values of "MM". it sounds. fine for now
  end
  
  if vowel == "OU"
    return [69, 84, 97]
  end
  
  if vowel == "AH"
    return [78, 85, 99]
  end
  
  if vowel == "UH"
    return [72, 86, 98]
  end
  
  if vowel == "ER"
    return [71, 88, 92]
  end
  
  if vowel == "AE"
    return [76, 93, 98]
  end
  
  if vowel == "EH"
    return [72, 94, 99]
  end
  
  if vowel == "IH"
    return [67, 95, 99]
  end
  
  if vowel == "EE"
    return [61, 98, 102]
  end
end

# Definition of input format for phonetic sequencing
# too many functions calling functions? maybe this is inefficient
define :klatt_diphones do |note, init_consonant, diphone1, diphone2, end_consonant, glide, breakpoint, sustain|
  d1 = klatt_vowels diphone1 # [f1, f2, f3] of vowel lookup table
  d2 = klatt_vowels diphone2 # arrays, h*ck yeah
  
  klatt note, init_consonant, d1[0], d1[1], d1[2], d2[0], d2[1], d2[2], end_consonant, glide, breakpoint, sustain
  
end

# Sequencing the phonemes
# NOTE: the phonetic sequence is not optimal here,
# but I've run into problems with the engine forgetting
# how to pronounce vowels after being given initial
# consonant arguments? I may simply need to find
# a more elegant format
klatt_diphones 40, 0, "OO", "ER", 0, 1, 0.125/4, 0.25
sleep 0.25
klatt_diphones 40, "K", "IH", "IH", "T", 0, 0.125, 0.25
sleep 0.5
klatt_diphones 52, 0, "MM", "EH", 0, 0, 0.125/2, 0.25
sleep 0.25
klatt_diphones 52, "K", "IH", "IH", "T", 0, 0.125, 0.25
sleep 0.5
klatt_diphones 43, "D", "OO", "OO", 0, 0, 0.125, 0.25
sleep 0.2 # for some reason i needed to make this "sleep" shorter in order for the next word to not be late??
klatt_diphones 43, 0, "IH", "IH", "T", 0, 0.125, 0.25
sleep 0.5
klatt_diphones 55, 0, "MM", "EH", 0, 0, 0.125/2, 0.25 # using "K" as end_consonant seems to mess with "EH" formant values here??
sleep 0.25
klatt_diphones 55, "S", "UH", "UH", "S", 0, 0.125, 0.25
sleep 0.75
klatt_diphones 59, "H", "AH", "AH", 0, 0, 0.125, 0.25
sleep 0.25
klatt_diphones 57, "D", "ER", "ER", 0, 0, 0.125, 0.25
sleep 0.5
klatt_diphones 57, "B", "EH", "EH", 0, 0, 0.125, 0.25
sleep 0.25
klatt_diphones 55, "D", "ER", "ER", 0, 0, 0.125, 0.25
sleep 0.5
klatt_diphones 54, "F", "AE", "AE", "S", 0, 0.125, 0.2
sleep 0.25
klatt_diphones 52, "D", "ER", "ER", 0, 0, 0.125, 0.25
sleep 0.5

I’ve also updated the github page for this project. Basically the way this thing works is that a buzz source goes into three band pass filters in series, which is a…very oversimplified model of how the human vocal tract works. The first series filter tends to have the lowest cutoff frequency (notated as “F1” in speech analysis), and it seems to model the part of the throat immediately above the vocal cords, hence why I used it to model the “H” sound with pink noise. The second filter (F2) tends to have a higher formant than F1, and it approximates position of the mouth and tongue. The final filter (F3) is the highest in its formant frequency, and seems to be the best place to model sibilance, fricatives and plosives (aka the stuff you do with your lips when you talk).

I will admit, this is still an experiment for me. I have no formal training in linguistics or acoustics or even coding at all, I’m just a synth geek who has a weird obsession of creating non-human vocals. ¯_(ツ)_/¯

3 Likes

okay so omg i JUST reread this post and i had NO clue there was a ruby gem that converted strings to IPA? that’s gonna take a LOT of work off, especially considering English as a language is just COMPLETELY ridiculous regarding phonotactics.

i think the reason vocal synths like Vocaloid are so popular in Japan is because Japanese is MUCH more phonotactically simple than English is, and thus MUCH easier to synthesize. it sort of makes me wonder whether it would be worth allowing Japanese support for this “:robot” instrument? i’m one of Plogue’s beta testers and social media managers, and they included Japanese support for Chipspeech from day one, even though a lot of the voices can’t pronounce the Japanese “r” phoneme correctly because they were designed for American English (or British English, in the case of Rotten.ST).

It’s been a little while. I’m still super keen to include fram in Sonic Pi if we ever manage to work out a way to port it to a Supercollider SynthDef :grinning_face_with_smiling_eyes:

I think this too is a great idea. We have a whole bunch of Japanese Sonic Pi enthusiasts for one thing :grinning_face_with_smiling_eyes: