Search

Music, Tech and Machine Learning — Interview with Daniel Rothmann

Songs, radio or a film at the cinema: do you ever stop to think about what's behind the sounds we hear? Do you ever wonder about what role modern technology — including Machine Learning — plays in the production of the sounds you consume and take for granted?

Photo by Pawel Czerwinski on Unsplash.

To inquire into the topic, we've recently talked to Daniel Rothmann. Daniel is one of the rare breed: he lived a first life as an audio engineer and music producer, for the big guys in the media and entertainment industry for ten years. Then, his career made a big turn when he decided he wanted to go back to tweaking synth knobs and building software for sound processing. For instance, he created a popular VST plugin called ROTH-AIR which lets you add breathiness to a sound: it's really cool! Then he discovered the Machine Learning space and since then he has worked as a Machine Learning and Data Engineer at companies like Kanda, LEGO, and Dynaudio. Last year he released SynthVR: ​​a modular synthesizer environment built from the ground up for virtual reality.

You can find some of his writings on his blog, explaining his take on the promise of AI in signal processing or his collaboration with Engadget to find whether AI can really make smart speakers sound better or if it's all a marketing stunt.


All in all, we got to ask him some questions, and here are his thoughts.


You've always had an interest in exploring the intersection of technology and sound, but that doesn't necessarily involve Machine Learning. What made you focus your attention on ML coming from such a sound-centric background?


It was the "Wavenet moment". Wavenet is an approach for generating audio based on deep neural networks proposed by Deepmind in 2016. At the time, most audio processing neural networks relied on intermediate representations of sound, mostly some kind of spectrogram. Think of splitting a sound wave into short samples of a few seconds and then doing some Fourier-like transformation to represent the signal in the frequency domain.


Why you might ask? A raw waveform is pretty much just "time-series" like data on the amplitude of a wave at each point in time. But the sampling of this time-series is pretty dense, we're talking normally more than 40k samples per second! Now imagine you feed that kind of data into a Recurrent Neural Network: well, things happening one second apart are... 40 thousand samples apart! That's a pretty long range dependency for vanishing gradients to overcome... And that's close to what Deepmind did with Wavenet, but using stacked 1D convolutions. This felt like a pretty important step that got me excited about Machine Learning on sound.


One of the issues with spectrogram-based representations of sound is that they're naturally based on complex numbers representing two components: intuitively a spectrogram tells you something like "how much 100Hz is in this signal?" (a magnitude) but also "how is the phase of this frequency aligned?" (corresponding to the phase component of the complex number). Many ML models just work with the magnitudes and discard the phase altogether in the spectrogram and then reconstruct the signal back without that original information. To some extent, our perception of sound is invariant to phase shifts, but there is definitely some information lost — This is one of those issues that “raw-sample” models like WaveNet address.


After having worked on various ML projects related to sound, and knowing many artists and music industry folks, what directions are you excited about in the combination of ML and music?


I think a lot of autoencoder-based modeling is fascinating. For instance, using autoencoders for a style-transfer type of application where one converts a melody in one timbre into a target timbre (e.g. check out recent works on the topic). Let's say you whistle a melody and then play it back as a string ensemble. This is tremendously interesting because melodies are not only notes, rhythm, and dynamics. The depth of sound often comes from an expressiveness that largely depends on the physical properties of each instrument and the player. The standard protocol for telling a computer what, when, and how to produce musical sounds is called MIDI (Musical Instrument Digital Interface) and it has its expressive limitations. Being able to transfer or blend sounds beyond what's possible with existing midi-based tech could open up avenues of new sounds and expressions that artists can experiment with. How would a trumpet glissando sound with the timbre of a piano? Those wild sounds might make it into the mainstream and into our ears sooner than later!


Still, most of these approaches are not "production-ready" yet: they’re either unreliable or too computationally expensive. One of the things however that have the potential to impact mainstream audio production is this idea of Differentiable Digital Signal Processing (or Differentiable DSP). This is basically an approach where all parameters of whatever signal processing tool you're using are differentiable, so you can apply all the gradient-descent type optimization techniques on them, this could be used for learning a processing component parameters to produce certain target sonic characteristics. For instance, adding a digital emulation amp to a raw guitar sound just so it resembles the guitar tone from your favorite rock-star.


Finally, what parts of audio production do you see automated and which not?


I don't see much opportunity for fully automating music and sound production workflows, but more using new tech as a helper/productivity booster for sound engineers or even empowering the creative expression of artists. But not so much replacing people in the pipeline of making a record. Some examples of existing machine learning-based processing tools are iZotope’s assistive audio technology which tries to predict which settings you will want to use for a certain audio clip. Another interesting example is that of Landr, an automated service for mastering. Broadly speaking mastering is the final polish made to an audio production that perfects the characteristics of the sound by making small adjustments. This service has made mastering available to a whole new set of consumers such as "bedroom producers" which is great but is still far from replacing professional human mastering engineers.


Thanks for your time and thoughts!

Thanks for having me.

 

We hope you enjoyed this little detour into the world of music and Machine Learning! We are quite excited about the kinds of sound that will emerge from the colab between AI and creativity. Could the next big pop hit have a style-transfer based synth solo on it? Time will tell!


In the meanwhile, you can find the 6 essential papers that Daniel recommends reading about ML and sound.

140 views