Réunion
Synthèse en traitement du signal audionumérique
Axes scientifiques :
- Audio, Vision et Perception
Organisateurs :
Nous vous rappelons que, afin de garantir l'accès de tous les inscrits aux salles de réunion, l'inscription aux réunions est gratuite mais obligatoire.
La réunion sera également accessible en distanciel
Inscriptions
7 personnes membres du GdR ISIS, et 0 personnes non membres du GdR, sont inscrits à cette réunion.
Capacité de la salle : 70 personnes. 63 places restantes
Annonce
Dans le cadre de l’axe « Audio, Vision, Perception » du GdR IASIS, nous organisons une journée d’études dédiée à la synthèse audio. La journée se tiendra le jeudi 7 novembre à l’Ircam, à Paris.
À cette occasion, nous invitons quatre orateurs et oratrices, issu-e-s de la recherche française publique et privée :
- Fanny Roche, ARTURIA France
- Neil Zeghidour, Kyutai
- Judy Najnudel, UVI
- Philippe Esling, Ircam
Après ces quatre présentations invitées, la journée s’achèvera par une session poster, ouverte à tous les chercheurs et chercheuses travaillant sur le traitement du signal audio en général. Ceci inclut la synthèse, mais aussi d’autres applications telles que : la détection, la classification, la régression, la transcription, l’alignement temporel, la structuration, la localisation, la spatialisation, la séparation de sources, le débruitage, la compression, la visualisation et le design d’interaction. Nous encourageons particulièrement les jeunes chercheur-e-s de niveau master ou doctorat à présenter leurs travaux, même si ceux-si sont encore à paraître, en cours de révision, ou en préparation.
Si vous souhaitez participer à la session poster, merci d’écrire avant le 10 octobre à Vincent Lostanlen : vincent.lostanlen (@) ls2n.fr.). En préparation de la journée, il vous sera demandé un titre ainsi qu’un « audiocarnet » : c’est-à-dire, un fichier audio d’une minute environ donnant à entendre les sons sur lesquels porte votre poster. Pour des exemples d’audiocarnets déjà parus, voir : https://hal.science/AUDIOCARNET
Programme
09:30 Accueil (Café)
10:00 Introduction
10:15 Fanny Roche : Music sound synthesis using machine learning
11:15 Neil Zeghidour : Audio Language Models
12:15 Repas
14:00 Judy Najnudel : Grey-box modelling informed by physics: Application to commercial digital audio effects.
15:00 Philippe Esling : AI in 64Kbps: Lightweight neural audio synthesis for embedded instruments
16:00 Présentation brève des posters en salle
16:30 Posters et café
17:30 Clôture
Résumés des contributions
Music sound synthesis using machine learning
Fanny Roche, ARTURIA France
One of the major challenges of the synthesizer market and sound synthesis today lies in proposing new forms of synthesis allowing the creation of brand new sonorities while offering musicians more intuitive and perceptually meaningful control to help them find the perfect sound more easily. Indeed, today's synthesizers are very powerful tools offering musicians a wide range of possibilities for creating sound textures, but the control of parameters still lacks user-friendliness and generally requires expert knowledge to manipulate. This presentation will focus on machine learning methods for sound synthesis, enabling the generation of new, high-quality sounds while providing perceptually relevant control parameters.
In a first part of this talk, we will focus on the perceptual characterization of synthetic musical timbre by highlighting a set of verbal descriptors frequently and consensually used by musicians. Secondly, we will explore the use of machine learning algorithms for sound synthesis, and in particular different models of the "autoencoder" type, for which we have carried out an in-depth comparative study on two different datasets. Then, this presentation will focus on the perceptual regularization of the proposed model, based on the perceptual characterization of synthetic timbre presented in the first part, to enable (at least partial) perceptually relevant control of sound synthesis. Finally, in the last part of this talk, we will quickly present some of the latest tests we conducted using more recent neural synthesis models.
Audio Language Models
Neil Zeghidour, Kyutai
Audio analysis and audio synthesis require modeling long-term, complex phenomena and have historically been tackled in an asymmetric fashion, with specific analysis models that differ from their synthesis counterpart. In this presentation, we will introduce the concept of audio language models, a recent innovation aimed at overcoming these limitations. By discretizing audio signals using a neural audio codec, we can frame both audio generation and audio comprehension as similar autoregressive sequence-to-sequence tasks, capitalizing on the well-established Transformer architecture commonly used in language modeling. This approach unlocks novel capabilities in areas such as textless speech modeling, zero-shot voice conversion, text-to-music generation and even real-time spoken dialogue. Furthermore, we will illustrate how the integration of analysis and synthesis within a single model enables the creation of versatile audio models capable of handling a wide range of tasks involving audio as inputs or outputs. We will conclude by highlighting the promising prospects offered by these models and discussing the key challenges that lie ahead in their development.
Grey-box modelling informed by physics: Application to commercial digital audio effects
Judy Najnudel, UVI
In an ever-expanding and competitive market, commercial digital audio effects have significant constraints. Their computation load must be reduced so that they can be operable in real-time. They must be easily controllable through parameters that should be scarce and relate to clear features. They must be robust and safe for large combinations of inputs and controls to allow for user creativity as well. Effects based on existing systems (acoustic or electronic devices) must in addition sound realistic and capture expected idiosyncrasies.
For this last category of effects, a full physical model is not always available or even desirable, as it can be both too complex to run and be used efficiently. In this talk, we explore grey-box approaches that combine strong physically-based priors and identification from measurement data. The priors impose a model structure that preserves fundamental properties such as passivity and dissipativity, while measurements allow to bridge possible gaps in the model. This produces reduced, macroscopic, power-balanced models of complex physical systems that can be fitted to data, and result in numerically stable simulations. This approach is illustrated on real electronic components and circuits, with audio demonstrations of the corresponding effects to complete the presentation.
AI in 64Kbps: Lightweight neural audio synthesis for embedded instruments
Philippe Esling, Ircam
The research project led by the ACIDS group at IRCAM aims to model musical creativity by extending probabilistic learning approaches to the use of multivariate and multimodal time series. Our main object of study lies in the properties and perception of musical synthesis and artificial creativity. In this context, we experiment with deep AI models applied to creative materials, aiming to develop artificial creative intelligence. Over the past years, we developed several objects aiming to embed these researches directly as real-time object usable in MaxMSP. Our team has produced many prototypes of innovative instruments and musical pieces in collaborations with renowned composers. However, The often overlooked downside of deep models is their massive complexity and tremendous computation cost. This aspect is especially critical in audio applications, which heavily relies on specialized embedded hardware with real-time constraints. Hence, the lack of work on efficient lightweight deep models is a significant limitation for the real-life use of deep models on resource-constrained hardware. We show how we can attain these objectives through different recent theories (the lottery ticket hypothesis (Frankle and Carbin, 2018), mode connectivity (Garipov et al. 2018) and information bottleneck theory) and demonstrate how our research led to lightweight and embedded deep audio models, namely 1/ Neurorack // the first deep AI-based eurorack synthesizer 2/ FlowSynth // a learning-based device that allows to travel auditory spaces of synthesizers, simply by moving your hand 3/ RAVE in Raspberry Pi // 48kHz real-time embedded deep synthesis.
Biographies :
Fanny Roche, ARTURIA France
Fanny Roche is a Research & Development Engineer at ARTURIA, where she heads the Digital Signal Processing - Machine Learning team. She holds a M.Sc. in signal processing from Grenoble Institute of Technology (Grenoble INP) and the KTH Royal Institute of Technology of Stockholm, and graduated with a Ph.D. in signal processing and machine learning from Université Grenoble Alpes in collaboration with ARTURIA. She has been working for about 8 years in industrial research, addressing various topics such as music sound synthesis, clustering and transformation applied to digital synthesizers and audio effects.
Neil Zeghidour, Kyutai
Neil is co-founder and Chief Modeling Officer of the Kyutai non-profit research lab. Neil and team have recently presented Moshi (https://moshi.chat), the first real-time spoken dialogue model. He was previously at Google DeepMind, where he started and led a team working on generative audio, with contributions including Google?s first text-to-music API, a voice preserving speech-to-speech translation system, and the first neural audio codec that outperforms general-purpose audio codecs. Before that, Neil spent three years at Facebook AI Research, working on automatic speech recognition and audio understanding. He graduated with a PhD in machine learning from Ecole Normale Supérieure (Paris), and holds an MSc in machine learning from Ecole Normale Supérieure (Saclay) and an MSc in quantitative finance from Université Paris Dauphine. In parallel with his research activities, Neil teaches speech processing technologies at the École Normale Supérieure (Saclay).
Judy Najnudel, UVI
Judy Najnudel received a Master degree in sound engineering from École Nationale Supérieure Louis Lumière in 2007. After ten years as a post-production manager, she retrained as an R&D audio engineer. She received a M.Sc. in Acoustics, Signal Processing and Computer Science Applied to Music from Sorbonne Université in 2018, and a PhD in Control Systems and Signal Processing from Sorbonne Université in 2022, funded by the software company UVI. She is now working at UVI full time, where she develops digital audio effects. Her research interests revolve around physical modelling and nonlinear audio circuits.
Philippe Esling, Ircam
Philippe Esling received a B.Sc in mathematics and computer science in 2007, a M.Sc in acoustics and signal processing in 2009 and a PhD on data mining and machine learning in 2012. He was a post-doctoral fellow in the department of Genetics and Evolution at the University of Geneva in 2012. He is now an associate professor with tenure at Ircam laboratory and Sorbonne Université since 2013. In this short time span, he authored and co-authored over 20 peer-reviewed journal papers in prestigious journals. He received a young researcher award for his work in audio querying in 2011, a PhD award for his work in multiobjective time series data mining in 2013 and several best paper awards since 2014. In applied research, he developed and released the first computer-aided orchestration software called Orchids, commercialized in fall 2014, which already has a worldwide community of thousands users and led to musical pieces from renowned composers played at international venues. He is the lead investigator of machine learning applied to music generation and orchestration, and lead the recently created Artificial Creative Intelligence and Data Science (ACIDS) group at IRCAM.