This is a list of all publications that I was involved in as primary or co-author. PDFs are linked where I can provide them. Publications where I am the primary author are marked with ➥. Filter publications: |
|
---|---|
➥ |
The ICASSP 2024 Audio Deep Packet Loss Concealment Grand Challenge
( , , , ),
at ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing, April 2024
AbstractAudio packet loss concealment is the hiding of gaps in VoIP audio streams caused by network packet loss. With the ICASSP 2024 Audio Deep Packet Loss Concealment Grand Challenge, we build on the success of the previous Audio PLC Challenge held at INTERSPEECH 2022. We evaluate models on an overall harder dataset, and use the new ITU-T P.804 evaluation procedure to more closely evaluate the performance of systems specifically on the PLC task. We evaluate a total of 9 systems, 8 of which satisfy the strict real-time performance requirements of the challenge, using both P.804 and Word Accuracy evaluations.@inproceedings{diener2024icassp, title = {The ICASSP 2024 Audio Deep Packet Loss Concealment Grand Challenge}, author = {Diener, Lorenz and Branets, Solomiya and Saabas, Ando and Cutler, Ross}, year = 2024, month = apr, booktitle = {{ICASSP} 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing}, abstract = {Audio packet loss concealment is the hiding of gaps in VoIP audio streams caused by network packet loss. With the ICASSP 2024 Audio Deep Packet Loss Concealment Grand Challenge, we build on the success of the previous Audio PLC Challenge held at INTERSPEECH 2022. We evaluate models on an overall harder dataset, and use the new ITU-T P.804 evaluation procedure to more closely evaluate the performance of systems specifically on the PLC task. We evaluate a total of 9 systems, 8 of which satisfy the strict real-time performance requirements of the challenge, using both P.804 and Word Accuracy evaluations.}, code = {https://aka.ms/plc-challenge}, url = {https://halcy.de/cites/pdf/diener2024icassp.pdf}, } |
➥ |
PLCMOS - a data-driven non-intrusive metric for the evaluation of packet loss concealment algorithms
( , , , , , ),
at INTERSPEECH 2023 - 23nd Annual Conference of the International Speech Communication Association, August 2023
AbstractSpeech quality assessment is a problem for every researcher working on models that produce or process speech. Human subjective ratings, the gold standard in speech quality assessment, are expensive and time-consuming to acquire in a quantity that is sufficient to get reliable data, while automated objective metrics show a low correlation with gold standard ratings. This paper presents PLCMOS, a non-intrusive data-driven tool for generating a robust, accurate estimate of the mean opinion score a human rater would assign an audio file that has been processed by being transmitted over a degraded packet-switched network with missing packets being healed by a packet loss concealment algorithm. Our new model shows a model-wise Pearson's correlation of 0.97 and rank correlation of 0.95 with human ratings, substantially above all other available intrusive and non-intrusive metrics. The model is released as an ONNX model for other researchers to use when building PLC systems.@inproceedings{diener2023plcmos, title = {PLCMOS - a data-driven non-intrusive metric for the evaluation of packet loss concealment algorithms}, author = {Diener, Lorenz and Sootla, Sten and Purin, Marju and Saabas, Ando and Aichner, Robert and Cutler, Ross}, year = 2023, month = aug, booktitle = {{INTERSPEECH} 2023 - 23nd Annual Conference of the International Speech Communication Association}, abstract = {Speech quality assessment is a problem for every researcher working on models that produce or process speech. Human subjective ratings, the gold standard in speech quality assessment, are expensive and time-consuming to acquire in a quantity that is sufficient to get reliable data, while automated objective metrics show a low correlation with gold standard ratings. This paper presents PLCMOS, a non-intrusive data-driven tool for generating a robust, accurate estimate of the mean opinion score a human rater would assign an audio file that has been processed by being transmitted over a degraded packet-switched network with missing packets being healed by a packet loss concealment algorithm. Our new model shows a model-wise Pearson's correlation of ~0.97 and rank correlation of ~0.95 with human ratings, substantially above all other available intrusive and non-intrusive metrics. The model is released as an ONNX model for other researchers to use when building PLC systems.}, doi = {10.21437/Interspeech.2023-1532}, code = {https://aka.ms/plcmos}, url = {https://halcy.de/cites/pdf/diener2023plcmos.pdf}, poster = {https://halcy.de/cites/pdf/diener2023plcmos_poster.pdf}, } |
➥ |
INTERSPEECH 2022 Audio Deep Packet Loss Concealment Challenge
( , , , , , ),
at INTERSPEECH 2022 - 22nd Annual Conference of the International Speech Communication Association, September 2022
AbstractAudio Packet Loss Concealment (PLC) is the hiding of gaps in audio streams caused by data transmission failures in packet switched networks. This is a common problem, and of increasing importance as end-to-end VoIP telephony and teleconference systems become the default and ever more widely used form of communication in business as well as in personal usage. This paper presents the INTERSPEECH 2022 Audio Deep Packet Loss Concealment challenge. We first give an overview of the PLC problem, and introduce some classical approaches to PLC as well as recent work. We then present the open source dataset released as part of this challenge as well as the evaluation methods and metrics used to determine the winner. We also briefly introduce PLCMOS, a novel data-driven metric that can be used to quickly evaluate the performance PLC systems. Finally, we present the results of the INTERSPEECH 2022 Audio Deep PLC Challenge, and provide a summary of important takeaways.@inproceedings{diener2022interspeech, title = {INTERSPEECH 2022 Audio Deep Packet Loss Concealment Challenge}, author = {Diener, Lorenz and Sootla, Sten and Branets, Solomiya and Saabas, Ando and Aichner, Robert and Cutler, Ross}, year = 2022, month = sep, booktitle = {{INTERSPEECH} 2022 - 22nd Annual Conference of the International Speech Communication Association}, video = {https://www.youtube.com/watch?v=W9gYjas9lB0}, doi = {10.21437/Interspeech.2022-10829}, abstract = {Audio Packet Loss Concealment (PLC) is the hiding of gaps in audio streams caused by data transmission failures in packet switched networks. This is a common problem, and of increasing importance as end-to-end VoIP telephony and teleconference systems become the default and ever more widely used form of communication in business as well as in personal usage. This paper presents the INTERSPEECH 2022 Audio Deep Packet Loss Concealment challenge. We first give an overview of the PLC problem, and introduce some classical approaches to PLC as well as recent work. We then present the open source dataset released as part of this challenge as well as the evaluation methods and metrics used to determine the winner. We also briefly introduce PLCMOS, a novel data-driven metric that can be used to quickly evaluate the performance PLC systems. Finally, we present the results of the INTERSPEECH 2022 Audio Deep PLC Challenge, and provide a summary of important takeaways.}, code = {https://aka.ms/plc-challenge}, url = {https://halcy.de/cites/pdf/diener2022interspeech.pdf}, } |
Sprechen durch Vorstellen
( , , ),
in Sprache· Stimme· Gehör, volume 46, pages 62–63, June 2022
AbstractIn einem internationalen Projekt ist es Informatikerinnen und Informatikern gelungen, eine sogenannte Neurosprachprothese zu realisieren. Damit kann vorgestellte Sprache akustisch hörbar gemacht werden – ohne Verzögerung in Echtzeit. Die Entwicklung kann Menschen helfen, die aufgrund neuronaler Erkrankungen verstummt sind und ohne fremde Hilfe nicht mit der Außenwelt kommunizieren können.@article{angrick2022sprechen, title = {Sprechen durch Vorstellen}, author = {Angrick, Miguel and Ottenhoff, Maarten and Diener, Lorenz}, year = 2022, month = jun, journal = {Sprache{\textperiodcentered} Stimme{\textperiodcentered} Geh{\"o}r}, volume = 46, pages = {62--63}, doi = {10.1055/a-1666-7303}, abstract = {In einem internationalen Projekt ist es Informatikerinnen und Informatikern gelungen, eine sogenannte Neurosprachprothese zu realisieren. Damit kann vorgestellte Sprache akustisch hörbar gemacht werden – ohne Verzögerung in Echtzeit. Die Entwicklung kann Menschen helfen, die aufgrund neuronaler Erkrankungen verstummt sind und ohne fremde Hilfe nicht mit der Außenwelt kommunizieren können.}, } |
|
Towards Closed-Loop Speech Synthesis from Stereotactic EEG: A Unit Selection Approach
( , , , , , , , , , , , ),
at ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, May 2022
AbstractNeurological disorders can severely impact speech communication. Recently, neural speech prostheses have been proposed that reconstruct intelligible speech from neural signals recorded superficially on the cortex. Thus far, it has been unclear whether similar reconstruction is feasible from deeper brain structures, and whether audible speech can be directly synthesized from these reconstructions with low-latency, as required for a practical speech neuroprosthetic. The present study aims to address both challenges. First, we implement a low-latency unit selection based synthesizer that converts neural signals into audible speech. Second, we evaluate our approach on open-loop recordings from 5 patients implanted with stereotactic depth electrodes who conducted a read-aloud task of Dutch utterances. We achieve correlation coefficients significantly higher than chance level of up to 0.6 and an average computational cost of 6.6 ms for each 10 ms frames. While the current reconstructed utterances are not intelligible, our results indicate promising decoding and run-time capabilities that are suitable for investigations of speech processes in closed-loop experiments.@inproceedings{angrick2022towards, title = {Towards Closed-Loop Speech Synthesis from Stereotactic EEG: A Unit Selection Approach}, author = {Angrick, Miguel and Ottenhoff, Maarten and Diener, Lorenz and Ivucic, Darius and Ivucic, Gabriel and Goulis, Sophocles and Colon, Albert J. and Wagner, Louis and Krusienski, Dean J. and Kubben, Pieter L. and Schultz, Tanja and Herff, Christian}, year = 2022, month = may, booktitle = {{ICASSP} 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing}, pages = {1296--1300}, doi = {10.1109/ICASSP43922.2022.9747300}, issn = {2379-190X}, abstract = {Neurological disorders can severely impact speech communication. Recently, neural speech prostheses have been proposed that reconstruct intelligible speech from neural signals recorded superficially on the cortex. Thus far, it has been unclear whether similar reconstruction is feasible from deeper brain structures, and whether audible speech can be directly synthesized from these reconstructions with low-latency, as required for a practical speech neuroprosthetic. The present study aims to address both challenges. First, we implement a low-latency unit selection based synthesizer that converts neural signals into audible speech. Second, we evaluate our approach on open-loop recordings from 5 patients implanted with stereotactic depth electrodes who conducted a read-aloud task of Dutch utterances. We achieve correlation coefficients significantly higher than chance level of up to 0.6 and an average computational cost of 6.6 ms for each 10 ms frames. While the current reconstructed utterances are not intelligible, our results indicate promising decoding and run-time capabilities that are suitable for investigations of speech processes in closed-loop experiments.}, url = {https://halcy.de/cites/pdf/angrick2022towards.pdf}, } |
|
Real-time synthesis of imagined speech processes from minimally invasive recordings of neural activity
( , , , , , , , , , , , , ),
in Nature Communications Biology, volume 4, number 1, pages 1–10, December 2021
AbstractSpeech neuroprosthetics aim to provide a natural communication channel to individuals who are unable to speak due to physical or neurological impairments. Real-time synthesis of acoustic speech directly from measured neural activity could enable natural conversations and notably improve quality of life, particularly for individuals who have severely limited means of communication. Recent advances in decoding approaches have led to high quality reconstructions of acoustic speech from invasively measured neural activity. However, most prior research utilizes data collected during open-loop experiments of articulated speech, which might not directly translate to imagined speech processes. Here, we present an approach that synthesizes audible speech in real-time for both imagined and whispered speech conditions. Using a participant implanted with stereotactic depth electrodes, we were able to reliably generate audible speech in real-time. The decoding models rely predominately on frontal activity suggesting that speech processes have similar representations when vocalized, whispered, or imagined. While reconstructed audio is not yet intelligible, our real-time synthesis approach represents an essential step towards investigating how patients will learn to operate a closed-loop speech neuroprosthesis based on imagined speech@article{angrick2021real, title = {Real-time synthesis of imagined speech processes from minimally invasive recordings of neural activity}, author = {Angrick, Miguel and Ottenhoff, Maarten and Diener, Lorenz and Ivucic, Darius and Ivucic, Gabriel and Goulis, Sophocles and Saal, Jeremy and Colon, Albert and Wagner, Louis and Krusienski, Dean and Kubben, Pieter and Schultz, Tanja and Herff, Christian}, year = 2021, month = dec, journal = {Nature Communications Biology}, publisher = {Nature Publishing Group}, volume = 4, number = 1, pages = {1--10}, video = {https://www.youtube.com/watch?v=2m8bUYZP-Eo}, doi = {10.1101/2020.12.11.421149}, abstract = {Speech neuroprosthetics aim to provide a natural communication channel to individuals who are unable to speak due to physical or neurological impairments. Real-time synthesis of acoustic speech directly from measured neural activity could enable natural conversations and notably improve quality of life, particularly for individuals who have severely limited means of communication. Recent advances in decoding approaches have led to high quality reconstructions of acoustic speech from invasively measured neural activity. However, most prior research utilizes data collected during open-loop experiments of articulated speech, which might not directly translate to imagined speech processes. Here, we present an approach that synthesizes audible speech in real-time for both imagined and whispered speech conditions. Using a participant implanted with stereotactic depth electrodes, we were able to reliably generate audible speech in real-time. The decoding models rely predominately on frontal activity suggesting that speech processes have similar representations when vocalized, whispered, or imagined. While reconstructed audio is not yet intelligible, our real-time synthesis approach represents an essential step towards investigating how patients will learn to operate a closed-loop speech neuroprosthesis based on imagined speech}, url = {https://halcy.de/cites/pdf/angrick2021real.pdf}, } |
|
➥ |
The Impact of Audible Feedback on EMG-to-Speech Conversion
( ),
PhD Thesis, May 2021
AbstractResearch interest in speech interfaces that can function even when an audible acoustic signal is not present – so-called Silent Speech Interfaces – has grown dramatically in recent years, as the field presents many barely exploded avenues for research and huge potential for applications in user interfaces and prosthetics. EMG-to-Speech conversion is a type of silent speech interface, based on electromyography: It is the direct conversion of a facial electrical speech muscle activity signal to audible speech without an intermediate textual representation. Such a direct conversion approach is well suited to speech prosthesis and silent telephony applications and could be used as a pre-processing step to enable a user to use a regular acoustic speech interface silently. To enable these applications in practice, one requirement is that EMG-to-Speech conversion systems must be capable of producing output in real time and with low latency, and work on EMG signals recorded during silently produced speech. The overall objective of this dissertation is to move EMG-to-Speech conversion further towards practical usability by building a real-time low-latency capable EMG-to-Speech conversion system and then use it to evaluate the effect of audible feedback, provided in real-time, on silent speech production.@phdthesis{diener2021impact, title = {The Impact of Audible Feedback on EMG-to-Speech Conversion}, author = {Diener, Lorenz}, year = 2021, month = may, doi = {10.26092/elib/556}, school = {University of Bremen}, supervisor = {Schultz, Tanja and Hueber, Thomas}, abstract = {Research interest in speech interfaces that can function even when an audible acoustic signal is not present -- so-called {Silent Speech Interfaces} -- has grown dramatically in recent years, as the field presents many barely exploded avenues for research and huge potential for applications in user interfaces and prosthetics. EMG-to-Speech conversion is a type of silent speech interface, based on electromyography: It is the direct conversion of a facial electrical speech muscle activity signal to audible speech without an intermediate textual representation. Such a direct conversion approach is well suited to speech prosthesis and silent telephony applications and could be used as a pre-processing step to enable a user to use a regular acoustic speech interface silently. To enable these applications in practice, one requirement is that EMG-to-Speech conversion systems must be capable of producing output in real time and with low latency, and work on EMG signals recorded during silently produced speech. The overall objective of this dissertation is to move EMG-to-Speech conversion further towards practical usability by building a real-time low-latency capable EMG-to-Speech conversion system and then use it to evaluate the effect of audible feedback, provided in real-time, on silent speech production.}, code = {https://github.com/cognitive-systems-lab/EMG-GUI}, url = {https://halcy.de/cites/pdf/diener2021impact.pdf}, } |
Voice Restoration with Silent Speech Interfaces (ReSSInt)
( , , , , , , , , , , , , , , ),
at Proc. IberSPEECH 2021, March 2021
AbstractReSSInt aims at investigating the use of silent speech interfaces (SSIs) for restoring communication to individuals who have been deprived of the ability to speak. SSIs are devices which capture non-acoustic biosignals generated during the speech production process and use them to predict the intended message. Two are the biosignals that will be investigated in this project: electromyography (EMG) signals representing electrical activity driving the facial muscles and invasive electroencephalography (iEEG) neural signals captured by means of invasive electrodes implanted on the brain. From the whole spectrum of speech disorders which may affect a person’s voice, ReSSInt will address two particular conditions: (i) voice loss after total laryngectomy and (ii) neurodegenerative diseases and other traumatic injuries which may leave an individual paralyzed and, eventually, unable to speak. To make this technology truly beneficial for these persons, this project aims at generating intelligible speech of reasonable quality. This will be tackled by recording large databases and the use of state-of-the-art generative deep learning techniques. Finally, different voice rehabilitation scenarios are foreseen within the project, which will lead to innovative research solutions for SSIs and a real impact on society by improving the life of people with speech impediments.@inproceedings{hernaez2021voice, title = {Voice Restoration with Silent Speech Interfaces ({ReSSInt})}, author = {Hernaez, Inma and González-López, Jose Andrés and Navas, Eva and {Pérez Córdoba}, Jose Luis and Saratxaga, Ibon and Olivares, Gonzalo and {Sánchez de la Fuente}, Jon and Galdón, Alberto and {García Romillo}, Víctor and González-Atienza, Míriam and Schultz, Tanja and Green, Phil and Wand, Michael and Marxer, Ricard and Diener, Lorenz}, year = 2021, month = mar, booktitle = {Proc. IberSPEECH 2021}, pages = {130--134}, doi = {10.21437/IberSPEECH.2021-28}, abstract = {ReSSInt aims at investigating the use of silent speech interfaces (SSIs) for restoring communication to individuals who have been deprived of the ability to speak. SSIs are devices which capture non-acoustic biosignals generated during the speech production process and use them to predict the intended message. Two are the biosignals that will be investigated in this project: electromyography (EMG) signals representing electrical activity driving the facial muscles and invasive electroencephalography (iEEG) neural signals captured by means of invasive electrodes implanted on the brain. From the whole spectrum of speech disorders which may affect a person’s voice, ReSSInt will address two particular conditions: (i) voice loss after total laryngectomy and (ii) neurodegenerative diseases and other traumatic injuries which may leave an individual paralyzed and, eventually, unable to speak. To make this technology truly beneficial for these persons, this project aims at generating intelligible speech of reasonable quality. This will be tackled by recording large databases and the use of state-of-the-art generative deep learning techniques. Finally, different voice rehabilitation scenarios are foreseen within the project, which will lead to innovative research solutions for SSIs and a real impact on society by improving the life of people with speech impediments.}, url = {https://halcy.de/cites/pdf/hernaez2021voice.pdf}, } |
|
Towards Speech Synthesis from Intracranial Signals
( , , , , , ),
chapter of "Brain--Computer Interface Research", pages 47--54, October 2020
AbstractBrain-computer interfaces (BCIs) are envisioned to enable individuals with severe disabilities to regain the ability to communicate. Early BCIs have provided users with the ability to type messages one letter at a time, providing an important, but slow, means of communication for locked-in patients. However, natural speech contains substantially more information than a textual representation and can convey many important markers of human communication in addition to the sequence of words. A BCI that directly synthesizes speech from neural signals could harness this full expressive power of speech. In this study with motor-intact patients undergoing glioma removal, we demonstrate that high-quality audio signals can be synthesized from intracranial signals using a method from the speech synthesis community called Unit Selection. The Unit Selection approach concatenates speech units of the user to form new audio output and thereby produces natural speech in the user’s own voice.@incollection{herff2020towards, title = {Towards Speech Synthesis from Intracranial Signals}, author = {Herff, Christian and Diener, Lorenz and Mugler, Emily and Slutzky, Marc and Krusienski, Dean and Schultz, Tanja}, year = 2020, month = oct, booktitle = {Brain--Computer Interface Research}, publisher = {Springer}, pages = {47--54}, doi = {10.1007/978-3-030-49583-1_5}, abstract = {Brain-computer interfaces (BCIs) are envisioned to enable individuals with severe disabilities to regain the ability to communicate. Early BCIs have provided users with the ability to type messages one letter at a time, providing an important, but slow, means of communication for locked-in patients. However, natural speech contains substantially more information than a textual representation and can convey many important markers of human communication in addition to the sequence of words. A BCI that directly synthesizes speech from neural signals could harness this full expressive power of speech. In this study with motor-intact patients undergoing glioma removal, we demonstrate that high-quality audio signals can be synthesized from intracranial signals using a method from the speech synthesis community called Unit Selection. The Unit Selection approach concatenates speech units of the user to form new audio output and thereby produces natural speech in the user’s own voice.}, } |
|
➥ |
Towards Silent Paralinguistics: Deriving Speaking Mode and Speaker ID from Electromyographic Signals
( , , , , , , ),
at INTERSPEECH 2020 - 21st Annual Conference of the International Speech Communication Association, September 2020
AbstractSilent Computational Paralinguistics (SCP) - the assessment of speaker states and traits from non-audibly spoken communication - has rarely been targeted in the rich body of either Computational Paralinguistics or Silent Speech Processing. Here, we provide first steps towards this challenging but potentially highly rewarding endeavour: Paralinguistics can enrich spoken language interfaces, while Silent Speech Processing enables confidential and unobtrusive spoken communication for everybody, including mute speakers. We approach SCP by using speech-related biosignals stemming from facial muscle activities captured by surface electromyography (EMG). To demonstrate the feasibility of SCP, we select one speaker trait (speaker identity) and one speaker state (speaking mode). We introduce two promising strategies for SCP: (1) deriving paralinguistic speaker information directly from EMG of silently produced speech versus (2) first converting EMG into an audible speech signal followed by conventional computational paralinguistic methods. We compare traditional feature extraction and decision making approaches to more recent deep representation and transfer learning by convolutional and recurrent neural networks, using openly available EMG data. We find that paralinguistics can be assessed not only from acoustic speech but also from silent speech captured by EMG.@inproceedings{diener2020towards, title = {Towards Silent Paralinguistics: Deriving Speaking Mode and Speaker ID from Electromyographic Signals}, author = {Diener, Lorenz and Amiriparian, Shahin and Botelho, Catarina and Scheck, Kevin and Küster, Dennis and Trancoso, Isabel Schuller, Björn W. and Schultz, Tanja}, year = 2020, month = sep, booktitle = {{INTERSPEECH} 2020 - 21st Annual Conference of the International Speech Communication Association}, video = {https://www.youtube.com/watch?v=sy7MeEmEusY}, doi = {10.21437/interspeech.2020-2848}, abstract = {Silent Computational Paralinguistics (SCP) - the assessment of speaker states and traits from non-audibly spoken communication - has rarely been targeted in the rich body of either Computational Paralinguistics or Silent Speech Processing. Here, we provide first steps towards this challenging but potentially highly rewarding endeavour: Paralinguistics can enrich spoken language interfaces, while Silent Speech Processing enables confidential and unobtrusive spoken communication for everybody, including mute speakers. We approach SCP by using speech-related biosignals stemming from facial muscle activities captured by surface electromyography (EMG). To demonstrate the feasibility of SCP, we select one speaker trait (speaker identity) and one speaker state (speaking mode). We introduce two promising strategies for SCP: (1) deriving paralinguistic speaker information directly from EMG of silently produced speech versus (2) first converting EMG into an audible speech signal followed by conventional computational paralinguistic methods. We compare traditional feature extraction and decision making approaches to more recent deep representation and transfer learning by convolutional and recurrent neural networks, using openly available EMG data. We find that paralinguistics can be assessed not only from acoustic speech but also from silent speech captured by EMG.}, url = {https://halcy.de/cites/pdf/diener2020towards.pdf}, } |
➥ |
CSL-EMG_Array: An Open Access Corpus for EMG-to-Speech Conversion
( , , ),
at INTERSPEECH 2020 - 21st Annual Conference of the International Speech Communication Association, September 2020
AbstractWe present a new open access corpus for the training and evaluation of EMG-to-Speech conversion systems based on array electromyographic recordings. The corpus is recorded with a recording paradigm closely mirroring realistic EMG-to-Speech usage scenarios, and includes evaluation data recorded from both audible as well as silent speech. The corpus consists of 9.5 hours of data, split into 12 sessions recorded from 8 speakers. Based on this corpus, we present initial benchmark results with a realistic online EMG-to-Speech conversion use case, both for the audible and silent speech subsets. We also present a method for drastically improving EMG-to-Speech system stability and performance in the presence of time-related artifacts.@inproceedings{diener2020csl, title = {{CSL-EMG\_Array}: An Open Access Corpus for EMG-to-Speech Conversion}, author = {Diener, Lorenz and Roustay Vishkasougheh, Mehrdad and Schultz, Tanja}, year = 2020, month = sep, booktitle = {{INTERSPEECH} 2020 - 21st Annual Conference of the International Speech Communication Association}, video = {https://www.youtube.com/watch?v=houE7c2zEko}, doi = {10.21437/Interspeech.2020-2859}, abstract = {We present a new open access corpus for the training and evaluation of EMG-to-Speech conversion systems based on array electromyographic recordings. The corpus is recorded with a recording paradigm closely mirroring realistic EMG-to-Speech usage scenarios, and includes evaluation data recorded from both audible as well as silent speech. The corpus consists of 9.5 hours of data, split into 12 sessions recorded from 8 speakers. Based on this corpus, we present initial benchmark results with a realistic online EMG-to-Speech conversion use case, both for the audible and silent speech subsets. We also present a method for drastically improving EMG-to-Speech system stability and performance in the presence of time-related artifacts.}, code = {https://www.uni-bremen.de/csl/forschung/lautlose-sprachkommunikation/csl-emg-array-corpus}, url = {https://halcy.de/cites/pdf/diener2020csl.pdf}, } |
Toward Silent Paralinguistics: Speech-to-EMG - Retrieving Articulatory Muscle Activity from Speech
( , , , , , , , , ),
at INTERSPEECH 2020 - 21st Annual Conference of the International Speech Communication Association, September 2020
AbstractElectromyographic (EMG) signals recorded during speech production encode information on articulatory muscle activity and also on the facial expression of emotion, thus representing a speech-related biosignal with strong potential for paralinguistic applications. In this work, we estimate the electrical activity of the muscles responsible for speech articulation directly from the speech signal. To this end, we first perform a neural conversion of speech features into electromyographic time domain features, and then attempt to retrieve the original EMG signal from the time domain features. We propose a feed forward neural network to address the first step of the problem (speech features to EMG features) and a neural network composed of a convolutional block and a bidirectional long short-term memory block to address the second problem (true EMG features to EMG signal). We observe that four out of the five originally proposed time domain features can be estimated reasonably well from the speech signal. Further, the five time domain features are able to predict the original speech-related EMG signal with a concordance correlation coefficient of 0.663. We further compare our results with the ones achieved on the inverse problem of generating acoustic speech features from EMG features.@inproceedings{botelho2020silent, title = {Toward Silent Paralinguistics: Speech-to-EMG - Retrieving Articulatory Muscle Activity from Speech}, author = {Botelho, Catarina and Diener, Lorenz and Küster, Dennis and Scheck, Kevin and Amiriparian, Shahin and Schuller, Björn W. and Schultz, Tanja and Abad, Alberto and Trancoso, Isabel}, year = 2020, month = sep, booktitle = {{INTERSPEECH} 2020 - 21st Annual Conference of the International Speech Communication Association}, doi = {10.21437/Interspeech.2020-2926}, abstract = {Electromyographic (EMG) signals recorded during speech production encode information on articulatory muscle activity and also on the facial expression of emotion, thus representing a speech-related biosignal with strong potential for paralinguistic applications. In this work, we estimate the electrical activity of the muscles responsible for speech articulation directly from the speech signal. To this end, we first perform a neural conversion of speech features into electromyographic time domain features, and then attempt to retrieve the original EMG signal from the time domain features. We propose a feed forward neural network to address the first step of the problem (speech features to EMG features) and a neural network composed of a convolutional block and a bidirectional long short-term memory block to address the second problem (true EMG features to EMG signal). We observe that four out of the five originally proposed time domain features can be estimated reasonably well from the speech signal. Further, the five time domain features are able to predict the original speech-related EMG signal with a concordance correlation coefficient of 0.663. We further compare our results with the ones achieved on the inverse problem of generating acoustic speech features from EMG features.}, url = {https://halcy.de/cites/pdf/botelho2020silent.pdf}, } |
|
➥ |
Improving Fundamental Frequency Generation in EMG-to-Speech Conversion using a Quantization Approach
( , , ),
at ASRU 2019 - IEEE Workshop on Automatic Speech Recognition and Understanding, December 2019
AbstractWe present a novel approach to generating fundamental frequency (intonation and voicing) trajectories in an EMG-to-Speech conversion Silent Speech Interface, based on quantizing the EMG-to-F0 mappings target values and thus turning a regression problem into a recognition problem. We present this method and evaluate its performance with regard to the accuracy of the voicing information obtained as well as the performance in generating plausible intonation trajectories within voiced sections of the signal. To this end, we also present a new measure for overall F0 trajectory plausibility, the trajectory-label accuracy (TLAcc), and compare it with human evaluations. Our new F0 generation method achieves a significantly better performance than a baseline approach in terms of voicing accuracy, correlation of voiced sections, trajectory-label accuracy and, most importantly, human evaluations.@inproceedings{diener2019improving, title = {Improving Fundamental Frequency Generation in EMG-to-Speech Conversion using a Quantization Approach}, author = {Diener, Lorenz and Umesh, Tejas and Schultz, Tanja}, year = 2019, month = dec, booktitle = {{ASRU} 2019 - IEEE Workshop on Automatic Speech Recognition and Understanding}, doi = {10.1109/ASRU46091.2019.9003804}, abstract = {We present a novel approach to generating fundamental frequency (intonation and voicing) trajectories in an EMG-to-Speech conversion Silent Speech Interface, based on quantizing the EMG-to-F0 mappings target values and thus turning a regression problem into a recognition problem. We present this method and evaluate its performance with regard to the accuracy of the voicing information obtained as well as the performance in generating plausible intonation trajectories within voiced sections of the signal. To this end, we also present a new measure for overall F0 trajectory plausibility, the trajectory-label accuracy (TLAcc), and compare it with human evaluations. Our new F0 generation method achieves a significantly better performance than a baseline approach in terms of voicing accuracy, correlation of voiced sections, trajectory-label accuracy and, most importantly, human evaluations.}, url = {https://halcy.de/cites/pdf/diener2019improving.pdf}, } |
Generating natural, intelligible speech from brain activity in motor, premotor, and inferior frontal cortices
( , , , , , , , , ),
in Frontiers in neuroscience, volume 13, pages 1267, November 2019
AbstractNeural interfaces that directly produce intelligible speech from brain activity would allow people with severe impairment from neurological disorders to communicate more naturally. Here, we record neural population activity in motor, premotor and inferior frontal cortices during speech production using electrocorticography (ECoG) and show that ECoG signals alone can be used to generate intelligible speech output that can preserve conversational cues. To produce speech directly from neural data, we adapted a method from the field of speech synthesis called unit selection, in which units of speech are concatenated to form audible output. In our approach, which we call Brain-To-Speech, we chose subsequent units of speech based on the measured ECoG activity to generate audio waveforms directly from the neural recordings. Brain-To-Speech employed the user's own voice to generate speech that sounded very natural and included features such as prosody and accentuation. By investigating the brain areas involved in speech production separately, we found that speech motor cortex provided more information for the reconstruction process than the other cortical areas.@article{herff2019generating, title = {Generating natural, intelligible speech from brain activity in motor, premotor, and inferior frontal cortices}, author = {Herff, Christian and Diener, Lorenz and Angrick, Miguel and Mugler, Emily and Tate, Matthew C and Goldrick, Matthew A and Krusienski, Dean J and Slutzky, Marc W and Schultz, Tanja}, year = 2019, month = nov, journal = {Frontiers in neuroscience}, publisher = {Frontiers Media SA}, volume = 13, pages = 1267, doi = {10.3389/fnins.2019.01267}, abstract = {Neural interfaces that directly produce intelligible speech from brain activity would allow people with severe impairment from neurological disorders to communicate more naturally. Here, we record neural population activity in motor, premotor and inferior frontal cortices during speech production using electrocorticography (ECoG) and show that ECoG signals alone can be used to generate intelligible speech output that can preserve conversational cues. To produce speech directly from neural data, we adapted a method from the field of speech synthesis called unit selection, in which units of speech are concatenated to form audible output. In our approach, which we call Brain-To-Speech, we chose subsequent units of speech based on the measured ECoG activity to generate audio waveforms directly from the neural recordings. Brain-To-Speech employed the user's own voice to generate speech that sounded very natural and included features such as prosody and accentuation. By investigating the brain areas involved in speech production separately, we found that speech motor cortex provided more information for the reconstruction process than the other cortical areas.}, url = {https://halcy.de/cites/pdf/herff2019generating.pdf}, } |
|
Towards Restoration of Articulatory Movements: Functional Electrical Stimulation of Orofacial Muscles
( , , , , , , , ),
at EMBC 2019 – 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society, July 2019
AbstractMillions of individuals suffer from impairments that significantly disrupt or completely eliminate their ability to speak. An ideal intervention would restore one's natural ability to physically produce speech. Recent progress has been made in decoding speech-related brain activity to generate synthesized speech. Our vision is to extend these recent advances toward the goal of restoring physical speech production using decoded speech-related brain activity to modulate the electrical stimulation of the orofacial musculature involved in speech. In this pilot study we take a step toward this vision by investigating the feasibility of stimulating orofacial muscles during vocalization in order to alter acoustic production. The results of our study provide necessary foundation for eventual orofacial stimulation controlled directly from decoded speech-related brain activity.@inproceedings{schultz2019towards, title = {Towards Restoration of Articulatory Movements: Functional Electrical Stimulation of Orofacial Muscles}, author = {Schultz, Tanja and Angrick, Miguel and Diener, Lorenz and Küster, Dennis and Meier, Moritz and Krusienski, Dean and Herff, Christian and Brumberg, Jonathan}, year = 2019, month = jul, booktitle = {{EMBC} 2019 -- 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society}, pages = {3111--3114}, doi = {10.1109/EMBC.2019.8857670}, issn = {1557-170X}, keywords = {brain;muscle;neuromuscular stimulation;speech coding;speech synthesis;decoding speech-related brain activity;physical speech production;decoded speech-related brain activity;eventual orofacial stimulation;functional electrical stimulation;synthesized speech generation;physical speech restoration;electrical stimulation;orofacial muscles stimulation;acoustic production;articulatory movement restoration;Muscles;Production;Electromyography;Spectrogram;Correlation;Electrodes;Brain}, abstract = {Millions of individuals suffer from impairments that significantly disrupt or completely eliminate their ability to speak. An ideal intervention would restore one's natural ability to physically produce speech. Recent progress has been made in decoding speech-related brain activity to generate synthesized speech. Our vision is to extend these recent advances toward the goal of restoring physical speech production using decoded speech-related brain activity to modulate the electrical stimulation of the orofacial musculature involved in speech. In this pilot study we take a step toward this vision by investigating the feasibility of stimulating orofacial muscles during vocalization in order to alter acoustic production. The results of our study provide necessary foundation for eventual orofacial stimulation controlled directly from decoded speech-related brain activity.}, url = {https://halcy.de/cites/pdf/schultz2019towards.pdf}, } |
|
➥ |
Session-Independent Array-Based EMG-to-Speech Conversion using Convolutional Neural Networks
( , , , ),
at 13th ITG Conference on Speech Communication, October 2018
AbstractThis paper presents an evaluation of the performance of EMG-to-Speech conversion based on convolutional neural networks. We present an analysis of two different architectures and network design considerations and evaluate CNN-based systems for their within-session and cross-session performance. We find that they are able to perform on par with feedforward neural networks when trained and evaluated on a single session and outperform them in cross session evaluations.@inproceedings{diener2018session, title = {Session-Independent Array-Based EMG-to-Speech Conversion using Convolutional Neural Networks}, author = {Diener, Lorenz and Felsch, Gerrit and Angrick, Miguel and Schultz, Tanja}, year = 2018, month = oct, booktitle = {13th {ITG} Conference on Speech Communication}, isbn = {978-3-8007-4767-2}, abstract = {This paper presents an evaluation of the performance of EMG-to-Speech conversion based on convolutional neural networks. We present an analysis of two different architectures and network design considerations and evaluate CNN-based systems for their within-session and cross-session performance. We find that they are able to perform on par with feedforward neural networks when trained and evaluated on a single session and outperform them in cross session evaluations.}, url = {https://halcy.de/cites/pdf/diener2018session.pdf}, } |
➥ |
A comparison of EMG-to-Speech Conversion for Isolated and Continuous Speech
( , , ),
at 13th ITG Conference on Speech Communication, October 2018
AbstractThis paper presents initial results of performing EMG-to-Speech conversion within our new EMG-to-Speech corpus. This new corpus consists of parallel facial array sEMG and read audible speech signals recorded from multiple speakers. It contains different styles of utterances - continuous sentences, isolated words, and isolated consonant-vowel combinations - which allows us to evaluate the performance of EMG-to-Speech conversion when trying to convert these different styles of utterance as well as the effect of training systems on one style to convert another. We find that our system deals with isolated-word/consonant-vowel utterances better than with continuous speech. We also find that it is possible to use a model trained on one style to convert utterances from another - however, performance suffers compared to training within that style, especially when going from isolated to continuous speech.@inproceedings{diener2018comparison, title = {A comparison of EMG-to-Speech Conversion for Isolated and Continuous Speech}, author = {Lorenz Diener and Sebastian Bredehöft and Tanja Schultz}, year = 2018, month = oct, booktitle = {13th {ITG} Conference on Speech Communication}, isbn = {978-3-8007-4767-2}, abstract = {This paper presents initial results of performing EMG-to-Speech conversion within our new EMG-to-Speech corpus. This new corpus consists of parallel facial array sEMG and read audible speech signals recorded from multiple speakers. It contains different styles of utterances - continuous sentences, isolated words, and isolated consonant-vowel combinations - which allows us to evaluate the performance of EMG-to-Speech conversion when trying to convert these different styles of utterance as well as the effect of training systems on one style to convert another. We find that our system deals with isolated-word/consonant-vowel utterances better than with continuous speech. We also find that it is possible to use a model trained on one style to convert utterances from another - however, performance suffers compared to training within that style, especially when going from isolated to continuous speech.}, url = {https://halcy.de/cites/pdf/diener2018comparison.pdf}, } |
➥ |
Investigating Objective Intelligibility in Real-Time EMG-to-Speech Conversion
( , ),
at INTERSPEECH 2018 – 19th Annual Conference of the International Speech Communication Association, September 2018
AbstractThis paper presents an analysis of the influence of various system parameters on the output quality of our neural network based real-time EMG-to-Speech conversion system. This EMG-to-Speech system allows for the direct conversion of facial surface electromyographic signals into audible speech in real time, allowing for a closed-loop setup where users get direct audio feedback. Such a setup opens new avenues for research and applications through co-adaptation approaches. In this paper, we evaluate the influence of several parameters on the output quality, such as time context, EMG-Audio delay, network-, training data- and Mel spectrogram size. The resulting output quality is evaluated based on the objective output quality measure STOI.@inproceedings{diener2018investigating, title = {Investigating Objective Intelligibility in Real-Time EMG-to-Speech Conversion}, author = {Lorenz Diener and Tanja Schultz}, year = 2018, month = sep, booktitle = {{INTERSPEECH} 2018 -- 19th Annual Conference of the International Speech Communication Association}, abstract = {This paper presents an analysis of the influence of various system parameters on the output quality of our neural network based real-time EMG-to-Speech conversion system. This EMG-to-Speech system allows for the direct conversion of facial surface electromyographic signals into audible speech in real time, allowing for a closed-loop setup where users get direct audio feedback. Such a setup opens new avenues for research and applications through co-adaptation approaches. In this paper, we evaluate the influence of several parameters on the output quality, such as time context, EMG-Audio delay, network-, training data- and Mel spectrogram size. The resulting output quality is evaluated based on the objective output quality measure STOI.}, url = {https://halcy.de/cites/pdf/diener2018investigating.pdf}, } |
EMG-to-Speech: Direct Generation of Speech From Facial Electromyographic Signals
( , ),
in TASLP – IEEE/ACM Transactions on Audio, Speech and Language Processing, volume 25, number 12, pages 2375–2385, November 2017
AbstractSilent speech interfaces are systems that enable speech communication even when an acoustic signal is unavailable. Over the last years, public interest in such interfaces has intensified. They provide solutions for some of the challenges faced by today's speech-driven technologies, such as robustness to noise and usability for people with speech impediments. In this paper, we provide an overview over our silent speech interface. It is based on facial surface electromyography (EMG), which we use to record the electrical signals that control muscle contraction during speech production. These signals are then converted directly to an audible speech waveform, retaining important paralinguistic speech cues for information such as speaker identity and mood. This paper gives an overview over our state-of-the-art direct EMG-to-speech transformation system. This paper describes the characteristics of the speech EMG signal, introduces techniques for extracting relevant features, presents different EMG-to-speech mapping methods, and finally, presents an evaluation of the different methods for real-time capability and conversion quality.@article{janke2017emg, title = {EMG-to-Speech: Direct Generation of Speech From Facial Electromyographic Signals}, author = {Janke, Matthias and Diener, Lorenz}, year = 2017, month = nov, day = 23, journal = {{TASLP} -- {IEEE/ACM} Transactions on Audio, Speech and Language Processing}, volume = 25, number = 12, pages = {2375--2385}, doi = {10.1109/TASLP.2017.2738568}, abstract = {Silent speech interfaces are systems that enable speech communication even when an acoustic signal is unavailable. Over the last years, public interest in such interfaces has intensified. They provide solutions for some of the challenges faced by today's speech-driven technologies, such as robustness to noise and usability for people with speech impediments. In this paper, we provide an overview over our silent speech interface. It is based on facial surface electromyography (EMG), which we use to record the electrical signals that control muscle contraction during speech production. These signals are then converted directly to an audible speech waveform, retaining important paralinguistic speech cues for information such as speaker identity and mood. This paper gives an overview over our state-of-the-art direct EMG-to-speech transformation system. This paper describes the characteristics of the speech EMG signal, introduces techniques for extracting relevant features, presents different EMG-to-speech mapping methods, and finally, presents an evaluation of the different methods for real-time capability and conversion quality.}, url = {https://halcy.de/cites/pdf/janke2017emg.pdf}, } |
|
Biosignal-based Spoken Communication: A Survey
( , , , , , ),
in TASLP – IEEE/ACM Transactions on Audio, Speech and Language Processing, volume 25, number 12, pages 2257–2271, November 2017
Note:
credited in Acknowledgements as additional authorAbstractSpeech is a complex process involving a wide range of biosignals, including but not limited to acoustics. These biosignals-stemming from the articulators, the articulator muscle activities, the neural pathways, and the brain itself-can be used to circumvent limitations of conventional speech processing in particular, and to gain insights into the process of speech production in general. Research on biosignal-based speech processing is a wide and very active field at the intersection of various disciplines, ranging from engineering, computer science, electronics and machine learning to medicine, neuroscience, physiology, and psychology. Consequently, a variety of methods and approaches have been used to investigate the common goal of creating biosignal-based speech processing devices for communication applications in everyday situations and for speech rehabilitation, as well as gaining a deeper understanding of spoken communication. This paper gives an overview of the various modalities, research approaches, and objectives for biosignal-based spoken communication.@article{schultz2017biosignal, title = {Biosignal-based Spoken Communication: A Survey}, author = {Schultz, Tanja and Wand, Michael and Hueber, Thomas and Krusienski, Dean J and Herff, Christian and Brumberg, Jonathan S}, year = 2017, note = {Lorenz Diener credited in Acknowledgements as additional author}, month = nov, day = 23, journal = {{TASLP} -- {IEEE/ACM} Transactions on Audio, Speech and Language Processing}, volume = 25, number = 12, pages = {2257--2271}, doi = {10.1109/TASLP.2017.2752365}, abstract = {Speech is a complex process involving a wide range of biosignals, including but not limited to acoustics. These biosignals-stemming from the articulators, the articulator muscle activities, the neural pathways, and the brain itself-can be used to circumvent limitations of conventional speech processing in particular, and to gain insights into the process of speech production in general. Research on biosignal-based speech processing is a wide and very active field at the intersection of various disciplines, ranging from engineering, computer science, electronics and machine learning to medicine, neuroscience, physiology, and psychology. Consequently, a variety of methods and approaches have been used to investigate the common goal of creating biosignal-based speech processing devices for communication applications in everyday situations and for speech rehabilitation, as well as gaining a deeper understanding of spoken communication. This paper gives an overview of the various modalities, research approaches, and objectives for biosignal-based spoken communication.}, url = {https://halcy.de/cites/pdf/schultz2017biosignal.pdf}, } |
|
Bremen Big Data Challenge 2017: Predicting University Cafeteria Load
( , , , , , , , , , , , ),
at KI 2017: Advances in Artificial Intelligence - 40th Annual German Conference on AI, September 2017
AbstractBig data is a hot topic in research and industry. The availability of data has never been as high as it is now. Making good use of the data is a challenging research topic in all aspects of industry and society. The Bremen Big Data Challenge invites students to dig deep into big data. In this yearly event students are challenged to use the month of March to analyze a big dataset and use the knowledge they gained to answer a question. In this year's Bremen Big Data Challenge students were challenged to predict the load of the university cafeteria from the load of past years. The best of 24 teams predicted the load with a root mean squared error of 8.6 receipts issued in five minutes, with a fusion system based on the top 5 entries achieving an even better result of 8.28.@inproceedings{weiner2017bremen, title = {Bremen Big Data Challenge 2017: Predicting University Cafeteria Load}, author = {Weiner, Jochen and Diener, Lorenz and Stelter, Simon and Externest, Eike and K{\"u}hl, Sebastian and Herff, Christian and Putze, Felix and Schulze, Timo and Salous, Mazen and Liu, Hui and K{\"u}ster, Dennis and Schultz, Tanja}, year = 2017, month = sep, booktitle = {{KI} 2017: Advances in Artificial Intelligence - 40th Annual German Conference on AI}, publisher = {Springer International Publishing}, address = {Cham}, pages = {380--386}, doi = {10.1007/978-3-319-67190-1_35}, isbn = {978-3-319-67190-1}, editor = {Kern-Isberner, Gabriele and F{\"u}rnkranz, Johannes and Thimm, Matthias}, abstract = {Big data is a hot topic in research and industry. The availability of data has never been as high as it is now. Making good use of the data is a challenging research topic in all aspects of industry and society. The Bremen Big Data Challenge invites students to dig deep into big data. In this yearly event students are challenged to use the month of March to analyze a big dataset and use the knowledge they gained to answer a question. In this year's Bremen Big Data Challenge students were challenged to predict the load of the university cafeteria from the load of past years. The best of 24 teams predicted the load with a root mean squared error of 8.6 receipts issued in five minutes, with a fusion system based on the top 5 entries achieving an even better result of 8.28.}, code = {https://bbdc.csl.uni-bremen.de/index.php/2017}, url = {https://halcy.de/cites/pdf/weiner2017bremen.pdf}, } |
|
Towards direct speech synthesis from ECoG: A pilot study
( , , , , , ),
at EMBC 2016 - 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, August 2016
AbstractMost current Brain-Computer Interfaces (BCIs) achieve high information transfer rates using spelling paradigms based on stimulus-evoked potentials. Despite the success of this interfaces, this mode of communication can be cumbersome and unnatural. Direct synthesis of speech from neural activity represents a more natural mode of communi- cation that would enable users to convey verbal messages in real-time. In this pilot study with one participant, we demonstrate that electrocoticography (ECoG) intracranial activity from temporal areas can be used to resynthesize speech in real-time. This is accomplished by reconstructing the audio magnitude spectrogram from neural activity and subsequently creating the audio waveform from these reconstructed spectrograms. We show that significant correlations between the original and reconstructed spectrograms and temporal waveforms can be achieved. While this pilot study uses audibly spoken speech for the models, it represents a first step towards speech synthesis from speech imagery.@inproceedings{herff2016towards, title = {Towards direct speech synthesis from ECoG: A pilot study}, author = {Herff, C. and Johnson, G. and Diener, L. and Shih, J. and Krusienski, D. and Schultz, T.}, year = 2016, month = aug, booktitle = {{EMBC} 2016 - 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society}, doi = {0.1109/EMBC.2016.7591004}, abstract = {Most current Brain-Computer Interfaces (BCIs) achieve high information transfer rates using spelling paradigms based on stimulus-evoked potentials. Despite the success of this interfaces, this mode of communication can be cumbersome and unnatural. Direct synthesis of speech from neural activity represents a more natural mode of communi- cation that would enable users to convey verbal messages in real-time. In this pilot study with one participant, we demonstrate that electrocoticography (ECoG) intracranial activity from temporal areas can be used to resynthesize speech in real-time. This is accomplished by reconstructing the audio magnitude spectrogram from neural activity and subsequently creating the audio waveform from these reconstructed spectrograms. We show that significant correlations between the original and reconstructed spectrograms and temporal waveforms can be achieved. While this pilot study uses audibly spoken speech for the models, it represents a first step towards speech synthesis from speech imagery.}, url = {https://halcy.de/cites/pdf/herff2016towards.pdf}, poster = {https://halcy.de/cites/pdf/herff2016towards_poster.pdf}, } |
|
An Initial Investigation into the Real-Time Conversion of Facial Surface EMG Signals to Audible Speech
( , , , ),
at EMBC 2016 - 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, August 2016
AbstractThis paper presents early-stage results of our investigations into the direct conversion of facial surface electromyographic (EMG) signals into audible speech in a real-time setting, enabling novel avenues for research and system improvement through real-time feedback. The system uses a pipeline approach to enable online acquisition of EMG data, extraction of EMG features, mapping of EMG features to audio features, synthesis of audio waveforms from audio features and output of the audio waveforms via speakers or headphones. Our system allows for performing EMG-to-Speech conversion with low latency and on a continuous stream of EMG data, enabling near instantaneous audio output during audible as well as silent speech production. In this paper, we present an analysis of our systems components for latency incurred, as well as the trade-offs between conversion quality, latency and training duration required.@inproceedings{diener2016initial, title = {An Initial Investigation into the Real-Time Conversion of Facial Surface EMG Signals to Audible Speech}, author = {Diener, L. and Herff, C. and Janke, M. and Schultz, T.}, year = 2016, month = aug, booktitle = {{EMBC} 2016 - 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society}, doi = {10.1109/EMBC.2016.7590843}, abstract = {This paper presents early-stage results of our investigations into the direct conversion of facial surface electromyographic (EMG) signals into audible speech in a real-time setting, enabling novel avenues for research and system improvement through real-time feedback. The system uses a pipeline approach to enable online acquisition of EMG data, extraction of EMG features, mapping of EMG features to audio features, synthesis of audio waveforms from audio features and output of the audio waveforms via speakers or headphones. Our system allows for performing EMG-to-Speech conversion with low latency and on a continuous stream of EMG data, enabling near instantaneous audio output during audible as well as silent speech production. In this paper, we present an analysis of our systems components for latency incurred, as well as the trade-offs between conversion quality, latency and training duration required.}, url = {https://halcy.de/cites/pdf/diener2016initial.pdf}, poster = {https://halcy.de/cites/pdf/diener2016initial_poster.pdf}, } |
|
➥ |
Direct Conversion from Facial Myoelectric Signals to Speech using Deep Neural Networks
( , , ),
at IJCNN 2015 - 2015 International Joint Conference on Neural Networks, October 2015
AbstractThis paper presents our first results using Deep Neural Networks for surface electromyographic (EMG) speech synthesis. The proposed approach enables a direct mapping from EMG signals captured from the articulatory muscle movements to the acoustic speech signal. Features are processed from multiple EMG channels and are fed into a feed forward neural network to achieve a mapping to the target acoustic speech output. We show that this approach is feasible to generate speech output from the input EMG signal and compare the results to a prior mapping technique based on Gaussian mixture models. The comparison is conducted via objective Mel-Cepstral distortion scores and subjective listening test evaluations. It shows that the proposed Deep Neural Network approach gives substantial improvements for both evaluation criteria.@inproceedings{diener2015direct, title = {Direct Conversion from Facial Myoelectric Signals to Speech using Deep Neural Networks}, author = {Diener, Lorenz and Janke, Matthias and Schultz, Tanja}, year = 2015, month = oct, booktitle = {{IJCNN} 2015 - 2015 International Joint Conference on Neural Networks}, pages = {1--7}, doi = {10.1109/IJCNN.2015.7280404}, abstract = {This paper presents our first results using Deep Neural Networks for surface electromyographic (EMG) speech synthesis. The proposed approach enables a direct mapping from EMG signals captured from the articulatory muscle movements to the acoustic speech signal. Features are processed from multiple EMG channels and are fed into a feed forward neural network to achieve a mapping to the target acoustic speech output. We show that this approach is feasible to generate speech output from the input EMG signal and compare the results to a prior mapping technique based on Gaussian mixture models. The comparison is conducted via objective Mel-Cepstral distortion scores and subjective listening test evaluations. It shows that the proposed Deep Neural Network approach gives substantial improvements for both evaluation criteria.}, keywords = {electromyography, silent speech interface, deep neural networks}, url = {https://halcy.de/cites/pdf/diener2015direct.pdf}, } |
➥ |
Codebook Clustering for Unit Selection Based EMG-to-Speech Conversion
( , , ),
at INTERSPEECH 2015 - 16th Annual Conference of the International Speech Communication Association, September 2015
AbstractThis paper reports on our recent advances in using Unit Selection to directly synthesize speech from facial surface electromyographic (EMG) signals generated by movement of the articulatory muscles during speech production. We achieve a robust Unit Selection mapping by using a more sophisticated unit codebook. This codebook is generated from a set of base units using a two stage unit clustering process. The units are first clustered based on the audio and afterwards on the EMG feature vectors they cover, and a new codebook is generated using these cluster assignments. We evaluate different cluster counts for both stages and revisit our evaluation of unit sizes in light of this clustering approach. Our final system achieves a significantly better Mel-Cepstral distortion score than the Unit Selection based EMG-to-Speech conversion system from our previous work while, due to the reduced codebook size, taking less time to perform the conversion.@inproceedings{diener2015codebook, title = {Codebook Clustering for Unit Selection Based EMG-to-Speech Conversion}, author = {Diener, Lorenz and Janke, Matthias and Schultz, Tanja}, year = 2015, month = sep, booktitle = {{INTERSPEECH} 2015 - 16th Annual Conference of the International Speech Communication Association}, pages = {2420--2424}, doi = {10.21437/Interspeech.2015-523}, abstract = {This paper reports on our recent advances in using Unit Selection to directly synthesize speech from facial surface electromyographic (EMG) signals generated by movement of the articulatory muscles during speech production. We achieve a robust Unit Selection mapping by using a more sophisticated unit codebook. This codebook is generated from a set of base units using a two stage unit clustering process. The units are first clustered based on the audio and afterwards on the EMG feature vectors they cover, and a new codebook is generated using these cluster assignments. We evaluate different cluster counts for both stages and revisit our evaluation of unit sizes in light of this clustering approach. Our final system achieves a significantly better Mel-Cepstral distortion score than the Unit Selection based EMG-to-Speech conversion system from our previous work while, due to the reduced codebook size, taking less time to perform the conversion.}, keywords = {electromyography, silent speech interface, unit selection}, url = {https://halcy.de/cites/pdf/diener2015codebook.pdf}, } |
➥ |
Improving Unit Selection based EMG-to-Speech Conversion
( ),
Masters Thesis, July 2015
AbstractThis master’s thesis introduces a new approach to improve the unit-selection based conversion of facial myoelectric signals to audible speech. Surface electromyography is the recording of electric signals generated by muscle activity using surface electrodes attached to the skin. Past work has shown that it is feasible to generate audible speech signals from facial electromyographic activity generated during speech production, using several different approaches. This work focuses on the unit-selection approach to conversion, where the speech signal is reconstructed by concatenating pieces of target audio data selected by a similarity criterion calculated on the parallel sequence of source electromyographic data. A novel approach, based on optimizing the database that units are selected from by using unit clustering to generate more prototypical units and improve the selection process, is introduced and evaluated. In total, we obtain a qualitative improvement of up to 14.92 percent relative over a baseline unit selection system, while improving the time taken for conversion by up to 98%.@mastersthesis{diener2015improving, title = {Improving Unit Selection based EMG-to-Speech Conversion}, author = {Diener, Lorenz}, year = 2015, month = jul, school = {Karlsruher Institut für Technologie}, supervisor = {Janke, Matthias and Schultz, Tanja}, abstract = {This master’s thesis introduces a new approach to improve the unit-selection based conversion of facial myoelectric signals to audible speech. Surface electromyography is the recording of electric signals generated by muscle activity using surface electrodes attached to the skin. Past work has shown that it is feasible to generate audible speech signals from facial electromyographic activity generated during speech production, using several different approaches. This work focuses on the unit-selection approach to conversion, where the speech signal is reconstructed by concatenating pieces of target audio data selected by a similarity criterion calculated on the parallel sequence of source electromyographic data. A novel approach, based on optimizing the database that units are selected from by using unit clustering to generate more prototypical units and improve the selection process, is introduced and evaluated. In total, we obtain a qualitative improvement of up to 14.92 percent relative over a baseline unit selection system, while improving the time taken for conversion by up to 98%.}, url = {https://halcy.de/cites/pdf/diener2015improving.pdf}, } |
A runtime cache for interactive procedural modeling
( , , , , , ),
in Computers & Graphics, volume 36, number 5, pages 366–375, August 2012
AbstractWe present an efficient runtime cache to accelerate the display of procedurally displaced and textured implicit surfaces, exploiting spatio-temporal coherence between consecutive frames. We cache evaluations of implicit textures covering a conceptually infinite space. Rotating objects, zooming onto surfaces, and locally deforming shapes now requires minor cache updates per frame and benefits from mostly cached values, avoiding expensive re-evaluations. A novel parallel hashing scheme supports arbitrarily large data records and allows for an automated deletion policy: new information may evict information no longer required from the cache, resulting in an efficient usage. This sets our solution apart from previous caching techniques, which do not dynamically adapt to view changes and interactive shape modifications. We provide a thorough analysis on cache behavior for different procedural noise functions to displace implicit base shapes, during typical modeling operations.@article{reiner2012runtime, title = {A runtime cache for interactive procedural modeling}, author = {Reiner, Tim and Lefebvre, Sylvain and Diener, Lorenz and Garc{\'\i}a, Ismael and Jobard, Bruno and Dachsbacher, Carsten}, year = 2012, month = aug, journal = {Computers \& Graphics}, publisher = {Elsevier}, volume = 36, number = 5, pages = {366--375}, doi = {10.1016/j.cag.2012.03.031}, abstract = {We present an efficient runtime cache to accelerate the display of procedurally displaced and textured implicit surfaces, exploiting spatio-temporal coherence between consecutive frames. We cache evaluations of implicit textures covering a conceptually infinite space. Rotating objects, zooming onto surfaces, and locally deforming shapes now requires minor cache updates per frame and benefits from mostly cached values, avoiding expensive re-evaluations. A novel parallel hashing scheme supports arbitrarily large data records and allows for an automated deletion policy: new information may evict information no longer required from the cache, resulting in an efficient usage. This sets our solution apart from previous caching techniques, which do not dynamically adapt to view changes and interactive shape modifications. We provide a thorough analysis on cache behavior for different procedural noise functions to displace implicit base shapes, during typical modeling operations.}, url = {https://halcy.de/cites/pdf/reiner2012runtime.pdf}, } |
|
➥ |
Procedural modeling with signed distance functions
( ),
Bachelors Thesis, February 2012
AbstractProcedural modeling is the modeling of scenes using algorithms instead of explicit lists of geometry specified vertex by vertex. The implicit procedural approach to modeling has several advantages over describing scenes in an explicit fashion, such as the possibility to have levels of detail that would be impossible to store explicitly, as the memory requirements would be prohibitive – even an infinite level of detail is possible when the scene description can simply provide the detail as soon as it becomes necessary during the rendering process. It is obvious, then, that describing scenes or objects procedurally is desirable. However, while intuitively accessible modeling tools for the creation of explicit geometry abound, there are only very few and hardly any mature tools or frameworks for the procedural modeling of objects or scenes. This thesis will give an overview over the current state of procedural modeling frameworks. After explaining the theoretical concepts required for its understanding, it will go into detail about a specific type of procedural modeling – modeling with implicit surfaces, with rendering based on distance functions – and introduce a tool which can be used to accomplish this task. It will then introduce improvements made to this tool throughout the course of this thesis, including the development of a cache enabling the real-time use of previously prohibitively expensive noise functions, and finally discuss and summarize its now extended capabilities.@bachelorsthesis{diener2012procedural, title = {Procedural modeling with signed distance functions}, author = {Diener, Lorenz}, year = 2012, month = feb, school = {Karlsruher Institut für Technologie}, supervisor = {Dachsbacher, Karsten and Reiner, Tim}, abstract = {Procedural modeling is the modeling of scenes using algorithms instead of explicit lists of geometry specified vertex by vertex. The implicit procedural approach to modeling has several advantages over describing scenes in an explicit fashion, such as the possibility to have levels of detail that would be impossible to store explicitly, as the memory requirements would be prohibitive – even an infinite level of detail is possible when the scene description can simply provide the detail as soon as it becomes necessary during the rendering process. It is obvious, then, that describing scenes or objects procedurally is desirable. However, while intuitively accessible modeling tools for the creation of explicit geometry abound, there are only very few and hardly any mature tools or frameworks for the procedural modeling of objects or scenes. This thesis will give an overview over the current state of procedural modeling frameworks. After explaining the theoretical concepts required for its understanding, it will go into detail about a specific type of procedural modeling – modeling with implicit surfaces, with rendering based on distance functions – and introduce a tool which can be used to accomplish this task. It will then introduce improvements made to this tool throughout the course of this thesis, including the development of a cache enabling the real-time use of previously prohibitively expensive noise functions, and finally discuss and summarize its now extended capabilities.}, url = {https://halcy.de/cites/pdf/diener2012procedural.pdf}, } |