Publication list

This is a list of all publications that I was involved in as primary or co-author. PDFs are linked where I can provide them. Publications where I am the primary author are marked with ➥.


Filter publications:
The ICASSP 2024 Audio Deep Packet Loss Concealment Grand Challenge (, , , ), at ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing, April 2024

Abstract

Audio packet loss concealment is the hiding of gaps in VoIP audio streams caused by network packet loss. With the ICASSP 2024 Audio Deep Packet Loss Concealment Grand Challenge, we build on the success of the previous Audio PLC Challenge held at INTERSPEECH 2022. We evaluate models on an overall harder dataset, and use the new ITU-T P.804 evaluation procedure to more closely evaluate the performance of systems specifically on the PLC task. We evaluate a total of 9 systems, 8 of which satisfy the strict real-time performance requirements of the challenge, using both P.804 and Word Accuracy evaluations.
@inproceedings{diener2024icassp,
  title        = {The ICASSP 2024 Audio Deep Packet Loss Concealment Grand Challenge},
  author       = {Diener, Lorenz and Branets, Solomiya and Saabas, Ando and Cutler, Ross},
  year         = 2024,
  month        = apr,
  booktitle    = {{ICASSP} 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal
    Processing},
  abstract     = {Audio packet loss concealment is the hiding of gaps in VoIP audio streams caused
    by network packet loss. With the ICASSP 2024 Audio Deep Packet Loss Concealment Grand Challenge,
    we build on the success of the previous Audio PLC Challenge held at INTERSPEECH 2022. We
    evaluate models on an overall harder dataset, and use the new ITU-T P.804 evaluation procedure
    to more closely evaluate the performance of systems specifically on the PLC task. We evaluate a
    total of 9 systems, 8 of which satisfy the strict real-time performance requirements of the
    challenge, using both P.804 and Word Accuracy evaluations.},
  code         = {https://aka.ms/plc-challenge},
  url          = {https://halcy.de/cites/pdf/diener2024icassp.pdf},
}
[details] [abstract v] [bibtex v] [pdf] [code/datasets]
PLCMOS - a data-driven non-intrusive metric for the evaluation of packet loss concealment algorithms (, , , , , ), at INTERSPEECH 2023 - 23nd Annual Conference of the International Speech Communication Association, August 2023

Abstract

Speech quality assessment is a problem for every researcher working on models that produce or process speech. Human subjective ratings, the gold standard in speech quality assessment, are expensive and time-consuming to acquire in a quantity that is sufficient to get reliable data, while automated objective metrics show a low correlation with gold standard ratings. This paper presents PLCMOS, a non-intrusive data-driven tool for generating a robust, accurate estimate of the mean opinion score a human rater would assign an audio file that has been processed by being transmitted over a degraded packet-switched network with missing packets being healed by a packet loss concealment algorithm. Our new model shows a model-wise Pearson's correlation of  0.97 and rank correlation of  0.95 with human ratings, substantially above all other available intrusive and non-intrusive metrics. The model is released as an ONNX model for other researchers to use when building PLC systems.
@inproceedings{diener2023plcmos,
  title        = {PLCMOS - a data-driven non-intrusive metric for the evaluation of packet loss
    concealment algorithms},
  author       = {Diener, Lorenz and Sootla, Sten and Purin, Marju and Saabas, Ando and Aichner,
    Robert and Cutler, Ross},
  year         = 2023,
  month        = aug,
  booktitle    = {{INTERSPEECH} 2023 - 23nd Annual Conference of the International Speech
    Communication Association},
  abstract     = {Speech quality assessment is a problem for every researcher working on models that
    produce or process speech. Human subjective ratings, the gold standard in speech quality
    assessment, are expensive and time-consuming to acquire in a quantity that is sufficient to get
    reliable data, while automated objective metrics show a low correlation with gold standard
    ratings. This paper presents PLCMOS, a non-intrusive data-driven tool for generating a robust,
    accurate estimate of the mean opinion score a human rater would assign an audio file that has
    been processed by being transmitted over a degraded packet-switched network with missing packets
    being healed by a packet loss concealment algorithm. Our new model shows a model-wise Pearson's
    correlation of ~0.97 and rank correlation of ~0.95 with human ratings, substantially above all
    other available intrusive and non-intrusive metrics. The model is released as an ONNX model for
    other researchers to use when building PLC systems.},
  doi          = {10.21437/Interspeech.2023-1532},
  code         = {https://aka.ms/plcmos},
  url          = {https://halcy.de/cites/pdf/diener2023plcmos.pdf},
  poster       = {https://halcy.de/cites/pdf/diener2023plcmos_poster.pdf},
}
[details] [abstract v] [bibtex v] [pdf] [poster] [doi] [code/datasets]
INTERSPEECH 2022 Audio Deep Packet Loss Concealment Challenge (, , , , , ), at INTERSPEECH 2022 - 22nd Annual Conference of the International Speech Communication Association, September 2022

Abstract

Audio Packet Loss Concealment (PLC) is the hiding of gaps in audio streams caused by data transmission failures in packet switched networks. This is a common problem, and of increasing importance as end-to-end VoIP telephony and teleconference systems become the default and ever more widely used form of communication in business as well as in personal usage. This paper presents the INTERSPEECH 2022 Audio Deep Packet Loss Concealment challenge. We first give an overview of the PLC problem, and introduce some classical approaches to PLC as well as recent work. We then present the open source dataset released as part of this challenge as well as the evaluation methods and metrics used to determine the winner. We also briefly introduce PLCMOS, a novel data-driven metric that can be used to quickly evaluate the performance PLC systems. Finally, we present the results of the INTERSPEECH 2022 Audio Deep PLC Challenge, and provide a summary of important takeaways.
@inproceedings{diener2022interspeech,
  title        = {INTERSPEECH 2022 Audio Deep Packet Loss Concealment Challenge},
  author       = {Diener, Lorenz and Sootla, Sten and Branets, Solomiya and Saabas, Ando and
    Aichner, Robert and Cutler, Ross},
  year         = 2022,
  month        = sep,
  booktitle    = {{INTERSPEECH} 2022 - 22nd Annual Conference of the International Speech
    Communication Association},
  video        = {https://www.youtube.com/watch?v=W9gYjas9lB0},
  doi          = {10.21437/Interspeech.2022-10829},
  abstract     = {Audio Packet Loss Concealment (PLC) is the hiding of gaps in audio streams caused
    by data transmission failures in packet switched networks. This is a common problem, and of
    increasing importance as end-to-end VoIP telephony and teleconference systems become the default
    and ever more widely used form of communication in business as well as in personal usage. This
    paper presents the INTERSPEECH 2022 Audio Deep Packet Loss Concealment challenge. We first give
    an overview of the PLC problem, and introduce some classical approaches to PLC as well as recent
    work. We then present the open source dataset released as part of this challenge as well as the
    evaluation methods and metrics used to determine the winner. We also briefly introduce PLCMOS, a
    novel data-driven metric that can be used to quickly evaluate the performance PLC systems.
    Finally, we present the results of the INTERSPEECH 2022 Audio Deep PLC Challenge, and provide a
    summary of important takeaways.},
  code         = {https://aka.ms/plc-challenge},
  url          = {https://halcy.de/cites/pdf/diener2022interspeech.pdf},
}
[details] [abstract v] [bibtex v] [pdf] [doi] [video] [code/datasets]
Sprechen durch Vorstellen (, , ), in Sprache· Stimme· Gehör, volume 46, pages 62–63, June 2022

Abstract

In einem internationalen Projekt ist es Informatikerinnen und Informatikern gelungen, eine sogenannte Neurosprachprothese zu realisieren. Damit kann vorgestellte Sprache akustisch hörbar gemacht werden – ohne Verzögerung in Echtzeit. Die Entwicklung kann Menschen helfen, die aufgrund neuronaler Erkrankungen verstummt sind und ohne fremde Hilfe nicht mit der Außenwelt kommunizieren können.
@article{angrick2022sprechen,
  title        = {Sprechen durch Vorstellen},
  author       = {Angrick, Miguel and Ottenhoff, Maarten and Diener, Lorenz},
  year         = 2022,
  month        = jun,
  journal      = {Sprache{\textperiodcentered} Stimme{\textperiodcentered} Geh{\"o}r},
  volume       = 46,
  pages        = {62--63},
  doi          = {10.1055/a-1666-7303},
  abstract     = {In einem internationalen Projekt ist es Informatikerinnen und Informatikern
    gelungen, eine sogenannte Neurosprachprothese zu realisieren. Damit kann vorgestellte Sprache
    akustisch hörbar gemacht werden – ohne Verzögerung in Echtzeit. Die Entwicklung kann Menschen
    helfen, die aufgrund neuronaler Erkrankungen verstummt sind und ohne fremde Hilfe nicht mit der
    Außenwelt kommunizieren können.},
}
[details] [abstract v] [bibtex v] [doi]
Towards Closed-Loop Speech Synthesis from Stereotactic EEG: A Unit Selection Approach (, , , , , , , , , , , ), at ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, May 2022

Abstract

Neurological disorders can severely impact speech communication. Recently, neural speech prostheses have been proposed that reconstruct intelligible speech from neural signals recorded superficially on the cortex. Thus far, it has been unclear whether similar reconstruction is feasible from deeper brain structures, and whether audible speech can be directly synthesized from these reconstructions with low-latency, as required for a practical speech neuroprosthetic. The present study aims to address both challenges. First, we implement a low-latency unit selection based synthesizer that converts neural signals into audible speech. Second, we evaluate our approach on open-loop recordings from 5 patients implanted with stereotactic depth electrodes who conducted a read-aloud task of Dutch utterances. We achieve correlation coefficients significantly higher than chance level of up to 0.6 and an average computational cost of 6.6 ms for each 10 ms frames. While the current reconstructed utterances are not intelligible, our results indicate promising decoding and run-time capabilities that are suitable for investigations of speech processes in closed-loop experiments.
@inproceedings{angrick2022towards,
  title        = {Towards Closed-Loop Speech Synthesis from Stereotactic EEG: A Unit Selection
    Approach},
  author       = {Angrick, Miguel and Ottenhoff, Maarten and Diener, Lorenz and Ivucic, Darius and
    Ivucic, Gabriel and Goulis, Sophocles and Colon, Albert J. and Wagner, Louis and Krusienski,
    Dean J. and Kubben, Pieter L. and Schultz, Tanja and Herff, Christian},
  year         = 2022,
  month        = may,
  booktitle    = {{ICASSP} 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal
    Processing},
  pages        = {1296--1300},
  doi          = {10.1109/ICASSP43922.2022.9747300},
  issn         = {2379-190X},
  abstract     = {Neurological disorders can severely impact speech communication. Recently, neural
    speech prostheses have been proposed that reconstruct intelligible speech from neural signals
    recorded superficially on the cortex. Thus far, it has been unclear whether similar
    reconstruction is feasible from deeper brain structures, and whether audible speech can be
    directly synthesized from these reconstructions with low-latency, as required for a practical
    speech neuroprosthetic. The present study aims to address both challenges. First, we implement a
    low-latency unit selection based synthesizer that converts neural signals into audible speech.
    Second, we evaluate our approach on open-loop recordings from 5 patients implanted with
    stereotactic depth electrodes who conducted a read-aloud task of Dutch utterances. We achieve
    correlation coefficients significantly higher than chance level of up to 0.6 and an average
    computational cost of 6.6 ms for each 10 ms frames. While the current reconstructed utterances
    are not intelligible, our results indicate promising decoding and run-time capabilities that are
    suitable for investigations of speech processes in closed-loop experiments.},
  url          = {https://halcy.de/cites/pdf/angrick2022towards.pdf},
}
[details] [abstract v] [bibtex v] [pdf] [doi]
Real-time synthesis of imagined speech processes from minimally invasive recordings of neural activity (, , , , , , , , , , , , ), in Nature Communications Biology, volume 4, number 1, pages 1–10, December 2021

Abstract

Speech neuroprosthetics aim to provide a natural communication channel to individuals who are unable to speak due to physical or neurological impairments. Real-time synthesis of acoustic speech directly from measured neural activity could enable natural conversations and notably improve quality of life, particularly for individuals who have severely limited means of communication. Recent advances in decoding approaches have led to high quality reconstructions of acoustic speech from invasively measured neural activity. However, most prior research utilizes data collected during open-loop experiments of articulated speech, which might not directly translate to imagined speech processes. Here, we present an approach that synthesizes audible speech in real-time for both imagined and whispered speech conditions. Using a participant implanted with stereotactic depth electrodes, we were able to reliably generate audible speech in real-time. The decoding models rely predominately on frontal activity suggesting that speech processes have similar representations when vocalized, whispered, or imagined. While reconstructed audio is not yet intelligible, our real-time synthesis approach represents an essential step towards investigating how patients will learn to operate a closed-loop speech neuroprosthesis based on imagined speech
@article{angrick2021real,
  title        = {Real-time synthesis of imagined speech processes from minimally invasive
    recordings of neural activity},
  author       = {Angrick, Miguel and Ottenhoff, Maarten and Diener, Lorenz and Ivucic, Darius and
    Ivucic, Gabriel and Goulis, Sophocles and Saal, Jeremy and Colon, Albert and Wagner, Louis and
    Krusienski, Dean and Kubben, Pieter and Schultz, Tanja and Herff, Christian},
  year         = 2021,
  month        = dec,
  journal      = {Nature Communications Biology},
  publisher    = {Nature Publishing Group},
  volume       = 4,
  number       = 1,
  pages        = {1--10},
  video        = {https://www.youtube.com/watch?v=2m8bUYZP-Eo},
  doi          = {10.1101/2020.12.11.421149},
  abstract     = {Speech neuroprosthetics aim to provide a natural communication channel to
    individuals who are unable to speak due to physical or neurological impairments. Real-time
    synthesis of acoustic speech directly from measured neural activity could enable natural
    conversations and notably improve quality of life, particularly for individuals who have
    severely limited means of communication. Recent advances in decoding approaches have led to high
    quality reconstructions of acoustic speech from invasively measured neural activity. However,
    most prior research utilizes data collected during open-loop experiments of articulated speech,
    which might not directly translate to imagined speech processes. Here, we present an approach
    that synthesizes audible speech in real-time for both imagined and whispered speech conditions.
    Using a participant implanted with stereotactic depth electrodes, we were able to reliably
    generate audible speech in real-time. The decoding models rely predominately on frontal activity
    suggesting that speech processes have similar representations when vocalized, whispered, or
    imagined. While reconstructed audio is not yet intelligible, our real-time synthesis approach
    represents an essential step towards investigating how patients will learn to operate a
    closed-loop speech neuroprosthesis based on imagined speech},
  url          = {https://halcy.de/cites/pdf/angrick2021real.pdf},
}
[details] [abstract v] [bibtex v] [pdf] [doi] [video]
The Impact of Audible Feedback on EMG-to-Speech Conversion (), PhD Thesis, May 2021

Abstract

Research interest in speech interfaces that can function even when an audible acoustic signal is not present – so-called Silent Speech Interfaces – has grown dramatically in recent years, as the field presents many barely exploded avenues for research and huge potential for applications in user interfaces and prosthetics. EMG-to-Speech conversion is a type of silent speech interface, based on electromyography: It is the direct conversion of a facial electrical speech muscle activity signal to audible speech without an intermediate textual representation. Such a direct conversion approach is well suited to speech prosthesis and silent telephony applications and could be used as a pre-processing step to enable a user to use a regular acoustic speech interface silently. To enable these applications in practice, one requirement is that EMG-to-Speech conversion systems must be capable of producing output in real time and with low latency, and work on EMG signals recorded during silently produced speech. The overall objective of this dissertation is to move EMG-to-Speech conversion further towards practical usability by building a real-time low-latency capable EMG-to-Speech conversion system and then use it to evaluate the effect of audible feedback, provided in real-time, on silent speech production.
@phdthesis{diener2021impact,
  title        = {The Impact of Audible Feedback on EMG-to-Speech Conversion},
  author       = {Diener, Lorenz},
  year         = 2021,
  month        = may,
  doi          = {10.26092/elib/556},
  school       = {University of Bremen},
  supervisor   = {Schultz, Tanja and Hueber, Thomas},
  abstract     = {Research interest in speech interfaces that can function even when an audible
    acoustic signal is not present -- so-called {Silent Speech Interfaces} -- has grown dramatically
    in recent years, as the field presents many barely exploded avenues for research and huge
    potential for applications in user interfaces and prosthetics. EMG-to-Speech conversion is a
    type of silent speech interface, based on electromyography: It is the direct conversion of a
    facial electrical speech muscle activity signal to audible speech without an intermediate
    textual representation. Such a direct conversion approach is well suited to speech prosthesis
    and silent telephony applications and could be used as a pre-processing step to enable a user to
    use a regular acoustic speech interface silently. To enable these applications in practice, one
    requirement is that EMG-to-Speech conversion systems must be capable of producing output in real
    time and with low latency, and work on EMG signals recorded during silently produced speech. The
    overall objective of this dissertation is to move EMG-to-Speech conversion further towards
    practical usability by building a real-time low-latency capable EMG-to-Speech conversion system
    and then use it to evaluate the effect of audible feedback, provided in real-time, on silent
    speech production.},
  code         = {https://github.com/cognitive-systems-lab/EMG-GUI},
  url          = {https://halcy.de/cites/pdf/diener2021impact.pdf},
}
[details] [abstract v] [bibtex v] [pdf] [doi] [code/datasets]
Voice Restoration with Silent Speech Interfaces (ReSSInt) (, , , , , , , , , , , , , , ), at Proc. IberSPEECH 2021, March 2021

Abstract

ReSSInt aims at investigating the use of silent speech interfaces (SSIs) for restoring communication to individuals who have been deprived of the ability to speak. SSIs are devices which capture non-acoustic biosignals generated during the speech production process and use them to predict the intended message. Two are the biosignals that will be investigated in this project: electromyography (EMG) signals representing electrical activity driving the facial muscles and invasive electroencephalography (iEEG) neural signals captured by means of invasive electrodes implanted on the brain. From the whole spectrum of speech disorders which may affect a person’s voice, ReSSInt will address two particular conditions: (i) voice loss after total laryngectomy and (ii) neurodegenerative diseases and other traumatic injuries which may leave an individual paralyzed and, eventually, unable to speak. To make this technology truly beneficial for these persons, this project aims at generating intelligible speech of reasonable quality. This will be tackled by recording large databases and the use of state-of-the-art generative deep learning techniques. Finally, different voice rehabilitation scenarios are foreseen within the project, which will lead to innovative research solutions for SSIs and a real impact on society by improving the life of people with speech impediments.
@inproceedings{hernaez2021voice,
  title        = {Voice Restoration with Silent Speech Interfaces ({ReSSInt})},
  author       = {Hernaez, Inma and González-López, Jose Andrés and Navas, Eva and {Pérez Córdoba},
    Jose Luis and Saratxaga, Ibon and Olivares, Gonzalo and {Sánchez de la Fuente}, Jon and Galdón,
    Alberto and {García Romillo}, Víctor and González-Atienza, Míriam and Schultz, Tanja and Green,
    Phil and Wand, Michael and Marxer, Ricard and Diener, Lorenz},
  year         = 2021,
  month        = mar,
  booktitle    = {Proc. IberSPEECH 2021},
  pages        = {130--134},
  doi          = {10.21437/IberSPEECH.2021-28},
  abstract     = {ReSSInt aims at investigating the use of silent speech interfaces (SSIs) for
    restoring communication to individuals who have been deprived of the ability to speak. SSIs are
    devices which capture non-acoustic biosignals generated during the speech production process and
    use them to predict the intended message. Two are the biosignals that will be investigated in
    this project: electromyography (EMG) signals representing electrical activity driving the facial
    muscles and invasive electroencephalography (iEEG) neural signals captured by means of invasive
    electrodes implanted on the brain. From the whole spectrum of speech disorders which may affect
    a person’s voice, ReSSInt will address two particular conditions: (i) voice loss after total
    laryngectomy and (ii) neurodegenerative diseases and other traumatic injuries which may leave an
    individual paralyzed and, eventually, unable to speak. To make this technology truly beneficial
    for these persons, this project aims at generating intelligible speech of reasonable quality.
    This will be tackled by recording large databases and the use of state-of-the-art generative
    deep learning techniques. Finally, different voice rehabilitation scenarios are foreseen within
    the project, which will lead to innovative research solutions for SSIs and a real impact on
    society by improving the life of people with speech impediments.},
  url          = {https://halcy.de/cites/pdf/hernaez2021voice.pdf},
}
[details] [abstract v] [bibtex v] [pdf] [doi]
Towards Speech Synthesis from Intracranial Signals (, , , , , ), chapter of "Brain--Computer Interface Research", pages 47--54, October 2020

Abstract

Brain-computer interfaces (BCIs) are envisioned to enable individuals with severe disabilities to regain the ability to communicate. Early BCIs have provided users with the ability to type messages one letter at a time, providing an important, but slow, means of communication for locked-in patients. However, natural speech contains substantially more information than a textual representation and can convey many important markers of human communication in addition to the sequence of words. A BCI that directly synthesizes speech from neural signals could harness this full expressive power of speech. In this study with motor-intact patients undergoing glioma removal, we demonstrate that high-quality audio signals can be synthesized from intracranial signals using a method from the speech synthesis community called Unit Selection. The Unit Selection approach concatenates speech units of the user to form new audio output and thereby produces natural speech in the user’s own voice.
@incollection{herff2020towards,
  title        = {Towards Speech Synthesis from Intracranial Signals},
  author       = {Herff, Christian and Diener, Lorenz and Mugler, Emily and Slutzky, Marc and
    Krusienski, Dean and Schultz, Tanja},
  year         = 2020,
  month        = oct,
  booktitle    = {Brain--Computer Interface Research},
  publisher    = {Springer},
  pages        = {47--54},
  doi          = {10.1007/978-3-030-49583-1_5},
  abstract     = {Brain-computer interfaces (BCIs) are envisioned to enable individuals with severe
    disabilities to regain the ability to communicate. Early BCIs have provided users with the
    ability to type messages one letter at a time, providing an important, but slow, means of
    communication for locked-in patients. However, natural speech contains substantially more
    information than a textual representation and can convey many important markers of human
    communication in addition to the sequence of words. A BCI that directly synthesizes speech from
    neural signals could harness this full expressive power of speech. In this study with
    motor-intact patients undergoing glioma removal, we demonstrate that high-quality audio signals
    can be synthesized from intracranial signals using a method from the speech synthesis community
    called Unit Selection. The Unit Selection approach concatenates speech units of the user to form
    new audio output and thereby produces natural speech in the user’s own voice.},
}
[details] [abstract v] [bibtex v] [doi]
Towards Silent Paralinguistics: Deriving Speaking Mode and Speaker ID from Electromyographic Signals (, , , , , , ), at INTERSPEECH 2020 - 21st Annual Conference of the International Speech Communication Association, September 2020

Abstract

Silent Computational Paralinguistics (SCP) - the assessment of speaker states and traits from non-audibly spoken communication - has rarely been targeted in the rich body of either Computational Paralinguistics or Silent Speech Processing. Here, we provide first steps towards this challenging but potentially highly rewarding endeavour: Paralinguistics can enrich spoken language interfaces, while Silent Speech Processing enables confidential and unobtrusive spoken communication for everybody, including mute speakers. We approach SCP by using speech-related biosignals stemming from facial muscle activities captured by surface electromyography (EMG). To demonstrate the feasibility of SCP, we select one speaker trait (speaker identity) and one speaker state (speaking mode). We introduce two promising strategies for SCP: (1) deriving paralinguistic speaker information directly from EMG of silently produced speech versus (2) first converting EMG into an audible speech signal followed by conventional computational paralinguistic methods. We compare traditional feature extraction and decision making approaches to more recent deep representation and transfer learning by convolutional and recurrent neural networks, using openly available EMG data. We find that paralinguistics can be assessed not only from acoustic speech but also from silent speech captured by EMG.
@inproceedings{diener2020towards,
  title        = {Towards Silent Paralinguistics: Deriving Speaking Mode and Speaker ID from
    Electromyographic Signals},
  author       = {Diener, Lorenz and Amiriparian, Shahin and Botelho, Catarina and Scheck, Kevin and
    Küster, Dennis and Trancoso, Isabel Schuller, Björn W. and Schultz, Tanja},
  year         = 2020,
  month        = sep,
  booktitle    = {{INTERSPEECH} 2020 - 21st Annual Conference of the International Speech
    Communication Association},
  video        = {https://www.youtube.com/watch?v=sy7MeEmEusY},
  doi          = {10.21437/interspeech.2020-2848},
  abstract     = {Silent Computational Paralinguistics (SCP) - the assessment of speaker states and
    traits from non-audibly spoken communication - has rarely been targeted in the rich body of
    either Computational Paralinguistics or Silent Speech Processing. Here, we provide first steps
    towards this challenging but potentially highly rewarding endeavour: Paralinguistics can enrich
    spoken language interfaces, while Silent Speech Processing enables confidential and unobtrusive
    spoken communication for everybody, including mute speakers. We approach SCP by using
    speech-related biosignals stemming from facial muscle activities captured by surface
    electromyography (EMG). To demonstrate the feasibility of SCP, we select one speaker trait
    (speaker identity) and one speaker state (speaking mode). We introduce two promising strategies
    for SCP: (1) deriving paralinguistic speaker information directly from EMG of silently produced
    speech versus (2) first converting EMG into an audible speech signal followed by conventional
    computational paralinguistic methods. We compare traditional feature extraction and decision
    making approaches to more recent deep representation and transfer learning by convolutional and
    recurrent neural networks, using openly available EMG data. We find that paralinguistics can be
    assessed not only from acoustic speech but also from silent speech captured by EMG.},
  url          = {https://halcy.de/cites/pdf/diener2020towards.pdf},
}
[details] [abstract v] [bibtex v] [pdf] [doi] [video]
CSL-EMG_Array: An Open Access Corpus for EMG-to-Speech Conversion (, , ), at INTERSPEECH 2020 - 21st Annual Conference of the International Speech Communication Association, September 2020

Abstract

We present a new open access corpus for the training and evaluation of EMG-to-Speech conversion systems based on array electromyographic recordings. The corpus is recorded with a recording paradigm closely mirroring realistic EMG-to-Speech usage scenarios, and includes evaluation data recorded from both audible as well as silent speech. The corpus consists of 9.5 hours of data, split into 12 sessions recorded from 8 speakers. Based on this corpus, we present initial benchmark results with a realistic online EMG-to-Speech conversion use case, both for the audible and silent speech subsets. We also present a method for drastically improving EMG-to-Speech system stability and performance in the presence of time-related artifacts.
@inproceedings{diener2020csl,
  title        = {{CSL-EMG\_Array}: An Open Access Corpus for EMG-to-Speech Conversion},
  author       = {Diener, Lorenz and Roustay Vishkasougheh, Mehrdad and Schultz, Tanja},
  year         = 2020,
  month        = sep,
  booktitle    = {{INTERSPEECH} 2020 - 21st Annual Conference of the International Speech
    Communication Association},
  video        = {https://www.youtube.com/watch?v=houE7c2zEko},
  doi          = {10.21437/Interspeech.2020-2859},
  abstract     = {We present a new open access corpus for the training and evaluation of
    EMG-to-Speech conversion systems based on array electromyographic recordings. The corpus is
    recorded with a recording paradigm closely mirroring realistic EMG-to-Speech usage scenarios,
    and includes evaluation data recorded from both audible as well as silent speech. The corpus
    consists of 9.5 hours of data, split into 12 sessions recorded from 8 speakers. Based on this
    corpus, we present initial benchmark results with a realistic online EMG-to-Speech conversion
    use case, both for the audible and silent speech subsets. We also present a method for
    drastically improving EMG-to-Speech system stability and performance in the presence of
    time-related artifacts.},
  code         = {https://www.uni-bremen.de/csl/forschung/lautlose-sprachkommunikation/csl-emg-array-corpus},
  url          = {https://halcy.de/cites/pdf/diener2020csl.pdf},
}
[details] [abstract v] [bibtex v] [pdf] [doi] [video] [code/datasets]
Toward Silent Paralinguistics: Speech-to-EMG - Retrieving Articulatory Muscle Activity from Speech (, , , , , , , , ), at INTERSPEECH 2020 - 21st Annual Conference of the International Speech Communication Association, September 2020

Abstract

Electromyographic (EMG) signals recorded during speech production encode information on articulatory muscle activity and also on the facial expression of emotion, thus representing a speech-related biosignal with strong potential for paralinguistic applications. In this work, we estimate the electrical activity of the muscles responsible for speech articulation directly from the speech signal. To this end, we first perform a neural conversion of speech features into electromyographic time domain features, and then attempt to retrieve the original EMG signal from the time domain features. We propose a feed forward neural network to address the first step of the problem (speech features to EMG features) and a neural network composed of a convolutional block and a bidirectional long short-term memory block to address the second problem (true EMG features to EMG signal). We observe that four out of the five originally proposed time domain features can be estimated reasonably well from the speech signal. Further, the five time domain features are able to predict the original speech-related EMG signal with a concordance correlation coefficient of 0.663. We further compare our results with the ones achieved on the inverse problem of generating acoustic speech features from EMG features.
@inproceedings{botelho2020silent,
  title        = {Toward Silent Paralinguistics: Speech-to-EMG - Retrieving Articulatory Muscle
    Activity from Speech},
  author       = {Botelho, Catarina and Diener, Lorenz and Küster, Dennis and Scheck, Kevin and
    Amiriparian, Shahin and Schuller, Björn W. and Schultz, Tanja and Abad, Alberto and Trancoso,
    Isabel},
  year         = 2020,
  month        = sep,
  booktitle    = {{INTERSPEECH} 2020 - 21st Annual Conference of the International Speech
    Communication Association},
  doi          = {10.21437/Interspeech.2020-2926},
  abstract     = {Electromyographic (EMG) signals recorded during speech production encode
    information on articulatory muscle activity and also on the facial expression of emotion, thus
    representing a speech-related biosignal with strong potential for paralinguistic applications.
    In this work, we estimate the electrical activity of the muscles responsible for speech
    articulation directly from the speech signal. To this end, we first perform a neural conversion
    of speech features into electromyographic time domain features, and then attempt to retrieve the
    original EMG signal from the time domain features. We propose a feed forward neural network to
    address the first step of the problem (speech features to EMG features) and a neural network
    composed of a convolutional block and a bidirectional long short-term memory block to address
    the second problem (true EMG features to EMG signal). We observe that four out of the five
    originally proposed time domain features can be estimated reasonably well from the speech
    signal. Further, the five time domain features are able to predict the original speech-related
    EMG signal with a concordance correlation coefficient of 0.663. We further compare our results
    with the ones achieved on the inverse problem of generating acoustic speech features from EMG
    features.},
  url          = {https://halcy.de/cites/pdf/botelho2020silent.pdf},
}
[details] [abstract v] [bibtex v] [pdf] [doi]
Improving Fundamental Frequency Generation in EMG-to-Speech Conversion using a Quantization Approach (, , ), at ASRU 2019 - IEEE Workshop on Automatic Speech Recognition and Understanding, December 2019

Abstract

We present a novel approach to generating fundamental frequency (intonation and voicing) trajectories in an EMG-to-Speech conversion Silent Speech Interface, based on quantizing the EMG-to-F0 mappings target values and thus turning a regression problem into a recognition problem. We present this method and evaluate its performance with regard to the accuracy of the voicing information obtained as well as the performance in generating plausible intonation trajectories within voiced sections of the signal. To this end, we also present a new measure for overall F0 trajectory plausibility, the trajectory-label accuracy (TLAcc), and compare it with human evaluations. Our new F0 generation method achieves a significantly better performance than a baseline approach in terms of voicing accuracy, correlation of voiced sections, trajectory-label accuracy and, most importantly, human evaluations.
@inproceedings{diener2019improving,
  title        = {Improving Fundamental Frequency Generation in EMG-to-Speech Conversion using a
    Quantization Approach},
  author       = {Diener, Lorenz and Umesh, Tejas and Schultz, Tanja},
  year         = 2019,
  month        = dec,
  booktitle    = {{ASRU} 2019 - IEEE Workshop on Automatic Speech Recognition and Understanding},
  doi          = {10.1109/ASRU46091.2019.9003804},
  abstract     = {We present a novel approach to generating fundamental frequency (intonation and
    voicing) trajectories in an EMG-to-Speech conversion Silent Speech Interface, based on
    quantizing the EMG-to-F0 mappings target values and thus turning a regression problem into a
    recognition problem. We present this method and evaluate its performance with regard to the
    accuracy of the voicing information obtained as well as the performance in generating plausible
    intonation trajectories within voiced sections of the signal. To this end, we also present a new
    measure for overall F0 trajectory plausibility, the trajectory-label accuracy (TLAcc), and
    compare it with human evaluations. Our new F0 generation method achieves a significantly better
    performance than a baseline approach in terms of voicing accuracy, correlation of voiced
    sections, trajectory-label accuracy and, most importantly, human evaluations.},
  url          = {https://halcy.de/cites/pdf/diener2019improving.pdf},
}
[details] [abstract v] [bibtex v] [pdf] [doi]
Generating natural, intelligible speech from brain activity in motor, premotor, and inferior frontal cortices (, , , , , , , , ), in Frontiers in neuroscience, volume 13, pages 1267, November 2019

Abstract

Neural interfaces that directly produce intelligible speech from brain activity would allow people with severe impairment from neurological disorders to communicate more naturally. Here, we record neural population activity in motor, premotor and inferior frontal cortices during speech production using electrocorticography (ECoG) and show that ECoG signals alone can be used to generate intelligible speech output that can preserve conversational cues. To produce speech directly from neural data, we adapted a method from the field of speech synthesis called unit selection, in which units of speech are concatenated to form audible output. In our approach, which we call Brain-To-Speech, we chose subsequent units of speech based on the measured ECoG activity to generate audio waveforms directly from the neural recordings. Brain-To-Speech employed the user's own voice to generate speech that sounded very natural and included features such as prosody and accentuation. By investigating the brain areas involved in speech production separately, we found that speech motor cortex provided more information for the reconstruction process than the other cortical areas.
@article{herff2019generating,
  title        = {Generating natural, intelligible speech from brain activity in motor, premotor,
    and inferior frontal cortices},
  author       = {Herff, Christian and Diener, Lorenz and Angrick, Miguel and Mugler, Emily and
    Tate, Matthew C and Goldrick, Matthew A and Krusienski, Dean J and Slutzky, Marc W and Schultz,
    Tanja},
  year         = 2019,
  month        = nov,
  journal      = {Frontiers in neuroscience},
  publisher    = {Frontiers Media SA},
  volume       = 13,
  pages        = 1267,
  doi          = {10.3389/fnins.2019.01267},
  abstract     = {Neural interfaces that directly produce intelligible speech from brain activity
    would allow people with severe impairment from neurological disorders to communicate more
    naturally. Here, we record neural population activity in motor, premotor and inferior frontal
    cortices during speech production using electrocorticography (ECoG) and show that ECoG signals
    alone can be used to generate intelligible speech output that can preserve conversational cues.
    To produce speech directly from neural data, we adapted a method from the field of speech
    synthesis called unit selection, in which units of speech are concatenated to form audible
    output. In our approach, which we call Brain-To-Speech, we chose subsequent units of speech
    based on the measured ECoG activity to generate audio waveforms directly from the neural
    recordings. Brain-To-Speech employed the user's own voice to generate speech that sounded very
    natural and included features such as prosody and accentuation. By investigating the brain areas
    involved in speech production separately, we found that speech motor cortex provided more
    information for the reconstruction process than the other cortical areas.},
  url          = {https://halcy.de/cites/pdf/herff2019generating.pdf},
}
[details] [abstract v] [bibtex v] [pdf] [doi]
Towards Restoration of Articulatory Movements: Functional Electrical Stimulation of Orofacial Muscles (, , , , , , , ), at EMBC 2019 – 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society, July 2019

Abstract

Millions of individuals suffer from impairments that significantly disrupt or completely eliminate their ability to speak. An ideal intervention would restore one's natural ability to physically produce speech. Recent progress has been made in decoding speech-related brain activity to generate synthesized speech. Our vision is to extend these recent advances toward the goal of restoring physical speech production using decoded speech-related brain activity to modulate the electrical stimulation of the orofacial musculature involved in speech. In this pilot study we take a step toward this vision by investigating the feasibility of stimulating orofacial muscles during vocalization in order to alter acoustic production. The results of our study provide necessary foundation for eventual orofacial stimulation controlled directly from decoded speech-related brain activity.
@inproceedings{schultz2019towards,
  title        = {Towards Restoration of Articulatory Movements: Functional Electrical Stimulation
    of Orofacial Muscles},
  author       = {Schultz, Tanja and Angrick, Miguel and Diener, Lorenz and Küster, Dennis and
    Meier, Moritz and Krusienski, Dean and Herff, Christian and Brumberg, Jonathan},
  year         = 2019,
  month        = jul,
  booktitle    = {{EMBC} 2019 -- 41st Annual International Conference of the IEEE Engineering in
    Medicine and Biology Society},
  pages        = {3111--3114},
  doi          = {10.1109/EMBC.2019.8857670},
  issn         = {1557-170X},
  keywords     = {brain;muscle;neuromuscular stimulation;speech coding;speech synthesis;decoding
    speech-related brain activity;physical speech production;decoded speech-related brain
    activity;eventual orofacial stimulation;functional electrical stimulation;synthesized speech
    generation;physical speech restoration;electrical stimulation;orofacial muscles
    stimulation;acoustic production;articulatory movement
    restoration;Muscles;Production;Electromyography;Spectrogram;Correlation;Electrodes;Brain},
  abstract     = {Millions of individuals suffer from impairments that significantly disrupt or
    completely eliminate their ability to speak. An ideal intervention would restore one's natural
    ability to physically produce speech. Recent progress has been made in decoding speech-related
    brain activity to generate synthesized speech. Our vision is to extend these recent advances
    toward the goal of restoring physical speech production using decoded speech-related brain
    activity to modulate the electrical stimulation of the orofacial musculature involved in speech.
    In this pilot study we take a step toward this vision by investigating the feasibility of
    stimulating orofacial muscles during vocalization in order to alter acoustic production. The
    results of our study provide necessary foundation for eventual orofacial stimulation controlled
    directly from decoded speech-related brain activity.},
  url          = {https://halcy.de/cites/pdf/schultz2019towards.pdf},
}
[details] [abstract v] [bibtex v] [pdf] [doi]
Session-Independent Array-Based EMG-to-Speech Conversion using Convolutional Neural Networks (, , , ), at 13th ITG Conference on Speech Communication, October 2018

Abstract

This paper presents an evaluation of the performance of EMG-to-Speech conversion based on convolutional neural networks. We present an analysis of two different architectures and network design considerations and evaluate CNN-based systems for their within-session and cross-session performance. We find that they are able to perform on par with feedforward neural networks when trained and evaluated on a single session and outperform them in cross session evaluations.
@inproceedings{diener2018session,
  title        = {Session-Independent Array-Based EMG-to-Speech Conversion using Convolutional
    Neural Networks},
  author       = {Diener, Lorenz and Felsch, Gerrit and Angrick, Miguel and Schultz, Tanja},
  year         = 2018,
  month        = oct,
  booktitle    = {13th {ITG} Conference on Speech Communication},
  isbn         = {978-3-8007-4767-2},
  abstract     = {This paper presents an evaluation of the performance of EMG-to-Speech conversion
    based on convolutional neural networks. We present an analysis of two different architectures
    and network design considerations and evaluate CNN-based systems for their within-session and
    cross-session performance. We find that they are able to perform on par with feedforward neural
    networks when trained and evaluated on a single session and outperform them in cross session
    evaluations.},
  url          = {https://halcy.de/cites/pdf/diener2018session.pdf},
}
[details] [abstract v] [bibtex v] [pdf] [isbn]
A comparison of EMG-to-Speech Conversion for Isolated and Continuous Speech (, , ), at 13th ITG Conference on Speech Communication, October 2018

Abstract

This paper presents initial results of performing EMG-to-Speech conversion within our new EMG-to-Speech corpus. This new corpus consists of parallel facial array sEMG and read audible speech signals recorded from multiple speakers. It contains different styles of utterances - continuous sentences, isolated words, and isolated consonant-vowel combinations - which allows us to evaluate the performance of EMG-to-Speech conversion when trying to convert these different styles of utterance as well as the effect of training systems on one style to convert another. We find that our system deals with isolated-word/consonant-vowel utterances better than with continuous speech. We also find that it is possible to use a model trained on one style to convert utterances from another - however, performance suffers compared to training within that style, especially when going from isolated to continuous speech.
@inproceedings{diener2018comparison,
  title        = {A comparison of EMG-to-Speech Conversion for Isolated and Continuous Speech},
  author       = {Lorenz Diener and Sebastian Bredehöft and Tanja Schultz},
  year         = 2018,
  month        = oct,
  booktitle    = {13th {ITG} Conference on Speech Communication},
  isbn         = {978-3-8007-4767-2},
  abstract     = {This paper presents initial results of performing EMG-to-Speech conversion within
    our new EMG-to-Speech corpus. This new corpus consists of parallel facial array sEMG and read
    audible speech signals recorded from multiple speakers. It contains different styles of
    utterances - continuous sentences, isolated words, and isolated consonant-vowel combinations -
    which allows us to evaluate the performance of EMG-to-Speech conversion when trying to convert
    these different styles of utterance as well as the effect of training systems on one style to
    convert another. We find that our system deals with isolated-word/consonant-vowel utterances
    better than with continuous speech. We also find that it is possible to use a model trained on
    one style to convert utterances from another - however, performance suffers compared to training
    within that style, especially when going from isolated to continuous speech.},
  url          = {https://halcy.de/cites/pdf/diener2018comparison.pdf},
}
[details] [abstract v] [bibtex v] [pdf] [isbn]
Investigating Objective Intelligibility in Real-Time EMG-to-Speech Conversion (, ), at INTERSPEECH 2018 – 19th Annual Conference of the International Speech Communication Association, September 2018

Abstract

This paper presents an analysis of the influence of various system parameters on the output quality of our neural network based real-time EMG-to-Speech conversion system. This EMG-to-Speech system allows for the direct conversion of facial surface electromyographic signals into audible speech in real time, allowing for a closed-loop setup where users get direct audio feedback. Such a setup opens new avenues for research and applications through co-adaptation approaches. In this paper, we evaluate the influence of several parameters on the output quality, such as time context, EMG-Audio delay, network-, training data- and Mel spectrogram size. The resulting output quality is evaluated based on the objective output quality measure STOI.
@inproceedings{diener2018investigating,
  title        = {Investigating Objective Intelligibility in Real-Time EMG-to-Speech Conversion},
  author       = {Lorenz Diener and Tanja Schultz},
  year         = 2018,
  month        = sep,
  booktitle    = {{INTERSPEECH} 2018 -- 19th Annual Conference of the International Speech
    Communication Association},
  abstract     = {This paper presents an analysis of the influence of various system parameters on
    the output quality of our neural network based real-time EMG-to-Speech conversion system. This
    EMG-to-Speech system allows for the direct conversion of facial surface electromyographic
    signals into audible speech in real time, allowing for a closed-loop setup where users get
    direct audio feedback. Such a setup opens new avenues for research and applications through
    co-adaptation approaches. In this paper, we evaluate the influence of several parameters on the
    output quality, such as time context, EMG-Audio delay, network-, training data- and Mel
    spectrogram size. The resulting output quality is evaluated based on the objective output
    quality measure STOI.},
  url          = {https://halcy.de/cites/pdf/diener2018investigating.pdf},
}
[details] [abstract v] [bibtex v] [pdf]
EMG-to-Speech: Direct Generation of Speech From Facial Electromyographic Signals (, ), in TASLP – IEEE/ACM Transactions on Audio, Speech and Language Processing, volume 25, number 12, pages 2375–2385, November 2017

Abstract

Silent speech interfaces are systems that enable speech communication even when an acoustic signal is unavailable. Over the last years, public interest in such interfaces has intensified. They provide solutions for some of the challenges faced by today's speech-driven technologies, such as robustness to noise and usability for people with speech impediments. In this paper, we provide an overview over our silent speech interface. It is based on facial surface electromyography (EMG), which we use to record the electrical signals that control muscle contraction during speech production. These signals are then converted directly to an audible speech waveform, retaining important paralinguistic speech cues for information such as speaker identity and mood. This paper gives an overview over our state-of-the-art direct EMG-to-speech transformation system. This paper describes the characteristics of the speech EMG signal, introduces techniques for extracting relevant features, presents different EMG-to-speech mapping methods, and finally, presents an evaluation of the different methods for real-time capability and conversion quality.
@article{janke2017emg,
  title        = {EMG-to-Speech: Direct Generation of Speech From Facial Electromyographic Signals},
  author       = {Janke, Matthias and Diener, Lorenz},
  year         = 2017,
  month        = nov,
  day          = 23,
  journal      = {{TASLP} -- {IEEE/ACM} Transactions on Audio, Speech and Language Processing},
  volume       = 25,
  number       = 12,
  pages        = {2375--2385},
  doi          = {10.1109/TASLP.2017.2738568},
  abstract     = {Silent speech interfaces are systems that enable speech communication even when an
    acoustic signal is unavailable. Over the last years, public interest in such interfaces has
    intensified. They provide solutions for some of the challenges faced by today's speech-driven
    technologies, such as robustness to noise and usability for people with speech impediments. In
    this paper, we provide an overview over our silent speech interface. It is based on facial
    surface electromyography (EMG), which we use to record the electrical signals that control
    muscle contraction during speech production. These signals are then converted directly to an
    audible speech waveform, retaining important paralinguistic speech cues for information such as
    speaker identity and mood. This paper gives an overview over our state-of-the-art direct
    EMG-to-speech transformation system. This paper describes the characteristics of the speech EMG
    signal, introduces techniques for extracting relevant features, presents different EMG-to-speech
    mapping methods, and finally, presents an evaluation of the different methods for real-time
    capability and conversion quality.},
  url          = {https://halcy.de/cites/pdf/janke2017emg.pdf},
}
[details] [abstract v] [bibtex v] [pdf] [doi]
Bremen Big Data Challenge 2017: Predicting University Cafeteria Load (, , , , , , , , , , , ), at KI 2017: Advances in Artificial Intelligence - 40th Annual German Conference on AI, September 2017

Abstract

Big data is a hot topic in research and industry. The availability of data has never been as high as it is now. Making good use of the data is a challenging research topic in all aspects of industry and society. The Bremen Big Data Challenge invites students to dig deep into big data. In this yearly event students are challenged to use the month of March to analyze a big dataset and use the knowledge they gained to answer a question. In this year's Bremen Big Data Challenge students were challenged to predict the load of the university cafeteria from the load of past years. The best of 24 teams predicted the load with a root mean squared error of 8.6 receipts issued in five minutes, with a fusion system based on the top 5 entries achieving an even better result of 8.28.
@inproceedings{weiner2017bremen,
  title        = {Bremen Big Data Challenge 2017: Predicting University Cafeteria Load},
  author       = {Weiner, Jochen and Diener, Lorenz and Stelter, Simon and Externest, Eike and
    K{\"u}hl, Sebastian and Herff, Christian and Putze, Felix and Schulze, Timo and Salous, Mazen
    and Liu, Hui and K{\"u}ster, Dennis and Schultz, Tanja},
  year         = 2017,
  month        = sep,
  booktitle    = {{KI} 2017: Advances in Artificial Intelligence - 40th Annual German Conference on
    AI},
  publisher    = {Springer International Publishing},
  address      = {Cham},
  pages        = {380--386},
  doi          = {10.1007/978-3-319-67190-1_35},
  isbn         = {978-3-319-67190-1},
  editor       = {Kern-Isberner, Gabriele and F{\"u}rnkranz, Johannes and Thimm, Matthias},
  abstract     = {Big data is a hot topic in research and industry. The availability of data has
    never been as high as it is now. Making good use of the data is a challenging research topic in
    all aspects of industry and society. The Bremen Big Data Challenge invites students to dig deep
    into big data. In this yearly event students are challenged to use the month of March to analyze
    a big dataset and use the knowledge they gained to answer a question. In this year's Bremen Big
    Data Challenge students were challenged to predict the load of the university cafeteria from the
    load of past years. The best of 24 teams predicted the load with a root mean squared error of
    8.6 receipts issued in five minutes, with a fusion system based on the top 5 entries achieving
    an even better result of 8.28.},
  code         = {https://bbdc.csl.uni-bremen.de/index.php/2017},
  url          = {https://halcy.de/cites/pdf/weiner2017bremen.pdf},
}
[details] [abstract v] [bibtex v] [pdf] [doi] [isbn] [code/datasets]
Towards direct speech synthesis from ECoG: A pilot study (, , , , , ), at EMBC 2016 - 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, August 2016

Abstract

Most current Brain-Computer Interfaces (BCIs) achieve high information transfer rates using spelling paradigms based on stimulus-evoked potentials. Despite the success of this interfaces, this mode of communication can be cumbersome and unnatural. Direct synthesis of speech from neural activity represents a more natural mode of communi- cation that would enable users to convey verbal messages in real-time. In this pilot study with one participant, we demonstrate that electrocoticography (ECoG) intracranial activity from temporal areas can be used to resynthesize speech in real-time. This is accomplished by reconstructing the audio magnitude spectrogram from neural activity and subsequently creating the audio waveform from these reconstructed spectrograms. We show that significant correlations between the original and reconstructed spectrograms and temporal waveforms can be achieved. While this pilot study uses audibly spoken speech for the models, it represents a first step towards speech synthesis from speech imagery.
@inproceedings{herff2016towards,
  title        = {Towards direct speech synthesis from ECoG: A pilot study},
  author       = {Herff, C. and Johnson, G. and Diener, L. and Shih, J. and Krusienski, D. and
    Schultz, T.},
  year         = 2016,
  month        = aug,
  booktitle    = {{EMBC} 2016 - 38th Annual International Conference of the IEEE Engineering in
    Medicine and Biology Society},
  doi          = {0.1109/EMBC.2016.7591004},
  abstract     = {Most current Brain-Computer Interfaces (BCIs) achieve high information transfer
    rates using spelling paradigms based on stimulus-evoked potentials. Despite the success of this
    interfaces, this mode of communication can be cumbersome and unnatural. Direct synthesis of
    speech from neural activity represents a more natural mode of communi- cation that would enable
    users to convey verbal messages in real-time. In this pilot study with one participant, we
    demonstrate that electrocoticography (ECoG) intracranial activity from temporal areas can be
    used to resynthesize speech in real-time. This is accomplished by reconstructing the audio
    magnitude spectrogram from neural activity and subsequently creating the audio waveform from
    these reconstructed spectrograms. We show that significant correlations between the original and
    reconstructed spectrograms and temporal waveforms can be achieved. While this pilot study uses
    audibly spoken speech for the models, it represents a first step towards speech synthesis from
    speech imagery.},
  url          = {https://halcy.de/cites/pdf/herff2016towards.pdf},
  poster       = {https://halcy.de/cites/pdf/herff2016towards_poster.pdf},
}
[details] [abstract v] [bibtex v] [pdf] [poster] [doi]
An Initial Investigation into the Real-Time Conversion of Facial Surface EMG Signals to Audible Speech (, , , ), at EMBC 2016 - 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, August 2016

Abstract

This paper presents early-stage results of our investigations into the direct conversion of facial surface electromyographic (EMG) signals into audible speech in a real-time setting, enabling novel avenues for research and system improvement through real-time feedback. The system uses a pipeline approach to enable online acquisition of EMG data, extraction of EMG features, mapping of EMG features to audio features, synthesis of audio waveforms from audio features and output of the audio waveforms via speakers or headphones. Our system allows for performing EMG-to-Speech conversion with low latency and on a continuous stream of EMG data, enabling near instantaneous audio output during audible as well as silent speech production. In this paper, we present an analysis of our systems components for latency incurred, as well as the trade-offs between conversion quality, latency and training duration required.
@inproceedings{diener2016initial,
  title        = {An Initial Investigation into the Real-Time Conversion of Facial Surface EMG
    Signals to Audible Speech},
  author       = {Diener, L. and Herff, C. and Janke, M. and Schultz, T.},
  year         = 2016,
  month        = aug,
  booktitle    = {{EMBC} 2016 - 38th Annual International Conference of the IEEE Engineering in
    Medicine and Biology Society},
  doi          = {10.1109/EMBC.2016.7590843},
  abstract     = {This paper presents early-stage results of our investigations into the direct
    conversion of facial surface electromyographic (EMG) signals into audible speech in a real-time
    setting, enabling novel avenues for research and system improvement through real-time feedback.
    The system uses a pipeline approach to enable online acquisition of EMG data, extraction of EMG
    features, mapping of EMG features to audio features, synthesis of audio waveforms from audio
    features and output of the audio waveforms via speakers or headphones. Our system allows for
    performing EMG-to-Speech conversion with low latency and on a continuous stream of EMG data,
    enabling near instantaneous audio output during audible as well as silent speech production. In
    this paper, we present an analysis of our systems components for latency incurred, as well as
    the trade-offs between conversion quality, latency and training duration required.},
  url          = {https://halcy.de/cites/pdf/diener2016initial.pdf},
  poster       = {https://halcy.de/cites/pdf/diener2016initial_poster.pdf},
}
[details] [abstract v] [bibtex v] [pdf] [poster] [doi]
Direct Conversion from Facial Myoelectric Signals to Speech using Deep Neural Networks (, , ), at IJCNN 2015 - 2015 International Joint Conference on Neural Networks, October 2015

Abstract

This paper presents our first results using Deep Neural Networks for surface electromyographic (EMG) speech synthesis. The proposed approach enables a direct mapping from EMG signals captured from the articulatory muscle movements to the acoustic speech signal. Features are processed from multiple EMG channels and are fed into a feed forward neural network to achieve a mapping to the target acoustic speech output. We show that this approach is feasible to generate speech output from the input EMG signal and compare the results to a prior mapping technique based on Gaussian mixture models. The comparison is conducted via objective Mel-Cepstral distortion scores and subjective listening test evaluations. It shows that the proposed Deep Neural Network approach gives substantial improvements for both evaluation criteria.
@inproceedings{diener2015direct,
  title        = {Direct Conversion from Facial Myoelectric Signals to Speech using Deep Neural
    Networks},
  author       = {Diener, Lorenz and Janke, Matthias and Schultz, Tanja},
  year         = 2015,
  month        = oct,
  booktitle    = {{IJCNN} 2015 - 2015 International Joint Conference on Neural Networks},
  pages        = {1--7},
  doi          = {10.1109/IJCNN.2015.7280404},
  abstract     = {This paper presents our first results using Deep Neural Networks for surface
    electromyographic (EMG) speech synthesis. The proposed approach enables a direct mapping from
    EMG signals captured from the articulatory muscle movements to the acoustic speech signal.
    Features are processed from multiple EMG channels and are fed into a feed forward neural network
    to achieve a mapping to the target acoustic speech output. We show that this approach is
    feasible to generate speech output from the input EMG signal and compare the results to a prior
    mapping technique based on Gaussian mixture models. The comparison is conducted via objective
    Mel-Cepstral distortion scores and subjective listening test evaluations. It shows that the
    proposed Deep Neural Network approach gives substantial improvements for both evaluation
    criteria.},
  keywords     = {electromyography, silent speech interface, deep neural networks},
  url          = {https://halcy.de/cites/pdf/diener2015direct.pdf},
}
[details] [abstract v] [bibtex v] [pdf] [doi]
Codebook Clustering for Unit Selection Based EMG-to-Speech Conversion (, , ), at INTERSPEECH 2015 - 16th Annual Conference of the International Speech Communication Association, September 2015

Abstract

This paper reports on our recent advances in using Unit Selection to directly synthesize speech from facial surface electromyographic (EMG) signals generated by movement of the articulatory muscles during speech production. We achieve a robust Unit Selection mapping by using a more sophisticated unit codebook. This codebook is generated from a set of base units using a two stage unit clustering process. The units are first clustered based on the audio and afterwards on the EMG feature vectors they cover, and a new codebook is generated using these cluster assignments. We evaluate different cluster counts for both stages and revisit our evaluation of unit sizes in light of this clustering approach. Our final system achieves a significantly better Mel-Cepstral distortion score than the Unit Selection based EMG-to-Speech conversion system from our previous work while, due to the reduced codebook size, taking less time to perform the conversion.
@inproceedings{diener2015codebook,
  title        = {Codebook Clustering for Unit Selection Based EMG-to-Speech Conversion},
  author       = {Diener, Lorenz and Janke, Matthias and Schultz, Tanja},
  year         = 2015,
  month        = sep,
  booktitle    = {{INTERSPEECH} 2015 - 16th Annual Conference of the International Speech
    Communication Association},
  pages        = {2420--2424},
  doi          = {10.21437/Interspeech.2015-523},
  abstract     = {This paper reports on our recent advances in using Unit Selection to directly
    synthesize speech from facial surface electromyographic (EMG) signals generated by movement of
    the articulatory muscles during speech production. We achieve a robust Unit Selection mapping by
    using a more sophisticated unit codebook. This codebook is generated from a set of base units
    using a two stage unit clustering process. The units are first clustered based on the audio and
    afterwards on the EMG feature vectors they cover, and a new codebook is generated using these
    cluster assignments. We evaluate different cluster counts for both stages and revisit our
    evaluation of unit sizes in light of this clustering approach. Our final system achieves a
    significantly better Mel-Cepstral distortion score than the Unit Selection based EMG-to-Speech
    conversion system from our previous work while, due to the reduced codebook size, taking less
    time to perform the conversion.},
  keywords     = {electromyography, silent speech interface, unit selection},
  url          = {https://halcy.de/cites/pdf/diener2015codebook.pdf},
}
[details] [abstract v] [bibtex v] [pdf] [doi]
Improving Unit Selection based EMG-to-Speech Conversion (), Masters Thesis, July 2015

Abstract

This master’s thesis introduces a new approach to improve the unit-selection based conversion of facial myoelectric signals to audible speech. Surface electromyography is the recording of electric signals generated by muscle activity using surface electrodes attached to the skin. Past work has shown that it is feasible to generate audible speech signals from facial electromyographic activity generated during speech production, using several different approaches. This work focuses on the unit-selection approach to conversion, where the speech signal is reconstructed by concatenating pieces of target audio data selected by a similarity criterion calculated on the parallel sequence of source electromyographic data. A novel approach, based on optimizing the database that units are selected from by using unit clustering to generate more prototypical units and improve the selection process, is introduced and evaluated. In total, we obtain a qualitative improvement of up to 14.92 percent relative over a baseline unit selection system, while improving the time taken for conversion by up to 98%.
@mastersthesis{diener2015improving,
  title        = {Improving Unit Selection based EMG-to-Speech Conversion},
  author       = {Diener, Lorenz},
  year         = 2015,
  month        = jul,
  school       = {Karlsruher Institut für Technologie},
  supervisor   = {Janke, Matthias and Schultz, Tanja},
  abstract     = {This master’s thesis introduces a new approach to improve the unit-selection based
    conversion of facial myoelectric signals to audible speech. Surface electromyography is the
    recording of electric signals generated by muscle activity using surface electrodes attached to
    the skin. Past work has shown that it is feasible to generate audible speech signals from facial
    electromyographic activity generated during speech production, using several different
    approaches. This work focuses on the unit-selection approach to conversion, where the speech
    signal is reconstructed by concatenating pieces of target audio data selected by a similarity
    criterion calculated on the parallel sequence of source electromyographic data. A novel
    approach, based on optimizing the database that units are selected from by using unit clustering
    to generate more prototypical units and improve the selection process, is introduced and
    evaluated. In total, we obtain a qualitative improvement of up to 14.92 percent relative over a
    baseline unit selection system, while improving the time taken for conversion by up to 98%.},
  url          = {https://halcy.de/cites/pdf/diener2015improving.pdf},
}
[details] [abstract v] [bibtex v] [pdf]
A runtime cache for interactive procedural modeling (, , , , , ), in Computers & Graphics, volume 36, number 5, pages 366–375, August 2012

Abstract

We present an efficient runtime cache to accelerate the display of procedurally displaced and textured implicit surfaces, exploiting spatio-temporal coherence between consecutive frames. We cache evaluations of implicit textures covering a conceptually infinite space. Rotating objects, zooming onto surfaces, and locally deforming shapes now requires minor cache updates per frame and benefits from mostly cached values, avoiding expensive re-evaluations. A novel parallel hashing scheme supports arbitrarily large data records and allows for an automated deletion policy: new information may evict information no longer required from the cache, resulting in an efficient usage. This sets our solution apart from previous caching techniques, which do not dynamically adapt to view changes and interactive shape modifications. We provide a thorough analysis on cache behavior for different procedural noise functions to displace implicit base shapes, during typical modeling operations.
@article{reiner2012runtime,
  title        = {A runtime cache for interactive procedural modeling},
  author       = {Reiner, Tim and Lefebvre, Sylvain and Diener, Lorenz and Garc{\'\i}a, Ismael and
    Jobard, Bruno and Dachsbacher, Carsten},
  year         = 2012,
  month        = aug,
  journal      = {Computers \& Graphics},
  publisher    = {Elsevier},
  volume       = 36,
  number       = 5,
  pages        = {366--375},
  doi          = {10.1016/j.cag.2012.03.031},
  abstract     = {We present an efficient runtime cache to accelerate the display of procedurally
    displaced and textured implicit surfaces, exploiting spatio-temporal coherence between
    consecutive frames. We cache evaluations of implicit textures covering a conceptually infinite
    space. Rotating objects, zooming onto surfaces, and locally deforming shapes now requires minor
    cache updates per frame and benefits from mostly cached values, avoiding expensive
    re-evaluations. A novel parallel hashing scheme supports arbitrarily large data records and
    allows for an automated deletion policy: new information may evict information no longer
    required from the cache, resulting in an efficient usage. This sets our solution apart from
    previous caching techniques, which do not dynamically adapt to view changes and interactive
    shape modifications. We provide a thorough analysis on cache behavior for different procedural
    noise functions to displace implicit base shapes, during typical modeling operations.},
  url          = {https://halcy.de/cites/pdf/reiner2012runtime.pdf},
}
[details] [abstract v] [bibtex v] [pdf] [doi]
Procedural modeling with signed distance functions (), Bachelors Thesis, February 2012

Abstract

Procedural modeling is the modeling of scenes using algorithms instead of explicit lists of geometry specified vertex by vertex. The implicit procedural approach to modeling has several advantages over describing scenes in an explicit fashion, such as the possibility to have levels of detail that would be impossible to store explicitly, as the memory requirements would be prohibitive – even an infinite level of detail is possible when the scene description can simply provide the detail as soon as it becomes necessary during the rendering process. It is obvious, then, that describing scenes or objects procedurally is desirable. However, while intuitively accessible modeling tools for the creation of explicit geometry abound, there are only very few and hardly any mature tools or frameworks for the procedural modeling of objects or scenes. This thesis will give an overview over the current state of procedural modeling frameworks. After explaining the theoretical concepts required for its understanding, it will go into detail about a specific type of procedural modeling – modeling with implicit surfaces, with rendering based on distance functions – and introduce a tool which can be used to accomplish this task. It will then introduce improvements made to this tool throughout the course of this thesis, including the development of a cache enabling the real-time use of previously prohibitively expensive noise functions, and finally discuss and summarize its now extended capabilities.
@bachelorsthesis{diener2012procedural,
  title        = {Procedural modeling with signed distance functions},
  author       = {Diener, Lorenz},
  year         = 2012,
  month        = feb,
  school       = {Karlsruher Institut für Technologie},
  supervisor   = {Dachsbacher, Karsten and Reiner, Tim},
  abstract     = {Procedural modeling is the modeling of scenes using algorithms instead of explicit
    lists of geometry specified vertex by vertex. The implicit procedural approach to modeling has
    several advantages over describing scenes in an explicit fashion, such as the possibility to
    have levels of detail that would be impossible to store explicitly, as the memory requirements
    would be prohibitive – even an infinite level of detail is possible when the scene description
    can simply provide the detail as soon as it becomes necessary during the rendering process. It
    is obvious, then, that describing scenes or objects procedurally is desirable. However, while
    intuitively accessible modeling tools for the creation of explicit geometry abound, there are
    only very few and hardly any mature tools or frameworks for the procedural modeling of objects
    or scenes. This thesis will give an overview over the current state of procedural modeling
    frameworks. After explaining the theoretical concepts required for its understanding, it will go
    into detail about a specific type of procedural modeling – modeling with implicit surfaces, with
    rendering based on distance functions – and introduce a tool which can be used to accomplish
    this task. It will then introduce improvements made to this tool throughout the course of this
    thesis, including the development of a cache enabling the real-time use of previously
    prohibitively expensive noise functions, and finally discuss and summarize its now extended
    capabilities.},
  url          = {https://halcy.de/cites/pdf/diener2012procedural.pdf},
}
[details] [abstract v] [bibtex v] [pdf]