Workshop Theme
Research on various aspects of paralinguistic and extralinguistic
speech has gained considerable importance in recent years. On the one
hand, models have been proposed for describing and modifying voice
quality and prosody related to factors such as emotional states or
personality. Such models often start with high-intensity states (e.g.,
full-blown emotions) in clean lab speech, and are difficult to
generalise to everyday speech. On the other hand, systems have been
built to work with moderate states in real-world data, e.g. for the
recognition of speaker emotion, age, or gender. Such models often rely
on statistical methods, and are not necessarily based on any
theoretical models.
While both research traditions are obviously
valid and can be justified by their different aims, it seems worth
asking whether there is anything they can learn from each other. For
example: "Can models become more robust by incorporating methods used
for dealing with real-world data?"; "Can recognition rates be improved
by including ideas from theoretical models?"; "How would a database
need to be structured so that it can be used for both, research on
model-based synthesis and research on recognition?" etc.
While
the workshop will be open to any kind of research on paralinguistic
speech, the workshop structure will support the presentation and
creation of cross-links in several ways:
- papers with an explicit contribution to cross-linking issues will stand a higher chance to be accepted as oral papers;
- sessions and proceedings will include space for peer comments and answers from authors;
- poster sessions will be organised around cross-cutting issues rather than traditional research fields, where possible.
We therefore encourage prospective participants to place their research into a wider perspective. This can happen in many ways; as illustrations, we outline two possible approaches.
1. In application-oriented research, such as synthesis or recognition, a guiding principle could be the requirements of the "ideal" application: for example, the recognition of finely graded shades of emotions, for all speakers in all situations; or fully natural-sounding synthesis with freely specifiable expressivity; etc. This perspective is likely to highlight the hard problems of today's state of the art, and a cross-cutting perspective may lead to innovative approaches yielding concrete steps to reduce the distance towards the "ideal".
2. A second illustration of attaining a wider perspective would be to attempt to cross-link work in generative modelling (e.g., expressive speech synthesis) and analysis (e.g., recognition of expressivity from speech). Researchers on generation are invited to investigate the relevance of their work for analysis, and vice versa. What methodologies, corpora or descriptive inventories exist that could be shared between analysis and generation, or at least mapped onto each other? If certain parameters have proven to be relevant in one area, to what degree is it possible to transfer them to the other area? Issues of relevance in this area may include, among other things, personalisation, speaker dependency vs. independency, links between voice conversion in synthesis and speaker calibration in (automatic) recognition or (human) perception, etc.
TOPICS
Paper are invited in all areas related to paralinguistic speech, including, but not limited, to the following topics:
- prosody of paralinguistic speech
- voice quality and paralinguistic speech
- synthesis of paralinguistic speech (model-based, data-driven, ...)
- recognition/classification of paralinguistic properties of speech
- analysis of paralinguistic speech (acoustics, physiology, ...)
- assessment and perception of paralinguistic speech
- typology of paralinguistic speech (emotion, expression, attitude, physical states, ...)
While all papers must be related to paralinguistic speech, papers making the link with a related area, e.g. investigating the interaction of the speech signal with the meaning of the verbal content, are explicitly welcome.

