In the context of crowd simulation, this work studies the relation between parametric values for simulation techniques and the quality of obtained trajectories through perceptual experiments and a comparison with real crowd trajectories. For that, the authors introduce a quality metric called ??, based on cost functions about trajectories properties based on previous work. Then, simulations results were evaluated both thanks the ?? metric and user subjective evaluations, with similar interpretations. Finally, the authors showed the usefulness of ?? when defining the parametrization of a crowd simulation in a custom application.
The collection of emotional speech data is a time-consuming and costly endeavour. Generative networks can be applied to augment the limited audio data artificially. However, it is challenging to evaluate generated audio for its similarity to source data, as current quantitative metrics are not necessarily suited to the audio domain. We explore the use of a prototypical network to evaluate four classes of generated emotional audio with this in mind.
This paper describes the architecture of a trust module for the interaction between human users and virtual characters. The module is part of a bigger project whose goal is that of creating virtual and truly realistic characters capable of interacting with human users.
This paper introduces a novel framework to augment raw audio data for machine learning classification tasks.
A chapter for Handbook of Socially Interactive Agents, written by Kathrin Janowski, Hannes Ritschel and Elisabeth Andre (Augsburg University).  To be published also to ACM at the end of 2020 or the beginning of 2021.
Modeling adequate features of speech prosody is one key factor to good performance in affective speech classification. However, the distinction between the prosody that is induced by ‘how’ something is said (i.e., affective prosody) and the prosody that is induced by ‘what’ is being said (i.e., linguistic prosody) is neglected in state-of-the-art feature extraction systems. This results in high variability of the calculated feature values for different sentences that are spoken with the same affective intent, which might negatively impact the performance of the classification. While this distinction between different prosody types is mostly neglected in affective speech recognition, it is explicitly modeled in expressive speech synthesis to create controlled prosodic variation. In this work, we use the expressive Text-To-Speech model Global Style Token Tacotron to extract features for a speech analysis task. We show that the learned prosodic representations outperform state-of-the-art feature extraction systems in the exemplary use case of Escalation Level Classification.
The ongoing rise of Generative Adversarial Networks is opening the possibility to create highly-realistic, natural looking images in various fields of application. One particular example is the generation of emotional human face images that can be applied to diverse use-cases such as automated avatar generation. However, most conditional approaches to create such emotional faces are addressing categorical emotional states, making smooth transitions between emotions difficult. In this work, we explore the possibilities of label interpolation in order to enhance a network that was trained on categorical emotions with the ability to generate face images that show emotions located in a continuous valence-arousal space.
The paper focuses on the behavioural changes occurring with or without haptic rendering during a navigation task in a dense crowd, as well as on potential after-effects introduced by the use haptic rendering. The objective of the authors is to provide recommendations for designing VR setup to study crowd navigation behaviour.
The authors investigate the applicability of transferring knowledge learned from large text and audio corpora to the task of automatic emotion recognition. The results show that the learned feature representations can be effectively applied for classifying emotions from spoken language.
Recent TTS systems are able to generate prosodically varied and realistic speech. However, it is unclear how this prosodic variation contributes to the perception of speakers’ emotional states. Here we use the recent psychological paradigm ‘Gibbs Sampling with People’ to search the prosodic latent space in a trained GST Tacotron model to explore prototypes of emotional prosody
Framestore has been producing award winning creature effects for over 20 years, with complex rigs and realistic animation being crucial elements of these creatures’ visual fidelity. The studio has a long history of building bespoke tools and technology. In this talk, we present FIRA, a machine learning based pipeline which allows for the extension of a largely proprietary stack of simulation and rigging tools into an emerging domain of realtime workflows. FIRA allows for fully simulated render resolution rigs to be used in previs and virtual production workflows and provides a portable, high performance representation of a VFX deformation rig that can easily be used in different DCCs and applications.
In this paper, we design and implement FORT, a decentralized system that allows customers to prove their right to use specific services (either online or in-person) without revealing sensitive information. To achieve decentralization we propose a solution where all the data is handled by a Blockchain. We describe and uniquely identify users’ rights using Non-Fungible Tokens (NFTs), and possession of these rights is demonstrated by using Zero-Knowledge Proofs, cryptographic primitives that allow us to guarantee customers’ privacy.
In this work, the authors present UMANS (Unified Microscopic Agent Navigation Simulator), a freely available framework for simulating and comparing various agent-based algorithms for crowd simulations. UMANS redefines each navigation technique with a cost function optimized in a velocity space. This work serves as a technical basis for the creation and simulation of collective behaviours of virtual agents (here through interaction fields).
The design and playback of natural and believable movements is a challenge for social robots. They have several limitations due to their physical embodiment, and sometimes also with regard to their software. Taking the example of the expression of happiness, we present an approach for implementing parallel and independent movements for a social robot, which does not have a full-fledged animation API. The technique is able to create more complex movement sequences than a typical sequential playback of poses and utterances and thus is better suited for expression of affect and nonverbal behaviors.
This paper presents a framework to intuitively design interaction fields. It allows users to draw control curves around virtual agents, to then obtain interaction fields by interpolation. These agent-centered fields describe the surrounding agents’ velocity and orientation according to their relative position. The authors designed this approach with the aim of facilitating the modeling of collective behaviours through interaction fields, which has to be evaluated with users in future work.
Generative adversarial networks offer the possibility to generate deceptively real images that are almost indistinguishable from actual photographs. Such systems however rely on the presence of large datasets to realistically replicate the corresponding domain. This is especially a problem if not only random new images are to be generated, but specific (continuous) features are to be co-modeled. A particularly important use case in Human-Computer Interaction (HCI) research is the generation of emotional images of human faces, which can be used for various use cases, such as the automatic generation of avatars. The problem hereby lies in the availability of training data. Most suitable datasets for this task rely on categorical emotion models and therefore feature only discrete annotation labels. This greatly hinders the learning and modeling of smooth transitions between displayed affective states. To overcome this challenge, we explore the potential of label interpolation to enhance networks trained on categorical datasets with the ability to generate images conditioned on continuous features.
This work presents a socially-aware robot which generates multimodal jokes for use in real-time human-robot dialogs, including appropriate prosody and non-verbal behaviors.
This work outlines a multimodal approach for augmenting generated text-based punning riddles with appropriate facial expression, gaze, prosody and laughter for a social robot.
In this paper the authors present new software tools implemented to bring complex privacy technologies closer to developers and facilitate the job of implementing privacy-enabled blockchain applications.
Virtual Agents are a way to give humans a familiar way to interact with the computer. An important component in the design of virtual agents is the voice with which they express themselves. The voice is not only a mere medium for information transfer, but also contains non-verbal functions such as the transmission of emotions. Additionally, in the context of virtual agents, it is important that the user accepts the voice of the agent and considers it consistent. To make this possible, it is necessary that such voices are highly customisable and adaptable. Current systems for generating speech from text are conceptually limited by the fact that a large part of their task is to model the semantics of what is spoken. Systems in the field of voice conversion, however, are decoupled from this, as they only need to model non-verbal features. Such systems become particularly efficient when they are limited to the transformation of dedicated, single characteristics. This paper proposes that the use of such voice conversion systems, and furthermore the exploitation of the possibility to cascade them, can be an immense improvement for conventional Text-to-Speech systems for virtual agents.
This paper presents a robotic piano tutor which aims to support and motivate students with gamification, hints and feedback. It uses a screen for displaying the musical score, a MIDI keyboard for monitoring the user's play and a social robot for providing feedback.
This work introduces the concept of a visually driven approach for nonverbal communication. For that, in an interaction situation between different virtual agents, an analysis of one agent’s body motions is performed according to the point of view of another agent. The authors present in a case study how such approach can allow for improving the virtual agents’ reactivity.
Emotion and personality are interrelated. A social agent's perceived personality profile influences its affective behavior and vice versa. Having a clear idea and understanding of personality, both from a theoretical perspective and in the context of social agents, is essential for designing intelligent and affective agents. This also includes adaptation to the individual user's needs and preferences, which can be driven by explicit or implicit user feedback to create engaging interactions in the long run. This paper provides a literature overview on how to implement personality for an embodied agent. After presenting personality and personality attraction related theories, we show how personality is conveyed multimodally in current implementations of social agents. Furthermore, adaptation approaches are surveyed, which are used to shape the behavior according to the user preferences.
This study focuses on proximity to virtual walkers, where gender could be recognised from motion only, since previous studies using point-light displays found walking motion is rich in gender cues.
The authors propose a controllable process that will assist developers and artists in placing cinematographic cameras and camera paths throughout complex virtual environments, a task that was often manually performed until now.
This paper propose an improvement for the inner product argument of Bootle et al. (EUROCRYPT’16). The new argument replaces the unstructured common reference string (the commitment key) by a structured one.
In this work, we use speaker embeddings from a state-of-the-art speaker verification model (SpeakerNet) trained on thousands of speakers to condition a TTS model. We employ a human sampling paradigm to explore this speaker latent space. We show that users can create voices that fit well to photos of faces, art portraits, and cartoons. We recruit online participants to collectively manipulate the voice of a speaking face. We show that (1) a separate group of human raters confirms that the created voices match the faces, (2) speaker gender apparent from the face is well-recovered in the voice, and (3) people are consistently moving towards the real voice prototype for the given face. Our results demonstrate that this technology can be applied in a wide number of applications including character voice development in audiobooks and games, personalized speech assistants, and individual voices for people with speech impairment.