How AI Translates Emotions into Numbers
AI can perform an array of impressive tasks today, from understanding text or creating art to powering self-driving cars. But one of the most exciting fields (and the area of our expertise) is detecting emotions from voice. It’s fascinating and exciting to see how something as complex and seemingly intangible as emotions can be identified using numbers, computing power and statistical analysis. And with the latest developments we can do this with reasonably high confidence and reliability now.
Let's dive into the step-by-step process of how AI can translate emotions to numbers:
Voice Sample Collection: The first step is recording high-quality and clear voice samples. The better the quality of the voice sample, the more accurate the emotion recognition can be in general.
Preprocessing: The collected voice data is then preprocessed. This step involves normalizing the data and eliminating background noise and other disturbances. State-of-the-art preprocessing algorithms help enhance the voice recording's quality, which makes it possible to use even regular microphones on standard smartphones for the recordings. Technically speaking, an analog-to-digital converter on the device transforms the recorded analog voice into a digital format.
Feature Extraction: In this step different features that describe the characteristics of the voice sample are extracted. These features broadly consist of prosodic features, spectral features, and voice quality features.
Prosodic Features: These pertain to pitch, intensity, and speech rate, all of which shift significantly when a person's emotional state alters. Someone expressing anger, for instance, typically speaks louder and faster.
Spectral Features: These elements represent the energy distribution in various frequency bands of the speech signal. Features such as formants and Mel-frequency cepstral coefficients (MFCCs), which give information about the vocal tract's shape and size, are also included. The shifts in voice quality that come with different emotions can be studied using these features.
Voice Quality Features: The quality aspects of the voice like roughness, breathiness, etc., that help distinguish between different emotional states.
To a certain extent, the three categories mentioned above overlap and are useful ways to categorize different subsets of the overall characteristics of someone's voice.
Prediction: After the extraction process, these features are normalized to remove variations caused by different speakers and recording conditions. The normalized features are then used as input for a classification model, which allows the system to identify the emotion into one or several specific emotions that the AI/ML model has been trained to identify.
Training the Model: These machine learning models are created by training them on large dataset (thousands of hours of voice data), where each voice sample is labeled with the emotion it expresses. From logistic regression to complex models like Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), or Transformers, these models learn the correlations between specific voice patterns and the respective emotion label.
Prediction: After training, the models can recognize emotions even in a new, unseen voice sample, if the model can successfully generalize. If trained with a diverse dataset from different languages, these models can even become language-, age- and gender-independent. The ability for an AI model to generalize beyond the datasets it has been trained with (i.e. that it has seen during the training phase only) is a difficult problem and collectively, we are just now making significant progress in this with the emergence of larger Transformer-based models.
Evaluation: The model's success is assessed using metrics like accuracy and precision in correctly recognizing emotions across different datasets. The key here is to use multiple datasets rather than a single clean one, as it ensures a holistic model evaluation that suggests more reliably that the trained AI model would generalize to other, real-world data samples presented to it.
Output: The ultimate output of this process is the recognized emotion, with some kind of a score indicating the probability of the sample being of that emotion class or category.
It’s important to note that even the best models right now are not 100% accurate, and never will be as emotions and their interpretation are subjective and person- and situation-dependent.
Interestingly though, the average accuracy of a person predicting another person’s emotional state from the second person’s voice samples only, in the way the AI models are tested (i.e. asking person A to predict whether person B is in one of several well-defined emotional classes or states) has been found to be between 60-80%, typically around 70% for most people. This means that the average person is also ‘far’ from 100% accurate in predicting other people’s emotional states.
Our latest state-of-the-art AI models at Maaind have an average accuracy above the average human accuracy now, matching that of very well-trained individuals who can assess other people’s emotions from the tone of voice alone. We are, therefore, on the verge of super-human accuracy of predicting emotional states from voice features alone.
Use cases today
Already today our AI is enabling amazing experiences: By adjusting interfaces, products or experiences to the user’s current emotional state they become much more intuitive and engaging.
We are also taking the next step after this, by building a neuroadaptive recommendation engine, that can recommend content or interventions based on the desired emotional state that the user wants to reach. Through the use of 1) measurement of the current emotional or physiological state of the user and 2) the ability to nudge a user towards a target or desired state, we are helping to build the foundation for the new stage of human-technology interaction which are seamless, emotionally-intelligent and support our wellbeing.
We have pioneered bringing #neuroadaptive AI from academic research to a wide range of real-world use cases. From adjusting the atmosphere in the car based on how you’re feeling, to supporting employees at the workplace with the right wellbeing content recommended at the right time, we are on a mission to enable our everyday products to be attuned to how we feel and how we want to feel.
If you want to join us on this mission to improve wellbeing, either by collaborating, investing or even just having a chat, please reach out to anyone of the team or contact us at https://martindinov.typeform.com/to/Jtv3StjC?typeform-source=maaind.com.
Comments