Maxim

When Your AI Can't Tell the Difference Between "Fine" and Frustration

When it comes to Speech Emotion Recognition - It’s not what you say, it’s how you say it. This is the single biggest roadblock for the entire Voice AI industry exploring Emotion Recognition. It's the reason models can hear a customer sarcastically say, "Oh, that's just brilliant," and confidently log the interaction as 'positive’ - which is a very critical blindspot that affects the model’s interactions with the customer.

Why Emotion Recognition is a Big Challenge in Voice AI

Speech Emotion Recognition (SER) promises to be the bridge between human emotion and AI, yet for now, it's a bridge filled with inaccuracies, leaving a wide gap between what models hear and what we truly feel.

Here's exactly why this is happening:

The Data Drought: High-quality, emotionally-labelled audio datasets are very hard to find. Most available data relies on acted emotions (which are nothing like real-world frustration), and lacks the cultural diversity to train a truly global model. This forces the models to guess, often leading to biased and brittle results.

The hidden costs of emotion AI: As of today, achieving top-tier results for Sentiment Analysis requires either fine tuning a model on niche emotion datasets or stitching together two separate models for audio and text (like Whisper and RoBERTa), Both paths demand fine-tuning and a huge amount of compute, creating an engineering bottleneck that makes world-class performance inaccessible to all but a handful of specialised research labs.

Models misreading emotion: An emotional misread isn't a single error; it's the first domino. When a model mistakes frustration for agreement, it triggers a cascade of flawed responses and inappropriate actions, poisoning the entire user interaction from that point forward.

With all these challenges, we explored how the SOTA Multimodal LLMs are performing on SER, particularly Sentiment Analysis. So today - we're cutting through the hype with our new Sentiment Analysis Evaluator. Instead of relying on complex, fine-tuned models, our evaluator follows a LLM as a judge approach, offering a faster and more context-aware way to evaluate the messy reality of human emotion

What We Learned When We Asked MLLMs to Listen

Our sentiment analysis benchmark across GPT-4o and Gemini 2.5 Flash shows how much audio duration and input modality impact the performance. Here’s what stood out from our experiments:

Increase in Audio Duration increases the accuracy: Accuracy rises as duration increases consistently in all modalities, as more context helps models reason better about the emotion and tone of the speaker(s)
Gemini 2.5 Flash (Audio input) dominates: It consistently outperformed all others, hitting 100% accuracy on >9min audio samples - suggesting strong acoustic features and speech understanding. We also observed that Gemini is very good at prompt adherence.
Audio Modality is More Reliable at Short Durations: Considering the limited context in short audio clips, where a single line of speech can be said sarcastically, emotionally, or neutrally, text alone often misses the true intent.
GPT-4o (Text) beats its own audio at longer durations, as it struggles with acoustic reasoning but leverages its strength in language understanding.
Gemini 2.5 Flash (Text) even outperformed GPT-4o (Audio) across the board, surprisingly strong without raw speech - which suggests that Gemini 2.5 might be trained on more emotionally rich conversations

Bottom line: Gemini 2.5 Flash works very well for Emotion recognition given an audio due it its great architecture underneath, but still needs improvement for very small audio clips

Sentiment Analysis Accuracy

Final Results of SER Accuracy of Gemini 2.5 Flash and GPT 4o across the two modalities.

During our experiments, we observed that GPT-4o demonstrates lower latency with shorter audio inputs. However, as the length of the audio increases, its latency worsens, and it struggles to keep pace with Gemini 2.5 Flash.

Meet Your New Audio Quality Guards

Maxim Sentiment Analysis Evaluator

The Sentiment Analysis takes in your audio, clearly analyses the acoustic features and textual content of the audio to accurately predict the emotion label of the audio

Quality Label: Business-friendly labels like "Positive," "Negative," "Neutral”

Below is a tricky audio sample where the Sentiment Analysis Evaluator labels the audio as “Negative”. The reasoning for the model’s choice of label is provided below

Output eleven1

0:00

/7.497143

A snippet of Gemini 2.5 flash’s reasoning :

The speaker uses a dry, flat tone of voice with a deliberate, slightly exaggerated rhythm. The loudness and pitch are slightly raised, suggesting frustration or annoyance. The voice quality is controlled, contributing to the sarcastic effect. There are no significant variations or turning points in the acoustic features within this brief utterance. The speaker's voice clearly expresses sarcasm and disbelief. Despite using positive words, the vocal delivery conveys a strong negative emotion, indicating that the speaker expects the described plan to fail miserably. The sincerity of the stated positive words is non-existent; the voice mood provides the true, negative emotional context.

Complete report of our tests - Spreadsheet

Ready to Decode Your Users' True Emotions?

Whether you're evaluating voice assistants, analysing customer sentiment, or monitoring support calls, emotional recognition shouldn't be an afterthought. Our Sentiment Analysis evaluator gives you the insight and confidence to understand not just what your users are saying, but how they're really feeling, delivering more empathetic and effective Voice AI experiences.

Get started today: