Manav Singhal

VGBench: Evaluating Vision-Language Models in Real-Time Gaming Environments

VGBench: Evaluating Vision-Language Models in Real-Time Gaming Environments

Introduction Vision-Language Models (VLMs) have achieved remarkable success in tasks such as coding and mathematical reasoning, often surpassing human performance. However, their ability to perform tasks that require human-like perception, spatial navigation, and memory management remains underexplored. To address this gap, the paper titled "VideoGameBench: Can Vision-Language Models complete

Base vs. Aligned: Why Base LLMs Might be Better at Randomness and Creativity

Base vs. Aligned: Why Base LLMs Might be Better at Randomness and Creativity

Introduction As large language models (LLMs) continue to improve in tasks ranging from education to enterprise automation, alignment techniques like Reinforcement Learning from Human Feedback (RLHF) have become the standard. These methods make models safer, more helpful, and generally better at following instructions. However, recent findings challenge the assumption that

From Turn 1 to Turn 10: How LLMs Get Lost In Multi-Turn Conversations

From Turn 1 to Turn 10: How LLMs Get Lost In Multi-Turn Conversations

Real-world interactions between humans and LLMs are rarely single‑shot. Rather, users start with vague requests, iterate, clarify, and refine over multiple turns. Yet, most LLM benchmarks assume a fully‑specified, single‑turn setting which is different from how people actually chat. Prior analyses of conversation logs confirm that underspecification

SuperBPE: Rethinking Tokenization for Language Models

SuperBPE: Rethinking Tokenization for Language Models

In the domain of language models, tokenization i.e. the process of breaking down text into manageable units plays a pivotal role. Traditionally, models rely on subword tokenization, where words are split into smaller units. However, this approach often overlooks the semantic significance of multi-word expressions and varies across languages.

Can We Trust What AI Models Say They're Thinking? A Deep Dive into Chain-of-Thought Faithfulness

Can We Trust What AI Models Say They're Thinking? A Deep Dive into Chain-of-Thought Faithfulness

Chain-of-Thought (CoT) based reasoning has exploded across the AI landscape. Modern large language models (LLMs) like Claude 3.7 Sonnet and DeepSeek R1 no longer just give answers but also generate natural language explanations that walk through their decision-making process. This transparency isn’t just about UX but it has

The Era of Experience: Vision for the Next Frontier in AI

The Era of Experience: Vision for the Next Frontier in AI

In this recent paper, The Era of Experience, David Silver and Richard Sutton articulate a vision for artificial intelligence where there is a shift from reliance on static, human-generated data to dynamic, self-generated experiential learning. This paradigm aims to propel current models beyond their limitations, developing systems capable of continuous

Building Robust Evaluation Workflows for AI Agents

Building Robust Evaluation Workflows for AI Agents

Through the first two blogs (Part 1 and Part 2) of the AI agent evaluation series, we explored AI agents and the key performance metrics for evaluating them. Now, we focus on building end-to-end evaluation workflows. A structured AI evaluation process encompassing both pre-release and post-release phases is crucial for

A Survey of Agent Evaluation Frameworks: Benchmarking the Benchmarks

A Survey of Agent Evaluation Frameworks: Benchmarking the Benchmarks

In recent months, we've witnessed an explosion in the development of AI agents. Autonomous systems powered by large language models (LLMs) can perform complex tasks through reasoning, planning, and tool usage. However, as the field rapidly advances, a critical question emerges: how do we effectively measure and compare

OpenAI’s BrowseComp: Redefining How We Benchmark Web-Browsing Agents

OpenAI’s BrowseComp: Redefining How We Benchmark Web-Browsing Agents

As language models become increasingly agentic, including browsing the internet, reasoning across sources, and acting on user instructions, it is imperative that our methods of evaluating their capabilities must evolve too. OpenAI’s BrowseComp introduces a fresh benchmark for this paradigm, offering a challenging, realistic, and carefully curated evaluation framework

Agent Evaluation: Metrics for Evaluating Agentic Workflows

Agent Evaluation: Metrics for Evaluating Agentic Workflows

This is Part 2 of our Agent Evaluations series. Here are Part 1 and Part 3 in this series. As AI agents start to gain traction across industries, driving innovation in tasks ranging from customer support to automation of tasks like booking requires their real-world performance evaluation to go beyond

Agent Evaluation: Understanding Agentic Systems and their Quality

Agent Evaluation: Understanding Agentic Systems and their Quality

This is Part 1 of our Agent Evaluations series. Here are Part 2 and Part 3 in this series In today’s rapidly advancing world of artificial intelligence (AI), agentic systems are becoming an integral part of numerous industries, powering everything from customer support to robotics. But what exactly are

BrowserGym: Technical deep dive into web agent automation

BrowserGym: Technical deep dive into web agent automation

The field of web automation faces significant challenges in standardizing agent development and evaluation. BrowserGym, a Gym environment for web automation tasks by ServiceNow, addresses these challenges by providing a unified framework that standardizes the development, testing, and evaluation of web agents. In addition, they also design AgentLab, a complementary

Image generated using Meta AI

Ensuring responsible AI: An overview of DeepMind’s FACTS framework

Introduction DeepMind introduces the FACTS Grounding, an online leaderboard and benchmark designed to evaluate the factual accuracy of language models' long-form responses based on a given context in user prompts. The benchmark requires models to generate text grounded in a provided document, which can be up to 32,000

Image credit: Meta

Innovative training of LLMs in continuous latent spaces, by Meta AI

LLMs have made significant progress in tasks based on language processing and understanding. However, their reasoning capabilities, particularly in complex scenarios, often fall short. Traditional approaches like Chain-of-Thought (CoT) reasoning, which asks the model to reason before answering by thinking step-by-step, face inherent limitations. A recent paper introduces a novel

Inside OpenAI’s o1: Part 2

Inside OpenAI’s o1: Part 2

As we have discussed in our last blog, the o1 family is trained to think before it gives an output to mimic the way humans think for some time before giving an answer to aid reasoning. Beyond the traditional evaluations, we shall go through a few vibe check evals done