Latest

System overview

Red teaming with auto-generated rewards and multi-step RL

Introduction Making AI systems like LLMs robust against adversarial cases is a critical area of research. One approach to identifying vulnerabilities in AI models is red-teaming, where adversarial prompts or attacks are designed to expose weaknesses. However, generating a wide variety of diverse yet effective attacks remains a significant challenge.

RAG

Evaluating RAG performance: Metrics and benchmarks

Introduction RAG architectures facilitate the extraction of pertinent information, enhancing the overall quality and accuracy of generated outputs. This blog explores the intricacies of RAG architecture, focusing on the evaluation of its retrieval and generation components, the structure of effective evaluation datasets, and the metrics essential for assessing system performance.

Agent as a Judge

Agent as a Judge

Introduction Most popular benchmarks like SWE-Bench rely solely on the final resolve rate of automated repair tasks. They do not effectively consider the steps taken by the agentic system to reach the resolve rate. Thus, agentic systems should be evaluated like a human, looking at the thoughts and agent trajectory

Chain-of-Thought prompting: A guide to enhancing LLM reasoning

Chain-of-Thought prompting: A guide to enhancing LLM reasoning

Introduction This blog explores Chain-of-Thought (CoT) prompting as a powerful technique for enhancing reasoning in large language models. By guiding models to break tasks into smaller steps, CoT mirrors human problem-solving. A study on shift ciphers reveals that CoT reasoning is influenced by factors such as task probability, frequency in

Contextual document embeddings

Contextual document embeddings

Introduction Retrieval is a complex task due to the diversity of queries and the importance of the relevance of the text being retrieved. There are primarily statistical-based retrieval and neural-based retrieval techniques. The paper we will be discussing today works on improving document embeddings for neural retrieval tasks. Traditional methods

LLM hallucination detection

LLM hallucination detection

Introduction Large Language Models (LLMs) such as GPT-4 and Llama2 generate human-like text, which has enabled a variety of applications. However, alongside this fluency, a major challenge remains hallucinations—situations where the model generates factually incorrect or unverifiable information. Hallucinations in LLMs are not one-dimensional but manifest in various forms,

Advanced RAG techniques

Advanced RAG techniques

Introduction LLMs excel in knowledge-intensive tasks but often struggle with niche or long-tail queries. RAG enhances LLMs by incorporating external knowledge, yet it faces a key challenge: imperfect retrieval. This occurs when retrieved information is incorrect, irrelevant, or conflicting, leading to unreliable results. In this blog, we explore an advanced

Synthetic data generation grounded in real data sources

Synthetic data generation grounded in real data sources

Introduction In this blog, we explore the paper on Source2Synth, a framework for synthetic data generation. Source2Synth addresses the common issues of low-quality synthetic data by grounding it in real-world sources and employing a multi-stage curation process. This approach enhances data quality and relevance. What is the problem with synthetic

DSPy framework

DSPy framework

DSPy is a declarative, self-improving framework for LLMs designed to streamline pipelines and aid broader LLM application development processes. The framework allows developers to define tasks and workflows in a high-level, declarative manner using their API. This process simplifies the need to specify “what needs to be done?” rather than

Agent workflow memory

Agent workflow memory

With the rapid adoption of AI agents and their workflows by enterprises and individuals alike in their daily lives, enabling agents to perform complex, long-horizon tasks with complex trajectories has become a significant challenge. The recent paper, “Agent Workflow Memory” introduces a novel approach to address this challenge by equipping

Understanding jailbreaking and prompt-based injections

Understanding jailbreaking and prompt-based injections

Through the world's Llama Guards and Guardrails AI, one problem persists in modern LLMs: the threat of being jailbroken. Jailbreaking in LLMs has become a growing concern as these models continue to be adapted as AI Assistants and Agents in every single pipeline imaginable. AI adoption has slowly

MiniCheck: Efficient fact-checking of LLMs on grounding documents

MiniCheck: Efficient fact-checking of LLMs on grounding documents

Introduction In this blog, we’ll explore a research paper that tackles the challenge of grounding LLM outputs in evidence, which is crucial for tasks like retrieval-augmented generation and document-grounded dialogue. The paper introduces a cost-effective approach by creating smaller models that deliver GPT-4-level performance at a fraction of the

Graph RAG

Introduction This blog explores Microsoft's Graph-based Retrieval-Augmented Generation (Graph RAG) approach. While traditional RAG excels at retrieving specific information, it struggles with global queries, like identifying key themes in a dataset, which require query-focused summarization (QFS). Graph RAG combines the strengths of RAG and QFS by using entity

RAGChecker

RAGChecker

Introduction In this blog, we explore the RAGChecker framework for its effectiveness in evaluating Retrieval-Augmented Generation (RAG) systems. RAGChecker addresses significant challenges in RAG evaluation, including the modular complexity of RAG systems, the assessment of long-form responses, and the reliability of measurements. Meta-evaluation has shown that RAGChecker aligns more closely

RAGEval: Scenario-specific RAG evaluation dataset generation framework

RAGEval: Scenario-specific RAG evaluation dataset generation framework

Introduction Evaluating Retrieval-Augmented Generation (RAG) systems in specialized domains like finance, healthcare, and legal presents unique challenges that existing benchmarks, focused on general question-answering, fail to address. In this blog, we will explore the research paper "RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework," which introduces RAGEval. RAGEval