> ## Documentation Index
> Fetch the complete documentation index at: https://www.getmaxim.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# BLEU

> Measures translation quality by comparing the n-gram precision of a candidate text to reference translations, penalizing overly short outputs.

### Input

* **`output`** (str): The generated text to be evaluated.
* **`expectedOutput`** (str): The reference or ground truth text.

### Output

* **`Result`** (float): A score between 0 and 1.

## Interpretation

* **Higher scores (closer to 1)**: Indicates higher degree of overlap between the generated text and the ground truth, suggesting better output quality
* **Lower scores (closer to 0)**: Indicates lower degree of overlap between the generated text and the ground truth, suggesting bad output quality

## Formula

The BLEU score is calculated as:

$$
\text{BLEU} = \text{BP} \times \exp\left(\sum_{n=1}^{N} w_n \log p_n\right)
$$

For a simplified version with bigrams (N=2):

$$
\text{BLEU} = \text{BP} \times (p_1 \times p_2)^{1/2}
$$

where:

* $p_1$ (precision 1) is the unigram precision:

$$
p_1 = \frac{\text{number of clipped matching unigrams}}{\text{total candidate unigrams}}
$$

* $p_2$ is the bigram precision (similar calculation for bigrams)
* BP is the Brevity Penalty:

$$
\text{BP} = \exp(1 - r/c) \text{ if } c < r \text{, otherwise } \text{BP} = 1
$$

* $r$ is the reference length, and $c$ is the candidate length.

#### Example Calculation:

* Reference: "The cat sat on the mat"
* Candidate: "A cat is sitting on the mat"

1. Count unigrams:
   * Reference: 6 words
   * Candidate: 7 words
   * Matching: "cat", "on", "the", "mat" (4 words)
   * $p_1 = 4/7 = 0.571$

2. Calculate BP:
   * $r = 6$ (reference length)
   * $c = 7$ (candidate length)
   * Since $c > r$, BP = 1

3. For simplicity (assuming only unigram precision):
   * $\text{BLEU} = 1 \times 0.571 = 0.571$

<Note>This is a **Similarity** Metric</Note>

## Use Cases

* Evaluating machine translation systems.
* Assessing the quality of text summarization.
* Measuring performance in dialogue generation.
