Input

  • output (str): The generated text to be evaluated.
  • expectedOutput (str): The reference or ground truth text.

Output

  • Result (float): A score between 0 and 1.

Interpretation

  • Higher scores (closer to 1): Indicates higher degree of overlap between the generated text and the ground truth, suggesting better output quality
  • Lower scores (closer to 0): Indicates lower degree of overlap between the generated text and the ground truth, suggesting bad output quality

Formula

The BLEU score is calculated as: BLEU=BP×exp(n=1Nwnlogpn)\text{BLEU} = \text{BP} \times \exp\left(\sum_{n=1}^{N} w_n \log p_n\right) For a simplified version with bigrams (N=2): BLEU=BP×(p1×p2)1/2\text{BLEU} = \text{BP} \times (p_1 \times p_2)^{1/2} where:
  • p1p_1 (precision 1) is the unigram precision:
p1=number of clipped matching unigramstotal candidate unigramsp_1 = \frac{\text{number of clipped matching unigrams}}{\text{total candidate unigrams}}
  • p2p_2 is the bigram precision (similar calculation for bigrams)
  • BP is the Brevity Penalty:
BP=exp(1r/c) if c<r, otherwise BP=1\text{BP} = \exp(1 - r/c) \text{ if } c < r \text{, otherwise } \text{BP} = 1
  • rr is the reference length, and cc is the candidate length.

Example Calculation:

  • Reference: “The cat sat on the mat”
  • Candidate: “A cat is sitting on the mat”
  1. Count unigrams:
    • Reference: 6 words
    • Candidate: 7 words
    • Matching: “cat”, “on”, “the”, “mat” (4 words)
    • p1=4/7=0.571p_1 = 4/7 = 0.571
  2. Calculate BP:
    • r=6r = 6 (reference length)
    • c=7c = 7 (candidate length)
    • Since c>rc > r, BP = 1
  3. For simplicity (assuming only unigram precision):
    • BLEU=1×0.571=0.571\text{BLEU} = 1 \times 0.571 = 0.571
This is a Similarity Metric

Use Cases

  • Evaluating machine translation systems.
  • Assessing the quality of text summarization.
  • Measuring performance in dialogue generation.