Vectara analyzed major language models, testing the accuracy on 1,000 texts, and released the results. This evaluates how often an LLM introduces hallucinations when summarizing a document.
Another reason to hire a writer and fact-checker for your AI and avoid embarrassment like this, this, this, or this.
Updated 11/1/23
| Model | Accuracy | Hallucination Rate | Answer Rate |
| GPT 4 | 97.0 % | 3.0 % | 100.0 % |
| GPT 4 Turbo | 97.0 % | 3.0 % | 100.0 % |
| GPT 3.5 Turbo | 96.5 % | 3.5 % | 99.6 % |
| Llama 2 70B | 94.9 % | 5.1 % | 99.9 % |
| Llama 2 7B | 94.4 % | 5.6 % | 99.6 % |
| Llama 2 13B | 94.1 % | 5.9 % | 99.8 % |
| Cohere-Chat | 92.5 % | 7.5 % | 98.0 % |
| Cohere | 91.5 % | 8.5 % | 99.8 % |
| Anthropic Claude 2 | 91.5 % | 8.5 % | 99.3 % |
| Mistral 7B | 90.6 % | 9.4 % | 98.7 % |
| Google Palm | 87.9 % | 12.1 % | 92.4 % |
| Google Palm-Chat | 72.8 % | 27.2 % | 88.8 % |
H/T The Rundown AI
