In a recent report by Arthur AI, researchers conducted a comprehensive study comparing the performance of top AI models from Meta, OpenAI, Cohere, and Anthropic. The study focused on the models’ tendency to "hallucinate," or fabricate information, with some models exhibiting this behavior more than others. According to the findings, Cohere’s AI was found to hallucinate the most, while OpenAI’s GPT-4 performed the best overall, hallucinating less than its predecessor, GPT-3.5, especially in math-related questions.
Misinformation generated by AI systems has become a pressing issue, particularly as generative AI gains popularity leading up to the 2024 U.S. presidential election. This report from Arthur AI is the first to comprehensively examine rates of hallucination, rather than simply providing a single number for comparison. Adam Wenchel, co-founder and CEO of Arthur, emphasized the importance of understanding how AI models perform in real-world scenarios, rather than relying solely on benchmark tests. The study also explored the models’ ability to hedge their answers with warning phrases to mitigate risks, with GPT-4 showing a relative increase in hedging compared to GPT-3.5.
AI Models Vary in the Rate of Hallucination, According to Researchers
Researchers from Arthur AI have conducted a comprehensive study comparing the top AI models from Meta, OpenAI, Cohere, and Anthropic. The study focused on the models’ tendency to "hallucinate," or fabricate information. The findings reveal significant variations among the models, with some exhibiting a higher rate of hallucination than others. Cohere’s AI was found to hallucinate the most, while Meta’s Llama 2 had more overall hallucinations than GPT-4 and Claude 2. Interestingly, GPT-4 performed the best overall, with fewer instances of hallucination compared to its predecessor, GPT-3.5.
Different AI Models Excel in Different Areas
The report highlights the unique strengths and weaknesses of each AI model. For instance, if we were to assign superlatives, GPT-4 from OpenAI would excel in math, Llama 2 from Meta would be average, Claude 2 from Anthropic would be best at recognizing its limitations, and Cohere AI would have the most hallucinations and confidently provide incorrect answers. The study from Arthur AI comes at a time when misinformation generated by AI systems is a pressing concern, especially in the lead-up to the 2024 U.S. presidential election.
Understanding AI Hallucination and the Experiment Conducted
AI hallucinations occur when large language models (LLMs) generate false information as if it were factual. To test the models’ propensity for hallucination, the researchers conducted experiments in various categories, including combinatorial mathematics, U.S. presidents, and Moroccan political leaders. The questions posed were designed to challenge the models’ reasoning abilities. Overall, GPT-4 performed the best, significantly reducing the rate of hallucination compared to GPT-3.5. On math questions alone, GPT-4 hallucinated between 33% and 50% less, depending on the category. In contrast, Llama 2 hallucinated more frequently than GPT-4 and Claude 2.
Hedging and Self-Awareness Among AI Models
In a second experiment, the researchers examined how the AI models hedged their answers with warning phrases to mitigate potential risks. GPT-4 showed a 50% increase in hedging compared to GPT-3.5. This finding aligns with anecdotal evidence from users who found GPT-4 more frustrating to use. In contrast, Cohere’s AI model did not hedge at all in its responses. Claude 2 demonstrated the most reliability in terms of "self-awareness," accurately gauging what it knows and doesn’t know, and only answering questions for which it had sufficient training data.
Key Takeaways for Users and Businesses
The researchers emphasize the importance of testing AI models on specific workloads. They suggest that benchmarks alone do not provide an accurate representation of real-world performance. Instead, understanding how an AI model performs for a particular task is crucial. Businesses and users should assess performance based on their specific requirements. The study from Arthur AI sheds light on the varying capabilities and limitations of different AI models, helping users make informed decisions about which model is best suited for their needs.
In conclusion, the study by Arthur AI reveals significant differences in the rate of hallucination among top AI models. GPT-4 emerged as the best-performing model, exhibiting a lower rate of hallucination compared to its predecessor, GPT-3.5. The findings underscore the importance of thoroughly testing AI models for specific tasks and understanding their performance in real-world applications. By doing so, users and businesses can harness the power of AI while mitigating the risks associated with hallucination and misinformation.