AI Models Ranked By Hallucinations: ChatGPT is Best, Palm-Chat Needs to Sober Up
Vectara has launched a leaderboard evaluating top AI chatbots on their ability to minimize 'hallucinations,' providing vital insights into the behavior of major large language models.
In a notable development, Vectara has released a leaderboard that evaluates the leading AI chatbots on their propensity to avoid 'hallucinations.' This ranking system is intriguing and critical in shedding light on the hallucination tendencies of various prominent large language models (LLMs). Understanding what this ranking entails, why it holds significance, and the criteria for its measurement becomes essential in AI development.
AI chatbots have increasingly been scrutinised for their inclination to 'hallucinate' - essentially, fabricating information to bridge knowledge gaps. This issue came to the forefront with a striking incident involving the law firm Levidow, Levidow & Oberman. The firm faced controversy after submitting fictitious judicial opinions, complete with counterfeit quotes and citations, generated by the AI tool ChatGPT. For instance, the fabricated legal decision in Martinez v. Delta Air Lines displayed semblances of authenticity but ultimately unravelled under meticulous examination, revealing nonsensical text segments.
Understanding AI Hallucinations
AI hallucinations stem from the underlying design of generative AI models. These models, devoid of natural intelligence, function as statistical systems, predicting words or other data forms based on patterns learned from vast datasets, usually sourced from the public web. However, this approach leads to outputs that may be grammatically correct but nonsensical or even factually incorrect. This happens due to the models' inability to distinguish between accurate and inaccurate associations.
Research indicates that while all AI models are prone to hallucinations, some perform significantly better than others. GPT-4, for instance, is the most accurate among the models tested, with higher reliability in providing factual answers compared to its predecessors and other models like Cohere's Command Model.
Strategies to Mitigate Hallucinations
Experts suggest various methods to reduce AI hallucinations. Vu Ha from the Allen Institute for Artificial Intelligence emphasizes the importance of training and deploying LLMs with high-quality knowledge bases to ensure accuracy. Another approach involves reinforcement learning from human feedback (RLHF), a method OpenAI uses to train models like GPT-4, where human annotators rank outputs from the LLM to refine their responses.
Despite these efforts, the complete elimination of hallucinations remains a challenge. Sebastian Berns, a PhD researcher, argues that hallucinating models could have creative benefits, enabling unexpected outputs that might lead to novel ideas. However, the issue becomes problematic when accuracy is paramount, such as expert-level advice or factual reporting.
The Current Landscape and Future Prospects
The current state of AI models suggests a trade-off between the utility and the risk of hallucinations. While all models, including popular ones like Bard, exhibit this tendency, the degree to which they do so varies. Cohere's model, for example, was found to be the worst in a study, with a high frequency of hallucinations, whereas GPT-4 showed a relatively better performance.
In conclusion, while AI models like ChatGPT demonstrate impressive capabilities, their hallucination propensity poses significant challenges. The industry continues to develop strategies to mitigate these errors, aiming to balance the models' benefits against the risks of misinformation. As AI technology evolves, it becomes crucial for users and developers alike to remain aware of these limitations and approach AI outputs with a critical eye.