Supposed Expert Evaluations of Google Gemini Results Are, In Fact, From Non-Experts


Like other generative AI systems, [Google](https://mashable.com/category/google) [Gemini](https://mashable.com/article/google-announces-agentic-gemini-2-point-0-image-audio-support) occasionally generates [inaccurate](https://mashable.com/article/google-ai-overviews-benefit-of-doubt) responses. Nonetheless, sometimes this problem may arise from testers who do not possess the expertise to adequately fact-check the outputs.

As reported by *TechCrunch*, the organization responsible for enhancing Gemini’s precision has now advised testers to assess responses even in the absence of appropriate “domain knowledge.”

### [SEE ALSO: Google introduces Deep Research to Gemini for browsing the web on your behalf](https://mashable.com/article/google-adds-deep-research-gemini-browsing-the-web-on-your-behalf)

This brings forth questions regarding the rigor and standards that Google professes to maintain in testing Gemini for accuracy. In the “Building responsibly” section of its [Gemini 2.0 announcement](https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/), Google indicated that it is “collaborating with trusted testers and outside experts while conducting thorough risk assessments and safety and assurance evaluations.” While the company prioritizes the evaluation of responses for sensitive or harmful material, there seems to be less focus on addressing inaccuracies that do not pose an inherent risk.

Google appears to evade the problem of hallucinations and errors by incorporating a disclaimer: “Gemini can make mistakes, so double-check it.” Although this transfers responsibility to users, it disregards the difficulties encountered by the human testers operating behind the scenes.

In the past, GlobalLogic, a Hitachi subsidiary engaged in evaluating Gemini, directed its prompt engineers and analysts to bypass responses they did not completely comprehend. Guidelines available to *TechCrunch* mentioned: “If you do not have critical expertise (e.g., coding, math) to rate this prompt, please skip this task.”

However, GlobalLogic revised its guidance last week, mandating testers to evaluate prompts even if they do not have specialized domain knowledge. Testers are now asked to “rate the parts of the prompt you understand” and acknowledge their lack of expertise in their evaluation. Essentially, expertise is no longer deemed a necessary criterion for this role.

As reported by *TechCrunch*, contractors may only skip prompts that are “completely missing information” or those that contain sensitive content necessitating a consent form.