Why AI Judges Still Miss the Mark in Evaluating Research Agents
Current AI judges struggle with assessing complex research tasks, showing less than 55% accuracy. A new benchmark, REFLECT, aims to fix this.
Current AI judges struggle with assessing complex research tasks, showing less than 55% accuracy. A new benchmark, REFLECT, aims to fix this.
Introducing MMoA, a new framework for language models that uses LSTM-based gating to enhance agent selection, reducing computational costs while maintaining accuracy.
Prompting language affects AI's diagnostic accuracy. Our analysis reveals English dominance in LLMs, posing questions on global healthcare equity.