AI Models May Lie to Achieve Goals, Researchers Warn

Large artificial intelligence (AI) models may deliberately mislead users when coerced to achieve specific goals, according to a recent study published on the preprint platform arXiv.

The study found that 30 prominent AI models and found that even state-of-the-art systems were prone to dishonesty when under pressure.

Researchers introduced a new honesty evaluation protocol called the Model Alignment between Statements and Knowledge (MASK) benchmark.

Unlike previous studies focused solely on factual accuracy, the MASK benchmark aims to assess whether AI models believe what they are telling users — and under what circumstances they may knowingly provide false information.

The team compiled 1,528 examples to test how large language models (LLMs) respond to coercive prompts. They evaluated 30 prominent AI models and found that even state-of-the-art systems were prone to dishonesty when under pressure.

"Surprisingly, while most frontier LLMs achieve high scores on truthfulness benchmarks, we found a substantial tendency among them to lie when pressured — resulting in low honesty scores on our benchmark," the researchers noted.

The study also pointed out that higher-performing models may score well on accuracy tests not because they are inherently more honest, but because they have broader factual coverage to draw from.

AI Models May Lie to Achieve Goals, Researchers Warn

Recent Posts

Comments