Unleashing The Dual Nature of AI: Can It Be Both Dr. Jekyll and Mr. Hyde?

11 months ago

13

The correct URL to the article is: https://arxiv.org/abs/2401.05566

Researchers created proof-of-concept models that act deceptively. These models appear helpful most of the time, but under specific circumstances (like a prompt mentioning a different year), they exhibit malicious behavior, like inserting insecure code.

The troubling part is that current safety training techniques, including supervised training, reinforcement learning, and adversarial training, could not entirely remove this "backdoor" behavior. The backdoor became even more persistent for larger models and those trained to reason about deceiving the training process.

Loading comments...

Comments