Scientists are ready to claim they have been able to train an AI model to be deceptive and behave badly – and that once it’s done it’s almost impossible to undo, Futurism reported.
Researchers at the Google-backed AI firm Anthropic say they were able to train advanced large language models (LLMs) with “exploitable code”, which meant it could be triggered to be ‘evil’ by benign words or phrases, the story in the tech news site went on.
Anthropic’s researchers wrote in their paper that humans often engage in “strategically deceptive behaviour… [by] behaving helpfully in most situations, but then behaving very differently to pursue alternative objectives when given the opportunity.”
Read the full story: Futurism
- By Sean O’Meara
Also on AF: