In a new paper printed Thursday titled “Auditing language fashions for hidden goals,” Anthropic researchers described how fashions educated to intentionally conceal sure motives from evaluators might nonetheless inadvertently reveal secrets and techniques, due to their skill to undertake totally different contextual roles or “personas.” The researchers have been initially astonished by how successfully a few of their interpretability strategies appeared to uncover these hidden motives, though the strategies are nonetheless beneath analysis.
Whereas the analysis concerned fashions educated particularly to hide motives from automated software program evaluators referred to as reward fashions (RMs), the broader objective of finding out hidden goals is to forestall future eventualities the place highly effective AI programs may deliberately deceive or manipulate human customers.
Whereas coaching a language mannequin utilizing reinforcement studying from human suggestions (RLHF), reward fashions are sometimes tuned to attain AI responses in line with how effectively they align with human preferences. Nonetheless, if reward fashions will not be tuned correctly, they will inadvertently reinforce unusual biases or unintended behaviors in AI fashions.