The inherent ambiguity in most other systems ends up being a near-term security defense against AI hacking. These are artificial and constrained, with all of the rules specified to the AI. This is why most of the current examples of goal and reward hacking come from simulated environments. And there’s no context to know outside of those things that would muddy the waters. In chess, the rules, objective, and feedback-did you win or lose?-are all precisely specified. And the AI needs some sort of feedback on how well it’s doing so that it can improve. Goals-known in AI as objective functions-need to be established. For an AI to even start on optimizing a problem, let alone hacking a completely novel solution, all of the rules of the environment must be formalized in a way the computer can understand. How realistic is AI hacking in the real world? The feasibility of an AI inventing a new hack depends a lot on the specific system being modeled. But the lesson of the genie is that there will always be unanticipated hacks. And yes, knowing the Volkswagen story, we can explicitly set the goal to avoid that particular hack. And because of the explainability problem, no one will realize what the AI did. The programmers will be satisfied, the accountants ecstatic. Unless the programmers specify the goal of not behaving differently when being tested, an AI might come up with the same hack. It won’t understand that the Volkswagen solution harms others, undermines the intent of the emissions control tests, and is breaking the law. ![]() It will think “out of the box” simply because it won’t have a conception of the box. If I asked you to design a car’s engine control software to maximize performance while still passing emissions control tests, you wouldn’t design the software to cheat without understanding that you were cheating. Their cheat remained undetected for years. They programmed their engine to detect emissions control testing, and to behave differently. This wasn’t AI-human engineers programmed a regular computer to cheat-but it illustrates the problem. In 2015, Volkswagen was caught cheating on emissions control tests. And AIs won’t be able to completely understand context. While humans most often implicitly understand context and usually act in good faith, we can’t completely specify goals to an AI. Any goal we specify will necessarily be incomplete. We never describe all the options, or include all the applicable caveats, exceptions, and provisos. Goals and desires are always underspecified in human language and thought. Whatever you wish for, he will always be able to grant it in a way you wish he hadn’t. We know this, but there’s still no way to outsmart the genie. Genies are very precise about the wording of wishes and can be maliciously pedantic. It’s a specification problem: Midas programmed the wrong goal into the system. He ends up starving and miserable when his food, drink, and daughter all turn to gold. When the god Dionysus grants him a wish, Midas asks that everything he touches turns to gold. We learned about this hacking problem as children with the story of King Midas. If there are problems, inconsistencies, or loopholes in the rules, and if those properties lead to an acceptable solution as defined by the rules, then AIs will find these hacks. ![]() Or the robot vacuum cleaner that instead of learning to not bump into things, it learned to drive backwards, where there were no sensors telling it it was bumping into things. Or another simulation, where an AI figured out that instead of running, it could make itself tall enough to cross a distant finish line by falling over it. Take a soccer simulation where an AI figured out that if it kicked the ball out of bounds, the goalie would have to throw the ball in and leave the goal undefended. This reward hacking involves achieving a goal but in a way the AI’s designers neither wanted nor intended. That’s because AIs don’t think in terms of the implications, context, norms, and values we humans share and take for granted. Because AIs don’t solve problems in the same way people do, they will invariably stumble on solutions we humans might never have anticipated-and some will subvert the intent of the system. ![]() Separately, AIs can engage in something called reward hacking.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |