Agents are dangerous

The field of agent foundations has arisen from a group of researchers concerned about AI safety. It seems that, while optimization is potentially dangerous, agents are even more likely to be dangerous. (Whether a given agent is in fact dangerous depends on its capabilities and whether its values are sufficiently aligned with ours.) This is because the agent structure of general-purpose search means that its domain of optimization is only limited by its world model. (It is very unclear how having approximate agent structure will affect its domain of optimization.)

Agents can give themselves more optimization power

A major reason agents are dangerous is because the ideal agent structure (i.e. a utility maximizer) is instrumentally convergent. That is, if an agent realizes it can modify its own architecture to be more like a utility maximizer, it will judge that outcome as better according to its current values, and will thus be incentivized to modify itself to be closer to ideal agent structure.

Here’s another way to say this. As mentioned above, an agent’s domain of optimization is limited by its world model. If its world model can express a particular counterfactual, then the agent can potentially choose to aim for that counterfactual through its choice of actions. Agents also typically learn their world models as they experience observations of the world. If it happened to get the right observations, an agent could potentially start to model itself. This would mean that its own architecture is now within its domain of optimization. If it has a good model of itself, then it can start to consider counterfactuals where it has a different architecture. It may then realize that architectures closer to the ideal agent architecture would lead it to choose actions that had higher expected value, which means that the action of modifying its architecture itself has a high expected value. Therefore, it has an incentive to modify its architecture toward the ideal agent, and may actually do so.