Agent

Most of our work can be related back to a particular structural model of what agents are. This model has the following parts.

A boundary, or separation from the environment
Input/output channels
A world model
A representation of values
A mechanism (e.g. planning or search) to choose actions based on its values and world model

Though this construction is somewhat complicated, we believe each of these parts is essential to capturing the unified phenomenon of agents. What unifies the parts? One gloss that can be useful is: an agent is something that does something on purpose. Another is: agents do things for reasons.

Here is an intuitive justification. The phenomenon we want to characterize (because it is dangerous) is one which strongly optimizes the state of the world, and which would do so when counterfactually placed within any of a wide variety of environments. To do so, it would certainly need to have some way to gain information about which environment it’s in, and to optimize said environment it would also certainly need a way to affect it. For a given sequence of actions that could cause the world state to go up the optimization ordering, there could be some environments for which those same actions would cause the world state to go down the ordering. Since the agent reliably makes the state go up the ordering, it must be using using the information it has received about its environment to determine which one it is in, and then choose an appropriate sequence of actions. Thus is must have some kind of model of the environment.

Motivation

Our goal is to understand what makes AI systems potentially dangerous. We focus on the concept of “agent” described on this page for two reasons: 1) we believe that these properties are a major source of the danger, and 2) it seems to naturally capture relevant concept (in other words, it “carves nature at its joints”), as evidenced by the fact that it’s been converged on by several different lines of inquiry. If we find theorems that derive this model from other conditions, such as the agent structure problem, then that is further significant reason to use this concept of agent.

Approximate vs ideal agent

This concept of agent is not intended to imply crisp separation between the parts. For example it may be that the boundary is fuzzy, that the structure of the inputs changes over time, or that its world model is mixed up with its planning strategy.

But each of these components should be definable in such a way that one can say when a system has a positive degree of agent structure; we want to form a theory of all agents, or the “space” of agents. Often, it’s easiest to first define a type of ideal agent, and then define an agent as something that is some distance away from this ideal agent.