Behavior versus structure

In the agent structure problem, and selection theorems more generally, there is a key distinction between behavior and structure. This matters because it is agent structure that is dangerous (see also my short talk on why structure matters). Merely observing agent-like behavior on a certain subset of inputs does not necessarily imply that the behavior will generalize, whereas having agent-like structure does imply generalized agent-like behavior.

In the field of psychology, behaviorism refers to the (largely historical) subfield that attempts to understand people and animals by focusing on their externally-observable behavior and response to stimuli, rather than on things like introspection or internal mechanisms of cognition.

Analogously, in the context of our agent foundations research, an AI’s behavior refers to its actions as a function of its observations. Mathematically, functions are defined as sets of ordered pairs, which are the input and the output. Two functions are equal if their sets of ordered pairs are equal. There is no necessary sense of “how” these pairs are produced.

In the real world, any particular property is produced by a physical mechanism. It may be that two physical mechanisms produce the same input-output behavior, while being mechanically different on the inside. Any AI system that is built in reality will thus have an internal structure, and it is that structure that determines all the ways that the system could act. Therefore it determines under what input conditions the system will output dangerous behavior.

Examples

A simple example of two things with the same behavior but different structure are $x + x$ versus $2 x$ . These may feel “the same”, but that’s because we’re used to dealing with them as functions in the context of mathematics.

Another example is different sorting algorithms, like quicksort vs bubblesort. All sorting algorithms should have the exact same behavior, but there are a large number of known algorithms that all “feel” different in how they sort.

A generalization of these examples is the distinction between computable functions and algorithms in computability theory. Algorithms are equivalent to the specific code that is written, and computable functions are equivalent to the input-output behavior of the code. Every computable function has infinitely many different algorithms that implement it. Algorithms have properties like a description length or a runtime. Computable functions can only be said to have these properties insofar as their set of equivalent algorithms has these properties, i.e. a computable function may have a minimum runtime, but no maximum runtime.