How theory can help AI safety

One of the main challenges that agent foundations faces as a field is the difficulty of communicating to other stakeholders in AI safety how exactly theory can help. While any given researcher may have their own personal theory of change which is their motivating reason for working in agent foundations, these reasons also need to be shared and explained for the sake of coordination.

Here we give a cursory attempt at communicating our own theory of change.

Theory can be helpful at different levels

People often struggle to see how theory can help because they are trying to draw a direct connection to the current state of affairs. It is very difficult indeed to figure out a way in which e.g. category theory papers will cause programmers at OpenAI to type different things into their keyboards!

While we do think it is crucial for agent foundations researchers to have at least some stories for how their work will have a through-line all the way to application, that is mostly not the kind of mental models researchers use to guide their work. Instead, there are other levels on which theory can help, and those can give more reliable and applicable heuristics. To give an especially coarse example, the heuristic that “understanding helps you take better actions” is a major motivator for many agent foundations researchers.

Yielding direct calculation

The golden standard for the applicability of a theory is for it to permit the numerical calculation of a solution, which can then be directly implemented. We do not really think this is the level at which agent foundations will be useful, but it is worth discussing, in part to fill out our understanding of the richness of what “application” could mean.

The development of calculus was so profound in part because it was immediately useful for calculating object-level things. In the Principia Mathematica, Newton himself calculated things like ballistic trajectories, tidal predictions, drag forces.

Applying a theory to real-life scenarios like these requires translating the scenario into the language of the theory. For the example of Newtonian mechanics, there are many different levels of accuracy with which this translation could happen;

  • Bodies could be modeled as point particles with fixed masses, positions and velocities (often appropriate for celestial mechanics)
  • Modelling bodies as rigid objects which additionally have a particular shape, density, and rotation
  • Further modelling the bodies as having a Young’s modulus, that is, flexing when forces are applied
  • Using finite element analysis to vary all these properties throughout the structure
  • Attaching sensors to the real bodies, to constantly feed in real-time measurements of these properties into a computeret cetera. For each of these modelling choices, a given real-life scenario can be immediately translated into a mathematical specification, and then the corpus of Newtonian theory will provide various algorithms for calculating out the implications of the model setup.

    Providing proofs of concept

    When special relativity was first published, it contained versions of the famous equation \(E = mc^2\). Despite being the epitome of theory, this discovery made it clear that it was at least in principle possible to extract an enormous amount of energy from matter; an idea which could have enormous practical consequences. People had previously observed large amounts of energy coming from radioactivity, but did not understand the source. This equation strongly suggested that it came from a decrease in the mass (which was too small to measure at the time, but consistent with error). This is an example of how one level at which theory can help is by pointing at what types of things are even possible. This pointing gave a justification for further studying radioactivity, and once the structure of the nucleus and the strong force were discovered, the mechanism of energy release because to clarify.

Turing’s work showed that one physical computer can be built which can be given different input to perform any calculation any other computer could do. Similarly, Shannon’s work on information theory showed that any medium of communication could be reduced down to an intrinsic “bit rate”, and thus we did not need to build separate communications lines for text, video, audio et cetera.

While none of these examples yielded direct numerical calculations for solutions to problems, they all gave strong reasons for reorienting engineers towards or away from entire categories of activity. We think that agent foundations can do the same. For example, if we find a suitable mathematical description of a corrigible architecture, engineers would have a strong justification for building systems to match that spec.

Providing disproofs of practicality

Chaos theory is, in some ways, extremely theoretical. While demonstrations of chaotic behavior take the form of swirling liquids or swinging pendulums, the mathematics involve studying things like “measure preserving transformations on compact manifolds”. Importantly, the consequential results of chaos theory are worked out in this mathematical regime. It’s one thing to say “weather sure does seem all swirly and chaotic, maybe it’s not practical to predict.” By combing the theorems of chaos theory with empirical observations about what kind of dynamical system weather actually is, you can calculate out a prediction error rate as a function of time, and arrive at a very strong justification for not trying to predict the weather one month in advance. In this way, theory can provide disproofs of practicality. Chaos theory also led to bifurcation theory, which may be able to tell you that if you have a chaotic system, changing one of the parameters of the dynamics (e.g. using a more viscous fluid) could yield a non-chaotic system, and indeed help you calculate which range of parameters to change to.

We can imagine that agent foundations might provide a similarly strong theoretical justifications. For example, if you have an AI agent whose architecture is approximately that of a utility maximizer, but whose utility function is imperfectly specified with respect to humanity’s values, what is the probability distribution over the value of the future, in the limit? This question is load-bearing for many researcher’s beliefs about AI risk, but theorem-level justification are at this time entirely absent. We believe progress can be made here.

Providing qualitative descriptions of behavior

Somewhere in between proofs of concept and direct calculation is something that you might call theorems of qualitative behavior.

In computational complexity, there is a classic distinction between problems that can be solved with algorithms that run in polynomial time, and problems which can only be solved with algorithms that run in longer than polynomial time. Generally, if a problem cannot be solved with a polynomial time algorithm, then people decide not to even try to use algorithms that solve it exactly! Instead, they use heuristic solutions, such as algorithms that get you the optimal answer on sub-types of the problem, or which give you a solution that is 95th percentile. Similarly, software engineers will often try to rewrite their code to avoid algorithms that are \(O(n^3)\) in favor of ones that are \(O(n^2)\).

What’s so interesting about these theoretical properties is that they don’t need to be cached out into an exact time duration to be useful. Software engineers don’t generally thumb through their code to calculate an exact integer number of computation steps before deciding if an \(O(n^3)\) algorithm is acceptable.

Statistics has some similar results. It does not prove that it is impossible to flip a fair coin repeatedly and get all heads; instead, it shows that the probability of this goes down exponentially in the number of coin flips. Exponential decay is generally considered extremely rapid, to the point where you don’t really need to calculate out the probabilities to decide on whether a particular phenomenon is “mostly random”. Instead you can just do the equivalent of flipping the coin “a bunch” of times, and easily reach a justified conclusion before getting tired of flipping.

During the acute period of the COVID-19 pandemic, the pre-existing theory work from complex systems helped us understand that, since viruses spread through networks, the spread can be much more effectively controlled by limiting the maximum number of connections a “node” had. Policy makers did not need to get this maximum size exactly correct in order to decide on an effective policy.

Overall, we think our research in agent foundations is much more likely to yield this kind of result. We expect to be able to find reasonable modelling assumptions for things like the environment dynamics or the value representation, and then we expect to get conclusions that tell us that certain types of outcomes have probabilities that decay exponentially to measure-zero, or that expected values diverge to infinity. Perhaps we find an RL training strategy which yields corrigible agent architectures with positive measure. Then further work could try to add assumptions to nail down lower bounds on that probability, and translate to strong recommendations for how to write the training code.