Formalizing the fragility of values

One of the key arguments that AI is dangerous states that human values are fragile. That is, if you take an accurate description of our values, modify it slightly, and then strongly optimize the state of the world for those modified values, then the resulting state will not be very valuable according to the original values.

It may even optimize strongly against them. As a simple example, changing the value specification to put a negative sign in front is essentially a one-bit change, and yet it produces the worst possible optimized outcome.

However, this is clearly not true in all cases; usually small changes have small effects. If the values are represented by something like a convex loss function over a Euclidean space, then an epsilon change in the parameters will only lead to a delta change in the location and value of the optimum.

We think this argument is qualitatively valid, and we wish to understand exactly how concerning it is. That is to say, given human’s actual values, we believe that a non-trivial fraction of the approximate value representations that would actually end up encoded in an AI system would, if optimized for with a plausible amount of optimization, result in a catastrophic outcome. We also believe that a non-trivial fraction would not lead to a catastrophic outcome, and instead a good one.

We think that significant progress can be made in understanding this, and that much of the work is purely mathematical.

See the Catastrophic Regressional Goodhart sequence by Thomas Kwa et al. for some work in this direction.