Convergence of abstractions

Any AI system will have some degree of world model. These world models will be made of abstractions. Abstractions have something to do with how we think about and make sense of the world by “abstracting away” inessential information from essential. But how exactly do they work? How are they formed? Humans tend to share many abstractions like “cat” and “pain”, but seem to have conflicting abstractions in some domains, like whether a given color is blue or green, or how “freedom” applies to a given scenario. To what degree do we have an arbitrary choice over the abstractions we use? When are our differing abstractions an error on someone’s part, or due to us having different goals?

Understanding what abstractions are and how they work is going to be a core part of safely navigating the creation of powerful AI systems, for two main reasons:

  1. Human values are in terms of the abstractions that humans use. If an AI is going to optimize for our values, it will need to have a sufficiently precise encoding of our values, which probably requires that it use (sufficiently precise versions of) the same underlying abstractions as us.
  2. More broadly, if the internal mechanisms of trained ML systems are to be understood (i.e. interpretability) then we will need to figure out how to translate them into abstractions that we use. Doing so also makes them dramatically more predictable and amenable to further engineering.

If abstractions tend to be more “objective” in the sense of being determined by the environment, then that is evidence that these tasks will be easier. If we manage to build an ML system whose world model has a sufficiently accurate representation of human values, then we can build a relatively simpler optimization architecture around it which “points” to that representation of human values as its optimization target. This is far from a sufficient plan for building safe superintelligences, but we think doing something like this could be a key stage in a more iterative and deliberative program of alignment.

An algorithmic theory of abstractions is one of our directions of research.

Wentworth’s “natural abstractions”

One agenda aiming to understand abstraction for this purpose is John Wentworth’s “natural abstractions”. For those unfamiliar, the post Natural Abstractions: Key claims, Theorems, and Critiques provides a thorough overview of John’s agenda and how it relates to broader ideas.