The great majority of human intelligence resides not in our brains, but between them. Our capacity to efficiently store, retrieve, and share information across individuals is why our species has seen such marked growth over the last hundred thousand years, despite no changes in our brain structure. Correspondingly, it should not surprise us that modern security paradigms rely not only on the management of individuals, but even more on the management of information and communication between them.

Over the last decade, the field of deep learning has principally focused on the development of larger, more performant, and more naturalistic models. This has prompted concerns about our ability (or inability) to align such models to our values. I argue that we need to give equal concern to aligning how such models communicate with each other. Not only is this critical in and of itself, but stepping out of the single-agent paradigm which has characterized most alignment research to date can potentially improve alignment by opening up new avenues for auditing, interpreting, and constraining AI systems.

Alignment efforts typically start from the assumption that our objective is to align an AI system (singular). This is reasonable—it would make little sense to attempt to align multiple AIs if we don’t know how to align one. But while it's intuitive to first look for AI safety solutions in the single agent paradigm, this misunderstands the true nature of the alignment problem. This is both because actors behave differently in networked systems than when alone, and the notion of “one” artificial intelligence is an anthropomorphization; we need not (and perhaps cannot) design our intelligence systems unitarily.

If we instead start from a paradigm of a network of connected models, we can begin to develop a kind of conceptual inverse of mechanistic interpretability: rather than investigating the function of a particular neuron within a neural network, we can investigate the function of a neural network within a network of neural networks. In contrast to the blackbox environment of a standalone model, a network of models provides opportunities for explicit, interpretable communication and action bottlenecks that we can audit and constrain. From this vantage point, we can begin to design specialized models which audit and constrain all messages and outputs in the system. This could be extended to a whole network of distributionally diverse filters, and it may be possible to develop a theoretical model of safety based on the probabilistic and game theoretic dynamics of such networks, where the aim is to make the possibility of a certain set of outputs probabilistically negligible.

Indeed, such principles are already being leveraged in a simplified manner. OpenAI’s recently announced plan for superalignment proposes to train specialized models that provide automatable and scalable evaluation for other models’ behavior and internals, building on their research into using GPT-4 to advance mechanistic interpretability efforts for GPT-2. DeepMind and Anthropic have also published research that uses language models to evaluate and align language models. And a number of fledgling AI security companies are launching products that make use of multiple models to detect and block sensitive content and adversarial prompting. A playful example is Gandalf, a game in which the player must convince the LLM to give them the password. As the game progresses, the Gandalf chatbot ropes in more and more sophisticated models to help supervise the conversation and prevent Gandalf from giving the user what they’re asking for.

The other approach we might consider is to decompose existing monolithic models into networks of smaller, communicating models. In a 2021 paper from Mila, the authors produce a shared symbolic language amongst the various components of transformer models by introducing discrete communication bottlenecks between them, leading to a higher performance than baseline models. If such communication could be trained on semantic loss to provide some level of interpretability, we may be able to intervene in a model’s outputs while they are still being formed inside its architecture.

The value of networks of smaller models lies not only in the auditability of their communication bottlenecks, but also in that smaller models are better candidates for alignment. This has already been shown in the field of mechanistic interpretability, but a recent paper from Anthropic also suggests that chain of thought is more transparently faithful to a model’s actual “thought process” in smaller models than in larger models, displaying an inverse scaling effect: “for reasoning faithfulness, larger models may behave worse than smaller ones.”

The 2022 paper Language Model Cascades presents a formalization for reasoning about the probabilistic qualities of networks of chained models. There would be significant value in using such formalizations to go beyond both the single agent paradigm and the simple agent-supervisor paradigm to instead build robust and probabilistically secure networks of models.

Evolution has been unkind to systems which lack diversity. Monocultures are prone to catastrophe, driving our dark associations with incest, famine, and plague. The brain itself is a diverse ecosystem—not one brain but many, interfacing with each other, constraining each other, and negotiating access of a shared resource (the body) to maximize their internal economy of reward functions. Notably, a key piece of this system is the thalamus, which performs routing of sensory inputs to various cortical regions. Deficiencies in the checks and balances of our neural ecosystems typically lead to painful outcomes—a prefrontal cortex without an amygdala is a recipe for psychopathy. More diverse and balanced ecosystems of models could similarly help us avoid catastrophic outcomes in the development of our AI systems.