Pavel Panchekha


Share under CC-BY-SA.

Moral Prisoners

In a previous post, I suggested a solution to the prisoners’ dilemma: each prisoner could, through independent reflection, rationally choose not to defect if the other cooperates; this then allows mutual cooperation to occur. In this post I want to extend the argument slightly.

Changing utility functions

The basic operation I'm proposing is the ability for the two prisoners to change their utility function. Of course, the prisoners are perfectly rational, so they weigh options for a new utility function using their current utility function. Nonetheless the game is not trivial.

Each prisoner reasons that adopting a distaste for being the sole defector helps them: they can still punish the other prisoner if they defect, but they also stabilize the equilibrium where both cooperate. Since both prisoners do the same, they ought to find themselves cooperating, despite an inability to repeat the game, make enforceable contracts, and so on.

Commitment to new values

At the end of that prior post I asked how credible the change in values has to be. That is, does each prisoner have to do something to prove that they've adopted a new utility function? Or is it enough to have a simple change of heart?

I'd have thought this would be an easy thing to answer, but it's actually subtle, since game theory normally assumes that games with multiple steps (extended-form games) keep a fixed utility function.1 [1 If you try to consider only the utility-function-choosing game, it's hard to say what the eventual “outcomes” are without a much stronger notion of game-theoretic solution.] So to answer this question, I really had to go back to the basics.

Knowledge and preference

The fundamentals of game theory are knowledge and preference: agents prefer outcomes over one another and also know things about each other, including what outcomes other agents prefer. This knowledge and preference is linked by rationality: if I know that I prefer the results of doing an action \(A\) over the results of any other action, then I will do \(A\).

The standard game-theory set-up assumes that the preferences of all agents is common knowledge. In this context, we show that if it is common knowledge that all agents are rational, it is also common knowledge that the outcome will be an equilibrium solution to the game. The proof is usually demonstrated by iterating through various levels of knowledge that the agents are rational, and formalized using a fixed-point argument: if the outcome were not an equilibrium solution, then some agent would be acting against their preference; since it is common knowledge that they are rational, this is impossible. So it's common knowledge that an equilibrium solution is reached.

Apply the argument to our contemplative prisoners. Even if rationality is common knowledge, we need to have common knowledge of preferences to get an equilibrium outcome. So if each agent changes their utility function, but not credibly—so that their new utility function isn't known—we may end up in a situation where a non-equilibrium outcome is played. In the prisoners’ dilemma, what may happen is each prisoner would defect, secure in their belief that the other prisoner has not had a change of heart and so there is no benefit to cooperating.2 [2 This is an equilibrium outcome, but a dumb one.]

Perfectly rational agents are restricted in their behavior—they must behave rationally, even when this might be against their interests. But it does seem to be a pretty good idea, to be perfectly rational, so it is reasonable to study a world where all agents have common knowledge that all agents are rational.

Where rationality comes from

Acting rational is, by definition, better than acting any other way. But a world where everyone acts perfectly rational can be a pretty dystopian place, as our two prisoners keep finding out.3 [3 The best case is maybe being rational but being known to be irrational, but that seems impossible to maintain.]

So in game theory we study the end-point of the slide toward rationality: a world where all agents have credibly committed to acting rationally at all times. A similar universality is, I think, useful for studying this utility-function-choosing behavior.

Rationality and morality

Credible commitment to the new utility function is necessary for the mechanism I sketched out to work. However, it strikes me that committing to always committing to the rationally-chosen utility function is reasonable.

We can call such an agent moral. Formally, we say an agent with utility function \(u\) is moral if, in any game, they are known (have credibly committed to) to act in the game with utility function \(u'\), where \(u'\) is strategically chosen so that the outcome of the game will be maximized according to utility function \(u\).

There are technical difficulties here. How do we compare outcomes of a game under different utility functions? A simplistic solution would be to say: if all equilibria of the game under one utility function are at least as good as any equilibrium with the other utility functions, then the first utility function leads to not-worse outcomes. This is sufficient for the prisoners’ dilemma, but seems too restrictive in other cases.

However, if the technical difficulties are somehow fixed, we can study a world composed entirely of perfectly-moral agents.4 [4 And where all agents knew the world to be such.] In this world, each prisoner would know that the other agent has also changed utility function, and could then choose to cooperate.

Moral moral agents

At the end of the previous post, I asked whether it ever makes sense for an agent to choose to evaluate utility functions using a different utility function. The framework of perfectly moral agents shows that that never makes sense. Whatever utility function you end up choosing is available from the start. The goal is to choose the utility function which will cause the best outcomes, according to the original utility function. Since the agent is perfectly moral, it can just go ahead and choose that utility function; reconsidering the utility function it uses for that choice might cause it to betray its original goals and can't cause it any good.

Open questions

The world of perfectly-moral agents seems like an interesting object of study. The world of perfectly-rational agents unveiled by game theory shows a beautiful internal logic, but also a devastating, far-from optimal outcome to most games. If this notion of morality is to be a useful formalization of true morality, a world of perfectly-moral agents needs to actually be pretty good.

  • Can perfectly-moral agents always achieve the socially-optimal result? Social choice theory should still exert its influence here: there's no reason to think that perfectly-moral agents can correctly balance competing interests. But can they avoid the sorts of spirals of distrust that perfectly-rational agents sometimes find themselves in?
  • Can perfectly-moral agents preserve their morality? Naturally a perfectly-moral agent will feel a temptation to cheat: claim to change utility functions, but not actually change them. Each prisoner would love to convince the other to cooperate, and then to defect themselves. Can anything stop them, or must the commitment to perfect morality be unnaturally total? If the answer to previous question is positive, then perhaps one could argue that an immoral agent could rationally choose to become perfectly moral.

If you've got any thoughts on the above, please let me know.



If you try to consider only the utility-function-choosing game, it's hard to say what the eventual “outcomes” are without a much stronger notion of game-theoretic solution.


This is an equilibrium outcome, but a dumb one.


The best case is maybe being rational but being known to be irrational, but that seems impossible to maintain.


And where all agents knew the world to be such.