Some Skyrmsian Signalling Simulations: Reinforcement Learning

Published by Reblogs - Credits in Posts, January 16th, 2022

Some Skyrmsian Signalling Simulations

1. An Old Puzzle about Meaning, Rehashed

Sophistical Simon: Very mysterious! You said you had no idea how to get to Callias’s house, yet you’ve driven us here on your first try without a single missed turn.

Reasonable Reese: Are you joking? You gave me directions. I just turned "left" when you said "left" and right when you said "right".

Simon: But how did you know to go right when I said "right" and left when I said "left"?

Reese: That’s what those words mean.

Simon: Hm, yes I suppose they do. But why do those words mean that? Does the sound of "left" somehow resemble the leftward turn?

Reese: Of course not. Words don’t have to resemble what they mean. It’s just a matter of more or less arbitrary convention that we English speakers use "left" for left and "right" for right.

Simon: And how did English speakers manage to coordinate that "left" would mean left, "right" would mean right, and so on for all the other words?

Reese: Don’t ask me; I wasn’t there. I’m just using the language as I found it.

Simon: Well, it can’t be that they did it like we did with the driving, since we were relying on the conventions of English already being established.

Reese: I guess there was Middle English, and before that Old English, and—

Simon: Surely they didn’t say, in Old English, "Let’s use these Middle English conventions instead".

Reese: No, but that’s still where the meanings of "left" and "right" probably came from. The sounds just drifted a bit.

Simon: In any case, this won’t help us with the real mystery, since we can just ask the same question about all the preceding languages. How did the speakers of the first ones manage to coordinate on the meanings of those?

Reese: Maybe they started with a simpler communication system, a kind of proto-language with pointing and stuff, and used that to establish the first real languages.

Simon: Maybe. But that just pushes the question back again, right? How did the symbols (gestures, sounds, whatever) of the proto-language come to have their meanings? At some point we need to give a different kind of answer.

Reese: Okay, whatever, you’re right: very mysterious. Can we go inside now? I’ve paid 50 drachma for this course and don’t want to miss the start.

2. Skyrms’s Solution and Simulations

In his wonderful book, Signals: Evolution, Learning, and Information (2010), Brian Skyrms gives the beginning of a different kind of answer to this puzzle, building on ideas from David Lewis’s Convention (1969). Skyrms shows how, starting from scratch, very simple agents can spontaneously learn (or evolve) to use meaningful symbols. In fact, it turns out that it is "easy to learn to signal" (p. 114).

Skyrms appeals to various simulations to make his case. The simulations are described in a reasonable amount of detail and graphs of the most relevant data resulting from them are included. This may be convincing enough, but I think it can go by a bit too quickly for the lessons to really sink in.

I found myself wishing there were some simulations online somewhere I could easily run for myself (and which my students could run for themselves). I couldn’t find any, so I made some.Using JavaScript and the p5.js library. If you want to learn p5.js, I highly recommend Daniel Shiffman’s Nature of Code 2.0 and other learning materials. The rest of this post walks through a handful of them.

So let’s build up to the puzzle about how meaning can originate, this time with some Skyrmsian simulations to go with it.

3. Sender-Receiver Games

S can see where the target is, but cannot move. R can move, but cannot see where the target is. Each round, S and R both get rewarded if R makes it to the target and neither gets rewarded otherwise.

If there is no communication between them, R just has to guess where to go, which in this setup will get about a 50% success rate.If it’s not there already, you can speed it up with the slider beneath the simulation. What’s graphed is the average success over the previous 100 rounds (or however many there have been, if fewer than 100).

Speed:

But suppose S can send a message, either a '0' or a '1’, and that R can receive this message.What does it mean for S to send a message that R receives? In this case, just that S and R each have a one digit "working memory", that the symbol in the message S sends is determined by S’s working memory and when that message reaches R, R’s working memory digit will change. It in fact changes to whichever digit the message displays, which in fact corresponds to the digit in S’s working memory. But things needn’t be that way. Everything that will follow would work just as well even if the digit of the message (and in S’s working memory) caused a different symbol to appear in R’s working memory. All that matters is that a '0' has one effect on R and a '1' has a different one. Then our setup is what is called a sender-receiver game.Or signalling game, or Lewis signalling game, or Lewis-Skyrms signalling game.

On its own this won’t do anything. If S sends messages randomly, they will do no better than before. And even if S’s messages are informative about where the goal is, that won’t help if R is still just guessing. For the messages to help, S needs to send them in a way that conveys information about where the goal is and R needs to base its decisions on what messages it receives.

In other words, for S’s messages to function as signals, both S and R need to pick reasonable strategies about how to act and stick to them.

For example, if S uses the [L0,R1] strategy, then it will send a '0' whenever the goal is on the left, and a '1' whenever the goal is on the right. If R is using the strategy that matches, they’ll succeed every time.

But there’s nothing about a '0' that makes it intrinsically well suited to mean left and nothing about '1' that makes it intrinsically well suited to mean right. S could just as well use the [L1,R0] strategy instead. And if R’s strategy matches, they will also succeed.And notice that if you pick the anti-matched strategy, they will do much worse than 50%. This is a good reminder that S and R here are extremely simple and do no more than what we put in. It is arbitrary which symbol is used for left and which is used for right.

Speed:

S Strategy [Random] [L0,R1] [L1,R0]R Strategy [Random] [0L,1R] [1L,0R]

So in in order to use messages to coordinate R’s movements with S’s observations, S and R will need to have coordinated on what those symbols mean. How can they manage to do this if they don’t already have some meaningful symbols to communicate with? This was the puzzle.

4. Simple Reinforcement Learning

Skyrms shows us how if S and R are reinforcement learners, they can learn a signalling system together.

At its most basic,The reinforcement learning that has been driving much of the impressive work in AI recently is much more sophisticated. There are loads of materials online for learning about this, many of them closely following the popular textbook, Sutton and Barto (2018). It is also usually uses neural networks, so that less needs to be specified by hand. What we’ll be doing is simple enough that this would be more trouble than it’s worth, but perhaps in a later post I’ll add simulations that use neural networks. reinforcement learning amounts to the following:

Try something.
If you got rewarded, be more disposed to do that same thing next time.
Repeat.

Now we’ll introduce some very simple reinforcement learning into our simulations.

Let’s choose S’s strategy, like we did before, but then let R learn what to do.

Instead of acting randomly, or having just one strategy and sticking to it no matter what, R will act in an intermediate way. Each round it will pick a strategy, not totally randomly, but weighted randomly, with one tendency to pick one strategy and another to pick the other. Then, when R gets a reward from making it to the goal, it can adjust the weights so that it tends to pick that strategy more often in the future.

One way to picture this—which in fact corresponds closely to the naive way it is implemented here—is to think of R as having a bag of strategies (maybe written on little pieces of paper) that it randomly draws from, starting off with one copy of each of the two strategies in the bag. If a strategy doesn’t work, nothing happens: the strategy is returned to the bag, and we’re back where we started. If a strategy does work, though, R adds another copy of that strategy to the bag. And so when R randomly draws next time, it has a greater chance than before of drawing the strategy that had succeeded.

Simple enough, but surprisingly powerful, especially in a setup as simple as ours:

Speed:

S Strategy [Random] [L0,R1] [L1,R0]

As you can see, if S is sticking to one strategy for sending messages, R can quickly learn how to react to them appropriately.What happens if you switch S’s strategy after a few hundred or 1000 rounds? Why is R’s learning different? How could we change the algorithm if we didn’t want this effect?

Similarly we can fix only R’s strategy and let S learn what signals to send to get R to go towards the goal. This is a little less intuitive—it feels odd that the receiver can determine what the sender’s messages will mean—but the way it works is exactly the same, since S also gets rewarded when R makes it to the goal.

Speed:

R Strategy [Random] [0L,1R] [1L,0R]

This is progress. It shows that S and R don’t have to start out already coordinated and they don’t need us to choose both of their strategies by hand. If one of them is already acting like the messages already have certain meanings, the other can coordinate with them without doing anything sophisticated.

But so far it has still involved one or the other of them taking the meanings as given and having the other learn from them.

If an aspiring signaller knows that their partner will be following their lead, it might make sense to act as if the messages already have a certain meaning and wait for their partner to catch on. But what if they don’t know that? Or what if they are too simple a creature to reason about how others will be reacting to what they do?

Can we have both S and R start out picking randomly and learning a signalling system through the kind of simple reinforcement learning we’ve already seen, or do we need to add something else to the mix?

Give it a few tries to see for yourself:

Speed:

Given how utterly simple an algorithm S and R are following, it’s a bit surprising that this works so quickly and consistently. How can it do so?

At the beginning S and R have a 50% of success per round. Soon enough, a pair of strategies will happen to work by luck. Maybe the goal was on the left, S picked [L0,R1], and R picked [0L,1R].

Through reinforcement, S and R will be more likely to use those strategies in the future. They might sometimes get lucky with the other strategy pair as well, evening the proportions back out. But even if so, eventually there will be a period where one successful strategy pair comes up enough times more often than the other to open up a significant gap between them. From there a positive feedback loop takes over, giving both S and R a strong tendency towards their strategy in that initially more successful strategy pair.

We can see, then, how behaviors that start out with no meaning can be bootstrapped into meaningful signals using only a simple learning mechanism. Not only that, this learning mechanism is not specific to signalling, but, in some form or other, is useful in general and is pervasive among living things.

5. Less Help: Learning by Forgetting

This puts a big dent in the puzzle, I think, but we shouldn’t stop here. The setup for S and R is a very simple one, and some of the simplifications help make it a lot easier to get a signalling system going. We don’t want to know only how meaning can emerge just in these easiest of circumstances, so we should make things harder on S and R and see if this approach still works.

Consider, for example, these strategies that S and R are sampling from. We’ve limited them to two each, the two that are sensible for signalling. We haven’t, for example, included the strategy [L0,R0] for S or [0L,1L] for R.Quick exercise: why would these strategies be bad ones for signalling?

If you know you’re trying to signal, it makes sense not to consider these as options. But what if you don’t even know that signalling is a possibility or would be a good thing?This was part of Rousseau’s concerns about language origins: "The first difficulty that arises is to imagine how languages could have become necessary; for, Men [in the state of nature] having no relations with one another and no need of any, one cannot conceive the necessity or the possibility of this invention if it was not indispensable" Second Discourse, I.25 (trans. Gourevitch). Why should R rule out the possibility of doing the same thing no matter what S does ahead of time?

Having seen the success of the simple reinforcement we’ve been using, it’s worth giving it a shot with these other strategies included to see what will happen:

Speed:

Well, it sometimes it works within a couple thousand rounds. But often it takes longer to coordinate and in many cases there’s no coordination even in 50,000 rounds.I would need to make the simulations more efficient to run them much past that.

This is concerning. If a couple extra options make learning this much harder here, how will the reinforcement learning approach play out for more realistic creatures and environments, where there are way more than two ways to act, way more than two relevant states of the world, and way more than two agents interacting?

Rather than arguing abstractly about how big a problem this is, I think the right reaction to this and similar problems arising from adding different complexities is to try out different kinds of learning algorithms to see if they’ll do better. After all, animals (and bacteria, for that matter), learn in much more sophisticated ways than what we’ve allowed S and R. What changes to their learning algorithms would make them better at learning to signal?

Plenty of options are worth exploring, but I’ll just mention one tweak which Skyrms discusses and which happens to help: forgetting.

You might have noticed in the earlier simulations that even when S and R have learned to coordinate on one strategy pair, they each still have some small disposition to choose the other strategies, which means their coordination will never be total and every now and then they’ll fail. We could change the way their learning works so that these leftover strategies will eventually be eliminated. Doing that helps when these extra strategies are present, too.

In the next simulation, S and R don’t just add new tokens to their strategy bags, they also destroy some old ones. Once their total number of strategy tokens reaches a certain threshold—I’ll call the Forgetting Point—they randomly remove one token from their strategy bag each round.I’ve set the Forgetting Point here at 64. How can it be changed so that they’ll learn faster? Is there a trade-off here?

Speed:

Forgetting Point:

Significantly better, though it still takes more time than without the extra strategies.

6. Where to go from here?

"How do these results generalize? This is not so much a single question as an invitation to explore an emerging field" (Skyrms 2010, p. 19).

There is plenty left to explore.

You might be wondering whether signalling systems have to be learned, or whether they could also be evolved. Evolution by natural selection is remarkably similar to learning via reinforcement with forgetting. This is not lost on Skyrms, and much of the book is about similar results about the emergence of signalling through evolution. Indeed, it was evolutionary game-theoretic work that came first.I plan to do another post at some point with an evolutionary simulation you can play around with. One thing I myself would like to explore here is the combination: signalling systems as they emerge in groups of evolved reinforcement learners.

You probably also realize the signalling systems we’ve seen so far are very minimal. They share some important and philosophically interesting features with human language and other animal communication systems, but it doesn’t take much effort to think of many important and interesting features which they lack.

So you might also want to explore questions about how other aspects of human language (and other animal communication systems) could have possibly emerged.

If you want to get more seriously into those questions and want to read work by philosophers in this simulation-centric tradition, you should read Skyrms’s book and take a look at his other work.

But you should also check out the research of

And you might also want to see the closely related work by linguists and psychologists, like Hawkins et al. (2021) (and its references). And for an overview of recent work on emergent communication by machine learning researchers, see Lazaridou and Baroni (2020).

Finally, you might be worried that all of this must be on the wrong track, at least as far as understanding human linguistic meaning is concerned. You might think that while certain animal communication systems can be understood in this bottom-up kind of way, genuine human linguistic meaning cannot. Perhaps there is an unbridgeable gulf between the merely reactive animal (and machine) behavior and the creative, rational, intentional, and normatively significant (etc.) behavior of humans. And perhaps to understand anything interesting about human meaning we must appeal to these special human traits. If you have this kind of worry, you’re in good philosophical company. But for arguments against them, see the work of Ruth Millikan and Dorit Bar-On.

Some Skyrmsian Signalling Simulations

1. An Old Puz­zle about Mean­ing, Re­hashed

2. Skyrms’s So­lu­tion and Sim­u­la­tions

3. Sender-Re­ceiv­er Games

4. Sim­ple Re­in­force­ment Learn­ing

5. Less Help: Learn­ing by For­get­ting

6. Where to go from here?

1. An Old Puzzle about Meaning, Rehashed

2. Skyrms’s Solution and Simulations

3. Sender-Receiver Games

4. Simple Reinforcement Learning

5. Less Help: Learning by Forgetting