Some Skyrmsian Signalling Simulations: Reinforcement Learning


Some Skyrmsian Signalling Simulations

1. An Old Puz­zle about Mean­ing, Re­hashed

So­phis­ti­cal Si­mon: Very mys­te­ri­ous! You said you had no idea how to get to Cal­lias’s house, yet you’ve dri­ven us here on your first try with­out a sin­gle missed turn.

Rea­son­able Reese: Are you jok­ing? You gave me di­rec­tions. I just turned "left" when you said "left" and right when you said "right".

Si­mon: But how did you know to go right when I said "right" and left when I said "left"?

Reese: That’s what those words mean.

Si­mon: Hm, yes I sup­pose they do. But why do those words mean that? Does the sound of "left" some­how re­sem­ble the left­ward turn?

Reese: Of course not. Words don’t have to re­sem­ble what they mean. It’s just a mat­ter of more or less ar­bi­trary con­ven­tion that we Eng­lish speak­ers use "left" for left and "right" for right.

Si­mon: And how did Eng­lish speak­ers man­age to co­or­di­nate that "left" would mean left, "right" would mean right, and so on for all the oth­er words?

Reese: Don’t ask me; I wasn’t there. I’m just us­ing the lan­guage as I found it.

Si­mon: Well, it can’t be that they did it like we did with the dri­ving, since we were re­ly­ing on the con­ven­tions of Eng­lish al­ready be­ing es­tab­lished.

Reese: I guess there was Mid­dle Eng­lish, and be­fore that Old Eng­lish, and—

Si­mon: Sure­ly they didn’t say, in Old Eng­lish, "Let’s use these Mid­dle Eng­lish con­ven­tions in­stead".

Reese: No, but that’s still where the mean­ings of "left" and "right" prob­a­bly came from. The sounds just drift­ed a bit.

Si­mon: In any case, this won’t help us with the real mys­tery, since we can just ask the same ques­tion about all the pre­ced­ing lan­guages. How did the speak­ers of the first ones man­age to co­or­di­nate on the mean­ings of those?

Reese: Maybe they start­ed with a sim­pler com­mu­ni­ca­tion sys­tem, a kind of pro­to-lan­guage with point­ing and stuff, and used that to es­tab­lish the first real lan­guages.

Si­mon: Maybe. But that just push­es the ques­tion back again, right? How did the sym­bols (ges­tures, sounds, what­ev­er) of the pro­to-lan­guage come to have their mean­ings? At some point we need to give a dif­fer­ent kind of an­swer.

Reese: Okay, what­ev­er, you’re right: very mys­te­ri­ous. Can we go in­side now? I’ve paid 50 drach­ma for this course and don’t want to miss the start.

2. Skyrms’s So­lu­tion and Sim­u­la­tions

In his won­der­ful book, Sig­nals: Evo­lu­tion, Learn­ing, and In­for­ma­tion (2010), Bri­an Skyrms gives the be­gin­ning of a dif­fer­ent kind of an­swer to this puz­zle, build­ing on ideas from David Lewis’s Con­ven­tion (1969). Skyrms shows how, start­ing from scratch, very sim­ple agents can spon­ta­neous­ly learn (or evolve) to use mean­ing­ful sym­bols. In fact, it turns out that it is "easy to learn to sig­nal" (p. 114).

Skyrms ap­peals to var­i­ous sim­u­la­tions to make his case. The sim­u­la­tions are de­scribed in a rea­son­able amount of de­tail and graphs of the most rel­e­vant data re­sult­ing from them are in­clud­ed. This may be con­vinc­ing enough, but I think it can go by a bit too quick­ly for the lessons to real­ly sink in.

I found my­self wish­ing there were some sim­u­la­tions on­line some­where I could eas­i­ly run for my­self (and which my stu­dents could run for them­selves). I couldn’t find any, so I made some.Us­ing Java­Script and the p5.js li­brary. If you want to learn p5.js, I high­ly rec­om­mend Daniel Shiff­man’s Na­ture of Code 2.0 and oth­er learn­ing ma­te­ri­als. The rest of this post walks through a hand­ful of them.

So let’s build up to the puz­zle about how mean­ing can orig­i­nate, this time with some Skyrm­sian sim­u­la­tions to go with it.

3. Sender-Re­ceiv­er Games

S can see where the tar­get is, but can­not move. R can move, but can­not see where the tar­get is. Each round, S and R both get re­ward­ed if R makes it to the tar­get and nei­ther gets re­ward­ed oth­er­wise.

If there is no com­mu­ni­ca­tion be­tween them, R just has to guess where to go, which in this set­up will get about a 50% suc­cess rate.If it’s not there al­ready, you can speed it up with the slid­er be­neath the sim­u­la­tion. What’s graphed is the av­er­age suc­cess over the pre­vi­ous 100 rounds (or how­ev­er many there have been, if few­er than 100).

Speed:

But sup­pose S can send a mes­sage, ei­ther a '0' or a '1’, and that R can re­ceive this mes­sage.What does it mean for S to send a mes­sage that R re­ceives? In this case, just that S and R each have a one dig­it "work­ing mem­o­ry", that the sym­bol in the mes­sage S sends is de­ter­mined by S’s work­ing mem­o­ry and when that mes­sage reach­es R, R’s work­ing mem­o­ry dig­it will change. It in fact changes to whichev­er dig­it the mes­sage dis­plays, which in fact cor­re­sponds to the dig­it in S’s work­ing mem­o­ry. But things needn’t be that way. Every­thing that will fol­low would work just as well even if the dig­it of the mes­sage (and in S’s work­ing mem­o­ry) caused a dif­fer­ent sym­bol to ap­pear in R’s work­ing mem­o­ry. All that mat­ters is that a '0' has one ef­fect on R and a '1' has a dif­fer­ent one. Then our set­up is what is called a sender-re­ceiv­er game.Or sig­nalling game, or Lewis sig­nalling game, or Lewis-Skyrms sig­nalling game.

On its own this won’t do any­thing. If S sends mes­sages ran­dom­ly, they will do no bet­ter than be­fore. And even if S’s mes­sages are in­for­ma­tive about where the goal is, that won’t help if R is still just guess­ing. For the mes­sages to help, S needs to send them in a way that con­veys in­for­ma­tion about where the goal is and R needs to base its de­ci­sions on what mes­sages it re­ceives.

In oth­er words, for S’s mes­sages to func­tion as sig­nals, both S and R need to pick rea­son­able strate­gies about how to act and stick to them.

For ex­am­ple, if S uses the [L0,R1] strat­e­gy, then it will send a '0' when­ev­er the goal is on the left, and a '1' when­ev­er the goal is on the right. If R is us­ing the strat­e­gy that match­es, they’ll suc­ceed every time.

But there’s noth­ing about a '0' that makes it in­trin­si­cal­ly well suit­ed to mean left and noth­ing about '1' that makes it in­trin­si­cal­ly well suit­ed to mean right. S could just as well use the [L1,R0] strat­e­gy in­stead. And if R’s strat­e­gy match­es, they will also suc­ceed.And no­tice that if you pick the anti-matched strat­e­gy, they will do much worse than 50%. This is a good re­minder that S and R here are ex­treme­ly sim­ple and do no more than what we put in. It is ar­bi­trary which sym­bol is used for left and which is used for right.

Speed:
S Strat­e­gy [Ran­dom] [L0,R1] [L1,R0]R Strat­e­gy [Ran­dom] [0L,1R] [1L,0R]

So in in or­der to use mes­sages to co­or­di­nate R’s move­ments with S’s ob­ser­va­tions, S and R will need to have co­or­di­nat­ed on what those sym­bols mean. How can they man­age to do this if they don’t al­ready have some mean­ing­ful sym­bols to com­mu­ni­cate with? This was the puz­zle.

4. Sim­ple Re­in­force­ment Learn­ing

Skyrms shows us how if S and R are re­in­force­ment learn­ers, they can learn a sig­nalling sys­tem to­geth­er.

At its most ba­sic,The re­in­force­ment learn­ing that has been dri­ving much of the im­pres­sive work in AI re­cent­ly is much more so­phis­ti­cat­ed. There are loads of ma­te­ri­als on­line for learn­ing about this, many of them close­ly fol­low­ing the pop­u­lar text­book, Sut­ton and Bar­to (2018). It is also usu­al­ly uses neur­al net­works, so that less needs to be spec­i­fied by hand. What we’ll be do­ing is sim­ple enough that this would be more trou­ble than it’s worth, but per­haps in a lat­er post I’ll add sim­u­la­tions that use neur­al net­works. re­in­force­ment learn­ing amounts to the fol­low­ing:

  1. Try some­thing.
  2. If you got re­ward­ed, be more dis­posed to do that same thing next time.
  3. Re­peat.

Now we’ll in­tro­duce some very sim­ple re­in­force­ment learn­ing into our sim­u­la­tions.

Let’s choose S’s strat­e­gy, like we did be­fore, but then let R learn what to do.

In­stead of act­ing ran­dom­ly, or hav­ing just one strat­e­gy and stick­ing to it no mat­ter what, R will act in an in­ter­me­di­ate way. Each round it will pick a strat­e­gy, not to­tal­ly ran­dom­ly, but weight­ed ran­dom­ly, with one ten­den­cy to pick one strat­e­gy and an­oth­er to pick the oth­er. Then, when R gets a re­ward from mak­ing it to the goal, it can ad­just the weights so that it tends to pick that strat­e­gy more of­ten in the fu­ture.

One way to pic­ture this—which in fact cor­re­sponds close­ly to the naive way it is im­ple­ment­ed here—is to think of R as hav­ing a bag of strate­gies (maybe writ­ten on lit­tle pieces of pa­per) that it ran­dom­ly draws from, start­ing off with one copy of each of the two strate­gies in the bag. If a strat­e­gy doesn’t work, noth­ing hap­pens: the strat­e­gy is re­turned to the bag, and we’re back where we start­ed. If a strat­e­gy does work, though, R adds an­oth­er copy of that strat­e­gy to the bag. And so when R ran­dom­ly draws next time, it has a greater chance than be­fore of draw­ing the strat­e­gy that had suc­ceed­ed.

Sim­ple enough, but sur­pris­ing­ly pow­er­ful, es­pe­cial­ly in a set­up as sim­ple as ours:

Speed:
S Strat­e­gy [Ran­dom] [L0,R1] [L1,R0]

As you can see, if S is stick­ing to one strat­e­gy for send­ing mes­sages, R can quick­ly learn how to re­act to them ap­pro­pri­ate­ly.What hap­pens if you switch S’s strat­e­gy af­ter a few hun­dred or 1000 rounds? Why is R’s learn­ing dif­fer­ent? How could we change the al­go­rithm if we didn’t want this ef­fect?

Sim­i­lar­ly we can fix only R’s strat­e­gy and let S learn what sig­nals to send to get R to go to­wards the goal. This is a lit­tle less in­tu­itive—it feels odd that the re­ceiv­er can de­ter­mine what the sender’s mes­sages will mean—but the way it works is ex­act­ly the same, since S also gets re­ward­ed when R makes it to the goal.

Speed:
R Strat­e­gy [Ran­dom] [0L,1R] [1L,0R]

This is progress. It shows that S and R don’t have to start out al­ready co­or­di­nat­ed and they don’t need us to choose both of their strate­gies by hand. If one of them is al­ready act­ing like the mes­sages al­ready have cer­tain mean­ings, the oth­er can co­or­di­nate with them with­out do­ing any­thing so­phis­ti­cat­ed.

But so far it has still in­volved one or the oth­er of them tak­ing the mean­ings as giv­en and hav­ing the oth­er learn from them.

If an as­pir­ing sig­naller knows that their part­ner will be fol­low­ing their lead, it might make sense to act as if the mes­sages al­ready have a cer­tain mean­ing and wait for their part­ner to catch on. But what if they don’t know that? Or what if they are too sim­ple a crea­ture to rea­son about how oth­ers will be re­act­ing to what they do?

Can we have both S and R start out pick­ing ran­dom­ly and learn­ing a sig­nalling sys­tem through the kind of sim­ple re­in­force­ment learn­ing we’ve al­ready seen, or do we need to add some­thing else to the mix?

Give it a few tries to see for your­self:

Speed:

Giv­en how ut­ter­ly sim­ple an al­go­rithm S and R are fol­low­ing, it’s a bit sur­pris­ing that this works so quick­ly and con­sis­tent­ly. How can it do so?

At the be­gin­ning S and R have a 50% of suc­cess per round. Soon enough, a pair of strate­gies will hap­pen to work by luck. Maybe the goal was on the left, S picked [L0,R1], and R picked [0L,1R].

Through re­in­force­ment, S and R will be more like­ly to use those strate­gies in the fu­ture. They might some­times get lucky with the oth­er strat­e­gy pair as well, evening the pro­por­tions back out. But even if so, even­tu­al­ly there will be a pe­ri­od where one suc­cess­ful strat­e­gy pair comes up enough times more of­ten than the oth­er to open up a sig­nif­i­cant gap be­tween them. From there a pos­i­tive feed­back loop takes over, giv­ing both S and R a strong ten­den­cy to­wards their strat­e­gy in that ini­tial­ly more suc­cess­ful strat­e­gy pair.

We can see, then, how be­hav­iors that start out with no mean­ing can be boot­strapped into mean­ing­ful sig­nals us­ing only a sim­ple learn­ing mech­a­nism. Not only that, this learn­ing mech­a­nism is not spe­cif­ic to sig­nalling, but, in some form or oth­er, is use­ful in gen­er­al and is per­va­sive among liv­ing things.

5. Less Help: Learn­ing by For­get­ting

This puts a big dent in the puz­zle, I think, but we shouldn’t stop here. The set­up for S and R is a very sim­ple one, and some of the sim­pli­fi­ca­tions help make it a lot eas­i­er to get a sig­nalling sys­tem go­ing. We don’t want to know only how mean­ing can emerge just in these eas­i­est of cir­cum­stances, so we should make things hard­er on S and R and see if this ap­proach still works.

Con­sid­er, for ex­am­ple, these strate­gies that S and R are sam­pling from. We’ve lim­it­ed them to two each, the two that are sen­si­ble for sig­nalling. We haven’t, for ex­am­ple, in­clud­ed the strat­e­gy [L0,R0] for S or [0L,1L] for R.Quick ex­er­cise: why would these strate­gies be bad ones for sig­nalling?

If you know you’re try­ing to sig­nal, it makes sense not to con­sid­er these as op­tions. But what if you don’t even know that sig­nalling is a pos­si­bil­i­ty or would be a good thing?This was part of Rousseau’s con­cerns about lan­guage ori­gins: "The first dif­fi­cul­ty that aris­es is to imag­ine how lan­guages could have be­come nec­es­sary; for, Men [in the state of na­ture] hav­ing no re­la­tions with one an­oth­er and no need of any, one can­not con­ceive the ne­ces­si­ty or the pos­si­bil­i­ty of this in­ven­tion if it was not in­dis­pens­able" Sec­ond Dis­course, I.25 (trans. Goure­vitch). Why should R rule out the pos­si­bil­i­ty of do­ing the same thing no mat­ter what S does ahead of time?

Hav­ing seen the suc­cess of the sim­ple re­in­force­ment we’ve been us­ing, it’s worth giv­ing it a shot with these oth­er strate­gies in­clud­ed to see what will hap­pen:

Speed:

Well, it some­times it works with­in a cou­ple thou­sand rounds. But of­ten it takes longer to co­or­di­nate and in many cas­es there’s no co­or­di­na­tion even in 50,000 rounds.I would need to make the sim­u­la­tions more ef­fi­cient to run them much past that.

This is con­cern­ing. If a cou­ple ex­tra op­tions make learn­ing this much hard­er here, how will the re­in­force­ment learn­ing ap­proach play out for more re­al­is­tic crea­tures and en­vi­ron­ments, where there are way more than two ways to act, way more than two rel­e­vant states of the world, and way more than two agents in­ter­act­ing?

Rather than ar­gu­ing ab­stract­ly about how big a prob­lem this is, I think the right re­ac­tion to this and sim­i­lar prob­lems aris­ing from adding dif­fer­ent com­plex­i­ties is to try out dif­fer­ent kinds of learn­ing al­go­rithms to see if they’ll do bet­ter. Af­ter all, an­i­mals (and bac­te­ria, for that mat­ter), learn in much more so­phis­ti­cat­ed ways than what we’ve al­lowed S and R. What changes to their learn­ing al­go­rithms would make them bet­ter at learn­ing to sig­nal?

Plen­ty of op­tions are worth ex­plor­ing, but I’ll just men­tion one tweak which Skyrms dis­cuss­es and which hap­pens to help: for­get­ting.

You might have no­ticed in the ear­li­er sim­u­la­tions that even when S and R have learned to co­or­di­nate on one strat­e­gy pair, they each still have some small dis­po­si­tion to choose the oth­er strate­gies, which means their co­or­di­na­tion will nev­er be to­tal and every now and then they’ll fail. We could change the way their learn­ing works so that these left­over strate­gies will even­tu­al­ly be elim­i­nat­ed. Do­ing that helps when these ex­tra strate­gies are present, too.

In the next sim­u­la­tion, S and R don’t just add new to­kens to their strat­e­gy bags, they also de­stroy some old ones. Once their to­tal num­ber of strat­e­gy to­kens reach­es a cer­tain thresh­old—I’ll call the For­get­ting Point—they ran­dom­ly re­move one to­ken from their strat­e­gy bag each round.I’ve set the For­get­ting Point here at 64. How can it be changed so that they’ll learn faster? Is there a trade-off here?

Speed:

For­get­ting Point:

Sig­nif­i­cant­ly bet­ter, though it still takes more time than with­out the ex­tra strate­gies.

6. Where to go from here?

"How do these re­sults gen­er­al­ize? This is not so much a sin­gle ques­tion as an in­vi­ta­tion to ex­plore an emerg­ing field" (Skyrms 2010, p. 19).

There is plen­ty left to ex­plore.

You might be won­der­ing whether sig­nalling sys­tems have to be learned, or whether they could also be evolved. Evo­lu­tion by nat­ur­al se­lec­tion is re­mark­ably sim­i­lar to learn­ing via re­in­force­ment with for­get­ting. This is not lost on Skyrms, and much of the book is about sim­i­lar re­sults about the emer­gence of sig­nalling through evo­lu­tion. In­deed, it was evo­lu­tion­ary game-the­o­ret­ic work that came first.I plan to do an­oth­er post at some point with an evo­lu­tion­ary sim­u­la­tion you can play around with. One thing I my­self would like to ex­plore here is the com­bi­na­tion: sig­nalling sys­tems as they emerge in groups of evolved re­in­force­ment learn­ers.

You prob­a­bly also re­al­ize the sig­nalling sys­tems we’ve seen so far are very min­i­mal. They share some im­por­tant and philo­soph­i­cal­ly in­ter­est­ing fea­tures with hu­man lan­guage and oth­er an­i­mal com­mu­ni­ca­tion sys­tems, but it doesn’t take much ef­fort to think of many im­por­tant and in­ter­est­ing fea­tures which they lack.

So you might also want to ex­plore ques­tions about how oth­er as­pects of hu­man lan­guage (and oth­er an­i­mal com­mu­ni­ca­tion sys­tems) could have pos­si­bly emerged.

If you want to get more se­ri­ous­ly into those ques­tions and want to read work by philoso­phers in this sim­u­la­tion-cen­tric tra­di­tion, you should read Skyrms’s book and take a look at his oth­er work.

But you should also check out the re­search of

And you might also want to see the close­ly re­lat­ed work by lin­guists and psy­chol­o­gists, like Hawkins et al. (2021) (and its ref­er­ences). And for an overview of re­cent work on emer­gent com­mu­ni­ca­tion by ma­chine learn­ing re­searchers, see Lazari­dou and Ba­roni (2020).

Fi­nal­ly, you might be wor­ried that all of this must be on the wrong track, at least as far as un­der­stand­ing hu­man lin­guis­tic mean­ing is con­cerned. You might think that while cer­tain an­i­mal com­mu­ni­ca­tion sys­tems can be un­der­stood in this bot­tom-up kind of way, gen­uine hu­man lin­guis­tic mean­ing can­not. Per­haps there is an un­bridge­able gulf be­tween the mere­ly re­ac­tive an­i­mal (and ma­chine) be­hav­ior and the cre­ative, ra­tio­nal, in­ten­tion­al, and nor­ma­tive­ly sig­nif­i­cant (etc.) be­hav­ior of hu­mans. And per­haps to un­der­stand any­thing in­ter­est­ing about hu­man mean­ing we must ap­peal to these spe­cial hu­man traits. If you have this kind of wor­ry, you’re in good philo­soph­i­cal com­pa­ny. But for ar­gu­ments against them, see the work of Ruth Mil­likan and Dorit Bar-On.