Neurithmic System’s radical new way of thinking about memory and their program (Sparsey) that implements it

Professor Rod Rinkus  of Neurithmic Systems came up with a net (he calls it SPARSEY) that is about memory – storing memories and retrieving them.   No matter how many memories are already stored, the time to store a new memory, or to retrieve an old one stays the same.   There are some very promising aspects of his idea, and I will explain the general idea below.  If you want to delve further, his actual papers are at his website (sparsey.com).

Suppose you want to store a pattern, perhaps a number.   You could store it as follows;

radios

Here we will assume that only one bit can be ON at a time.   With this constraint, we could only present 5 different numbers to the neural net, and it might learn to associate them with a different positions of the green dot.

This is called a ‘localist’ representation.   One disadvantage of associating patterns this way is that similarity is lost.   The number ‘1’ might be associated with a green dot at the second position, or it might be associated with a green dot at the fifth position.   Also, you can’t store many patterns this way unless you have many neurons.

A more compact way to store data is with a type of number system.   For instance, in everyday math we use base 10 numbers,  (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10) where a new digit is added to represent 10, then 100, then 1000 etc.   If we only want to use ON and OFF, we can limit our digits to zero and one.   This gives us a base 2 number system.   In that case,

the number zero would be:

radioszero

The number one would be:

radiosone

The number two would require an additional digit just like 10 in base 10:

radiostwo

Three would be:

radiosthree

And four would be:

radiosfour

We have already represented 5 numbers, this time with only 3 places needed, and we can represent more, for instance, of all three dots are green (1,1,1) then we have the number ‘7’, using this binary system.

This system is compact, but still, similar numbers are not coded similarly.   Suppose we measured similarity by overlap: the number of dots in the same position that have the green color.   We would see that the number zero has no overlap with the number one, though it is close numerically.  We would see that four has no overlap with three, though three does have one item that overlaps with two.

An ideal memory would code similar items similarly.   The more different the items were, the more different their representations should be.

In the brain, a sofa and a chair should be represented more similarly than a sofa and a mailbox, for instance.   A robin should have a representation closer to a parrot than to a giraffe.

If this can be accomplished, there are several advantages, which I will list after showing one of Gerard Rinkus’s storage units below.

mac

The above is a representation of what he calls a MAC.   Each MAC is made up of several “Competitive Modules”.   In the brain, the Competitive Modules (which he abbreviates as CM) would correspond to cortical Minicolumns, and the MAC would correspond to a cortical MacroColumn.   In this illustration, we are looking at one MAC with three CMs in it.   Each CM has internal competition where only one neuron (in this case each CM has 3 neurons) can win the competition and turn on.   The others lose and turn off.

So what is the advantage of this?   First of all, since each CM can have 3 separate patterns, there are in total 3 * 3 * 3 patterns that can be represented – or 27.   This is more compact than a totally localist method (where of the 9 neurons only one can be on at a time).   It is not as compact as the example we gave of base 2 numbers.

A Sparse-Distributed Representation (SDR) is a representation where only a small fraction of neurons are on.   In the above Mac, only a third of neurons are ON, so it qualifies.   We can interpret the fraction of a feature’s SDR code that overlaps with the stored SDR of a memory (hypothesis) as the probability/likelihood of that feature (hypothesis).

Using Rinkus’s CMs and MACs, we can introduce similarity into our representations that reflects similarity in the world.

Take a look at these 2 MACs

mac mac2

Notice that they do overlap in 2 out of 3 of the CMs.   We could say that their similarity is 2/3.   If they were identical, their similarity would be 3/3.   And so forth.

What is the advantages of representations that reflect similarity?

Well if you have a net (such as Sparsey) that automatically represents similar inputs in a similar way, then you automatically learn hierarchies.   For example, by looking at the SDR for ‘cat’ and ‘dog’, versus the SDRs for ‘cat’ and ‘fish’, you would see that ‘cat’ and ‘dog’ are more similar.

Another interesting advantage is this.   Suppose the MAC on the left represents a cat, and the MAC  on the right represents a dog.    Now you are presented with a noisy input that doesn’t exactly match cat or dog, and gets a representation such as:

mac3

This MAC representation overlaps the MACs for dog and for cat equally.   We could say that the probability that what the net saw was a cat is equal to the probability that it saw a dog.

So the fact that similar inputs yield similar representations means that we can look at a SDR as a probability distribution over several possibilities.   This MAC overlaps the representation for dog and for cat by two CMs, but perhaps this MAC might overlap with just one CM a representation for ‘mouse’.

The next figure is from an article by Prof. Rinkus.   Unlike my depictions, his MACS have a hexagonal shape and each CM is a cluster of little circles (neurons).

 

radical2

A single MAC can learn a sequence of patterns over time  by having horizontal connections of every neuron in the MAC connect with a time delay to every neuron (including itself) in the same MAC.  Once it has learned these, then if an input pattern activates the first SDR, then in the next time step it leads to the SDR that represents the second pattern that it was taught, and that in turn leads to the next.   Let us assume that the MAC has many CMs (maybe 70) and each CM might have 20 neurons).  20 to the power of 70 is a large number, and with this large capacity you can store many sequences now without ‘crosstalk’ (or interference) and also, when presented with the start of a sequence that is ambiguous, you can keep multiple possibilities of the endings of that sequence in memory of the net a particular time.   (This reminds me of Quantum computing where multiple possibilities are tried at the same time).

Suppose Sparsey is trained on single words such as:

  1. THEN
  2. THERE
  3. THAT
  4. SAILBOAT

And so forth.

If the first four words it was trained on are the words above, then if we now want to retrieve a word and present ‘TH’ (the two letters are presented singly, (“T” at time t1, “H” at time t2), then in the above example there are three possibilities for the stored sequence we are attempting to retrieve (THEN, THERE and THAT).   If the next letter that comes in is ‘E’, then there are only two possibilities (THEN and THERE).   If the next letter that comes in is ‘R’, then we are left with just the possibility of the sequence of letters in the word “THERE”.   When we use the principle that similar inputs yield similar SDRs, and we also insist that when a pattern is learned every ON neuron in one MAC learns to increase weights on the same connections as every other neuron, then at any time, all learned sequences stored in memory that match so far are possible, until the ambiguity is resolved by the next input.

Think about the the letter ‘A’ in the above 4 words.   We see that ‘A’ occurs in THAT (one time) and SAILBOAT (in two places).   There are 3 instances of ‘A’ and they cannot be represented in the exact same way (if they were, then there would be no clue of what comes next).   ‘A’ in the context  of THAT does not have the same exact representation as the first ‘A’ in SAILBOAT, and neither have the same exact SDR as the second ‘A’ in SAILBOAT.    Nonetheless, they will have representations that are more similar to each other than to the letter ‘B’ for instance.

Remember that any sequence is represented at a series of time steps, with the position varying but not the sequence.   Think of your own brain.   Your past and your future are all compressed in the present moment.   The past can be retrieved, and the future can be predicted, but at the moment, all you have is the present: a  snapshot of neural firings in your brain.   The same is true in Sparsey of a sequence such as SAILBOAT.   When you reach the first ‘A’ of sailboat it has all the information to complete that word, assuming that only the above 4 words were learned.   There is no ambiguity.   But that is only true because the pattern for this ‘A’ is slightly different than for the other ‘A’s (such  as the ‘A’ in THAT’).  They don’t overlap completely.

So how does Sparsey achieve the property of similar inputs giving rise to similar memories?

First we need to know that each neuron in each CM of a particular MAC has exactly the same inputs.  It may not have the same weights applied to those inputs, but it has the same inputs.   The inputs come from a lower level, which might be a picture in pixels, or if we have a multilevel net, might be from another abstract level.   Initially, all weights are zero.

radical1

Sparsey’s core algorithm is called the Code Selection Algorithm (CSA).   We’ll say that in every CM there are K neurons.   In each MAC (there can be several per level in a multilevel net) there are Q CMs.

CSA Step 1 computes the input sums for all Q×K cells comprising the coding field.  Specifically, for each cell, a separate sum is computed for each of its major afferent synaptic projections.

The cells also have horizontal inputs from cells within some maximum perimeter around the MAC and may have signals also coming down on connections from a layer above them.   But we’ll focus on just the inputs coming from below. The following is a very simplified version of what happens:

As in typical neural nets, each neuron in a MAC has an activation ‘V’ equal to the sum of the product of weights on  a connection times the signal coming over that connection.

Then these sums are normalized so that none exceed one and none are less than zero, but they retain their relative magnitudes or ‘V’ values for each neuron.

Now find the Max V in each CM and tentatively pick the neuron with that value to be the ON neuron in that CM

Finally, a measure called G is computed as the average max-V across the Q CMs.  In the remaining CSA steps, G is used, in each CM, to transform the V distribution over the K cells into a final probability distribution from which a winner is picked.  G’s influence on the distributions can be summarized as follows.

  1. a) When high global familiarity is detected (G is close to 1), those distributions are exaggerated to bias the choice in favor of cells that have high input summations.
  2. b) When low global familiarity is detected (G is close to 0), those distributions are flattened so as to reduce bias due to local familiarity

G does this indirectly, by modifying a ‘sigmoid’ curve that is applied to each neuron’s output.

The lower level in the next picture has a sigmoid curve (the red shaped curve to the right) that has a normal height.   The upper level has a sigmoid curve that has been flattened.   We can see that in the lower level’s sigmoid function, Y-axis values are farther apart (at least in the middle of the ‘S’) than in the second.   The lower level here, we assume, had a larger G than the upper level did, so the CSA calculates a taller sigmoid to apply to the neurons in that level.   If a sigmoid is flattened, and the probability of the most likely neuron is thus made to be closer to the probability of the second most likely and the third most likely, then there is a greater chance that a neuron other than the one with the highest weighted input summation is the one that will fire, and be part of the memory for this neuron   Since low G means low confidence (or low familiarity) we do want the new SDR to have some differences from whatever SDR the collection of V’s seem closest too.   Having probabilities that are close together makes differences more likely.

Suppose you see a prototypical cat that is just like the pet cat owned by your neighbor.   You already have a memory that matches very closely (your G is high).   Now suppose you see an exotic breed of cat that you’ve never encountered.   It matches all stored traces of cats less well, and therefore the memory that the CSA creates for it should be somewhat different.   So even though the V’s may approximate a cat (or intersection of cats) that you’ve seen before, applying the flattened sigmoid and then using a toss of the dice on which neuron will win in each CM, will lead to at least some CMs with different neurons firing than in the prototypical cat representation.  The flatter the sigmoid, the more likely a CM is to have finally selected a different neuron than the favored one to be On.

The connections from the inputs in the receptive field of the MAC (in the lower level) will strengthen to those neurons finally chosen in the SDR in the level above it.   Synapses are basically binary, though their strengths can decay, and neuron activations are binary too.

radical12

Any finite net that stored many memories can run into a problem of interference, or “cross-talk”.   The problem is that there are so many learned links, that you can have similar patterns that differ by very few neurons and can be confused with each other.   You can also get patterns that are hybrids of others and never were actually encountered in real life.   The CSA actually freezes the number of SDRS a MAC can learn after a critical period, to attempt to avoid this problem.   In a multilevel net this is not necessarily a limitation.

I sent a few questions of about human mental abilities and weaknesses to Professor Rinkus and he had interesting replies.

I asked about memories that are false, or partly false, and he said this:

Let’s consider an episodic memory example, my 10th birthday, with different features, where it was, who was there, etc.  That episodic memory as a whole is spread out across many of my macro-columns (“macs”), across all sensory modalities.  But those macs have been involved in another 50 more years of other episodic memories as well.  In general, the rate at which new SDR codes, and thus the rate at which crosstalk accrues, may differ between them.  So, say one mac M1 where a visual image of one of my friends at the party, John, is stored has had many more images of John and other people stored over the years, and is quite full (specifically, ‘quite full’ means that so many SDRs have been stored that the average Hamming distance between all those stored codes has gotten low).  But suppose another mac, M2, where a memory trace of some other feature of the party, say, “number of presents I got”, say 10, was stored ended up having far fewer SDRs stored in it over the years, and so, much less crosstalk.  (After all, the number of instances where I saw a person is vastly greater than the number of instances where I got presents, so the hypothetical example has some plausibility).  So now, when I try to remember the party, which ideally would mean reactivating the entire original memory trace, across all the macs involved, as accurately as possible, including with their correct temporal orders of activation, the chance of activating the wrong SDR in M1 (e.g., remembering image of other friend, Bill, instead of John), is higher than activating the wrong trace in M2…so I remember (Bill, 10) instead of (John, 10).   The overall trace I remember is then a mix of things that actually happened in different instances, e.g., confabulation.

He also said this:

Whenever you recognize any new input as familiar, reactivation of the original trace must be happening.  So, the act of creating new memories involves reactivation of old memories. But reactivating old memory traces becomes increasingly subject to errors due to increasing crosstalk.  So, if my macs are already pretty full, then as I create brand new memory traces, they could include components that are confabulations…i.e., the memories are wrong from inception.

So Professor Rinkus is saying that a false memory can be wrong not only due to an oversupply of similar memories that affects the retrieval process, but can be wrong even at the time it was stored!

I would add that some memories are false because you don’t remember the source.   If you are told at one point that as a child, you were lost in a mall, even if that’s not true, years later you may have a memory that you were, and you may even fill in details of how it happened and how you felt.

Then I asked this question:

According to Wikipedia: “Eidetic memory sometimes called photographic memory) is an ability to vividly recall images from memory after only a few instances of exposure, with high precision for a brief time after exposure, without using a mnemonic device.”   In your theory it would seem that everyone should have this memory, since every experience leaves a trace.   Why then, do only a few people have this ability?

I include a part of his answer below:

My general answer is that when we are all infants/young and we have not stored much information (in the form of SDRs) in the macs comprising our cortex, and so the amount of crosstalk interference between memories (SDR codes, chains of SDRs, hierarchies of chains of SDRs) is low, we all have very good episodic memory, perhaps approaching eidetic to varying degrees and in various circumstances.  But as we accumulate experience, storing ever more SDRs into our macs, the level of crosstalk increases, and increasing mistakes (confabulations) are made.  From another point of view, since these confabulations are generally semantically reasonable, we can say that as we age, our growing semantic memory, i.e., knowledge of the similarity structure of the world, gradually becomes more dominant in determining our responses/behavior (we accumulate wisdom)….  I think those who retain extreme eidetic ability into their later years, and perhaps autistics, may have a brain difference  that makes the sigmoid stay much flatter than for normals, i.e., the sigmoid’s dependence on G is somehow muted.

His speculation makes sense because if the sigmoid is very flat, then new SDRs that are stored for new patterns will be less likely to overlap much with existing SDRs.   Every cat you encounter that is slightly different than an old cat, will have its own representation.

If you are interested in more details of the model (I’ve left out many), take a look at Professor Rinkus’s website (sparsey.com).

Sources:
(you can obtain both from the publications tab of Sparsey.com):
A Radically New Theory of how the Brain Represents and Computes with Probabilities – (2017)
Sparsey™: event recognition via deep hierarchical sparse distributed codes – (2014)

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s