Neural Arithmetic Logic Units – getting backpropagation nets to extrapolate

Backpropagation nets have a problem doing math. You can get them to learn a multiplication table, but when you try to use the net on problems where the answers are higher or lower than the ones used in training, they fail. In theory, they should be able to extrapolate, but in practice, they memorize, instead of learning the principles behind addition, multiplication, division, etc.

A group at Google DeepMind in England solved this problem.
They did this by modifying the typical backprop neuron as follows:

  1. They removed the bias input
  2. They removed the nonlinear activation function
  3. Instead of just using one weight on each incoming connection to the neuron, they use two. Both weights are learned by gradient descent, but a sigmoid function is applied to one, a hypertangent function is applied to the other, and then they are multiplied together. In standard nets, a sigmoid or hypertangent function is not used on weights at all, instead these types of functions are used on activation.  The opposite is true here.

Here is the equation for computing the weight matrix.  W is the final weight, and the variables M and W with the hat symbols are values that are combined to create that final composite weight:


So what is the rationale behind all this?

First lets look at what a sigmoid function looks like:


And now a hypertangent function (also known as ‘tanh’):


We see that the sigmoid function ranges (on the Y axis) between 0 and 1. The hypertangent ranges from -1 to 1. Both functions have a high rate of change when their x-values are fairly close to zero, but that rate of change flattens out the farther they get from that point.

So if you multiply these two functions together, the most the product can be is 1, the least is -1, and there is a bias to the composite weight result – its less likely to be fractional, and more likely to be -1, 1, or zero.
Why the bias?
The reason is that near x = zero, the derivative being large actually indicates that the neuron would be biased to learn numbers other than that point (because it will take the biggest step sizes when the derivative is highest). Thus, tanh is biased to learn its saturation points (-1 and 1) and sigmoid is biased to learn its saturation points (0 and 1). The elementwise product of them thus has saturation points at -1, 1, and 0.

So why have a bias? As they explain:

Our first model is the neural accumulator (NAC), which is a special case of a linear (affine) layer whose transformation matrix W consists just of -1’s, 0’s, and 1’s; that is, its outputs are additions or subtractions (rather than arbitrary rescalings) of rows in the input vector. This prevents the layer from changing the scale of the representations of the numbers when mapping the input to the output, meaning that they are consistent throughout the model, no matter how many operations are chained together.

As an example, if you want the neuron to realize it has to add 5 and -7, you don’t want those numbers multiplied by fractions, rather in this case, you prefer 1 and -1. Likewise, the result of this neuron’s addition could be fed into another neuron, and again, you don’t want it multiplied by a fraction before it is combined with that neuron’s other inputs.

This isn’t always true though, one of their experiments was learning to calculate the square root, which required a weight training to the value of 0.5.

On my first read of the paper, I was sure of why the net worked, and so I asked one author: Andrew Trask, who replied that it works because:


  1. because it encodes numbers as real values (instead of as distributed representations)
  2. because the functions it learns over numbers extrapolate inherently (aka… addition/multiplication/division/subtraction) – so learning an attention mechanism over these functions leads to neural nets which extrapolate


The first point is important because many models assume that any particular number is coded by many neurons, each with different weights. In this model, one neuron, without any nonlinear function applied to its result, does math such as addition and subtraction.

It is true that real neurons are limited in the values they can represent. In fact, neurons fire at a constant, fixed amplitude and its just the frequency of pulses that increase when they get a higher input.

But ignoring that point, the units they have can extrapolate, because they do simple addition and subtraction (point #2).

But wait a minute – what about multiplication and division?

For those operations they make use of a mathematical property of logarithms. The log of (X * Y) is equal to log(X) + log(Y). So if you take logarithms of values before you feed them into an addition neuron, and then the inverse of the log of the result, you have the equivalent of multiplication.

The log is differentiable, so the net can still learn by gradient descent.

So they now need to combine the addition/subtraction neurons with the multiplication/division neurons, and this diagram shows their method:



This fairly simple but clever idea is a breakthrough:

Experiments show that NALU-enhanced neural networks can learn to track time, perform arithmetic over images of numbers, translate numerical language into real-valued scalars, execute computer code, and count objects in images. In contrast to conventional architectures, we obtain substantially better generalization both inside and outside of the range of numerical values encountered during training, often extrapolating orders of magnitude beyond trained numerical ranges.

Neural Arithmetic Logic Units – Andrew Trask, Felix Hill, Scott Reed, Jack Rae, Chris Dyer, Phil Blunsom – Google DeepMind


Neurithmic System’s radical new way of thinking about memory and their program (Sparsey) that implements it

Professor Rod Rinkus  of Neurithmic Systems came up with a net (he calls it SPARSEY) that is about memory – storing memories and retrieving them.   No matter how many memories are already stored, the time to store a new memory, or to retrieve an old one stays the same.   There are some very promising aspects of his idea, and I will explain the general idea below.  If you want to delve further, his actual papers are at his website (

Suppose you want to store a pattern, perhaps a number.   You could store it as follows;


Here we will assume that only one bit can be ON at a time.   With this constraint, we could only present 5 different numbers to the neural net, and it might learn to associate them with a different positions of the green dot.

This is called a ‘localist’ representation.   One disadvantage of associating patterns this way is that similarity is lost.   The number ‘1’ might be associated with a green dot at the second position, or it might be associated with a green dot at the fifth position.   Also, you can’t store many patterns this way unless you have many neurons.

A more compact way to store data is with a type of number system.   For instance, in everyday math we use base 10 numbers,  (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10) where a new digit is added to represent 10, then 100, then 1000 etc.   If we only want to use ON and OFF, we can limit our digits to zero and one.   This gives us a base 2 number system.   In that case,

the number zero would be:


The number one would be:


The number two would require an additional digit just like 10 in base 10:


Three would be:


And four would be:


We have already represented 5 numbers, this time with only 3 places needed, and we can represent more, for instance, of all three dots are green (1,1,1) then we have the number ‘7’, using this binary system.

This system is compact, but still, similar numbers are not coded similarly.   Suppose we measured similarity by overlap: the number of dots in the same position that have the green color.   We would see that the number zero has no overlap with the number one, though it is close numerically.  We would see that four has no overlap with three, though three does have one item that overlaps with two.

An ideal memory would code similar items similarly.   The more different the items were, the more different their representations should be.

In the brain, a sofa and a chair should be represented more similarly than a sofa and a mailbox, for instance.   A robin should have a representation closer to a parrot than to a giraffe.

If this can be accomplished, there are several advantages, which I will list after showing one of Gerard Rinkus’s storage units below.


The above is a representation of what he calls a MAC.   Each MAC is made up of several “Competitive Modules”.   In the brain, the Competitive Modules (which he abbreviates as CM) would correspond to cortical Minicolumns, and the MAC would correspond to a cortical MacroColumn.   In this illustration, we are looking at one MAC with three CMs in it.   Each CM has internal competition where only one neuron (in this case each CM has 3 neurons) can win the competition and turn on.   The others lose and turn off.

So what is the advantage of this?   First of all, since each CM can have 3 separate patterns, there are in total 3 * 3 * 3 patterns that can be represented – or 27.   This is more compact than a totally localist method (where of the 9 neurons only one can be on at a time).   It is not as compact as the example we gave of base 2 numbers.

A Sparse-Distributed Representation (SDR) is a representation where only a small fraction of neurons are on.   In the above Mac, only a third of neurons are ON, so it qualifies.   We can interpret the fraction of a feature’s SDR code that overlaps with the stored SDR of a memory (hypothesis) as the probability/likelihood of that feature (hypothesis).

Using Rinkus’s CMs and MACs, we can introduce similarity into our representations that reflects similarity in the world.

Take a look at these 2 MACs

mac mac2

Notice that they do overlap in 2 out of 3 of the CMs.   We could say that their similarity is 2/3.   If they were identical, their similarity would be 3/3.   And so forth.

What is the advantages of representations that reflect similarity?

Well if you have a net (such as Sparsey) that automatically represents similar inputs in a similar way, then you automatically learn hierarchies.   For example, by looking at the SDR for ‘cat’ and ‘dog’, versus the SDRs for ‘cat’ and ‘fish’, you would see that ‘cat’ and ‘dog’ are more similar.

Another interesting advantage is this.   Suppose the MAC on the left represents a cat, and the MAC  on the right represents a dog.    Now you are presented with a noisy input that doesn’t exactly match cat or dog, and gets a representation such as:


This MAC representation overlaps the MACs for dog and for cat equally.   We could say that the probability that what the net saw was a cat is equal to the probability that it saw a dog.

So the fact that similar inputs yield similar representations means that we can look at a SDR as a probability distribution over several possibilities.   This MAC overlaps the representation for dog and for cat by two CMs, but perhaps this MAC might overlap with just one CM a representation for ‘mouse’.

The next figure is from an article by Prof. Rinkus.   Unlike my depictions, his MACS have a hexagonal shape and each CM is a cluster of little circles (neurons).



A single MAC can learn a sequence of patterns over time  by having horizontal connections of every neuron in the MAC connect with a time delay to every neuron (including itself) in the same MAC.  Once it has learned these, then if an input pattern activates the first SDR, then in the next time step it leads to the SDR that represents the second pattern that it was taught, and that in turn leads to the next.   Let us assume that the MAC has many CMs (maybe 70) and each CM might have 20 neurons).  20 to the power of 70 is a large number, and with this large capacity you can store many sequences now without ‘crosstalk’ (or interference) and also, when presented with the start of a sequence that is ambiguous, you can keep multiple possibilities of the endings of that sequence in memory of the net a particular time.   (This reminds me of Quantum computing where multiple possibilities are tried at the same time).

Suppose Sparsey is trained on single words such as:

  1. THEN
  2. THERE
  3. THAT

And so forth.

If the first four words it was trained on are the words above, then if we now want to retrieve a word and present ‘TH’ (the two letters are presented singly, (“T” at time t1, “H” at time t2), then in the above example there are three possibilities for the stored sequence we are attempting to retrieve (THEN, THERE and THAT).   If the next letter that comes in is ‘E’, then there are only two possibilities (THEN and THERE).   If the next letter that comes in is ‘R’, then we are left with just the possibility of the sequence of letters in the word “THERE”.   When we use the principle that similar inputs yield similar SDRs, and we also insist that when a pattern is learned every ON neuron in one MAC learns to increase weights on the same connections as every other neuron, then at any time, all learned sequences stored in memory that match so far are possible, until the ambiguity is resolved by the next input.

Think about the the letter ‘A’ in the above 4 words.   We see that ‘A’ occurs in THAT (one time) and SAILBOAT (in two places).   There are 3 instances of ‘A’ and they cannot be represented in the exact same way (if they were, then there would be no clue of what comes next).   ‘A’ in the context  of THAT does not have the same exact representation as the first ‘A’ in SAILBOAT, and neither have the same exact SDR as the second ‘A’ in SAILBOAT.    Nonetheless, they will have representations that are more similar to each other than to the letter ‘B’ for instance.

Remember that any sequence is represented at a series of time steps, with the position varying but not the sequence.   Think of your own brain.   Your past and your future are all compressed in the present moment.   The past can be retrieved, and the future can be predicted, but at the moment, all you have is the present: a  snapshot of neural firings in your brain.   The same is true in Sparsey of a sequence such as SAILBOAT.   When you reach the first ‘A’ of sailboat it has all the information to complete that word, assuming that only the above 4 words were learned.   There is no ambiguity.   But that is only true because the pattern for this ‘A’ is slightly different than for the other ‘A’s (such  as the ‘A’ in THAT’).  They don’t overlap completely.

So how does Sparsey achieve the property of similar inputs giving rise to similar memories?

First we need to know that each neuron in each CM of a particular MAC has exactly the same inputs.  It may not have the same weights applied to those inputs, but it has the same inputs.   The inputs come from a lower level, which might be a picture in pixels, or if we have a multilevel net, might be from another abstract level.   Initially, all weights are zero.


Sparsey’s core algorithm is called the Code Selection Algorithm (CSA).   We’ll say that in every CM there are K neurons.   In each MAC (there can be several per level in a multilevel net) there are Q CMs.

CSA Step 1 computes the input sums for all Q×K cells comprising the coding field.  Specifically, for each cell, a separate sum is computed for each of its major afferent synaptic projections.

The cells also have horizontal inputs from cells within some maximum perimeter around the MAC and may have signals also coming down on connections from a layer above them.   But we’ll focus on just the inputs coming from below. The following is a very simplified version of what happens:

As in typical neural nets, each neuron in a MAC has an activation ‘V’ equal to the sum of the product of weights on  a connection times the signal coming over that connection.

Then these sums are normalized so that none exceed one and none are less than zero, but they retain their relative magnitudes or ‘V’ values for each neuron.

Now find the Max V in each CM and tentatively pick the neuron with that value to be the ON neuron in that CM

Finally, a measure called G is computed as the average max-V across the Q CMs.  In the remaining CSA steps, G is used, in each CM, to transform the V distribution over the K cells into a final probability distribution from which a winner is picked.  G’s influence on the distributions can be summarized as follows.

  1. a) When high global familiarity is detected (G is close to 1), those distributions are exaggerated to bias the choice in favor of cells that have high input summations.
  2. b) When low global familiarity is detected (G is close to 0), those distributions are flattened so as to reduce bias due to local familiarity

G does this indirectly, by modifying a ‘sigmoid’ curve that is applied to each neuron’s output.

The lower level in the next picture has a sigmoid curve (the red shaped curve to the right) that has a normal height.   The upper level has a sigmoid curve that has been flattened.   We can see that in the lower level’s sigmoid function, Y-axis values are farther apart (at least in the middle of the ‘S’) than in the second.   The lower level here, we assume, had a larger G than the upper level did, so the CSA calculates a taller sigmoid to apply to the neurons in that level.   If a sigmoid is flattened, and the probability of the most likely neuron is thus made to be closer to the probability of the second most likely and the third most likely, then there is a greater chance that a neuron other than the one with the highest weighted input summation is the one that will fire, and be part of the memory for this neuron   Since low G means low confidence (or low familiarity) we do want the new SDR to have some differences from whatever SDR the collection of V’s seem closest too.   Having probabilities that are close together makes differences more likely.

Suppose you see a prototypical cat that is just like the pet cat owned by your neighbor.   You already have a memory that matches very closely (your G is high).   Now suppose you see an exotic breed of cat that you’ve never encountered.   It matches all stored traces of cats less well, and therefore the memory that the CSA creates for it should be somewhat different.   So even though the V’s may approximate a cat (or intersection of cats) that you’ve seen before, applying the flattened sigmoid and then using a toss of the dice on which neuron will win in each CM, will lead to at least some CMs with different neurons firing than in the prototypical cat representation.  The flatter the sigmoid, the more likely a CM is to have finally selected a different neuron than the favored one to be On.

The connections from the inputs in the receptive field of the MAC (in the lower level) will strengthen to those neurons finally chosen in the SDR in the level above it.   Synapses are basically binary, though their strengths can decay, and neuron activations are binary too.


Any finite net that stored many memories can run into a problem of interference, or “cross-talk”.   The problem is that there are so many learned links, that you can have similar patterns that differ by very few neurons and can be confused with each other.   You can also get patterns that are hybrids of others and never were actually encountered in real life.   The CSA actually freezes the number of SDRS a MAC can learn after a critical period, to attempt to avoid this problem.   In a multilevel net this is not necessarily a limitation.

I sent a few questions of about human mental abilities and weaknesses to Professor Rinkus and he had interesting replies.

I asked about memories that are false, or partly false, and he said this:

Let’s consider an episodic memory example, my 10th birthday, with different features, where it was, who was there, etc.  That episodic memory as a whole is spread out across many of my macro-columns (“macs”), across all sensory modalities.  But those macs have been involved in another 50 more years of other episodic memories as well.  In general, the rate at which new SDR codes, and thus the rate at which crosstalk accrues, may differ between them.  So, say one mac M1 where a visual image of one of my friends at the party, John, is stored has had many more images of John and other people stored over the years, and is quite full (specifically, ‘quite full’ means that so many SDRs have been stored that the average Hamming distance between all those stored codes has gotten low).  But suppose another mac, M2, where a memory trace of some other feature of the party, say, “number of presents I got”, say 10, was stored ended up having far fewer SDRs stored in it over the years, and so, much less crosstalk.  (After all, the number of instances where I saw a person is vastly greater than the number of instances where I got presents, so the hypothetical example has some plausibility).  So now, when I try to remember the party, which ideally would mean reactivating the entire original memory trace, across all the macs involved, as accurately as possible, including with their correct temporal orders of activation, the chance of activating the wrong SDR in M1 (e.g., remembering image of other friend, Bill, instead of John), is higher than activating the wrong trace in M2…so I remember (Bill, 10) instead of (John, 10).   The overall trace I remember is then a mix of things that actually happened in different instances, e.g., confabulation.

He also said this:

Whenever you recognize any new input as familiar, reactivation of the original trace must be happening.  So, the act of creating new memories involves reactivation of old memories. But reactivating old memory traces becomes increasingly subject to errors due to increasing crosstalk.  So, if my macs are already pretty full, then as I create brand new memory traces, they could include components that are confabulations…i.e., the memories are wrong from inception.

So Professor Rinkus is saying that a false memory can be wrong not only due to an oversupply of similar memories that affects the retrieval process, but can be wrong even at the time it was stored!

I would add that some memories are false because you don’t remember the source.   If you are told at one point that as a child, you were lost in a mall, even if that’s not true, years later you may have a memory that you were, and you may even fill in details of how it happened and how you felt.

Then I asked this question:

According to Wikipedia: “Eidetic memory sometimes called photographic memory) is an ability to vividly recall images from memory after only a few instances of exposure, with high precision for a brief time after exposure, without using a mnemonic device.”   In your theory it would seem that everyone should have this memory, since every experience leaves a trace.   Why then, do only a few people have this ability?

I include a part of his answer below:

My general answer is that when we are all infants/young and we have not stored much information (in the form of SDRs) in the macs comprising our cortex, and so the amount of crosstalk interference between memories (SDR codes, chains of SDRs, hierarchies of chains of SDRs) is low, we all have very good episodic memory, perhaps approaching eidetic to varying degrees and in various circumstances.  But as we accumulate experience, storing ever more SDRs into our macs, the level of crosstalk increases, and increasing mistakes (confabulations) are made.  From another point of view, since these confabulations are generally semantically reasonable, we can say that as we age, our growing semantic memory, i.e., knowledge of the similarity structure of the world, gradually becomes more dominant in determining our responses/behavior (we accumulate wisdom)….  I think those who retain extreme eidetic ability into their later years, and perhaps autistics, may have a brain difference  that makes the sigmoid stay much flatter than for normals, i.e., the sigmoid’s dependence on G is somehow muted.

His speculation makes sense because if the sigmoid is very flat, then new SDRs that are stored for new patterns will be less likely to overlap much with existing SDRs.   Every cat you encounter that is slightly different than an old cat, will have its own representation.

If you are interested in more details of the model (I’ve left out many), take a look at Professor Rinkus’s website (

(you can obtain both from the publications tab of
A Radically New Theory of how the Brain Represents and Computes with Probabilities – (2017)
Sparsey™: event recognition via deep hierarchical sparse distributed codes – (2014)


Making Neural Nets more decipherable and closer to Computers

In an article titled “Neural Turing Machines“, three researchers from ‘Google DeepMind’ Alex Graves, Greg Wayne, and Ivo Danihelka describe a neural net that has a new feature, a memory bank. The system is similar in this respect to a Turing Machine – which was originally proposed by Alan Turing in 1936. His hypothetical machine had a read/write head that wrote on squares on a tape, and could move to other squares and read from them as well. So it had a memory. In theory, it could compute anything that modern computers could compute, given enough time.

One advantage of making a Neural Net that is also a Turing machine is that it can be trained with gradient descent algorithms.   That means it doesn’t just execute algorithms, it learns algorithms (though, if you want to be fanatical, you might note that since a Turing machine can simulate any recipe that a computer can execute, it could simulate a neural net that learns as well).

The authors say this:

Computer programs make use of three fundamental mechanisms: elementary operations (e.g., arithmetic operations), logical flow control (branching), and external memory, which can be written to and read from in the course of computation. Despite its wide-ranging success in modelling complicated data, modern machine learning has largely neglected the use of logical flow control and external memory.

Recurrent neural networks (RNNs) …are Turing-Complete and therefore have the capacity to simulate arbitrary procedures, if properly wired. Yet what is possible in principle is not always what is simple in practice. We therefore enrich the capabilities of standard recurrent networks to simplify the solution of algorithmic tasks. This enrichment is primarily via a large, addressable memory, so, by analogy to Turing’s enrichment of finite-state machines by an infinite memory tape, we dub our device a “Neural Turing Machine” (NTM). Unlike a Turing machine, an NTM is a differentiable computer that can be trained by gradient descent, yielding a practical mechanism for learning programs.

They add that in humans, the closest analog to a Turing Machine is ‘working memory’ where information can be stored and rules applied to that information.

…In computational terms, these rules are simple programs, and the stored information constitutes the arguments of these programs.

A Neural Turing memory is designed

to solve tasks that require the application of approximate rules to “rapidly-created variables.” Rapidly-created variables are data that are quickly bound to memory slots, in the same way that the number 3 and the number 4 are put inside registers in a conventional computer and added to make 7.

… In [human] language, variable-binding is ubiquitous; for example, when one produces or interprets a sentence of the form, “Mary spoke to John,” one has assigned “Mary” the role of subject, “John” the role of object, and “spoke to” the role of the transitive verb.

A Neural Turing Machine (NTM) architecture contains two components: a neural network controller and a memory bank.

Like most neural networks, the controller interacts with the external world via input and output vectors. Unlike a standard network, it also interacts with a memory matrix…. By analogy to the Turing machine we refer to the network outputs that parametrize these operations as “heads.”
Crucially, every component of the architecture is differentiable, making it straightforward to train with gradient descent. We achieved this by defining ‘blurry’ read and write operations that interact to a greater or lesser degree with all the elements in memory (rather than addressing a single element, as in a normal Turing machine or digital computer).

In a regular computer, a number is retrieved by fetching it at a given address.

Their net has two differences in retrieval from a standard computer.   First of all, they retrieve  an entire vector of numbers from a particular address.   Think of a rectangular matrix, where each row number is an address, and the row itself is the vector that is retrieved.

Secondly, instead of retrieving at just one address, there is a vector of weights that controls the retrieval at multiple addresses.    The weights in that vector add up to ‘1’.   Think of a memory matrix consisting of 5 vectors.   There will be 5 corresponding weights.

If the weights were:


then only one vector will be retrieved, the vector at the third row of the matrix.  This is similar to ordinary location based addressing in computers or Turing machines. You can also shift that ‘1’ each cycle, so that it retrieves an adjacent number each time (to the number retrieved before).

Now think of the following vector of weights:


In this case two vectors are retrieved (one from the 2nd row, and one from the third).   The first one has all its elements multiplied by 0.3, the second has all its elements multiplied by 0.7, and then the two are added.   This gives one resultant vector.  They say this is a type of ‘blurry’ retrieval .

They use the same idea when writing to memory – a vector is used to relatively weight the different values written to memory.

This vector-multiplication method of retrieval allows the entire mechanism to be trained by gradient descent.  It also can be thought of as an ‘attentional mechanism” where the focus is on the vectors with relatively high corresponding weights.

Some other nets do a probabilistic type of addressing, where there is a probability distribution over all the vectors, and at each cycle the net uses  most probable (perhaps with a random component).   But since neural Turing machines learn by gradient descent, the designers had to use the distribution to obtain a weighted sum of memory vectors is retrieved.   This was not a bug, but a feature!

They say:

The degree of blurriness is determined by an attentional “focus” mechanism that constrains each read and write operation to interact with a small portion of the memory, while ignoring the rest… Each weighting, one per read or write head, defines the degree to which the head reads or writes at each location. A head can thereby attend sharply to the memory at a single location or weakly to the memory at many locations.

Writing to memory is done in two steps:

we decompose each write into two parts: an erase followed by an add.
Given a weighting wt emitted by a write head at time t, along with an erase vector et whose M elements all lie in the range (0,1), the memory vectors M{t-1}(i) from the previous time-step are modified as follows:


where 1 is a row-vector of all 1’s, and the multiplication against the memory location acts point-wise. Therefore, the elements of a memory location are reset to zero only if both the weighting at the location and the erase element are one; if either the weighting or the erase is zero, the memory is left unchanged.
Each write head also produces a length M add vector at, which is added to the memory after the erase step has been performed:


The combined erase and add operations of all the write heads produces the final content of the memory at time t. Since both erase and add are differentiable, the composite write operation is differentiable too.

The network that outputs these vectors and that reads and writes to memory, as well as taking inputs and producing outputs, can be a recurrent neural network, or a plain feedforward network.   In either case, the vector used to retrieve from memory locations is then fed back, along with the inputs, into the net.

The authors trained their net on various problems, such as copying a sequence of numbers, or retrieving the next number in an arbitrary sequence given the one before it. It came up with algorithms such as this one, for copying sequences of numbers: (in the following, a ‘head’ can be either a read-head or a ‘write-head’ and has a vector of weights associated with it to weight the various memory vectors for the process of combination and retrieval, or for writing.)

initialise: move head to start location
while input delimiter not seen do

receive input vector
write input to head location
increment head location by 1

end while
return head to start location
while true do

read output vector from head location
emit output
increment head location by 1

end while

This is essentially how a human programmer would perform the same task in a low-level programming language. In terms of data structures, we could say that NTM has learned how to create and iterate through arrays. Note that the algorithm combines both content-based addressing (to jump to start of the sequence) and location-based addressing (to move along the sequence).

The way the NTM solves problems is easier to understand than trying to decipher a standard recurrent neural net, because you can look at how memory is being addressed, and what is being retrieved and written to memory at any point.
There is more to the NTM than I have explained above as you can see from the following diagram from their paper:


Take home lesson: The Turing Net outperforms existing architectures such as LSTMs (neural nets where each unit has a memory cell, plus trainable gates to decide what to forget and what to remember), and it generalizes better as well. It is also easier to understand what the net is doing, especially if you use a ‘feedforward net’ as the ‘controller’. The net doesn’t just passively compute outputs, it decides what to write to memory, and what to retrieve from memory.

Neural Turing Machines by Alex Graves, Greg Wayne and Ivo Danihelka – Google DeepMind, London, UK (

Making Recurrent neural net weights decipherable – new ideas.

One problem with neural nets is that after training, their inner workings are hard to interpret.
The problem is even worse with recurrent neural networks, where the hidden layer sends branches back to feed, along with the inputs in the next time step, back to itself.

Before I talk about how the problem has been tackled, I should mention an improvement to standard recurrent nets, which was called by its authors (Jurgen Shmidhuber and Sepp Hochreiter) LSTM (Long Short Term Memory). The inventors of this net realized that backpropagation isn’t limited to training a relation between two patterns, it can also be used to train gates that control the learning by the other gates.  One such gate is a ‘forget gate’. It uses a ‘sigmoid function’ on the weighted sum of its inputs. Sigmoid functions are shaped like a slanted letter ‘S’, and the bottom and top of the ‘S’ are at zero and 1 respectively. This means that if you multiply a signal by the output of a sigmoid function, at one extreme you could be multiplying by zero, which means that the product is zero too, which means no signal gets through the gate. At the other extreme, you would be multiplying by 1, so that the entire signal gets through. Since sigmoid gates are differentiable, backpropagation can be used on them. In an LSTM, you have a cell-state that holds a memory value, as well as having one or more outputs. In addition to the standard training, you also train a gate to decide how much of the past ‘memory’ to forget on each time-step as a sequence of inputs are presented to the net. A good explanation of LSTMS is at:, but the point to remember is that you can train gates to control the learning process of other gates.

So back to making sense of the weights of recurrent nets. One approach is the IndRNN (Independently Recurrent Neural Network). If you will recall, a recurrent net with 5 hidden nodes would not only feedforward 5 signals into each neuron of its output layer, but would send 5 branches with the signals from the 5 hidden nodes as 5 extra ‘inputs’ to join the normal inputs in the next time step.  If you had 8 inputs, then in total you would have 13 signals feeding into every hidden node. Once a net like this is trained, the actual intuitive meaning of the weights is hard to unravel, so the authors asked – why not just feed each hidden node into itself, this keeping the hidden nodes independent of each other. Each node still gets all the normal signals from inputs it would normally get, but in the above example, instead of getting 5 signals from the hidden layer’s previous time step as well, it gets just one extra signal instead – that of itself on the previous time step. This may seem to reduce the power of the net since there are fewer connections, but it makes the net more powerful.  One plus is that with this connectivity, the net is able to train on many layers in each time step. Another plus is that the neurons don’t have to use ‘S’ shaped functions, they can work with non-saturated activation functions such as RELU (rectified linear unit – which is a diagonal line when the weighted sum of neural inputs is zero and above and otherwise is a horizontal line with value zero).


It is easier to understand what a net like this is doing than a traditional recurrent net.

Another ingenious idea came from a paper titled Opening the Black Box: Low-dimensional dynamics in high-dimensional recurrent neural networks, by David Sussilo of Stanford and Omri Barak of the Technion.

A recurrent network is a non-linear dynamic system, in that at any time step, the output of a computation is used for the inputs of the next time-step, where the same computation is made. One the weights are learned, you can write the computation of the net as one large equation.  In the equation below the J matrix is the weights from the context (hidden units feeding back) and the B matrix is the weights for the regular inputs and h is a function such as hypertangent.  The symbol x is the union of u and r, where u are the signals from the input neurons.


The systems described by these equations can have attractors, such as fixed points. You can think of fixed points as being at the bottom of a basin in a landscape. If you roll a marble anywhere into the valley, it will roll to the bottom. In the space of patterns, all patterns that are in the basin will evolve over time to the pattern at the bottom. Attractors do not have to be fixed points, they can be lines, or they can be a repeating sequence of points (the sequence repeats as time goes by), or they can never repeat but still be confined in a finite space – those trajectories in pattern space are called ‘strange attractors’. A fixed point can be a point where all neighboring patterns eventually evolve to end up, or it can be a repeller, so that all patterns in their neighborhood evolve to go away from it. Another interesting type of fixed point is a saddle. Here patterns in some directions evolve toward the point, but patterns in other directions evolve to go away from it. Think of a saddle of a horse. You can fall off sideways (that would be the ‘repeller’), but if you were jolted forward and upward in the saddle, you would slide back to the center (the attractor).


So Sussilo and Barak looked for fixed points in recurrent networks. They also looked for ‘slow points’ that is points that attract, but eventually drift. I should mention here that just like in a basin, the area around a fixed point is approximately linear (if you are looking at a small area). As patterns approach an attractor, usually they start off quickly, but the progress slows down the closer they get to it.

The authors write:

Finding stable fixed points is often as easy as running the system dynamics until it converges (ignoring limit cycles and strange attractors). Finding repellers is similarly done by running the dynamics backwards. Neither of these methods, however, will find saddles. The technique we introduce allows these saddle points to be found, along with both attractors and repellers. As we will demonstrate, saddle points that have mostly stable directions, with only a handful of unstable directions, appear to be of high significance when studying how RNNs accomplish their tasks.

Why is finding saddles valuable?

A saddle point with one unstable mode can funnel a large volume of phase space through its many stable modes, and then send them to two different attractors depending on which direction of the unstable mode is taken.

Consider a system of first-order differential equations

where x is an N-dimensional state vector and F is a vector function that defines the update rules (equations of motion) of the system. We wish to find values round which the system is approximately linear. Using a Taylor series expansion, we expand F(x) around a candidate point in phase space:

(A Taylor expansion uses the idea that if you know the value of  a function at a point X, you can find the value of the function function at a point (x + delta-x), using first order derivatives, second order derivatives, up to n’th order derivatives)

The authors say that “Because we are interested in the linear regime, we want the first derivative term of the right hand side to dominate the other terms, so that


They say that his observation “motivated us to look for regions where the norm of the dynamics, |F(x)|, is either zero or small. To this end, we define an auxiliary scalar function.   In the caption of the equation, they explain that there is a intuitive correspondence to speed in the real physical world:

A picture that shows a saddle with attractors on either side follows:


The authors trained recurrent nets on several problems, and found saddles between attractors, which allowed them to understand how the net was solving problems and representing data. One of the more difficult problems they tried was to train a recurrent net to produce a sine wave given an input that represented the desired frequency. They would present an amplitude that represented a frequency range, (the higher the amplitude of the input signal, the higher the frequency they wanted the net output to fire at) and they trained the output neuron to fire at a frequency proportional to that input. When they analyzed the dynamics, they found that, even though fixed points were not reached,

For the sine wave generator the oscillations could be explained by the slightly unstable oscillatory linear dynamics around each input-dependent saddle point.

I’m not clear on what the above means but it is known that you can have limit cycles around certain types of fixed points (unstable ones).  In the sine wave example the location of attractors and saddle points differ depending on what input is presented to the network. In other problems they trained the net with, the saddle point(s) was at the same place, no matter what inputs were presented  because the analysis was done in the absence of input – maybe because the, input was transient (applied for a short time), whereas in the sine wave it was always there.  So in the sine wave example, if you change the input, you changed the whole attractor landscape.

They also say that one reason studying slow points, as opposed to just fixed points was valuable, since

funneling network dynamics can be achieved by a slow point, and not a fixed point

(as shown in the next figure):


A mathematician who I’ve corresponded with told me his opinion of attractors.  He wrote:

I think that:
• a memory is an activated attractor.
• when a person gets distracted, the current attractor is destroyed and gets replaced with another.
• the thought process is the process of one attractor triggering another, then another.
• memories are plastic and can be altered through suggestion, hypnosis, etc.  Eye witness accounts can be easily changed, simply by asking the right sequence of questions.
• some memories, once thought to be long forgotten, can be resurrected by odors, or a musical song.

One can speculate that emotions are a type of attractor.   When you depressed, the types of thoughts you have are sad ones, and when you are angry at a friend, you dredge up  the memories of the annoying things they did in the past.

In the next post, I’ll discuss a different approach to understanding a recurrent network. Its called a “Neural Turing Machine”. I’ll explain a bit about it here.

It had been found by Kurt Godel that certain problems could not be solved by any set of axioms.

There had been a half-century of attempts, before Gödel came along to find a set of axioms sufficient for all mathematics, but that ended when he proved the “incompleteness theorem”.

In hindsight, the basic idea at the heart of the incompleteness theorem is rather simple. Gödel essentially constructed a formula that claims that it is unprovable in a given formal system. If it were provable, it would be false. Thus there will always be at least one true but unprovable statement. That is, for any computably enumerable set of axioms for arithmetic (that is, a set that can in principle be printed out by an idealized computer with unlimited resources), there is a formula that is true of arithmetic, but which is not provable in that system.

In a paper published in 1936 Alan Turing reformulated Kurt Gödel’s 1931 results on the limits of proof and computation, replacing Gödel’s universal arithmetic-based formal language with hypothetical devices that became known as Turing machines. These devices wrote on a tape and then moved the tape, but they could compute anything (in theory) that any modern computer can compute. They needed a list of rules to know what to write on the tape in different conditions, and when and where to move it.
So Alex Graves, Greg Wayne and Ivo Danihelka of Google DeepMind in London came up with the idea to make a recurrent neural net with a separated memory section that could be looked at as a Turing machine with its tape. You can see their paper here: I’ve corresponded with one author, and hopefully can explain their project in my next post.

Opening the Black Box: Low-dimensional dynamics in high-dimensional recurrent neural networks by David Sussilo and Omri Barak (
Independently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN – by Shuai Li, Wanqing Li, Chris Cook, Ce Zhu, Yanbo Gao (

When your character flaw is due to brain chemicals

Blue Dreams is a new book by Lauren Slater – its describes the drugs that have been invented to treat mental illness. Lauren is a science journalist who has needed some of those drugs. Its a very well written book, with both fascinating history and science in it. But lets concentrate here on the science.
The question we will ask is, can character flaws be created by a simple neurotransmitter imbalance?
I think the answer is yes, because of this paragraph in her book (warning: she includes some explicit terms):

Because Prozac dampens the sex drive, psychiatrists often use it to treat compulsive masturbators and others with heightened libido or sexual-addiction disorders that leave their lives in shreds. Martin Kafka, a psychiatrist at McLean Hospital in Belmont, Massachusetts,…, has an entire practice comprising men who are addicted to sex. These are largely married men who nevertheless seek out prostitutes and pornography, not once a day, not twice day, but twenty or thirty times in a twenty-four-hour period. Men haunted and ravaged by their own internal fires, men eaten alive by uncontrollable desire, men whose brains are likely damp with dopamine coursing down dendrites and being sucked up by axons in a never-ending obsessive circuit. Kafka is not the only psycho-pharmacologist who uses Prozac and its chemical cousins in this manner. The literature is rife with cases […], all trained and tamed by serotonin-boosting drugs. Kafka has seen these drugs turn men around, has seen his patients go from the far fringes of fantasy, pornography, and prostitution to surprisingly conventional existences, picket fence and all.
Sex addicts for whom Prozac allows a normal life, most of whom are men, generally tend to be grateful…

So here we have a drug giving people the willpower to lead decent lives. I think this is remarkable.

Conversely, the author says this about Zyprexa, a drug that she takes because the alternative is worse:

…the problem with Zyprexa was that it so intensified my appetite that it was beyond satiation, so that the mere mention of food caused my mouth to water….As the Zyprexa toyed first with my metabolism and then with my body, my weight went up, up, up, with the end result that I am now an overweight diabetic. High blood sugar is destroying my eyesight, so that without glasses everything looks fuzzy….My high sugar has also caused my kidneys to malfunction…

To sum up: we have a drug (Zyprexa) that reduces willpower, or drowns it out with a strong urge to eat. You could say it causes a character flaw.
Prozac increases the levels of serotonin at synapses. Zyprexa blocks or lessens the effects of dopamine and serotonin. It is not known if these are the only effects, or why those effects have curative properties.

We like to believe we are in charge – but if a chemical can make us a slave to our impulses – or conversely free us from impulses, what does this say for human responsibility, guilt, and shame?

Serotonin, the neurotransmitter directly affected by Zyprexa, also occurs in wasps, earthworms, and lobsters. The author of “12 Rules For Life”, Jordan Peterson, studied lobsters, and noted that they had a social hierarchy, and if you were a lucky respected lobster, you had high serotonin.

Lauren Slater points out that serotonin interacts with many other neurotransmitter systems, and so why drugs such as Zyprexa or Prozac work at all is not known.

I’m not trying here to make excuses for every out-of-control rascal, (“I was made that way, I can’t help it), but it is interesting that one of our presidents, JFK, when asked why he pursued so many ‘affairs’, replied “I can’t help it”. Maybe that wasn’t just an excuse!

(note: Lauren Slater says there is no proof that these drugs address a simple ‘neurotransmitter imbalance’.  Nonetheless, whatever is going on is being alleviated by a single chemical taken in pill form.)

Unraveling The Mystery of How The Brain Makes The Mind – Michael Gazzaniga’s new book

Michael Gazzaniga, who directs the SAGE institute for the study of the mind at UC Santa Barbara, has come out with a book on how the brain gives rise to the mind. He believes that consciousness is not tied to a specific neural network. Various functions that take place in the brain each have intrinsic consciousness. So what is his proof? One line of evidence is that when the cable connecting the left and right hemisphere is cut:

…the left hemisphere keeps on talking and thinking as if nothing had happened even though it no longer has access to half of the human cortex. More important, disconnecting the two half brains instantly creates a second, also independent conscious system. The right brain now purrs along carefree from the left, with its own capacities, desires, goals, insights and feelings. One network, split into two, becomes two conscious systems.

Patients with a lesion in a particular part of the right hemisphere will behave as if part or all of the left side of their world…does not exist!

This could include not eating off the left side of their plate, not shaving…on the left side of their face…not reading the left pages of a book, etc.

It seems that when you lose a capacity, you are not even cognizant of what you’ve lost. You don’t know what you don’t know.

In a section titled “What is it like to be a Right Hemisphere”, Gazzaniga says that you would not notice the loss of the left hemisphere. You would have trouble communicating, because you lost your speech center. You would also lose your ability to make inferences.

“Though you would know that others had intentions, beliefs and desires, and you could attempt to guess what they might be, you would not be able to infer cause and effect. You would not be able to infer why someone is angry or believes as they do.


Not only do you not infer that your neighbor is angry because you left the gate open and her dog got out, you don’t infer that the dog got out because you left the gate open

On the positive side, without a left hemisphere:

You won’t be a hypocrite and rationalize your actions…You would have no understanding of metaphors…

This is interesting also if you look at my prior post in this blog on the causal diagrams of Judea Pearl. He says that the causal diagrams he draws must be similar to how we represent them internally, and yet, Gazzaniga is saying that only one hemisphere has the ability to think about causes.

In a prior post, I mentioned that: “V. S. Ramachandran’s studies of anosognosia reveal a tendency for the left hemisphere to deny discrepancies that do not fit its’ already generated schema of things. The right hemisphere, by contrast is actively watching for discrepancies, more like a devil’s advocate. These approaches are both needed, but pull in opposite directions. ”

So I suppose if you were a right hemisphere, your ways of thinking would be very different than if you were a left hemisphere.

In fact, Gazzaniga gives fascinating illustrations of such differences.  Suppose you show a film of a ball coming toward another ball, not quite touching, but the second ball acting as if it was hit and taking off.   The left hemisphere of a split-hemispheres patient will fall for the illusion even if there is a delay in the second ball moving, or if the distance is increased between the balls.  The right hemisphere eventually sees that it is an illusion.  On the other hand, the left hemisphere will solve problems with logical inference that the right hemisphere can’t.   It seems that the right-hemisphere can PERCEIVE causality, but the left hemisphere can INFER causality.

Rebecca Saxe at MIT found that the right half brain  has special ‘hardware’ to determine the intentions of other people.   (This ability is also known as ‘theory of mind”).  Based on this finding a former student of Gazzaniga’s Michael Miller, and a philosopher (Walter Sinnot-Armstrong) had the idea of looking at split brain patients to see how each hemisphere evaluated moral responsibility.  They presented scenarios such as this one:

If a secretary wants to bump off her boss and intends to add poison to his coffee, but unknown to here, it actually is sugar, he drinks it, and he is fine, was that permissible?

The left hemisphere judges the secretary as blameless, despite the malicious intent, since no harm was done.   The right hemisphere would not agree – it would take ‘intent’ into account.  But the judgement of the right hemisphere is not available to the left hemisphere, because the communication cable between the hemispheres is cut.  There may be some type of communication though – the right hemisphere has a ‘bad feeling’ about the situation, and that feeling is available via older brain areas to both hemispheres.  So the left hemisphere may feel impelled to explain its decision.

Gazzaniga writes that even though your experience seems a coherent, flawlessly edited film, it is instead

a stream of single vignettes that surface like bubbles a boiling pot of water, linked together by their occurrence of time.

This raises a question:

Do the bubbles burst willy nilly, or are they the product of a dynamic control system?  Is there a control layer giving some bubbles the nod and quashing others?

He gives several options:

  1. competition.  If you bite into bitter chocolate, no module that processes sweet sensations is activated.   Bitter information is being processed.  If you eat milk chocolate instead, then the ‘sweet module’ is up and running as well, and outcompetes bitterness by a landslide.
  2. Top down expectation: you are searching for the one person in the crowd with a red flower in her hair
  3. Arbitrary rules:   For instance, if you have been told that low-fat diets are the road to health, then bubbles will come up as you shop for groceries, guiding you to stay clear from various products.  If you then read that ‘fat’ is actually good for you, you will get different bubbles.

At one point he also suggests that the ‘bubbles’ are linked by the feelings they engender as well, and a possibility that occurs to me is that the ‘interpreter’ in our brain – an area with the function to make a story out of our thoughts and make sense of our own actions, might also link these experiences.

There is much more to the book.
One analogy he makes that I have not heard before is for the hierarchical sets of layers in the cortex. The analogy he makes is to the screen of a computer. When you interact with a computer, as far as your experience, you don’t interact with circuits and registers and bits, you interact with icons and pictures and words on a screen, using a keyboard and a mouse. The details are abstracted for you. The same goes for the hierarchies of layers in the brain – each layer just passes an abstracted version to the next.

After finishing the book I’m still left with a mystery.   Suppose a ‘bubble’ outcompetes all others, and is the focus of my mind.   And each bubble has intrinsic consciousness that goes with it.   How is that subjective consciousness felt and experienced?  Where is the qualia of a bright red sunset? Where is our feeling of 3-D space? What does it mean to feel we are exerting a force, such as a force of willpower not to have dessert? And where does the inner voice come from that says something is wrong with a story you’ve just heard, but you can’t pinpoint yet what it is? Is it a module trying to get into consciousness?

The Consciousness Instinct – Unraveling The Mystery of How The Brain Makes The Mind – Michael S. Gazzaniga (2018)

Judea Pearls causal revolution – and implications for A.I.

Judea Pearl recently wrote a book for a popular audience about his life work, called “The Book of Why”. Pearl is the inventor of “Bayesian Networks”, which are graphs whose links are probabilities. Such a graph might have some nodes that represented symptoms of a disease, and other nodes that represented various diseases.  The links between nodes, as well as the decision of which nodes link to which, can help diagnose which disease a person has, given his symptoms. The probabilities propagate using “Bayes Rule”. Despite the huge success of Bayesian Networks, Professor Pearl was not satisfied, so he invented causal networks and a causal language to go with them. This latter discovery has big implications for machine learning.

In what follows, I’ll assume you know Bayes Rule, and I give a flavor of what he accomplished, based on the book.

The weakness of associations in probability are that they don’t tell you what caused what. The rooster may crow just before sunrise, but the association doesn’t tell you whether the approaching sunrise caused the rooster to crow or the rooster’s crowing caused the sunrise.

Lets start with some interesting aspects of Bayesian Graphs.


The simplest Bayes net would be a junction between two nodes that is updated via Bayes Rule. But lets look at junctions that involve 3 nodes (which could exist in a huge graph of hundreds of nodes). The following is a quote from the book:

There are three basic types of junctions, with the help of which we can characterize any pattern of arrows in the network.
1. A -> B -> C. This junction is the simplest example of a “chain,” or of mediation. In science, one often thinks of B as the mechanism, or “mediator,” that transmits the effect of A to C. A familiar example is
Fire -> Smoke -> Alarm.
Although we call them “fire alarms,” they are really smoke alarms. The fire by itself does not set off an alarm, so there is no direct arrow from Fire to Alarm. Nor does the fire set off the alarm through any other variable, such as heat. It works only by releasing smoke molecules in the air. If we disable that link in the chain, for instance by sucking all the smoke molecules away with a fume hood, then there will be no alarm. This observation leads to an important conceptual point about chains: the mediator B “screens off” information about A from C, and vice versa. Suppose we had a database of all the instances when there was fire, when there was smoke, or when the alarm went off. If we looked at only the rows where Smoke = 1 (i.e. TRUE), we would expect Alarm = 1 every time, regardless of whether Fire = 0 (FALSE) or Fire = 1 (TRUE).
2. A < — B — > C. This kind of junction is called a “fork,” and B is often called a common cause or confounder of A and C. A confounder will make A and C statistically correlated even though there is no direct causal link between them. A good example (due to David Freedman) is Reading Ability.
Children with larger shoes tend to read at a higher level. But the relationship is not one of cause and effect. Giving a child larger shoes won’t make him read better! Instead, both variables are explained by a third, which is the child’s age. Older children have larger shoes, and they also are more advanced readers. We can eliminate this spurious correlation, as Karl Pearson and George Udny Yule called it, by conditioning on the child’s age. For instance, if we look only at seven-year-olds, we expect to see no relationship between shoe size and reading ability.
3. A — > B < — C an example: Talent — > Celebrity < — Beauty. Here we are asserting that both talent and beauty contribute to an actor’s success, but beauty and talent are completely unrelated to one another in the general population. We will now see that this collider pattern works in exactly the opposite way from chains or forks when we condition on the variable in the middle. If A and C are independent to begin with, conditioning on B will make them dependent. For example, if we look only at famous actors (in other words, we observe the variable Celebrity = 1), we will see a negative correlation between talent and beauty: finding out that a celebrity is unattractive increases our belief that he or she is talented. This negative correlation is sometimes called collider bias or the “explain-away” effect. For simplicity, suppose that you don’t need both talent and beauty to be a celebrity; one is sufficient. Then if Celebrity A is a particularly good actor, that “explains away” his success, and he doesn’t need to be any more beautiful than the average person. On the other hand, if Celebrity B is a really bad actor, then the only way to explain his success is his good looks. So, given the outcome Celebrity = 1, talent and beauty are inversely related—even though they are not related in the population as a whole. Even in a more realistic situation, where success is a complicated function of beauty and talent, the explain-away effect will still be present. .

The miracle of Bayesian networks lies in the fact that the three kinds of junctions we are now describing in isolation are sufficient for reading off all the independencies implied by a Bayesian network, regardless of how complicated.

Pearl then explains that Bayes Rule lets you update your belief in a hypothesis when new data is presented. It lets you calculate a backward probability, given a forward probability:

Suppose you take a medical test to see if you have a disease, and it comes back positive. How likely is it that you have the disease? For specificity, let’s say the disease is breast cancer, and the test is a mammogram. In this example the forward probability is the probability of a positive test, given that you have the disease: P(test | disease). This is what a doctor would call the “sensitivity” of the test, or its ability to correctly detect an illness. Generally it is the same for all types of patients, because it depends only on the technical capability of the testing instrument to detect the abnormalities associated with the disease. The inverse probability is the one you surely care more about: What is the probability that I have the disease, given that the test came out positive? This is P(disease | test), and it represents a flow of information in the non-causal direction, from the result of the test to the probability of disease. This probability is not necessarily the same for all types of patients; we would certainly view the positive test with more alarm in a patient with a family history of the disease than in one with no such history. Notice that we have started to talk about causal and non-causal directions.

For the next quote, you should understand that we can rewrite Bayes’s rule as follows: (Updated probability of Disease once the test results are in) = P(D | T) = (likelihood ratio) × (prior probability of D) where the new term “likelihood ratio” is given by P(T | D)/P(T).

Judea Pearl was reading about neural network models of the brain, and he put that together with Bayes Rule when he first planned how Bayesian Networks would work.

I assumed that the network would be hierarchical, with arrows pointing from higher neurons to lower ones, or from “parent nodes” to “child nodes.” Each node would send a message to all its neighbors (both above and below in the hierarchy) about its current degree of belief about the variable it tracked (e.g., “I’m two-thirds certain that this letter is an R”). The recipient would process the message in two different ways, depending on its direction. If the message went from parent to child, the child would update its beliefs using conditional probabilities,… If the message went from child to parent, the parent would update its beliefs by multiplying them by a likelihood ratio, as in the mammogram example.

In other words, ‘forward probability’ (Test | Disease) would be passed down, backward probability (Disease | Test) would be passed up.
In image recognition of a word, the probability of a word being “Lion” might be increase by a message passed up to a parent that there is more evidence that the first letter is “L”, and in turn, the more evidence there is for “Lion”, the more probability given to the message passed downward that the first letter is “L”.

So why didn’t Professor Pearl rest on his laurels? He states some limitations with Bayesian networks:

By design, in a Bayesian network, information flows in both directions, causal and diagnostic: smoke increases the likelihood of fire, and fire increases the likelihood of smoke. In fact, a Bayesian network can’t even tell what the “causal direction” is…

It is one thing to say, “Smoking causes cancer,” but another to say that my uncle Joe, who smoked a pack a day for thirty years, would have been alive had he not smoked. The difference is both obvious and profound: none of the people who, like Uncle Joe, smoked for thirty years and died can ever be observed in the alternate world where they did not smoke for thirty years. Responsibility and blame, regret and credit: these concepts are the currency of a causal mind. To make any sense of them, we must be able to compare what did happen with what would have happened under some alternative hypothesis.

A structural model, as he diagrams it, looks simple. Here is one:


This is based on a causal model that a doctor named John Snow in England used in 1854 when there was a cholera outbreak. Dr. Snow trudged around town and realized that people downstream from a water company were getting sick. The belief among ‘experts’ at the time was that some atmospheric ‘miasma’ caused cholera, and that idea is also incorporated in the diagram. ‘Poverty’ probably really did have an effect on both water purity and likelihood of cholera (as you can also see in the diagram).

So what’s so great about the diagram shown above?

First, structural causal models are a shortcut that works, and there aren’t any competitors around with that miraculous property. Second, they were modeled on Bayesian networks, which in turn were modeled on David Rumelhart’s description of message passing in the brain.

Professor Pearl has an interesting speculation at this point:

It is not too much of a stretch to think that 40,000 years ago, humans co-opted the machinery in their brain that already existed for pattern recognition and started to use it for causal reasoning…

[A.I. researchers] aimed to build robots that could communicate with humans about alternate scenarios, credit and blame, responsibility and regret. These are all counterfactual notions that AI researchers had to mechanize before they had the slightest chance of achieving what they call “strong AI”—humanlike intelligence.

Pearl and his students created a mathematical language of causality. For instance, they could represent “counterfactuals” in this language. A counterfactual is an alternative that was not taken. For instance, “if only I had not left my Facebook page open to that joke I made about my wife when she was still in the house”.

Pearl writes:

The case for causal models becomes even more compelling when we seek to answer counterfactual queries such as “What would have happened had we acted differently?” any query about the mechanism by which causes transmit their effects—the most prototypical “Why?” question—is actually a counterfactual question in disguise. Thus, if we ever want robots to answer “Why?” questions or even understand what they mean, we must equip them with a causal model and teach them how to answer counterfactual queries.

Belief propagation formally works in exactly the same way whether the arrows are non-causal or causal. So Bayes Nets and causal diagrams have similarities However causal models are assumptions – and that is where their extra power comes from.


The above “ladder of causation” illustration shows that without a causal model, statistics stay only on the bottom rung of the ladder.

Remember this diagram from above:


Pearl says of it:

John Snow’s painstaking detective work had showed two important things: (1) there is no arrow between Miasma and Water Company (the two are independent), and (2) there is an arrow between Water Company and Water Purity. Left unstated by Snow, but equally important, is a third assumption: (3) the absence of a direct arrow from Water Company to Cholera, which is fairly obvious to us today because we know the water companies were not delivering cholera to their customers by some alternate route. … Because there are no confounders of the relation between Water Company and Cholera, any observed association must be causal. Likewise, since the effect of Water Company on Cholera must go through Water Purity, we conclude (as did Snow) that the observed association between Water Purity and Cholera must also be causal. Snow stated his conclusion in no uncertain terms: if the Southwark and Vauxhall Company had moved its intake point upstream, more than 1,000 lives would have been saved. Few people took note of Snow’s conclusion at the time. He printed a pamphlet of the results at his own expense, and it sold a grand total of fifty-six copies.

I don’t have room to go into the causal language and elegant equations Pearl has in his book, but I’ll mention the ‘do’ operator – the idea of actively changing a cause, as opposed to just observing:

if we are interested in the effect of a drug (D) on lifespan (L), then our query might be written symbolically as: P(L | do(D)). In other words, what is the probability (P) that a typical patient would survive L years if made to take the drug? This question describes what epidemiologists would call an intervention or a treatment and corresponds to what we measure in a clinical trial. In many cases we may also wish to compare P(L | do(D)) with P(L | do(not-D)); the latter describes patients denied treatment, also called the “control” patients. The do-operator signifies that we are dealing with an intervention rather than a passive observation; classical statistics has nothing remotely similar to this operator. We must invoke an intervention operator do(D) to ensure that the observed change in Lifespan L is due to the drug itself and is not confounded with other factors that tend to shorten or lengthen life. If, instead of intervening, we let the patient himself decide whether to take the drug, those other factors might influence his decision, and lifespan differences between taking and not taking the drug would no longer be solely due to the drug. For example, suppose only those who were terminally ill took the drug. Such persons would surely differ from those who did not take the drug, and a comparison of the two groups would reflect differences in the severity of their disease rather than the effect of the drug.

… Note that P(L | D) may be totally different from P(L | do(D)). This difference between seeing and doing is fundamental …A world devoid of P(L | do(D)) and governed solely by P(L | D) would be a strange one indeed. For example, patients would avoid going to the doctor to reduce the probability of being seriously ill; cities would dismiss their firefighters to reduce the incidence of fire…

Pearl notes that we can say that X causes Y if P(Y | do(X)) > P(Y). So here we have a mathematical formula, in his new causal language, for causality!

We could think up other applications for Pearl’s causal diagrams and language. Perhaps indexing of pages by search engines on the internet would benefit by first parsing the causal structure of the story in the document. Perhaps arguments, political or otherwise, could be analyzed to show their causal assumptions.

Pearl says that current convolution nets and deep-learning nets leave out causality. I would mention though that there are also “generative models” that do learn causes. You can test hypotheses about neuronal time-series with a method called “dynamic causal modelling, and in “free energy” based nets, which I mentioned in an earlier blog post, the application of the free energy principle results in a generative model that generates consequences from causes.

But apart from that minor point, I would think the application of Pearl’s causal language and diagrams will give a boost to neural models of all sorts as well as reasoning and logic. His causal diagrams also are easy for a human to understand, unlike the internal weights of recurrent generative nets.


Pearl, Judea. The Book of Why: The New Science of Cause and Effect. Basic Books. Kindle Edition.