Making Neural Nets more decipherable and closer to Computers

In an article titled “Neural Turing Machines“, three researchers from ‘Google DeepMind’ Alex Graves, Greg Wayne, and Ivo Danihelka describe a neural net that has a new feature, a memory bank. The system is similar in this respect to a Turing Machine – which was originally proposed by Alan Turing in 1936. His hypothetical machine had a read/write head that wrote on squares on a tape, and could move to other squares and read from them as well. So it had a memory. In theory, it could compute anything that modern computers could compute, given enough time.

One advantage of making a Neural Net that is also a Turing machine is that it can be trained with gradient descent algorithms.   That means it doesn’t just execute algorithms, it learns algorithms (though, if you want to be fanatical, you might note that since a Turing machine can simulate any recipe that a computer can execute, it could simulate a neural net that learns as well).

The authors say this:

Computer programs make use of three fundamental mechanisms: elementary operations (e.g., arithmetic operations), logical flow control (branching), and external memory, which can be written to and read from in the course of computation. Despite its wide-ranging success in modelling complicated data, modern machine learning has largely neglected the use of logical flow control and external memory.

Recurrent neural networks (RNNs) …are Turing-Complete and therefore have the capacity to simulate arbitrary procedures, if properly wired. Yet what is possible in principle is not always what is simple in practice. We therefore enrich the capabilities of standard recurrent networks to simplify the solution of algorithmic tasks. This enrichment is primarily via a large, addressable memory, so, by analogy to Turing’s enrichment of finite-state machines by an infinite memory tape, we dub our device a “Neural Turing Machine” (NTM). Unlike a Turing machine, an NTM is a differentiable computer that can be trained by gradient descent, yielding a practical mechanism for learning programs.

They add that in humans, the closest analog to a Turing Machine is ‘working memory’ where information can be stored and rules applied to that information.

…In computational terms, these rules are simple programs, and the stored information constitutes the arguments of these programs.

A Neural Turing memory is designed

to solve tasks that require the application of approximate rules to “rapidly-created variables.” Rapidly-created variables are data that are quickly bound to memory slots, in the same way that the number 3 and the number 4 are put inside registers in a conventional computer and added to make 7.

… In [human] language, variable-binding is ubiquitous; for example, when one produces or interprets a sentence of the form, “Mary spoke to John,” one has assigned “Mary” the role of subject, “John” the role of object, and “spoke to” the role of the transitive verb.

A Neural Turing Machine (NTM) architecture contains two components: a neural network controller and a memory bank.
turing1

Like most neural networks, the controller interacts with the external world via input and output vectors. Unlike a standard network, it also interacts with a memory matrix…. By analogy to the Turing machine we refer to the network outputs that parametrize these operations as “heads.”
Crucially, every component of the architecture is differentiable, making it straightforward to train with gradient descent. We achieved this by defining ‘blurry’ read and write operations that interact to a greater or lesser degree with all the elements in memory (rather than addressing a single element, as in a normal Turing machine or digital computer).

In a regular computer, a number is retrieved by fetching it at a given address.

Their net has two differences in retrieval from a standard computer.   First of all, they retrieve  an entire vector of numbers from a particular address.   Think of a rectangular matrix, where each row number is an address, and the row itself is the vector that is retrieved.

Secondly, instead of retrieving at just one address, there is a vector of weights that controls the retrieval at multiple addresses.    The weights in that vector add up to ‘1’.   Think of a memory matrix consisting of 5 vectors.   There will be 5 corresponding weights.

If the weights were:

0,0,1,0,0

then only one vector will be retrieved, the vector at the third row of the matrix.  This is similar to ordinary location based addressing in computers or Turing machines. You can also shift that ‘1’ each cycle, so that it retrieves an adjacent number each time (to the number retrieved before).

Now think of the following vector of weights:

0,0.3,0.7,0,0

In this case two vectors are retrieved (one from the 2nd row, and one from the third).   The first one has all its elements multiplied by 0.3, the second has all its elements multiplied by 0.7, and then the two are added.   This gives one resultant vector.  They say this is a type of ‘blurry’ retrieval .

They use the same idea when writing to memory – a vector is used to relatively weight the different values written to memory.

This vector-multiplication method of retrieval allows the entire mechanism to be trained by gradient descent.  It also can be thought of as an ‘attentional mechanism” where the focus is on the vectors with relatively high corresponding weights.

Some other nets do a probabilistic type of addressing, where there is a probability distribution over all the vectors, and at each cycle the net uses  most probable (perhaps with a random component).   But since neural Turing machines learn by gradient descent, the designers had to use the distribution to obtain a weighted sum of memory vectors is retrieved.   This was not a bug, but a feature!

They say:

The degree of blurriness is determined by an attentional “focus” mechanism that constrains each read and write operation to interact with a small portion of the memory, while ignoring the rest… Each weighting, one per read or write head, defines the degree to which the head reads or writes at each location. A head can thereby attend sharply to the memory at a single location or weakly to the memory at many locations.

Writing to memory is done in two steps:

we decompose each write into two parts: an erase followed by an add.
Given a weighting wt emitted by a write head at time t, along with an erase vector et whose M elements all lie in the range (0,1), the memory vectors M{t-1}(i) from the previous time-step are modified as follows:

turing4

where 1 is a row-vector of all 1’s, and the multiplication against the memory location acts point-wise. Therefore, the elements of a memory location are reset to zero only if both the weighting at the location and the erase element are one; if either the weighting or the erase is zero, the memory is left unchanged.
Each write head also produces a length M add vector at, which is added to the memory after the erase step has been performed:

turing5

The combined erase and add operations of all the write heads produces the final content of the memory at time t. Since both erase and add are differentiable, the composite write operation is differentiable too.

The network that outputs these vectors and that reads and writes to memory, as well as taking inputs and producing outputs, can be a recurrent neural network, or a plain feedforward network.   In either case, the vector used to retrieve from memory locations is then fed back, along with the inputs, into the net.

The authors trained their net on various problems, such as copying a sequence of numbers, or retrieving the next number in an arbitrary sequence given the one before it. It came up with algorithms such as this one, for copying sequences of numbers: (in the following, a ‘head’ can be either a read-head or a ‘write-head’ and has a vector of weights associated with it to weight the various memory vectors for the process of combination and retrieval, or for writing.)

initialise: move head to start location
while input delimiter not seen do

receive input vector
write input to head location
increment head location by 1

end while
return head to start location
while true do

read output vector from head location
emit output
increment head location by 1

end while

This is essentially how a human programmer would perform the same task in a low-level programming language. In terms of data structures, we could say that NTM has learned how to create and iterate through arrays. Note that the algorithm combines both content-based addressing (to jump to start of the sequence) and location-based addressing (to move along the sequence).

The way the NTM solves problems is easier to understand than trying to decipher a standard recurrent neural net, because you can look at how memory is being addressed, and what is being retrieved and written to memory at any point.
There is more to the NTM than I have explained above as you can see from the following diagram from their paper:

turing6

Take home lesson: The Turing Net outperforms existing architectures such as LSTMs (neural nets where each unit has a memory cell, plus trainable gates to decide what to forget and what to remember), and it generalizes better as well. It is also easier to understand what the net is doing, especially if you use a ‘feedforward net’ as the ‘controller’. The net doesn’t just passively compute outputs, it decides what to write to memory, and what to retrieve from memory.

Sources:
Neural Turing Machines by Alex Graves, Greg Wayne and Ivo Danihelka – Google DeepMind, London, UK (https://arxiv.org/abs/1410.5401)

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s