Back in 1990 (28 years ago now), Jeffrey Elman wrote an interesting paper titled “Finding Structure in Time”. He took a standard backpropagation net, with one hidden layer, and fed that layer back into the inputs.
To review what a backpropagation net is: it’s just a way of learning associations between a group of inputs and an output. The inputs could be a set of pixel values for a picture, and the output could be a binary category like “is it a cat” – which is 1 if the picture is of a cat, and 0 otherwise. The associations are learned by adjusting weights on connections leading from the inputs to the output(s). Usually there is a hidden layer, which gives the net more power to solve problems. An example neural net that has 3 inputs, 2 outputs, and a hidden layer might look like this:
When you present a pattern to the input units, for instance a pattern of 1’s and 0’s, they send their activation over weighted connections to the hidden nodes. Each hidden node combines its incoming weighted signals, applies a function to that combined value, and outputs the result of the function on its outgoing connection (which in this case leads to the output layer). The output node(s) do the same operation. If the weights are adjusted properly, (by a training algorithm), eventually the net learns to predict the correct values. So in the cat example, when you present a picture of a cat to the input units (as a group of pixel values), the output unit for ‘cat’ might output a ‘1’, and the output unit for ‘dog’ might output a ‘0’ and if you had an output unit for ‘tiger’ it might output some intermediate value.
Suppose you had a net with two inputs, one output, and 2 hidden nodes in the hidden layer. You could use a net like this to learn to predict problems such as XOR (Exclusive OR), which outputs a ‘1’ if the sum of the inputs is odd, and a zero if the sum is even.
To recap, you have some patterns (in the case of XOR only 4 patterns) and each pattern helps the net learn a relationship.
(There are many papers explaining backpropagation on the internet (here is one) and my assumption when writing this blog is that you know what backpropagation is, and that you have a background in high-school math, including calculus.)
The improvement Elman made was to feed back the activations of the (in this case) 2 hidden nodes from the hidden layer back to the input layer. This means now (in this example) that the input layer has 2 extra nodes, for a total of 4, where the first 2 nodes are the standard inputs, and the second nodes are ‘context nodes’, receiving inputs from the two nodes (respectively) of the hidden layer. Doing this doesn’t make sense for a simple XOR problem, but it would make sense for a series – perhaps a series where every bit is the XOR of the 2 bits before it.
For an XOR series starting with 0 and 1 you would get
This series could go on forever, but since the net has 2 inputs, its presented in consecutive groups of 2 as:
0,1 –> 1
1,1 –> 0
1,0 –> 1
There is a time step involved. Initially, the hidden nodes are assumed to have an arbitrary value – such as an activation of 0.5. So instead of just presenting – say – 0, 1 to the input layer, we present 0, 1, 0.5, 0.5 to the 4 nodes in the input layer. The last two inputs are supposed to come from the activations of the hidden nodes, but at this point there haven’t been any activations there yet.
At time ‘t’ = 2, we look at what the activation of the hidden nodes were when the first pattern was presented (at time t=1) and feed them back. Suppose, after the inputs were presented and the hidden nodes combined them, that the hidden layer’s first node activation was 0.7, and the second was 0.6, then, from the above example we have
1,1,0.7,0.6 (as inputs).
It turned out that this type of recursive net could predict series of numbers successively presented over time. There was an actual memory in the net. The extra nodes in the input layer, which could Elman called ‘context nodes’, made information from a previous time available to the net, which meant that the hidden nodes became a function not only of the input/output relationship, but of themselves at a previous instant. And at the next instant, they became a function of themselves again, and since they were already a function of a previous instant, they inherited information from that instant as well. You could write it out like this
HidnodesTime2 = function of (HidNodesTime1, inputsTime1,outputsTime1)
HidNodesTime3 = function of (HidNodesTimes2,inputsTime2,outputsTime2)
Where ‘function’ is the function of the neural net – or at least the transformation between the input layer and the hidden layer.
Exclusive OR is a boring series, but Elman was just getting started.
He now made a sequences of letters. Each letter was 6 bits long, and each bit meant something. Here is a table from his paper:
He found that in letter sequences, it actually helped that there was this kind of meaningful representation of letters. If he had used random bit-patterns to represent letters, the predictions would have been less accurate. You might not be surprised at this, for instance usually a vowel follows a consonant so providing information about each letter, such as whether it is a vowel or a consonant, is helpful.
He then tried meaningful word combinations. And here, he ended up with a “family tree” of word meaning emerging from this simple neural net.
Here is what he says:
As a first step, a somewhat modest experiment was undertaken. A sentence generator program was used to construct a set of short (two- and three-word) utterances. Thirteen classes of nouns and verbs were chosen; these are listed in Table 3.
The generator program used these categories and the 15 sentence templates given in Table 4 to create 10,000 random two- and three-word sentence frames. Each sentence frame was then filled in by randomly selecting one of the possible words appropriate to each category. Each word was replaced by a randomly assigned 31-bit vector in which each word was represented by a different bit.
He is saying that since his dictionary (which was very limited) contained only 29 words, then to represent each word, he just used 29 bits. For a particular word, all bits would be off except for one, which would be flipped on. (He used 31 bits instead of 29, to keep 2 extra bits for other purposes).
For this simulation the input layer and output layers contained 31 nodes each, and the hidden and context layers contained 150 nodes each.
The task given to the network was to learn to predict the order of successive words. That’s why the output had to have 31 nodes, this allowed it to represent the one word that was being predicted. In a sentence such as “boy chases dog”, the inputs might be ‘boy’ (with all the context nodes in the input layer being set to 0.5, which is halfway between ‘off’ and ‘on’) and then at the next time instant, “chases” and whatever values the hidden nodes had obtained when ‘boy’ was activated, and at the same time ‘dog’ would be presented to the output layer, and any mismatch between prediction and “dog” used to adjust the weights to make the net predict “dog”.
A problem arises since at any point there are various possibilities. A boy doesn’t have to chase a dog, he could chase a cat. A boy doesn’t have to chase anything, the sentence could be “boy sings.” However, there are probabilities, a verb will usually follow a noun, and some of the actions a “boy” prefers to do varies from what a “girl” does.
In Elman’s words:
For any given sequence of words there are a limited number of possible successors. Under these circumstances, the network should learn the expected frequency of occurrence of each of the possible successor words; it should then activate the output nodes proportional to these expected frequencies.
So what did the net learn from the co-occurrence statistics of the words in the various sentences?
Elman did a statistical cluster analysis of the activation values of the hidden nodes (nodes of the hidden layer) after the net had learned all the patterns. In doing this analysis he would present a word, or the first words of a simple sentence, and then look at the values of the activations of the hidden nodes. He would record those values. The clusters of the Hidden-node patterns formed a hierarchy. Words that were closest in meaning created activation patterns in the hidden layer that were close to each other.
The clusters showed that the net learned that:
there are several major categories of words. One large category corresponds to verbs; another category corresponds to nouns. The verb category is broken down into groups that require a direct object, or are intransitive, or where a direct object is optional. The noun category is broken into two major groups: inanimates and animates. Animates are divided into human and nonhuman; the nonhuman are divided into large animals and small animals. Inanimates are broken into breakable, edibles, and nouns which appeared as subjects of agentless active verbs.
…The network is not able to predict the precise order of words, but it recognizes that (in this corpus) there is a class of inputs (namely, verbs) that typically follow other inputs (namely, nouns). This knowledge of class behavior is quite detailed; from the fact that there is a class of items which always precedes “chase, break,” “smash,” it infers that the large animals form a class.
Elman notes that the hierarchy is “soft” and implicit. While some categories may be qualitatively distinct (i.e., very far from each other in space), there may also be other categories that share properties and have less distinct boundaries.
He says that there are limits to how much memory the net can have since:
There is limited numeric precision in the machines on which these simulations are run; the activation function is repetitively applied to the memory and results in exponential decay.
Though this work was done a long time ago, the ideas are being used today, notably by Brain Corp, which has a net that has a hierarchy of mini-nets, each of which use their hidden layer as a source of context for other mini-nets. I will talk about their research (they sell robots partly based on it), in my next post.
You could speculate where else this basic idea can be used. Can it find organizational structure in other types of data than text? Can it find patterns in history?
Finding Structure in Time by Jeffrey Elman University of California, San Diego – COGNITIVE SCIENCE 14, 179-211 (1990)