Thoughts on the Forward Forward Algorithm
A breakdown of Geoffrey Hinton's Forward Forward algorithm
Geoffrey Hinton, notable for his work in deep learning, namely artificial neural networks, recently proposed a new method to training neural networks - the Forward Forward (FF) algorithm1. The traditional approach to training neural networks involves a forward pass through the network to compute some loss function, and then a backward pass (backpropagation) which computes the gradients of the loss function with respect to the parameters (weights) of the network to update these parameters via some optimization method such as stochastic gradient descent until we converge to a local (or global) optima. FF essentially replaces the backward pass with another forward pass. Before we detail the algorithm, why propose such an alternative to the widely successful backpropagation? Aside from the nature of scientific research (even if work is ultimately shelved), there are several issues with backpropagation. Thus, there is value in exploring alternatives.
Issues with Backpropagation
If you are not familiar with backpropagation, I recommend reviewing this interlude. Broadly, neural networks use backpropagation, an application of the chain rule, to compute the gradient of the loss function with respect to each parameter of the network. Then, one may apply some rule such as stochastic gradient descent to update the value of each parameter with the computed gradient. Backpropagation is powerful and the de facto technique to train neural networks as of writing this. But, as Hinton highlights, the following points are some issues with backprop and reasons why FF may be a promising alternative:
Does the brain implement backprop (more discussion on this towards the coda)? He states that there is no evidence (based on experiments) thus far that the cortical connections compute error derivatives, and human perceptual learning systems likely depend on a mechanism that involves learning on the fly. Personally, I don’t think this is a strong point. However, there is some value in considering how the brain may operate under the hood as a reference point.
Furthermore, backprop requires a white-box knowledge of the computations in the forward pass. If given a black-box model of computations performed in the forward pass, it would be difficult to formulate a differentiable model to perform the backward pass.
Neural networks do not need to fallback to reinforcement learning with FF to update the weights (which suffers from the variance problem as the number of parameters of our neural network increase).
I think the points two and three are reasonable arguments for this proposal.
The Forward Forward Algorithm
The FF algorithm involves two forward passes, instead of a forward and a backward pass, that operate in the same exact way, just with different data and different objectives. We want high goodness for positive data (one forward pass), and low goodness for negative data (the other forward pass). For the case of supervised learning, for example, positive data may be an image associated with the correct label, and negative data may be an image associated with an incorrect label. Hinton describes goodness as merely the sum of squared activities for positive data, and the negative sum of the squared activities for negative data. An activity of a neuron is just the output of some activation function (such as Rectified Linear Unit, or ReLU for short). In the positive forward pass, the weights are adjusted to increase the goodness of each layer such that it is strictly greater than some threshold theta. In the negative forward pass the weights are adjusted to decrease the goodness of each layer such that it is strictly less than that same threshold. h_i^j denotes the output (activity) of the neuron i in layer j of the network.
For multi-layer networks, layer normalization is applied to an output vector before it is passed as input into the subsequent layer. For inference, the probability that an input vector is positive data can be computed as the Sigmoid of our goodness function minus that same threshold for layer i:
To update the weights, we simply take the gradient of the log probability that a particular input vector is positive with respect to each weight per layer, and add that gradient to the weight for the forward pass with positive data, and subtract the gradient from the weight for the forward pass with negative data. This is because we want the probability to be high for positive data, and low for negative data (so that we are closer to satisfying the constraints for our goodness functions with respect to the threshold).
In the paper, Hinton details some preliminary experiments to demonstrate the promise of forward forward, as well as comparing it to backpropagation with standard datasets such as CIFAR and MNIST. I refer the reader to the original paper for a complete breakdown of the experimental setup and results. To summarize, the FF consistently returned an error rate hundreds of basis points higher than the backpropagation baseline. The inference speed of the algorithm is not notably faster either (though training time could be reduced via parallelism as backpropogation is a compute intensive process). It seems Hinton is more focused on how one does not need to rely on reinforcement learning, and more subtly, that the FF algorithm may more accurately mimic the processes of the human brain. Regardless, incremental work is necessary to show the promise (if any) of this algorithm. Hinton poses several questions towards the end of this paper that may be of interest to the reader.
Personal Remarks
Hinton states that it is unlikely that the brain implements backpropagation because there does not exist any evidence of such a mechanism (first point in the “Issues with Backpropagation” section). I believe this is a logical fallacy. Absence of evidence is not evidence of absence. A helpful analogy regarding deep learning is the following. Humans have successfully designed and built systems capable of aviation (planes). Did we try to replicate precisely how birds fly? Then, why get unnecessarily bogged down with details of neuroscience (aside from our species’s innate curiosity) to build intelligent systems? That is not to say there is no value (or that it is not necessary to build inteligent systems) in coupling the domains, but it should not be a hard constraint.