Deep learning has been immensely successful in recent years, spawning a lot of hope and generating a lot of hype, but no one has really understood why it works. The prevailing wisdom has been that deep learning is capable of discovering new representations of the data, rather than relying on hand-coded features like other learning algorithms do. But because deep networks are black boxes — what Allen School professor emeritus Pedro Domingos describes as “an opaque mess of connections and weights” — how that discovery actually happens is anyone’s guess.
Until now, that is. In a new paper posted on the preprint repository arXiv, Domingos gives us a peek inside that black box and reveals what is — and just as importantly, what isn’t — going on inside. Read on for a Q&A with Domingos on his latest findings, what they mean for our understanding of how deep learning actually works, and the implications for researchers’ quest for a “master algorithm” to unify all of machine learning.
You lifted the lid off the so-called black box of deep networks, and what did you find?
Pedro Domingos: In short, I found that deep networks are not as unintelligible as we thought, but neither are they as revolutionary as we thought. Deep networks are learned by the backpropagation algorithm, an efficient implementation for neural networks of the general gradient descent algorithm that repeatedly tweaks the network’s weights to make its output for each training input better match the true output. That process helps the model learn to label an image of a dog as a dog, and not as a cat or as a chair, for instance. This paper shows that all gradient descent does is memorize the training examples, and then make predictions about new examples by comparing them with the training ones. This is actually a very old and simple type of learning, called similarity-based learning, that goes back to the 1950s. It was a bit of a shock to discover that, more than half a century later, that’s all that is going on in deep learning!
Deep learning has been the subject of a lot of hype. How do you think your colleagues will respond to these findings?
PD: Critics of deep learning, of which there are many, may see these results as showing that deep learning has been greatly oversold. After all, what it does is, at heart, not very different from what 50-year-old algorithms do — and that’s hardly a recipe for solving AI! The whole idea that deep learning discovers new representations of the data, rather than relying on hand-coded features like previous methods, now looks somewhat questionable — even though it has been deep learning’s main selling point.
Conversely, some researchers and fans of deep learning may be reluctant to accept this result, or at least some of its consequences, because it goes against some of their deepest beliefs (no pun intended). But a theorem is a theorem. In any case, my goal was not to criticize deep learning, which I’ve been working in since before it became popular, but to understand it better. I think that, ultimately, this greater understanding will be very beneficial for both research and applications in this area. So my hope is that deep learning fans will embrace these results.
So it’s a good news/bad news scenario for the field?
PD: That’s right. In “The Master Algorithm,” I explain that when a new technology is as pervasive and game-changing as machine learning has become, it’s not wise to let it remain a black box. Whether you’re a consumer influenced by recommendation algorithms on Amazon, or a computer scientist building the latest machine learning model, you can’t control what you don’t understand. Knowing how deep networks learn gives us that greater measure of control.
So, the good news is that it is now going to be much easier for us to understand what a deep network is doing. Among other things, the fact that deep networks are just similarity-based algorithms finally helps to explain their brittleness, whereby changing an example just slightly can cause the network to make absurd predictions. Up until now, it has puzzled us why a minor tweak would, for example, lead a deep network to suddenly start labeling a car as an ostrich. If you’re training a model for a self-driving car, you probably don’t want to hit either, but for multiple reasons — not least, the predictability of what an oncoming car might do compared to an oncoming ostrich — I would like the vehicle I’m riding in to be able to tell the difference.
But these findings could be considered bad news in the sense that it’s clear there is not much representation learning going on inside these networks, and certainly not as much as we hoped or even assumed. How to do that remains a largely unsolved problem for our field.
If they are essentially doing 1950s-style learning, why would we continue to use deep networks?
PD: Compared to previous similarity-based algorithms such as kernel machines, which were the dominant approach prior to the emergence of deep learning, deep networks have a number of important advantages.
One is that they allow incorporating bits of knowledge of the target function into the similarity measure — the kernel — via the network architecture. This is advantageous because the more knowledge you incorporate, the faster and better you can learn. This is a consequence of what we call the “no free lunch” theorem in machine learning: if you have no a priori knowledge, you can’t learn anything from data besides memorizing it. For example, convolutional neural networks, which launched the deep learning revolution by achieving unprecedented accuracy on image recognition problems, differ from “plain vanilla” neural networks in that they incorporate the knowledge that objects are the same no matter where in the image they appear. This is how humans learn, by building on the knowledge they already have. If you know how to read, then you can learn about science much faster by reading textbooks than by rediscovering physics and biology from scratch.
Another advantage to deep networks is that they can bring distant examples together into the same region, which makes learning more complex functions easier. And through superposition, they’re much more efficient at storing and matching examples than other similarity-based approaches.
Can you describe superposition for those of us who are not machine learning experts?
PD: Yes, but we’ll have to do some math. The weights produced by backpropagation contain a superposition of the training examples. That is, the examples are mapped into the space of variations of the function being learned and then added up. As a simple analogy, if you want to compute 3 x 5 + 3 x 7 + 3 x 9, it would be more efficient to instead compute 3 x ( 5 + 7 + 9) = 3 x 21. The 5, 7 and 9 are now “superposed” in the 21, but the result is still the same as if you separately multiplied each by 3 and then added the results.
The practical result is that deep networks are able to speed up learning and inference, making them more efficient, while reducing the amount of computer memory needed to store the examples. For instance, if you have a million images, each with a million pixels, you would need on the order of terabytes to store them. But with superposition, you only need an amount of storage on the order of the number of weights in the network, which is typically much smaller. And then, if you want to predict what a new image contains, such as a cat, you need to cycle through all of those training images and compare them with the new one. That can take a long time. With superposition, you just have to pass the image through the network once. That takes much less time to execute. It’s the same with answering questions based on text; without superposition, you’d have to store and look through the corpus, instead of a compact summary of it.
So your findings will help to improve deep learning models?
PD: That’s the idea. Now that we understand what is happening when the aforementioned car suddenly becomes an ostrich, we should be able to account for that brittleness in the models. If we think of a learned model as a piece of cheese and the failure regions as holes in that cheese, we now understand better where those holes are, and what their shape and size is. Using this knowledge, we can actively figure out where we need new data or adjustments to the model to fix the holes. We should also improve our ability to defend against attacks that cause deep networks to misclassify images by tweaking some pixels such that they cause the network to fall into one of those holes. An example would be attempts to fool self-driving cars into misrecognizing traffic signs.
What are the implications of your latest results in the search for the master algorithm?
PD: These findings represent a big step forward in unifying the five major machine learning paradigms I described in my book, which is our best hope for arriving at that universal learner, what I call the “master algorithm.” We now know that all learning algorithms based on gradient descent — including but not limited to deep networks — are similarity-based learners. This fact serves to unify three of the five paradigms: neural, probabilistic, and similarity-based learning. Tantalizingly, it may also be extensible to the remaining two, symbolic and genetic learning.
Given your findings, what’s next for deep learning? Where does the field go from here?
PD: I think deep learning researchers have become too reliant on backpropagation as the near-universal learning algorithm. Now that we know how limited backprop is in terms of the representations it can discover, we need to look for better learning algorithms! I’ve done some work in this direction, using combinatorial optimization to learn deep networks. We can also take inspiration from other fields, such as neuroscience, psychology, and evolutionary biology. Or, if we decide that representation learning is not so important after all — which would be a 180-degree change — we can look for other algorithms that can form superpositions of the examples and that are compact and generalize well.
Now that we know better, we can do better.