Skip to main content

Deep vs Wide

Currently, in machine learning, deep learning publications dominate over wide networks, but when not looking at current trends and concentrating only on precision (accuracy), the differences might not be as big as some might think.

DatasetDeep NetworksWide Networks
TIMIT & CIFAR-10Acoustic modeling using deep belief networksDo Deep Nets Really Need to be Deep?
NORB & CIFAR-10Learning methods for generic object recognition with invarianceto pose and lightingAn analysis of single-layer networks in unsupervised feature learning
MNIST & ADSStochastic pooling for regularization of deep convolutional neural networks#1 Linear Regression on a Set of Selected Templates from a Pool of Randomly Generated Templates
#2 Optimal blending of multiple independent prediction models

Motivation: Let's have 9 features of same probability of correctness p, which is in (0.5, 1.0]. What is the output probability when we combine 9 features together (majority of features is correct) - blue line and when we do it in deep manner (3 features create 1 new feature and then we combine 3 new features together) - red line?

Python script for above graph. Blue line (wide networks) is always above red line (deep networks).

Now what if we go to infinity?

As it turns out, if we can generate infinite amount of features with probability of correctness from (0.5, 1.0], all we have to do is count which features are for and which are against.

Instead of technical proof, here is a graph how sum of first k+1 terms in expansion of (p+(1-p))^(2k+1) evolves with increasing k:

Or if you don't like this limit as proof of it might be quite challenging and technical, you might consider reading Optimal blending of multiple independent prediction models, which plays around the same idea from statistical point of view (It has its own python script as well).