Currently, in machine learning, deep learning publications dominate over wide networks, but when not looking at current trends and concentrating only on precision (accuracy), the differences might not be as big as some might think.
Motivation: Let's have 9 features of same probability of correctness p, which is in (0.5, 1.0]. What is the output probability when we combine 9 features together (majority of features is correct) - blue line and when we do it in deep manner (3 features create 1 new feature and then we combine 3 new features together) - red line?
Python script for above graph. Blue line (wide networks) is always above red line (deep networks).
Now what if we go to infinity?
As it turns out, if we can generate infinite amount of features with probability of correctness from (0.5, 1.0], all we have to do is count which features are for and which are against.
Instead of technical proof, here is a graph how sum of first k+1 terms in expansion of (p+(1-p))^(2k+1) evolves with increasing k:
Or if you don't like this limit as proof of it might be quite challenging and technical, you might consider reading Optimal blending of multiple independent prediction models, which plays around the same idea from statistical point of view (It has its own python script as well).