Skip to main content

Review of Large Margin Deep Networks for Classification

Current deep learning research is pushing further in several directions. There are new structures appearing:
There are new optimization techniques for enabling faster distributed DNN training:
And last, but not least, current research is also looking at new loss functions, which is topic of this blog post:
Under motto 'You get what you optimize', the authors of the paper argue that benefits for using large margin loss function include better robustness to input perturbations and in section 4 of the paper on MNIST dataset they show that in case of 80% of noisy labels, l2-margin optimization loss performs better than cross entropy (96.4% accuracy vs. 93.9% accuracy). Unfortunately, for 0% of noisy data, even though paper states that 'margin l1 and l2 models perform better than cross-entropy across the entire range of noise levels', it seems from graph that for 0% and 60% of noisy data, accuracy is around the same for l2-margin loss function and cross-entropy (figure 2 left).
For the CIFAR-10, they perform the same analysis and in this case, l2-margin loss function performs really better than cross-entropy across the entire range 0% to 80% of noisy data (figure 4 left).
Considering that computationally using l2-margin optimization function is more expensive than cross-entropy (20%-60% more expensive - section 4.1 Optimization of Parameters) and for some datasets (MNIST) and 0% noisy data accuracy of these techniques is around same, using l2-margin optimization function might not be beneficial to all the problems out there. Especially, if using large-margin softmax loss (which is cited in the paper):
brings advantages in terms of accuracy for MNIST, CIFAR10, CIFAR100 and LFW datasets.
The authors mention also other publications using large-margin in the context of deep networks and their claimed contribution is that they apply margins not only for the output layer of a deep neural network, but at multiple hidden layers as well. In section 3 the authors derive the formulas. The authors also mention the origin of the idea (SVM):
The biggest contribution of the paper I see is the analyses for three cases:
  • noisy data
  • using only fraction of the data 
  • adversarial perturbations