Current deep learning research is pushing further in several directions. There are

**new structures**appearing:
There are

**new optimization techniques**for enabling faster distributed DNN training:- Sparsified SGD with Memory
- 1-Bit Stochastic Gradient Descent and its Application to Data-Parallel Distributed Training of Speech DNNs

And last, but not least, current research is also looking at

**new loss functions**, which is topic of this blog post:
Under motto 'You get what you optimize', the authors of the paper argue that benefits for using large margin loss function include better robustness to input perturbations and in section 4 of the paper on MNIST dataset they show that in case of 80% of noisy labels, l2-margin optimization loss performs better than cross entropy (96.4% accuracy vs. 93.9% accuracy). Unfortunately, for 0% of noisy data, even though paper states that 'margin l1 and l2 models perform better than cross-entropy across the entire range of noise levels', it seems from graph that for 0% and 60% of noisy data, accuracy is around the same for l2-margin loss function and cross-entropy (figure 2 left).

For the CIFAR-10, they perform the same analysis and in this case, l2-margin loss function performs really better than cross-entropy across the entire range 0% to 80% of noisy data (figure 4 left).

Considering that computationally using l2-margin optimization function is more expensive than cross-entropy (20%-60% more expensive - section 4.1 Optimization of Parameters) and for some datasets (MNIST) and 0% noisy data accuracy of these techniques is around same, using l2-margin optimization function might not be beneficial to all the problems out there. Especially, if using large-margin softmax loss (which is cited in the paper):

brings advantages in terms of accuracy for MNIST, CIFAR10, CIFAR100 and LFW datasets.

The authors mention also other publications using large-margin in the context of deep networks and their claimed contribution is that they apply margins not only for the output layer of a deep neural network, but at multiple hidden layers as well. In section 3 the authors derive the formulas. The authors also mention the origin of the idea (SVM):

The biggest contribution of the paper I see is the analyses for three cases:

- noisy data
- using only fraction of the data
- adversarial perturbations