
Towards Principled Design of Deep Convolutional Networks: Introducing SimpNet - ghosthamlet
https://arxiv.org/abs/1802.06205
======
p1esk
I'm sure a lot of effort went to writing this paper, but I have to say I'm
underwhelmed with the reported results.

First of all, CIFAR results in 2018 are just not going to cut it. The
efficiency is only measured in terms of parameter count, and not even compared
to CondenseNet. "SAF Pooling" does not deserve a special name, as it's just
max pooling followed by dropout. And again, no significant performance benefit
is reported.

Overall, the paper seems like a solid class project, but the contribution to
the field is minimal.

~~~
Mortal_Mind
Have you read the whole paper? cause the whole idea around that paper was not
to introduce a new architecture or breaking any new state of the art results
but to identify and introduce a series of guidelines by which one can achieve
better performance in "any" architecture.

Their "demo" architecture which they call simpnet outperforms much deeper and
more complex counterparts without the use of any fancy techniques such as skip
connections, DenseBlock, etc. Not only their test network which was in fact a
demo showcasing the effectiveness of the mentioned guidelines (as they
indicate that in section 3 I guess) could outperform much heavier and more
complex counterparts, the memory usage and training speed is also different
and much better.

The "SAF Pooling" part was to show how a design choice concerning something as
trivial as a pooling operation can particularly affect the end result,
improving network generalization capability, and what the intuition behind
such design choice was. In the section they talk about SAF Pooling, they
clearly state they deliberately did not choose a complex mechanism such as the
ones they had covered, in order to keep everything simple ( a reminder
of/reference to what they first claimed in the introduction) and show how much
performance can be harnessed out of a simple design with a proper systematic
approach.

They also show some previously known practices/thoughts should not be followed
and why. For example, concerning overlapped pooling they show the idea of
overlapped pooling being better than its counter part, advocated by many
including professor Hinton does not apply all the time, and interestingly
using non overlapped pooling would be a better choice. Thats when they dig
into the pooling operation and later show how to get to the SAF pooling idea
and why that makes sense. The same thing goes about why architectures face
degradation and explain what PLS and PLD issues are and why networks face such
problems. They also explain why strided convolution is not a good replacement
for maxpooling and many more.

~~~
p1esk
I've seen at least a dozen of papers providing "guidelines" to good NN
architecture design (e.g. "Rethinking the Inception Architecture"). I don't
mind another one, and I'm not saying they gave a bad advice, but as a DL
practitioner I haven't really learned much from this paper as far as new good
tricks to use.

Anyone who tried using dropout in convolutional layers would use it after
pooling/subsampling. It just does not make sense to do otherwise. Moreover, in
my experiments, I didn't see any improvement of doing dropout there. If
overfitting is a problem, dropout in conv layers is not going to help much.
I've seen a number of various powerful new regularization methods published
recently, I'd try them instead.

As far as max pooling vs stride two convolutions, I compared them on multiple
architectures, and while I agree that max pooling usually provided a slight
boost, there were exceptions, and the difference was typically not significant
enough to care. In fact I remember max pooling having a significant advantage
only in very small networks (such as Lenet5).

Same goes for the most of their advices: either you already know it, or it's
not always true, or when true it does not result in significant accuracy
gains.

More importantly, and as I pointed out already, there are two serious
shortcomings in the paper:

1\. They have not tested their architecture on ImageNet. That alone
disqualifies them from providing any advice on how to design a modern image
classifier. Cifar is a toy dataset only suitable as proof of concept for new
ideas, just to show that whatever crazy thing you came up with works in
principle. In 2018 no one cares about results you got on Cifar. ImageNet is
much more challenging, and that's why it requires all those "fancy
techniques".

2\. They have not compared their architecture to the latest state of the art
(e.g. CondenseNet, which has been out for several months now). From the
numbers I see in the paper, they haven't even bothered to measure the amount
of computation (FLOPs). Show me that SimpNet outperforms existing SoA in
efficiency models on ImageNet, specifically in regards to latency of
inference. Then I will take the advice seriously.

EDIT: I just read the section in the appendix which argues for using non-
overlapping pooling (vs non-overlapping), and I fail to see any compelling
explanation. Yes, higher layer feature maps show the presence of abstract
features. Why should non-overlapping pooling be better here?

~~~
Mortal_Mind
I agree on the ImageNet and I also think they should publish their results on
that dataset as well. I don't see why it shouldn't perform well. CIFAR10 is
small compared to ImageNet, However, achieving highly competitive accuracy is
very hard. Nearly all SOTA architectures have more than 20 ~ 50 Millions of
parameters or high memory usage. So I still think their result is impressive
enough concerning their "small and simple" architecture with that amount of
parameters. If they were to use any new/fancy techniques and achieve the
current accuracy, I would say you are right, but concerning everything that's
been said, its impressive for its kind. Its as if, a normal car is already on
par with top sports car which utilized all they could at hand to achieve their
maximum performance. Think what happens if an already well-performing car gets
tuned up with those fancy techniques its rivals are utilizing. (The main idea
for the paper by the way)

Again we need to pay attention here that, models such as CondenseNet cannot be
compared with their architecture in that sense because the nature of the work
is completely different. It seems you failed at getting their actual point
concerning strided/max-pooling. Strided convolution always requires more
parameters, it imposes more overhead compared to the pooling operation. Were
you to have a limited budget concerning FLOPS/#params, you would clearly see
what difference such trivial yet important statement makes.

Concerning the ImageNet, while I agree with you that they should also submit
their results on ImageNet, I do not think failing to do so would disqualify
them completely! Since the nature of the work implies you can improve "any"
architecture with the aforementioned principles. (ResNet and SqueezeNet were
two simple examples in their paper with a single change!) So you would still
be benefiting from their work even if you are not using their demo
architecture. Aside from that, they are trying to provide better understanding
as to why certain things work better, so that with such gained intuition, a
better strategy, a method is devised or developed.(The reason for using
dropout and their effect, the use of maxpooling, the PLD, PLS issues, etc) So
unlike CondenseNet, they were not focused on a specific technique to improve,
rather they tried to focus on the main building blocks that affect the
performance the most and why. and in choosing those building blocks, they
aimed at the most primitive ones that nearly all architectures can take
advantage of. So Simply put, you may start improving the already great
CondenseNet or pretty any architecture by following such principles.

I could think of a reason as to why they may not have included the CondenseNet
results. Looking at their comment on the arXiv page, it seems they had
submitted their paper in December 2017 and also looking at the date CondensNet
was published,(November 25, 2017) its possible by December they were already
in the process of submitting their work and never noticed CondenseNet.

They first dismissed the previous claim that advocated: "use overlapped
pooling because of the imagelike properties and also because the non-
overlapped pooling loses a lot of information in that sense which affects the
performance adversely", then they indicated that when considering their
intuition and what actually exists, looking at their experiments proved the
non-overlapped pooling performing better. Their provided intuition seems
plausible, but at the same time it may not apply to all situations.

~~~
p1esk
Ok, you have a good point regarding max pooling being more efficient than
strided convolutions. I didn't think of that.

Regarding overlapping pooling: ok, so they dismissed the previous intuition.
But I don't see any new intuition offered! Yes, those higher layer feature
maps are more abstract, so what? Why exactly non-overlapping pooling should
work better for more abstract "images"? At least in Hinton's paper they tried
to explain it.

Ultimately, it all comes down to this:

1\. I often use a vanilla ResNet-50 with ImageNet as a baseline (e.g. [1]]).
Can you tell me what suggestions from the paper I can apply to make it either
more accurate or more efficient?

2\. If I use CondenseNet [2], what suggestions from the paper would help me
improve accuracy/efficiency?

3\. Alternatively, if I use the SimpNet as a starting point, what adjustments
should I make to make it competitive on ImageNet?

I have a 4 GPU workstation sitting idle at the moment, so I can test any of
these scenarios right away.

[1]
[https://github.com/tensorflow/models/blob/master/official/re...](https://github.com/tensorflow/models/blob/master/official/resnet/imagenet_main.py)
[2]
[https://github.com/ShichenLiu/CondenseNet](https://github.com/ShichenLiu/CondenseNet)

~~~
Mortal_Mind
Forget about more abstract "images", you are not dealing with "images" from a
certain point forward in a network. The concept and nature gets very
different. You are dealing with "feature"-maps. when this is the case, you can
no more make assumptions based on dealing with images. If we are dealing with
some features, it may make sense not to include duplicated features which will
ultimately result in not very distinctive or accurate patterns. I guess, when
you avoid the duplicated features or let me rephrase it better, prevent the
overlapping, this way, you are essentially paying attention to the unique
combination of features in the input and as the network is further trained,
such patterns get more developed and the network can have much easier time to
discriminate between classes, because of better feature entanglement. that's
what I think is happening.

1.There are more than 10+ points in the paper, simple things that I guess you
can readily do are : 1.Increase the interlayer connectivity 2.Use larger
feature-maps, 3.Use max-pooling (or its dropout variant) (change the upper
layers that do downsampling, to have the stride of 1, instead of 2 and then
use a max-pooling after them for downsampling) 4.Use Global Max-pooling
instead of Global Avg-Pooling 5.ResNet seems to be using padding and stride of
2 for early layers: 1.Do padding = 1, set stride = 2, remove the immediate
pooling layer and instead use the stride 2 for the outgoing volumes(side
branch and the main one, use stride of for the others). (the early local
information are important, so the operation max-pooling carries out would not
be interesting. 2.Experiment with the removal of 1x1 kernels in early layers,
(as a starting point, change the first/two resblocks and later on in the next
round of changes, add more blocks). 3.remember to use GP at the end as well.
4.Since we are removing 1x1 kernels in early layers, we will be facing an
increase in #params, we first carry out the test, run the counterpart resnet
with increased #params, or balance out the neurons in the next round (as they
show in their paper, this balancing can be tricky and may need a well-planned
distribution)

2.You can go about this like ResNet, they have similar basic building blocks,
multiple poolings, use of stride 2 and average pooling.

3.SimpNet was a demo architecture, I don't think that would make sense :-/,
but anyway if we are to enhance it based on what we have available, I would
say, SimpNet would benefit much more if it was deeper and had interlayer
connectivity. So since they point out to papers such as DenseNet, stating one
of the main reasons they perform very well is because of the better and richer
information pools made available by using interlayer connectivity, we can use
that as well. Aside from these, they have not included their ImageNet tuned
network. I couldn't find one in their repo as of today. The ImageNet model may
differ, so we can either run our tests on the CIFAR10 model that we have or
wait until they update their repo with the ImageNet model.

One important thing to note here is that, after all the changes, the result
should then be finetuned (hyperparameter tuning, multiple runs to check the
variance and account for random initialization effect, etc ) so we can come to
the right conclusion. (By the way whatever result you get, I think it would be
a good idea to send the authors an email and let them know about your
experiments)

Good luck and keep me posted

