
You Only Look Once: Unified, Real-Time Object Detection - yogrish
http://arxiv.org/abs/1506.02640
======
mike_hearn
This is very cool. It looks from the videos like the next step for them is to
provide some sort of temporal stability so that detected objects don't get
temporarily forgotten across frames and so the bounds expand and contract
smoothly. It's obvious that the detection is being run frame-at-a-time.

I also wonder to what extent merging the detection with underlying P-frame
information from the video codecs would help. Knowing that a segment of video
just moved to the left would mean the detected object could be moved to the
left by the same amount, even if it was passing behind another object.
Calculating the movement vectors independently seems silly if you can get that
data from the underlying video codec itself.

------
dplarson
They named their method "YOLO"…

Edit: to add something more "helpful" to this comment, their paper links to a
YouTube channel [1] that shows demos of their method, which I think is great.

[1] [https://goo.gl/bEs6Cj](https://goo.gl/bEs6Cj)

~~~
nxnfufunezn
(Darknet)
[https://github.com/pjreddie/darknet](https://github.com/pjreddie/darknet)

------
clickok
This is really cool, even inspiring. Not just because it's one of the first
examples I've seen of accurate, real-time detection powered by neural nets,
but because they're getting these results via black magic, basically.

The objective function is defined heuristically, and involves about five
different sub-objectives (top of page four). Some of the parameters chosen
seem to be rough guesses, as does the decision to scale up the images to twice
the resolution when moving from classification (the pre-training task) to
detection.

It seems miraculous that a process of estimating and refinement, guided by
experience, can work on tasks where you have no mathematical guarantee that a
good solution can be found. Maybe in time we'll build the theory that explains
just why deep learning works so well, but for now I'm just kinda awed and
impressed every time one of these stories comes out.

~~~
dwiel
I share your optimism, however it isn't obvious from the paper how many hyper-
parameters and different variations of the loss function were tried to get
this result. It is still cool that something so heuristic can work so well.

------
bradneuberg
In the paper they use the abbreviation mAP without explaining what it is or
providing a reference, such as "Fast YOLO, processes an astounding 155 frames
per second while still achieving double the mAP of other real-time detectors";
do folks know what mAP is?

~~~
GrantS
I'm assuming "mAP" in this context is "mean Average Precision":
[http://fastml.com/what-you-wanted-to-know-about-mean-
average...](http://fastml.com/what-you-wanted-to-know-about-mean-average-
precision/)

The other possibility would be "maximum a posteriori" which doesn't fit their
usage here.

------
jbott
One of the authors has additional information posted here:
[http://pjreddie.com/darknet/yolo/](http://pjreddie.com/darknet/yolo/)

