
YOLOv5: State-of-the-art object detection at 140 FPS - rocauc
https://blog.roboflow.ai/yolov5-is-here/
======
bArray
I'm just going to call this out as bullshit. This isn't YOLOv5. I doubt they
even did a proper comparison between their model and YOLOv4.

Someone asked it to not be called YOLOv5 and their response was just awful
[1]. They also blew off a request to publish a blog/paper detailing the
network [2].

I filed a ticket to get to the bottom of this with the creators of YOLOv4:
[https://github.com/AlexeyAB/darknet/issues/5920](https://github.com/AlexeyAB/darknet/issues/5920)

[1]
[https://github.com/ultralytics/yolov5/issues/2](https://github.com/ultralytics/yolov5/issues/2)

[2]
[https://github.com/ultralytics/yolov5/issues/4](https://github.com/ultralytics/yolov5/issues/4)

~~~
rocauc
Hey all - OP here. We're not affiliated with Ultralytics or the other
researchers. We're a startup that enables developers to use computer vision
without being machine learning experts, and we support a wide array of open
source model architectures for teams to try on their data:
[https://models.roboflow.ai](https://models.roboflow.ai)

Beyond that, we're just fans. We're amazed by how quickly the field is moving
and we did some benchmarks that we thought other people might find as exciting
as we did. I don't want to take a side in the naming controversy. Our core
focus is helping developers get data into _any_ model, regardless of its name!

~~~
sillysaurusx
YOLOv5 seems to have one important advantage over v4, which your post helped
highlight:

 _Fourth, YOLOv5 is small. Specifically, a weights file for YOLOv5 is 27
megabytes. Our weights file for YOLOv4 (with Darknet architecture) is 244
megabytes. YOLOv5 is nearly 90 percent smaller than YOLOv4. This means YOLOv5
can be deployed to embedded devices much more easily._

Naming controversy aside, it's nice to have some model that can get close to
the same accuracy at 10% of the size.

Naming it v5 was certainly ... bold ... though. If it can't outperform v4 in
any scenario, is it really worthy of the name? (On the other hand, if v5 can
beat v4 in inference time or accuracy, that should be highlighted somewhere.)

FWIW I doubt anyone who looks into this will think roboflow had anything to do
with the current controversies. You just showed off what someone else made,
which is both legit and helpful. It's not like you were the ones that named it
v5.

On the other hand... visiting
[https://models.roboflow.ai/](https://models.roboflow.ai/) _does_ show YOLOv5
as "current SOTA", with some impressive-sounding results:

 _SIZE: YOLOv5 is about 88% smaller than YOLOv4 (27 MB vs 244 MB)

SPEED: YOLOv5 is about 180% faster than YOLOv4 (140 FPS vs 50 FPS)

ACCURACY: YOLOv5 is roughly as accurate as YOLOv4 on the same task (0.895 mAP
vs 0.892 mAP)_

Then it links to [https://blog.roboflow.ai/yolov5-is-
here/](https://blog.roboflow.ai/yolov5-is-here/) but there doesn't seem to be
any clear chart showing "here's v5 performance vs v4 performance under these
conditions: x, y, z"

Out of curiosity, where did the "180% faster" and 0.895 mAP vs 0.892 mAP
numbers come from? Is there some way to reproduce those measurements?

The benchmarks at
[https://github.com/WongKinYiu/CrossStagePartialNetworks/issu...](https://github.com/WongKinYiu/CrossStagePartialNetworks/issues/32#issuecomment-640887979)
seem to show different results, with v4 coming out ahead in both accuracy and
speed at 736x736 res. I'm not sure if they're using a standard benchmarking
script though.

Thanks for gathering together what's currently known. The field does move
fast.

~~~
rocauc
Agreed!

Crucially, we're tracking "out of the box" performance, e.g., if a developer
grabbed X model and used it on a sample task, how could they expect it to
perform? Further research and evaluation is recommended!

For size, we measured the sizes of our saved weights files for Darknet YOLOv4
versus the PyTorch YOLOv5 implementation.

For inference speed, we checked "out of the box" speed using a Colab Notebook
equipped with a Tesla P100. We used the same task[1] for both - e.g. see the
YOLOv5 Colab notebook[2]. For Darknet YOLOv4 inference speed, we translated
the Darknet weights using the Ultralytics YOLOv3 repo (as we've seen many do
for deployments)[3]. (To achieve top YOLOv4 inference speed, one should
reconfigure Darknet carefully with OpenCV, CUDA, cuDNN, and carefully monitor
batch size.)

For accuracy, we evaluated the task above with mAP after quick training (100
epochs) with the smallest YOLOv5s model against the full YOLOv4 model (using
recommended 2000*n, n is classes). Our example is a small custom dataset, and
should be investigated on e.g. COCO. 90-classes.

[1] [https://public.roboflow.ai/object-
detection/bccd](https://public.roboflow.ai/object-detection/bccd) [2]
[https://colab.research.google.com/drive/1gDZ2xcTOgR39tGGs-
EZ...](https://colab.research.google.com/drive/1gDZ2xcTOgR39tGGs-
EZ6i3RTs16wmzZQ) [3]
[https://github.com/ultralytics/yolov3](https://github.com/ultralytics/yolov3)

------
nharada
I welcome forward progress in the field, but something about this doesn't sit
right with me. The authors have an unpublished/unreviewed set of results and
they're already co-opting the YOLO name (without the original author) for it
and all of this to promote a company? I guess this was inevitable when there's
so much money in ML but it definitely feels against the spirit of the academic
research community that they're building upon.

~~~
syntaxing
Totally agreed, kinda seems dirty to call something "v5" when it this is a
derivative work of the original.

~~~
oehtXRwMkIs
I think derivative is a bit generous. This is just a reimplementation of v4
with a different framework.

------
sillysaurusx
We made a site that lets you collaboratively tag a bunch of images, called
tagpls.com. For example, users decided to re-tag imagenet for fun:
[https://twitter.com/theshawwn/status/1262535747975868418](https://twitter.com/theshawwn/status/1262535747975868418)

And the tags ended up being hilarious:
[https://pbs.twimg.com/media/EYXRzDAUwAMjXIG?format=jpg&name=...](https://pbs.twimg.com/media/EYXRzDAUwAMjXIG?format=jpg&name=large)

(I'm particularly fond of
[https://i.imgur.com/ZMz2yUc.png](https://i.imgur.com/ZMz2yUc.png))

The data is freely available via API:
[https://www.tagpls.com/tags/imagenet2012validation.json](https://www.tagpls.com/tags/imagenet2012validation.json)

It exports the data in yolo format (e.g. it has coordinates in yolo's [0..1]
range), so it's straightforward to spit it out to disk and start a yolo
training run on it.

Gwern recently used tagpls to train an anime hand detector model:
[https://www.reddit.com/r/AnimeResearch/comments/gmcdkw/help_...](https://www.reddit.com/r/AnimeResearch/comments/gmcdkw/help_build_an_anime_hand_detector_by_tagging/)

People seem willing to tag things for free, mostly for the novelty of it.

The NSFW tags ended up being shockingly high quality, especially in certain
niches:
[https://twitter.com/theshawwn/status/1270624312769130498](https://twitter.com/theshawwn/status/1270624312769130498)

I don't think we could've paid human labelers to create tags that thorough or
accurate.

All the tags for all experiments can be grabbed via
[https://www.tagpls.com/tags.json](https://www.tagpls.com/tags.json), so over
time we hope the site will become more and more valuable to the ML community.

tagpls went from 50 users to 2,096 in the past three weeks. The database size
also went from 200KB a few weeks ago to 1MB a week ago and 2MB today. I don't
know why it's becoming popular, but it seems to be.

~~~
sillysaurusx
Well, that didn't take long – our API endpoint keeled over. Luckily, you can
fetch all the data directly from firebase:

    
    
      # fetch raw tag data
      $ curl -fsSL https://experiments-573d7.firebaseio.com/results/.json > tags.json
      $ du -hs tags.json
      14M tags.json
    
      # fetch tag metadata (colors, remapping label names, possibly other stuff in the future)
      $ curl -fsSL https://experiments-573d7.firebaseio.com/user_meta/.json > tags_meta.json
      $ du -hs tags_meta.json
      376K tags_meta.json
      $ jq tags_meta.json
    

Note that's the raw unprocessed data (no yolo). To get info about all
experiments, you can use this:

    
    
      curl -fsSL https://experiments-573d7.firebaseio.com/meta/.json | jq
    

I'm a bit worried about the bill. It's up to $50 and rising:
[https://imgur.com/ZgmXsWU](https://imgur.com/ZgmXsWU) almost entirely egress
bandwidth. Be gentle with those `curl` statements. :)

(I _think_ that's due to a poor architectural decision on my part, which is
solvable, and not due to egress bandwidth via the API endpoint. But it's
always fun to see a J curve in your bill... It's about $1 a day right now.
[https://imgur.com/4gUTLO7](https://imgur.com/4gUTLO7))

~~~
foota
Can you set it up so that it's only available via cloud? I'm sure that would
bother people, but is a better alternative to losing access or you going broke
:)

~~~
sillysaurusx
We're motivated to keep this as open as possible. I really like the idea of an
open dataset that continues to grow with time. If it keeps growing, then
within a couple years it should have a vast quantity of tags on a variety of
diverse datasets, which we hope might prove helpful.

If anyone wants to contribute, I started a patreon a few minutes ago:
[https://www.patreon.com/shawwn](https://www.patreon.com/shawwn)

------
jcims
Has anyone (beyond maybe self-driving software) tried using object tagging as
a way to start introducing physics into a scene? E.g. human and bicycle have
same motion vector, increases likelihood that human is riding bicycle. Bicycle
and human have size and weight ranges that could be used to plot trajectory.
Bicycles riding in a straight line and trees both provide some cues as to the
gravity vector in the scene. Etc. etc.

Seems like the camera motion is probably already solved with optical
flow/photogrammetry stuff, but you might be able to use that to help scale the
scene and start filtering your tagging based on geometric likelihood.

The idea of hierarchical reference frames (outlined a bit by Jeff Hawkins here
[https://www.youtube.com/watch?v=-EVqrDlAqYo&t=3025](https://www.youtube.com/watch?v=-EVqrDlAqYo&t=3025)
) seems pretty compelling to me for contextualizing scenes to gain
comprehension. Particularly if you build a graph from those reference frames
and situate models tuned to the type of object at the root of each each frame
(vertex). You could use that to help each model learn, too. So if a bike model
projects a 'riding' edge towards the 'person' model, there wouldn't likely be
much learning. e.g. [Person]-(rides)->[Bike] would have likely been
encountered already.

However if the [Bike] projects the (rides) edge towards the [Capuchin] sitting
in the seat, the [Capuchin] model might learn that capuchins can (ride) and
furthermore they can (ride) a [Bike].

~~~
joshvm
RGB-D based semantic segmentation is certainly a thing. I'm sure it's also
been done with video sequences as well.

~~~
jcims
Yeah I wish the flagship phone manufacturers would put the hardware back into
the phone to take 3d photos...even better if you can get point cloud data to
go with it. The applications right now are kind of cheesy but they will get
better and if the majority of photos taken pivot to including depth
information i think it could really drive better capabilities from our phones.

Eyes are very hard to make and coordinate, yet there are almost no cyclops in
nature.

~~~
joshvm
In theory you could also do this with visual-inertial odometry eg monocular
SLAM. But this is definitely something we're looking at in my group (I do CV
for ecology), especially for object detection where geometry (absolute size)
is a good way to distinguish between two confusing classes. A good candidate
here is aerial imagery. If you've calibrated the camera and you know your
altitude, then you know your ground sample distance (m/px).

Most flagships can do this though, any multicamera phone can get some kind of
stereo. Google do it with the PDAF pixels for smart bokeh (they have some nice
blog posts about it). I don't know if there is a way to so that in an API
though (or to obtain the depth map).

[https://ai.googleblog.com/2018/11/learning-to-predict-
depth-...](https://ai.googleblog.com/2018/11/learning-to-predict-depth-on-
pixel-3.html?m=1)

~~~
jcims
High resolution light field cameras would really help here as well. That seems
a ways off though.

Are you folks able to do any multi-spectral stuff? That seems interesting.

~~~
joshvm
I work mostly with RGB/Thermal, if that counts. My PhD was in stereo/lidar
fusion, so I've always been into mixing sensors :)

I've also done some work on satellite imaging which is 13-band (Sentinel 2).
Lots of people in ecology use the Parrot Sequoia which is four-band
multispectral. There really isn't much published work in ML beyond RGB, which
I find interesting - yes there's RGB-D and LIDAR but it's mostly for driving
applications. Part of the reason I'm so familiar with the yolo codebases is
that I've had to modify them a lot to work with non-standard data. There's
nothing that stops you from using n-channel images, but you will almost
certainly have to hack every off the shelf solution to make it work. RGB and
8-bit is almost always hard coded, augmentation also often fails with non RGB
data (albumentations is good though). A bigger issue is there's a massive lack
of good labelled datasets for non rgb imagery.

On the plus side, in a landscape where everyone is fighting over COCO, there
is still a lot of low hanging fruit to pick I think.

I've not done any hyperspectral, very hard to (a) get labelled data (there's
AVIRIS and EO-1/Hyperion maybe) (b) it's very hard to label, the images are
enormous and (c) the cameras are stupid expensive.

By the way, even satellite imaging ML applications tend to overwhelmingly use
just the RGB channels and not the full extent of the data.

~~~
jcims
Whoa that's awesome! Love hearing contemporary technology used to
detect/diagnose/monitor the environment and our ecological impact. Boots on
ground will always be important but the horizontal scaling you can get out of
imaging I would imagine really helps prioritize where you turn your attention.
Thanks for the info and best of luck!

------
ely-s
There seems to be an unfair comparison between the various network
architectures. The reported speed and accuracy improvements should be taken
with a bit of scepticism for two reasons.

* This is the first yolo implemented in Pytorch. Pytorch is the fastest ml framework around, so some of YOLOv5's speed improvements may be attributed to the platform it was implemented on rather than actual scientific advances. Previous yolos were implemented using darknet, and EfficientDet is implemented in TensorFlow. It would be necessary to train them all on the same platform for a fair speed comparison.

* EfficientDet was trained on the 90-class COCO challenge (1), while YOLOv5 was trained on 80 classes (2).

[1]
[https://github.com/ultralytics/yolov5/blob/master/data/coco....](https://github.com/ultralytics/yolov5/blob/master/data/coco.yaml)

[2]
[https://github.com/google/automl/blob/master/efficientdet/in...](https://github.com/google/automl/blob/master/efficientdet/inference.py#L42)

~~~
rocauc
Great points, and hoping Glenn releases a paper to complement performance. We
are also planning more rigorous benchmarking nonetheless.

re: PyTorch being a confounding factor for speed - we recompiled YOLOv4 to
PyTorch to achieve 50 FPS. Darknet would likely top out around 10 FPS on the
same hardware.

EDIT: Alexey, author of YOLOv4, provided benchmarks of YOLOv4 hitting much
higher FPS here:
[https://github.com/AlexeyAB/darknet/issues/5920#issuecomment...](https://github.com/AlexeyAB/darknet/issues/5920#issuecomment-642213028)

------
rocauc
EfficientDet was open sourced March 18 [1], YOLOv4 came out April 23 [2], and
now YOLOv5 is out only 48 days later.

In our initial look, YOLOv5 is 180% faster, 88% smaller, similarly accurate,
and easier to use (native to PyTorch rather thank Darknet) than YOLOv4.

[1] [https://venturebeat.com/2020/03/18/google-ai-open-sources-
ef...](https://venturebeat.com/2020/03/18/google-ai-open-sources-efficientdet-
for-state-of-the-art-object-detection/) [2]
[https://arxiv.org/abs/2004.10934](https://arxiv.org/abs/2004.10934)

~~~
eeZah7Ux
> open sourced

This is not a verb.

~~~
dang
It's behaving like one:

[https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...](https://hn.algolia.com/?dateRange=all&page=0&prefix=true&query=open%20sources&sort=byDate&type=story)

[https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...](https://hn.algolia.com/?dateRange=all&page=0&prefix=false&query=open%20sourced&sort=byDate&type=story&storyText=none)

------
ma2rten
_In February 2020, PJ Reddie noted he would discontinue research in computer
vision._

He actually stopped working on it because of ethical concerns. I'm inspired
that he made this principled choice despite being quite successful in this
field.

[https://syncedreview.com/2020/02/24/yolo-creator-says-he-
sto...](https://syncedreview.com/2020/02/24/yolo-creator-says-he-stopped-cv-
research-due-to-ethical-concerns/)

------
gok
Er so this "Ultralytics" consulting firm just borrowed the name YOLO for this
model and didn't actually publish their results yet?

~~~
yeldarb
Yeah, they made the most popular PyTorch implementation of YOLOv3 as well so
they're not entering out of the blue, though.
[https://github.com/ultralytics/yolov3](https://github.com/ultralytics/yolov3)

The author of YOLOv3 quit working on Computer Vision due to ethical concerns.
YOLOv4, which built on his work in v3, was released by different authors last
month. I'd expect more YOLOvX's from different authors in the future.
[https://twitter.com/pjreddie/status/1230524770350817280](https://twitter.com/pjreddie/status/1230524770350817280)

------
david_draco
> In February 2020, PJ Reddie noted he would discontinue research in computer
> vision.

It would be fair to state also why he chose to discontinue developing YOLO, as
it is relevant.

------
rememberlenny
Two interesting links from the article:

1\. How to train YOLOv5: [https://blog.roboflow.ai/how-to-train-yolov5-on-a-
custom-dat...](https://blog.roboflow.ai/how-to-train-yolov5-on-a-custom-
dataset/)

2\. Comparing various YOLO versions [https://yolov5.com/](https://yolov5.com/)

------
boscon
Latency is measured for batch=32 and divided by 32? This means that 1 batch
will be processed in 500 milliseconds. I have never seen a more fake
comparison.

------
bcatanzaro
Why benchmark using 32-bit FP on a V100? That means it’s not using tensor
cores, which is a shame since they were built for this purpose. There’s no
reason not to benchmark using FP16 here.

~~~
joshvm
Not sure about the benchmark, but the code includes the option for mixed
precision training via Apex/AMP.

~~~
bcatanzaro
If you click around enough you’ll see they benchmarked in 32-bit FP. Glad they
have a mixed precision training option but I really think it’s a mistake in
2020 to do work related to efficient inference using 32-but FP.

The problem is that your conclusions aren’t independent of this choice. A
different network might be far better in terms of accuracy/speed tradeoffs
when evaluated at a lower precision. But there is no reason to use 32-but
precision for inference, so this is just a big mistake.

------
hnarayanan
What does it take to now use this name?

~~~
newen
Yeah, it's pretty unethical. Looks like they just stole the name without any
care. There doesn't seem to be any relationship between these guys and the
original YOLO group.

------
darknet-rider
I really like the work done by AlexAB on darknet YOLOv4 and the original
author Joseph Radmon with YOLOv3. These guys need a lot more respect than any
other version of YOLO.

------
0xcoffee
Is it possible to run these models in the browser, something similar to
tensorflow.js?

~~~
m00dy
I would try convert to ONNX model and then try to infer with tensorflowjs.

------
ebg13
It looks like this is YOLOv4 implemented on PyTorch, not actually a new YOLO?

~~~
bArray
YOLO is a neural network, Darknet is the framework. Without both YOLOv4 and
"YOLOv5" on the same framework, it makes it near impossible to make any kind
of meaningful comparison.

------
franciscop
I am very interested on loading YOLO into a Raspberry Pi + Coral.ai, anyone
knows a good tutorial on how to get started? I tried before and with Darknet
it was not easy at all, but now with pytorch there seem to be ways of loading
that into Coral. I am familiar with Raspberry Pi dev, but not much with ML or
TPUs, so I think it'd be mostly a tutorial on bridging the different
technologies.

(might need to wait a couple of months since this was just released)

------
kuzee
Just read this. Nice overview of the history of the "YOLO" family, and summary
of what YOLOv5 is/does.

------
DEDLINE
Does anyone know of an open-source equivalent to YOLOv5 in the sound
recognition / classification domain? Paid?

~~~
fattire
Like it would identify what you're hearing? "Trumpet!" "Wind whistling through
oak leaves!" "Male child!" etc?

~~~
DEDLINE
Ubicoustics [1] would be the closest example to what I am looking for in a
FOSS / Commercial offering. Is anyone working on this?

[1]
[https://github.com/FIGLAB/ubicoustics](https://github.com/FIGLAB/ubicoustics)

------
osipov
Just recently IBM announced with a loud PR move that the company is getting
out of the face recognition business. Guess what? Wall Street doesn't want to
keep subsidizing IBM's subpar face recognition technology when open source and
Google solutions are pushing the state of the art.

~~~
ahelwer
Not something to brag about. Facial recognition has very few applications
outside of total surveillance. We should not respect those who lend it their
time and effort.

~~~
jcims
>Facial recognition has very few applications outside of total surveillance.

That's not really for you to decide, is it? You're absolutely free to have
that opinion of course.

>We should not respect those who lend it their time and effort.

Also your choice of course. Facial recognition is essentially a light
integration of powerful underlying technologies. Should 'we' ostracize those
working on machine learning, computer vision, network and distributed
computing, etc?

~~~
TehCorwiz
You didn't really address the author's point which was that there don't appear
to be compelling uses of facial technology beyond mass automated surveillance.

I can't think of other uses and I'd be interested if you can come up with
some.

~~~
homarp
Some compelling use:

1) assistance to recognizing people (because low vision, because memory fails,
because you have a lot of photos...)

2) ensure candidate X is actually candidate X and not a paid person to take
the exam in name of candidate X

3) door access control (to replace/in addition to access card)

4) having your own X-Ray (like in Amazon Prime): identify an
actor/actress/model

5) having your personal robot addressing you by name

~~~
homarp
also, having models that can run locally on "cheap enough" devices is also
quite interesting that you more or less understand.

Compared to using API in the cloud or purchasing Hikvision cameras.

~~~
tropshop
Just getting into this. Do you recommend any particular "dumb" camera devices
with a quality stream?

~~~
homarp
what's your price range? indoor or outdoor? where do you want to do the
inference?

------
heisenburgzero
This is not the first time something is fishy. Back in the early stages of the
repo. They were advertising on the front page that they are achieving similar
MAP to the original C++ version. But only to be found out they haven't train
it on COCO dataset and test it.

------
qchris
If anyone's interested in the direct GitHub link to the repository:
[https://github.com/ultralytics/yolov5](https://github.com/ultralytics/yolov5)

~~~
travisporter
Hm on this page it has something written in an eastern language under YOLO,
[https://github.com/ultralytics](https://github.com/ultralytics) says Madrid,
Spain, but then they say "Ultralytics is a U.S.-based particle physics and AI
startup"

~~~
hikarudo
"You only look once" in Chinese.

------
heavyset_go
I like to think that the name is also a reference to the fact that this will
inevitably be used in some autonomous driving systems.

------
tapatio
Less weights, more accuracy. Magic :)

