Hacker News new | past | comments | ask | show | jobs | submit login
YOLOv5: State-of-the-art object detection at 140 FPS (roboflow.ai)
391 points by rocauc on June 10, 2020 | hide | past | favorite | 131 comments

I'm just going to call this out as bullshit. This isn't YOLOv5. I doubt they even did a proper comparison between their model and YOLOv4.

Someone asked it to not be called YOLOv5 and their response was just awful [1]. They also blew off a request to publish a blog/paper detailing the network [2].

I filed a ticket to get to the bottom of this with the creators of YOLOv4: https://github.com/AlexeyAB/darknet/issues/5920

[1] https://github.com/ultralytics/yolov5/issues/2

[2] https://github.com/ultralytics/yolov5/issues/4

Hey all - OP here. We're not affiliated with Ultralytics or the other researchers. We're a startup that enables developers to use computer vision without being machine learning experts, and we support a wide array of open source model architectures for teams to try on their data: https://models.roboflow.ai

Beyond that, we're just fans. We're amazed by how quickly the field is moving and we did some benchmarks that we thought other people might find as exciting as we did. I don't want to take a side in the naming controversy. Our core focus is helping developers get data into any model, regardless of its name!

YOLOv5 seems to have one important advantage over v4, which your post helped highlight:

Fourth, YOLOv5 is small. Specifically, a weights file for YOLOv5 is 27 megabytes. Our weights file for YOLOv4 (with Darknet architecture) is 244 megabytes. YOLOv5 is nearly 90 percent smaller than YOLOv4. This means YOLOv5 can be deployed to embedded devices much more easily.

Naming controversy aside, it's nice to have some model that can get close to the same accuracy at 10% of the size.

Naming it v5 was certainly ... bold ... though. If it can't outperform v4 in any scenario, is it really worthy of the name? (On the other hand, if v5 can beat v4 in inference time or accuracy, that should be highlighted somewhere.)

FWIW I doubt anyone who looks into this will think roboflow had anything to do with the current controversies. You just showed off what someone else made, which is both legit and helpful. It's not like you were the ones that named it v5.

On the other hand... visiting https://models.roboflow.ai/ does show YOLOv5 as "current SOTA", with some impressive-sounding results:

SIZE: YOLOv5 is about 88% smaller than YOLOv4 (27 MB vs 244 MB)

SPEED: YOLOv5 is about 180% faster than YOLOv4 (140 FPS vs 50 FPS)

ACCURACY: YOLOv5 is roughly as accurate as YOLOv4 on the same task (0.895 mAP vs 0.892 mAP)

Then it links to https://blog.roboflow.ai/yolov5-is-here/ but there doesn't seem to be any clear chart showing "here's v5 performance vs v4 performance under these conditions: x, y, z"

Out of curiosity, where did the "180% faster" and 0.895 mAP vs 0.892 mAP numbers come from? Is there some way to reproduce those measurements?

The benchmarks at https://github.com/WongKinYiu/CrossStagePartialNetworks/issu... seem to show different results, with v4 coming out ahead in both accuracy and speed at 736x736 res. I'm not sure if they're using a standard benchmarking script though.

Thanks for gathering together what's currently known. The field does move fast.


Crucially, we're tracking "out of the box" performance, e.g., if a developer grabbed X model and used it on a sample task, how could they expect it to perform? Further research and evaluation is recommended!

For size, we measured the sizes of our saved weights files for Darknet YOLOv4 versus the PyTorch YOLOv5 implementation.

For inference speed, we checked "out of the box" speed using a Colab Notebook equipped with a Tesla P100. We used the same task[1] for both - e.g. see the YOLOv5 Colab notebook[2]. For Darknet YOLOv4 inference speed, we translated the Darknet weights using the Ultralytics YOLOv3 repo (as we've seen many do for deployments)[3]. (To achieve top YOLOv4 inference speed, one should reconfigure Darknet carefully with OpenCV, CUDA, cuDNN, and carefully monitor batch size.)

For accuracy, we evaluated the task above with mAP after quick training (100 epochs) with the smallest YOLOv5s model against the full YOLOv4 model (using recommended 2000*n, n is classes). Our example is a small custom dataset, and should be investigated on e.g. COCO. 90-classes.

[1] https://public.roboflow.ai/object-detection/bccd [2] https://colab.research.google.com/drive/1gDZ2xcTOgR39tGGs-EZ... [3] https://github.com/ultralytics/yolov3

This is why I have so much doubt. To claim it's better in any meaningful way you need to show it on the same framework, varied datasets, varied input sizes and you should be able to use it in your detection problem and also see some benefits from the previous version.

> SIZE: YOLOv5 is about 88% smaller than YOLOv4 (27 MB vs 244 MB)

Is that a benefit of Darknet vs TF, YOLOv4 vs YOLOv5, or did you win the NN lottery [1]?

> SPEED: YOLOv5 is about 180% faster than YOLOv4 (140 FPS vs 50 FPS)

Again, where does this improvement come from?

> ACCURACY: YOLOv5 is roughly as accurate as YOLOv4 on the same task (0.895 mAP vs 0.892 mAP)

The difference in 0.1% accuracy can be huge, for example the difference between 99.9% and 100% could require an insanely larger neural network. Even much less that 99% accuracy, it seems clear to me that there can still be some limitations on accuracy from neural network size.

For example, if you really don't care so much for accuracy, you can really squeeze the network down [2].

[1] https://ai.facebook.com/blog/understanding-the-generalizatio...

[2] https://arxiv.org/abs/1910.03159

It's about time for Roboflow to pull this article. It seems highly unlikely that a 90 % smaller model would provide a similar accuracy, and the result seems to come from a small custom dataset only. Please make a real COCO comparison instead.

The YoloV5 repo itself shows performance comparable to YoloV3: https://github.com/ultralytics/yolov5#pretrained-checkpoints

Another comparison suggests YoloV5 is slightly WORSE than YoloV4: https://github.com/WongKinYiu/CrossStagePartialNetworks/issu...

> It's about time for Roboflow to pull this article.

The article still adds value by suggesting how one would run the network and in general the site seems to be about collating different networks.

Perhaps a disclaimer could be good, reading something like: "the speed improvements mentioned in this article are currently being tested". As a publisher, when you print somebody else's words, unless quoted, they are said with your authority. The claims are very big and it doesn't feel like enough testing has been done yet to even verify that they hold true.

Very cool business model! How long have you been at it? I've been pushing for a while (unsuccessfully, so far) for the NIH to cultivate a team providing such a service to our many biomedical imaging labs. It seems pretty clear to me that this sort of AI hub model is going to win out in at least the medium term versus spending money on lots of small redundant AI teams each dedicated to a single project. What sort of application sectors have you found success with?

Appreciate it!

Nice, I really respect research coming out of NIH. (Happen to know Travis Hoppe?) Coincidentally, our notebook demo for YOLOv5 is on the blood cell count and detection dataset: https://public.roboflow.ai/object-detection/bccd

We've seen 1000+ different use cases. Some of the most popular are in agriculture (weeds vs crops), industrials / production (quality assurance), and OCR.

Send me an email? joseph at roboflow.ai

Do you know of any battery-wired drones that can pick out invasive plants? I've been looking for this to use on trails but since the plant's sap is highly poisonous, drones seem to be the logical solution.

> We're not affiliated with Ultralytics or the other

> researchers.

Unfortunately I am now unable to edit to reflect this better.

I somewhat agree on the naming issue. I don't think yolov5 is semantically very informative. But by the way, if you read the issues from a while back you'll see that AlexeyAB's fork basically scooped them, hence the version bump. Ultralytics probably would have called this Yolov4 otherwise. This repo has been in the works for a while.

For history, Ultralytics originally forked the core code from some other Pytorch implementation which was inference-only. Their claim to fame is that they were the first to get training to work in Pytorch. This took a while, probably because there is actually very little documentation for Yolov3 and there was confusion over what the loss function actually ought to be. The darknet repo is totally uncommented C with lots of single letter variable names. AlexeyAB is a Saint.

That said, should it be a totally new name? The changes are indeed relatively minor in terms of architecture, it's still yolo underneath (in fact I think the classification/regression head is pretty much unchanged). The v4 release was also quite contentious. Actually their previous models used to be called yolov3-spp-ultralytics.

Probably I would have gone with efficient-yolo or something similar. That's no worse than fast/faster rcnn.

I disagree on your second point though. Demanding a paper when the author says "we will later" is hardly a blow off. Publishing and writing takes time. The code is open source, the implementation is there. How many times does it happen the other way around? And before we knock Glenn for this, as far as I know, he's running a business, not a research group.

Disclosure: I've contributed (in minor ways) to both this repository and Alexey's darknet fork. I use both regularly for work and I would say I'm familiar enough with both codebases. I mostly ignore the benchmarks because performance on coco is meaningless for performance on custom data. I'm not affiliated with either group, in case it's not clear.

> But by the way, if you read the issues from a while back

> you'll see that AlexeyAB's fork basically scooped them,

> hence the version bump.

Yeah that sucks, but it does mean they should have done some proper comparison with YOLOv4.

> This took a while, probably because there is actually very

> little documentation for Yolov3 and there was confusion

> over what the loss function actually ought to be. The

> darknet repo is totally uncommented C with lots of single

> letter variable names. AlexeyAB is a Saint.

Maybe I'm alone, but I found it quite readable. You can quite reasonably understand the source in a day.

> The v4 release was also quite contentious.

Kind of, I am personally still evaluating this network fully.

> I disagree on your second point though. Demanding a paper

> when the author says "we will later" is hardly a blow off.

Checkout the translation of "you can you up,no can no bb" (see other comments).

> And before we knock Glenn for this, as far as I know, he's

> running a business, not a research group.

I understand, but this seems very unethical to take the name of an open source framework and network that publishes it's improvements in some form, bump the version number and then claim it's faster without actually doing an apples to apples test. It would have seem appropriate to contact the person who carried the torch after pjreddie stepped down from the project.

On the whole I agree about darknet being readable, it seemed well written and I've found it useful to grok how training libraries are written. I think they've moved to other backends now for the main computation though.

But.. it was still very much undocumented (and there were details missing from the paper). I think this almost certainly led to some slowdown in porting to other frameworks. And the fact its written in C has probably limited how much people are willing to contribute to the project.

> Checkout the translation of "you can you up,no can no bb" (see other comments).

That's from an 11 day old github account with no history, not Ultralytics as far as I know.

> Kind of, I am personally still evaluating this network fully.

Contention referring to the community response rather than the performance of the model itself.

> Checkout the translation of "you can you up,no can no bb" (see other comments).

Who actually is "WDNMD0-0"? Looks like the account was created to make just that one comment.

Didn't AlexeyAB endorse YOLOv4 though? Did he also endorse YOLOv5?

AlexeyAB is the primary author on YOLOv4, and the darknet maintainer, so yes. This is pretty much the official word on the matter:


Despite that, there was still a lot of controversy over the decision to call it v4.

See that thread for the discussion on v5 and you can make your own judgement.

Ah, I misspoke. I meant prjeddie. prjeddie kind of endorsed YOLOv4. Did he endorse YOLOv5?

Although YOLOv4 isn't anything new achitecture-wise, it tried all the tricks in the book on the existing YOLO architecture to increase its speed performance, and its method and experiment results were published as a paper; it provided value to humanity.

YOLOv5 seemed to have taken the YOLO name to seemingly only to increase the startup name value without giving much(it did appear to provided YOLOv3 Pytorch implementation, but that's before taking YOLOv5 name) back. I wonder how prjeddie would think of YOLOv5.

> Someone asked it to not be called YOLOv5 and their response was just awful [1]

I don't see any response by them at all. Do you mean the comment by WDNMD0-0? I can't see any reason to believe they're connected to the company, have I missed something?

There are some benchmarks here: https://github.com/WongKinYiu/CrossStagePartialNetworks/issu...

It's hard to interpret benchmarks in a fair way, but it's sort of sounding like YOLOv4 might be superior to YOLOv5, at least for certain resolutions.

Does YOLOv5 outperform YOLOv4 at all? Faster inference time or higher accuracy?

I love that the response to them is "you can you up,no can no bb"

Learned a new phrase today.

你行你上啊 不行别bb(bb=trashtalking/non-favorable comments)

This is literally trash talking Slang in Chinese, because this field is full of young bloated researchers who forget their last name

> who forget their last name

I've not heard that one before either. Is it a reference to the Dark Tower? ("[he] has forgotten the face of his father") or did Stephen King borrow it from somewhere else?

This is an old punchline in China for many years and I doubt it comes from English literature. I guess the meaning is similar (last name ~= name of the father)

Edit: obviously I should google dark power first lol.

Also a slight edit, I wrote name initially. Of course in the books it's "face of his father", but it still sounds similar [1]. To admit to forgetting the face of one's father is to be deeply shameful, to accuse someone of it is insinuating they should be ashamed of themselves.

Can you write it in Chinese?

[1] https://www.goodreads.com/quotes/12991-i-do-not-aim-with-my-...


Can you explain it? I can't figure out what that means.

Apparently it is Chinese internet slang meaning:

"If you can do it, then you go and do it. If you can’t do it, then don’t criticise others."

via: http://www.chinesetimeschool.com/zh-cn/articles/chinese-inte...

Just found these.[1][2] That is pretty awful, if it's from a dev.

Edit: Although as yeldarb explains in a comment here[3], it's probably a bit more complicated than that.

1: https://www.urbandictionary.com/define.php?term=you%20can%20...

2: https://www.quora.com/Whats-the-meaning-of-you-can-you-up-no...

3: https://news.ycombinator.com/item?id=23478983

> Edit: Although as yeldarb explains in a comment here[3],

> it's probably a bit more complicated than that.

Legally speaking I'm not sure anything wrong was really done here.

Morally speaking, it seems quite unethical. AlexeyAB has really been carrying the torch of the Darknet framework and the YOLO neural network for quite some time (with pjreddie effectively handing it over to him).

AlexeyAB has been providing support on pjreddie's abandoned repository (e.g. [1]) and actively working on improvements in a fork [2]. If you look at the contributors graphs, he really has been keeping the project alive [3] (vs Darknet by pjreddie [4]).

Probably the worse part in my opinion is that they have also seemingly bypassed the open source nature of the project. This is quite damning.

[1] https://github.com/pjreddie/darknet/issues/1900

[2] https://github.com/AlexeyAB/darknet

[3] https://github.com/AlexeyAB/darknet/graphs/contributors

[4] https://github.com/pjreddie/darknet/graphs/contributors

So, the question I have is whether AlexeyAB got some sort of endorsement from pjreddie, or if they just took over the name by nature of being the most active fork? If it's the latter, ultralytics' actions don't seem quite as bad (although they still feel kind of off-putting, especially with how some of the responses to calls for a name change were formulated).

I guess given the info I have now, to me it boils down to whether there's precedent for the next version of the name to be taken by whoever is doing work on it? If the original author never endorsed AlexeyAB (I don't know one way or another), then perhaps AlexeyAB should have changed the name but references or payed homage to YOLO in some way?

Eh, this is all starting to feel a bit too close to youtube drama for my liking.

AlexeyAB seems to have gotten endorsement from pjreddie: https://github.com/AlexeyAB/darknet/issues/5920#issuecomment...

Looks like ultralytics, not roboflow, is the one that named this model v5. Different people/companies.

Yep, I updated my GitHub comment with respect to what @josephofiowa said. I made an assumptions when seeing the same PR images/language being used.

I welcome forward progress in the field, but something about this doesn't sit right with me. The authors have an unpublished/unreviewed set of results and they're already co-opting the YOLO name (without the original author) for it and all of this to promote a company? I guess this was inevitable when there's so much money in ML but it definitely feels against the spirit of the academic research community that they're building upon.

Well, very unlikely to get the original author. He doesn't do that kind of thing anymore


Totally agreed, kinda seems dirty to call something "v5" when it this is a derivative work of the original.

I think derivative is a bit generous. This is just a reimplementation of v4 with a different framework.

> there's so much money in ML

What do you mean? I thought the DL hypetrain was dying as companies failed to make returns on their investments.

We made a site that lets you collaboratively tag a bunch of images, called tagpls.com. For example, users decided to re-tag imagenet for fun: https://twitter.com/theshawwn/status/1262535747975868418

And the tags ended up being hilarious: https://pbs.twimg.com/media/EYXRzDAUwAMjXIG?format=jpg&name=...

(I'm particularly fond of https://i.imgur.com/ZMz2yUc.png)

The data is freely available via API: https://www.tagpls.com/tags/imagenet2012validation.json

It exports the data in yolo format (e.g. it has coordinates in yolo's [0..1] range), so it's straightforward to spit it out to disk and start a yolo training run on it.

Gwern recently used tagpls to train an anime hand detector model: https://www.reddit.com/r/AnimeResearch/comments/gmcdkw/help_...

People seem willing to tag things for free, mostly for the novelty of it.

The NSFW tags ended up being shockingly high quality, especially in certain niches: https://twitter.com/theshawwn/status/1270624312769130498

I don't think we could've paid human labelers to create tags that thorough or accurate.

All the tags for all experiments can be grabbed via https://www.tagpls.com/tags.json, so over time we hope the site will become more and more valuable to the ML community.

tagpls went from 50 users to 2,096 in the past three weeks. The database size also went from 200KB a few weeks ago to 1MB a week ago and 2MB today. I don't know why it's becoming popular, but it seems to be.

Well, that didn't take long – our API endpoint keeled over. Luckily, you can fetch all the data directly from firebase:

  # fetch raw tag data
  $ curl -fsSL https://experiments-573d7.firebaseio.com/results/.json > tags.json
  $ du -hs tags.json
  14M tags.json

  # fetch tag metadata (colors, remapping label names, possibly other stuff in the future)
  $ curl -fsSL https://experiments-573d7.firebaseio.com/user_meta/.json > tags_meta.json
  $ du -hs tags_meta.json
  376K tags_meta.json
  $ jq tags_meta.json
Note that's the raw unprocessed data (no yolo). To get info about all experiments, you can use this:

  curl -fsSL https://experiments-573d7.firebaseio.com/meta/.json | jq
I'm a bit worried about the bill. It's up to $50 and rising: https://imgur.com/ZgmXsWU almost entirely egress bandwidth. Be gentle with those `curl` statements. :)

(I think that's due to a poor architectural decision on my part, which is solvable, and not due to egress bandwidth via the API endpoint. But it's always fun to see a J curve in your bill... It's about $1 a day right now. https://imgur.com/4gUTLO7)

Can you set it up so that it's only available via cloud? I'm sure that would bother people, but is a better alternative to losing access or you going broke :)

We're motivated to keep this as open as possible. I really like the idea of an open dataset that continues to grow with time. If it keeps growing, then within a couple years it should have a vast quantity of tags on a variety of diverse datasets, which we hope might prove helpful.

If anyone wants to contribute, I started a patreon a few minutes ago: https://www.patreon.com/shawwn

Heh, $1 per day? Try $1k per day. https://imgur.com/duugqHK

We've confirmed that this was someone running `while true; curl ...`, resulting in a $3,700 bill. https://twitter.com/theshawwn/status/1271365062913961984

I guess "be gentle" means "please troll us."

Perhaps you can mirror it to an s3 bucket or GH repo for people to CURL more easily?

I remember following this as it came out (and learning windshield wipers should be called "swipey bois")

Surprised and happy to hear you're seeing high labeling quality.

We'll re-host with credit on https://public.roboflow.ai What license is this?

Thanks! We've decided to license the data as CC-0. We'll add that to the footer.

We don't host any images directly – we merely serve a list of URLs (e.g. https://battle.shawwn.com/tfdne.txt). But any data served via the API endpoints is CC-0.

I need a dataset and tags for hair, face, neck, arms, left breast, right breast, nipple, torso. Any tips? I'm training a GAN, but I need to specifically segment the parts, as I don't want nipples in the middle of a face. I don't want to have to manually annotate 1,000 images

At the moment, the only experiments with enough data to be useful are e621-portraits (5,407 tags https://www.tagpls.com/exp?n=e621-portraits) and danbooru-e (344 tags https://www.tagpls.com/exp?n=danbooru2019-e) both of which are NSFW.

Those are also drawings/anime, not photos. We have an /r/pics experiment (SFW, 99 tags https://www.tagpls.com/exp?n=r-pics) and /r/gonewild (NSFW, 57 tags https://www.tagpls.com/exp?n=r-gonewild) but currently I haven't gathered enough urls to be very useful -- it only scrapes about 100 or so images every half hour. So there is a lack of tags right now on human photos. We also have a pps experiment (NSFW, exactly what you think it is, 306 tags https://www.tagpls.com/exp?n=pps) but I assume that's not quite what you were looking for.

If you have an idea for a dataset, you can create a list of image URLs like https://battle.shawwn.com/r/pics.txt and we can add them to the site. You can request an addition by joining our ML discord (https://discordapp.com/invite/x52Xz3y) and posting in the #tagging channel.

Also, if anyone's curious, here's how I'm measuring the tag count:

  $ curl -fsSL https://experiments-573d7.firebaseio.com/results/danbooru2019-e/.json | jq '.' | grep points | wc -l
  $ curl -fsSL https://experiments-573d7.firebaseio.com/results/e621-portraits/.json | jq '.' | grep points | wc -l
  $ curl -fsSL https://experiments-573d7.firebaseio.com/results/r-gonewild/.json | jq '.' | grep points | wc -l
  $ curl -fsSL https://experiments-573d7.firebaseio.com/results/r-pics/.json | jq '.' | grep points | wc -l
  $ curl -fsSL https://experiments-573d7.firebaseio.com/results/pps/.json | jq '.' | grep points | wc -l

I love that it's porn (and specifically furry/hentai) which pushes the limits of image recognition and creativity within computer vision. Between this and the de-censoring tool "DeepCreamPy" I can't look most data scientists in the face anymore .

that's a great name, turning jagged edges back to smooth and applying reverse Gaussian blur /s

on a serious note, kind of interesting the authenticity/accuracy if it's just filled in... eg. turning black and white pictures back to color eg. was it actually green or blue

Yeah, I mean, the tagging is awesome, but I'm thinking I'll need more image segmentation than object recognition. With a segmentation map, I can make a great image->image translator.

This is really cool, thanks for sharing

> I don't want nipples in the middle of a face

There is a market somewhere

You should post this as a Show HN!

Okay! Thank you. We appreciate the encouragement.

It looks like an HN user on an EC2 server decided to fetch data from our firebase as quickly as possible, running up a $3,700 bill. Once (or if) that's sorted out, and once we verify tagpls can handle HN's load without charging thousands of dollars, we'll add an "about" page to tagpls and submit it.

How does this have anything to do with the linked article?

The idea with the site is that you can tag your own datasets, and then get the data suitable for yolo training. We've done that ourselves to train an anime hand detector, and other users have reported similar successes. I could've been a bit clearer about that.

Has anyone (beyond maybe self-driving software) tried using object tagging as a way to start introducing physics into a scene? E.g. human and bicycle have same motion vector, increases likelihood that human is riding bicycle. Bicycle and human have size and weight ranges that could be used to plot trajectory. Bicycles riding in a straight line and trees both provide some cues as to the gravity vector in the scene. Etc. etc.

Seems like the camera motion is probably already solved with optical flow/photogrammetry stuff, but you might be able to use that to help scale the scene and start filtering your tagging based on geometric likelihood.

The idea of hierarchical reference frames (outlined a bit by Jeff Hawkins here https://www.youtube.com/watch?v=-EVqrDlAqYo&t=3025 ) seems pretty compelling to me for contextualizing scenes to gain comprehension. Particularly if you build a graph from those reference frames and situate models tuned to the type of object at the root of each each frame (vertex). You could use that to help each model learn, too. So if a bike model projects a 'riding' edge towards the 'person' model, there wouldn't likely be much learning. e.g. [Person]-(rides)->[Bike] would have likely been encountered already.

However if the [Bike] projects the (rides) edge towards the [Capuchin] sitting in the seat, the [Capuchin] model might learn that capuchins can (ride) and furthermore they can (ride) a [Bike].

I've been wondering these same thoughts for years. I don't do much work in the neural network subfield, but have done a lot with computer vision, and always found myself wanting more robust physical estimation techniques that didn't require external data.

RGB-D based semantic segmentation is certainly a thing. I'm sure it's also been done with video sequences as well.

Yeah I wish the flagship phone manufacturers would put the hardware back into the phone to take 3d photos...even better if you can get point cloud data to go with it. The applications right now are kind of cheesy but they will get better and if the majority of photos taken pivot to including depth information i think it could really drive better capabilities from our phones.

Eyes are very hard to make and coordinate, yet there are almost no cyclops in nature.

In theory you could also do this with visual-inertial odometry eg monocular SLAM. But this is definitely something we're looking at in my group (I do CV for ecology), especially for object detection where geometry (absolute size) is a good way to distinguish between two confusing classes. A good candidate here is aerial imagery. If you've calibrated the camera and you know your altitude, then you know your ground sample distance (m/px).

Most flagships can do this though, any multicamera phone can get some kind of stereo. Google do it with the PDAF pixels for smart bokeh (they have some nice blog posts about it). I don't know if there is a way to so that in an API though (or to obtain the depth map).


High resolution light field cameras would really help here as well. That seems a ways off though.

Are you folks able to do any multi-spectral stuff? That seems interesting.

I work mostly with RGB/Thermal, if that counts. My PhD was in stereo/lidar fusion, so I've always been into mixing sensors :)

I've also done some work on satellite imaging which is 13-band (Sentinel 2). Lots of people in ecology use the Parrot Sequoia which is four-band multispectral. There really isn't much published work in ML beyond RGB, which I find interesting - yes there's RGB-D and LIDAR but it's mostly for driving applications. Part of the reason I'm so familiar with the yolo codebases is that I've had to modify them a lot to work with non-standard data. There's nothing that stops you from using n-channel images, but you will almost certainly have to hack every off the shelf solution to make it work. RGB and 8-bit is almost always hard coded, augmentation also often fails with non RGB data (albumentations is good though). A bigger issue is there's a massive lack of good labelled datasets for non rgb imagery.

On the plus side, in a landscape where everyone is fighting over COCO, there is still a lot of low hanging fruit to pick I think.

I've not done any hyperspectral, very hard to (a) get labelled data (there's AVIRIS and EO-1/Hyperion maybe) (b) it's very hard to label, the images are enormous and (c) the cameras are stupid expensive.

By the way, even satellite imaging ML applications tend to overwhelmingly use just the RGB channels and not the full extent of the data.

Whoa that's awesome! Love hearing contemporary technology used to detect/diagnose/monitor the environment and our ecological impact. Boots on ground will always be important but the horizontal scaling you can get out of imaging I would imagine really helps prioritize where you turn your attention. Thanks for the info and best of luck!

There seems to be an unfair comparison between the various network architectures. The reported speed and accuracy improvements should be taken with a bit of scepticism for two reasons.

* This is the first yolo implemented in Pytorch. Pytorch is the fastest ml framework around, so some of YOLOv5's speed improvements may be attributed to the platform it was implemented on rather than actual scientific advances. Previous yolos were implemented using darknet, and EfficientDet is implemented in TensorFlow. It would be necessary to train them all on the same platform for a fair speed comparison.

* EfficientDet was trained on the 90-class COCO challenge (1), while YOLOv5 was trained on 80 classes (2).

[1] https://github.com/ultralytics/yolov5/blob/master/data/coco....

[2] https://github.com/google/automl/blob/master/efficientdet/in...

Great points, and hoping Glenn releases a paper to complement performance. We are also planning more rigorous benchmarking nonetheless.

re: PyTorch being a confounding factor for speed - we recompiled YOLOv4 to PyTorch to achieve 50 FPS. Darknet would likely top out around 10 FPS on the same hardware.

EDIT: Alexey, author of YOLOv4, provided benchmarks of YOLOv4 hitting much higher FPS here: https://github.com/AlexeyAB/darknet/issues/5920#issuecomment...

Side note: I like Pytorch but eager pytorch is not faster the jax.jit or tf.function code

EfficientDet was open sourced March 18 [1], YOLOv4 came out April 23 [2], and now YOLOv5 is out only 48 days later.

In our initial look, YOLOv5 is 180% faster, 88% smaller, similarly accurate, and easier to use (native to PyTorch rather thank Darknet) than YOLOv4.

[1] https://venturebeat.com/2020/03/18/google-ai-open-sources-ef... [2] https://arxiv.org/abs/2004.10934

Those numbers are quite impressive.

YOLOv4 -> YOLOv5

Inference time: 20ms -> 7ms (on P100)

Frames per second: 50 -> 140

Size: 244mb -> 27 mb

f(x)=c, zero size, infinite fps. You should also take some accuracy metric into account ;)

> "Similarly accurate"

MS COCO looks to be improved even overall.

OT: the stats above should be part of the PyTorch marketing material, indeed impressive

> open sourced

This is not a verb.

Arguably it became one.

In February 2020, PJ Reddie noted he would discontinue research in computer vision.

He actually stopped working on it because of ethical concerns. I'm inspired that he made this principled choice despite being quite successful in this field.


Er so this "Ultralytics" consulting firm just borrowed the name YOLO for this model and didn't actually publish their results yet?

Yeah, they made the most popular PyTorch implementation of YOLOv3 as well so they're not entering out of the blue, though. https://github.com/ultralytics/yolov3

The author of YOLOv3 quit working on Computer Vision due to ethical concerns. YOLOv4, which built on his work in v3, was released by different authors last month. I'd expect more YOLOvX's from different authors in the future. https://twitter.com/pjreddie/status/1230524770350817280

I'm a bit fascinated by this Ultralytics. It has super nice website but according to LinkedIn, I think it's just one guy who does consultancy.

The intriguing part is that he has also done research in particle physics (as Ulatrlytics) that has been published in Nature [1].

I had never seen anything like that.

[1] https://www.nature.com/articles/srep13945

> In February 2020, PJ Reddie noted he would discontinue research in computer vision.

It would be fair to state also why he chose to discontinue developing YOLO, as it is relevant.

Two interesting links from the article:

1. How to train YOLOv5: https://blog.roboflow.ai/how-to-train-yolov5-on-a-custom-dat...

2. Comparing various YOLO versions https://yolov5.com/

Latency is measured for batch=32 and divided by 32? This means that 1 batch will be processed in 500 milliseconds. I have never seen a more fake comparison.

Why benchmark using 32-bit FP on a V100? That means it’s not using tensor cores, which is a shame since they were built for this purpose. There’s no reason not to benchmark using FP16 here.

Not sure about the benchmark, but the code includes the option for mixed precision training via Apex/AMP.

If you click around enough you’ll see they benchmarked in 32-bit FP. Glad they have a mixed precision training option but I really think it’s a mistake in 2020 to do work related to efficient inference using 32-but FP.

The problem is that your conclusions aren’t independent of this choice. A different network might be far better in terms of accuracy/speed tradeoffs when evaluated at a lower precision. But there is no reason to use 32-but precision for inference, so this is just a big mistake.

What does it take to now use this name?

Yeah, it's pretty unethical. Looks like they just stole the name without any care. There doesn't seem to be any relationship between these guys and the original YOLO group.

If it's not trademarked, perhaps not much? I think it's pretty misleading, but the fight for attention is on! Using an established brand in your title will get more clicks.

Just guts, i guess. You would also need to show some real performance improvement.

Is it possible to run these models in the browser, something similar to tensorflow.js?

I would try convert to ONNX model and then try to infer with tensorflowjs.

It looks like this is YOLOv4 implemented on PyTorch, not actually a new YOLO?

YOLO is a neural network, Darknet is the framework. Without both YOLOv4 and "YOLOv5" on the same framework, it makes it near impossible to make any kind of meaningful comparison.

I really like the work done by AlexAB on darknet YOLOv4 and the original author Joseph Radmon with YOLOv3. These guys need a lot more respect than any other version of YOLO.

I am very interested on loading YOLO into a Raspberry Pi + Coral.ai, anyone knows a good tutorial on how to get started? I tried before and with Darknet it was not easy at all, but now with pytorch there seem to be ways of loading that into Coral. I am familiar with Raspberry Pi dev, but not much with ML or TPUs, so I think it'd be mostly a tutorial on bridging the different technologies.

(might need to wait a couple of months since this was just released)

Just read this. Nice overview of the history of the "YOLO" family, and summary of what YOLOv5 is/does.

Does anyone know of an open-source equivalent to YOLOv5 in the sound recognition / classification domain? Paid?

Like it would identify what you're hearing? "Trumpet!" "Wind whistling through oak leaves!" "Male child!" etc?

Ubicoustics [1] would be the closest example to what I am looking for in a FOSS / Commercial offering. Is anyone working on this?

[1] https://github.com/FIGLAB/ubicoustics

This is not the first time something is fishy. Back in the early stages of the repo. They were advertising on the front page that they are achieving similar MAP to the original C++ version. But only to be found out they haven't train it on COCO dataset and test it.

If anyone's interested in the direct GitHub link to the repository: https://github.com/ultralytics/yolov5

Hm on this page it has something written in an eastern language under YOLO, https://github.com/ultralytics says Madrid, Spain, but then they say "Ultralytics is a U.S.-based particle physics and AI startup"

"You only look once" in Chinese.

I like to think that the name is also a reference to the fact that this will inevitably be used in some autonomous driving systems.

Less weights, more accuracy. Magic :)

Just recently IBM announced with a loud PR move that the company is getting out of the face recognition business. Guess what? Wall Street doesn't want to keep subsidizing IBM's subpar face recognition technology when open source and Google solutions are pushing the state of the art.

Not something to brag about. Facial recognition has very few applications outside of total surveillance. We should not respect those who lend it their time and effort.

I thought the real focus on the bad actors at this point was on gait detection. Works in civil unrest situations where everyone covers their face.

Not the the difference matters that much.

Its not exclusive. Bad actors are working on whatever they are paid to build, by other bad actors with less technical acumen and more money.

Edit: I should add, that most of the actual progress is being made by smart people who think its an interesting problem and are unaware or uncaring of the clear outcome of such tech.

In this case I meant bad actors as in who is funding the research with the idea of increasing surveillance for the purpose of squashing dissent.

Being able to distinguish between people is pretty foundational to being able to personalize AI applications. If you wanted to make a smart home actually smart and not just full of inconvenient remote controlled appliances, this is pretty necessary.

There are obviously privacy concerns with this example, it’d ideally be fully on-prem.

>Facial recognition has very few applications outside of total surveillance.

That's not really for you to decide, is it? You're absolutely free to have that opinion of course.

>We should not respect those who lend it their time and effort.

Also your choice of course. Facial recognition is essentially a light integration of powerful underlying technologies. Should 'we' ostracize those working on machine learning, computer vision, network and distributed computing, etc?

The question is always the same: is every technical/scientific progress desirable ? But it seems that this question isn't asked anymore, "move fast and break things" am I right ?

I'm much more worried about people using your arguments to try and shut down the discussion than people trying to open the debate, because once the mass surveillance/face recognition mass adoption pandora's box is open there won't be any way to go back.

When I see predator drones and FBI stingray planes above every major us cities during protests I already know we're not going in the "let's talk about this before reaching the point no return" direction.

>>> When I see predator drones and FBI stingray planes above every major us cities during protests

Can you provide evidence for this ? Not that I doubt it, but if I want to tell other people that story, I must have evidence to be believed :-)

edit: ah of course, 30 seconds of duckduckgo just provide the info I need : https://thehill.com/homenews/house/501445-democrats-press-dh...

Yes and plenty of other sources:




Once the tech is out there it's simply a question of "when" will it be used for borderline illegal activities, especially in the US where you have these different entities (fbi, cia, nsa, dea, &c.) basically acting in their own bubble and doing whatever they want until it's leaked and/or gets outrageous enough to get the public attention.

I mean, there were unidentified armed forces marching in US streets last week, if people don't se this as the biggest red flag in recent US history I don't know what they need.



You didn't really address the author's point which was that there don't appear to be compelling uses of facial technology beyond mass automated surveillance.

I can't think of other uses and I'd be interested if you can come up with some.

Some compelling use:

1) assistance to recognizing people (because low vision, because memory fails, because you have a lot of photos...)

2) ensure candidate X is actually candidate X and not a paid person to take the exam in name of candidate X

3) door access control (to replace/in addition to access card)

4) having your own X-Ray (like in Amazon Prime): identify an actor/actress/model

5) having your personal robot addressing you by name

> 2) ensure candidate X is actually candidate X and not a paid person to take the exam in name of candidate X

Can you imagine the bureaucratic nightmare that would be unleashed upon yourself if "the system" decides you aren't who you say you are because of the way you aged, an injury, surgery or a few new freckles?

This already happens sometimes with birth certificates and identity theft, and it's awful for those who have to experience it. I'd hate to have a black box AI inflicting that upon others for inexplicable reasons.

also, having models that can run locally on "cheap enough" devices is also quite interesting that you more or less understand.

Compared to using API in the cloud or purchasing Hikvision cameras.

Just getting into this. Do you recommend any particular "dumb" camera devices with a quality stream?

what's your price range? indoor or outdoor? where do you want to do the inference?

Here's one: identifying child soldiers.

[i] https://www.pyimagesearch.com/2020/05/11/an-ethical-applicat...

Biometric authentication is one that comes to mind. Facial recognition running locally on my own photo library would also be useful for organizing photos. A cloud-free local-only home automation system that can tell the difference between owners/housemates/guests and customize behavior accordingly would also be nice.

I'm looking into YOLO for this, but it's moreso to verify your selfie == image on document, and we want to avoid sending highly sensitive information to third party providers.

The current service we use, while accurate, costs 50 cents per verification...

Edit: reading through this thread, if the model isn't super massive, we could offer on-browser verification! 27MB is still a hefty download though.

Labeling photos with who's in them, either on social media or in private image albums (creating indexes of people in your photos.)

Arguably this is a front for mass surveillance, or can easily be misused for that, but the ostensible purpose is separate and benign.

> Facial recognition is essentially a light integration of powerful underlying technologies. Should 'we' ostracize those working on machine learning, computer vision, network and distributed computing, etc?

Couldn't you argue the same way against just about any kind of IED or booby trap? Yet people tend to ostracize those who make them more than they do people who make ball bearings and nails.

Who in Iraq was getting ostracized by their family and friends for making IEDs?

If they want to use it to enable surveillance, yes.

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact