Hacker News new | past | comments | ask | show | jobs | submit login
Laion-5B: A new era of open large-scale multi-modal datasets (laion.ai)
163 points by tosh 11 months ago | hide | past | favorite | 104 comments

If you've ever actually looked into the Laion datasets, you'll notice that they are hot garbage, in that the captions often don't even correlate to what the image is about, and the images are often low quality, bad cropped and so on.

There are other datasets being developed that use high quality images that are manually labeled by humans, such as by Unstable Diffusion who's having a Kickstarter right now [0]. They say they will be able to get a much more high quality model due to such high quality images and captioning, so we'll see. They also want to make the model and code entirely open source rather than the license that Stable Diffusion has which is not open source (it has many restrictions, enforceable or not, on the images made).

[0] https://www.kickstarter.com/projects/unstablediffusion/unsta...

Looks like you missed the whole point of this dataset.

The idea that we proved is you can get a dataset with decent caption and images (that do match yes, you can see for yourself at https://rom1504.github.io/clip-retrieval/ ) that can be used to trained well performing models (eg openclip and stable diffusion) while using only automated filtering of a noisy source (common crawl)

We further proved that idea by using aesthetic prediction, nsfw and watermark tags to select the best pictures.

Is it possible to write caption manually? sure, but that doesn't scale much and won't make it possible to train general models.

It's possible to so a much better job with automation too. Context aware cropping, accurate aspect ratio, quality filtering by various metric... all solved problems long ago, but absent from Laion-5B for some reason. Perhaps it would be a good idea go collaborate more closely with image experts for the next round.

> Is it possible to write caption manually? sure, but that doesn't scale much and won't make it possible to train general models.

Maybe, I don't think so however based on the above comments by Unstable Diffusion. It seems like people are underestimating the power of high quality data and just throwing the kitchen sink at models. Perhaps a set of good quality data can indeed outperform Laion-style datasets.

It's like the YC saying about doing things that don't scale, perhaps with the high quality dataset, we can train better models than CLIP and in turn use those to caption the rest of the images, only now the caption model is much better than previous ones.

The new Unstable Diffusion model will be one of the several SD finetuned model out there, these models usually have much higher quality (but smaller image diversity) because they take the coherency of SD and costrain the distribution to a small high quality portion, this means that you can train a model on a smaller high quality dataset from scratch but you would not, for example, have the same level of coherency, this can only be obtained with an incredible amount of images, and they don't need to be "high quality", a man will almost always have 2 arms, 2 legs etc... regardless of the quality of the images, and after the model has fit the entire distribution you can finetune it to produce high quality and coherent images with a small dataset, that's why Unstable Diffusion will finetuned a SD checkpoint, also why researchers use these big dataset like LAION-400M/5B

> and they don't need to be "high quality", a man will almost always have 2 arms, 2 legs etc...

At the next generation it feels like the training set will be inbreeding on the flood of stable diffusion images with 7 mangled fingers, heads coming out of legs, etc.

LAION-400M/5B will obviously not change (and there is enough data to train a really good model), if a future dataset has AI-generated images, these will be highly curated as the images were chosen by the person who was using the model and probably further shared by other users, it would work like a complicated Reinforcement Learning from Human Feedback (RLHF), plus AI-generated images will usually have keywords such as "AI," "Midjourney" in the caption so that the model can learn to distinguish them from the rest of the dataset (and CFG comes to the rescue when the dataset is noisy).

I'd guess there is a bias-variance tradeoff. If you just want to make a certain kind of image, no doubt a manually labeled and curated dataset can be better. If you want a generic generative model that has learned a wide variety of stuff, scale wins.

I can see LAION playing a similar role to imagenet. The main application of imagenet isn't directly training image recognition models. It's pertaining on diverse data so that a "big" (big in 2016) model can be fine tuned easily on a small dataset, after learning to be a good feature extractor. From that perspective, the label quality (and concerns about bias and whatnot) are almost irrelevant

DALL-E, Stable Diffusion, GPT-3, Whisper, CLIP, etc are all trained on "hot garbage" and all of them are SOTA. Whisper is a great example, as it shows that this broader use of imperfect training data helps to make the models more robust and general than their "perfectly" trained counterparts. The trick behind all of these is to build mechanisms on smaller scale, human labelled data that can then be used to filter and label the broader dataset. Or use training methods that are more robust to imperfect data, like contrastive learning ala CLIP.

If you ever actually look into Unstable diffusion Kickstarter, you’ll notice that they’re not actually claiming they’ll manually label a dataset the size of Laion-5B - that’s a much bigger task than what you seem to think it is.

Even if a million people are labeling images, without any overlap, 5 billion images would mean each of them has to label 5000 images each.

What Unstable diffusion folks seem to be doing is that they’re using a few thousand labeled images to train a caption generation model and then use it to create a huge multimodal dataset with text and high quality images.

> If you ever actually look into Unstable diffusion Kickstarter, you’ll notice that they’re not actually claiming they’ll manually label a dataset the size of Laion-5B - that’s a much bigger task than what you seem to think it is.

I never claimed this either.

Creators of the data quality tool for computer vision, fastdup, continue to improve on their free release https://github.com/visual-layer/fastdup

Here's a short video of some recent results for LAION 400M https://www.youtube.com/watch?v=dlRCm29Upu4

Obviously there would be limits as to how much could be manually reviewed by hand (if 1000 people reviewed 1000 images each, only 0.02% of the images would be reviewed assuming no overlap was required), but I wonder if there would be any benefit to attempting to crowdsource captions for the dataset for the worst available images

It is not possible to manually label hundreds of millions of images to train a model on them, CFG exists to deal with this problem, also Unstable Diffusion will just finetune a Stable Diffusion model, so you cannot simply change the licence to what you want.

Boorus [0] contain millions of images, manually labeled to a pretty high quality. Notably defusion models trained on booru datasets have had good success.

This is not the only example of well curated image-tag pairs, especially in artistic circles. It's just that most of them are not CC.

[0]: https://en.wiktionary.org/wiki/booru

Booru use tags instead of captions, so a model trained on them is really limited; moreover, Danbooru has only 5 million images, while other booru such as gelbooru and sankaku have lower quality.

Tags are limited how exactly? Prompt crafting becomes a case of selecting the relevant tags, and the embedding space will still capture the dataset.

Danbooru is only one such example of well curated tagging and if we ignore copyright there are far more examples. These example just serve as evidence that refining poor labeling is not outside of the relm of possibility as you suggested.

A tag-based system would completely lack any kind of contextual information and it would not be possible to create any relationship between words; natural language is much more powerful.

An example, an image is tagged: kanna_kamui, kimono and torhu_(maiddragon), who has the kimono? Kanna, Torhu or both? It cannot be known, but with natural language it is possible to describe who is wearing what.

I think this is mainly theoretical at this point. In my experience, current technology doesn’t seem to be utilizing the additional information that comes from natural language all that well. For example: prompt Dalle2 for “a dinner plate on a stack of pancakes” and you will get ordinary images of pancakes on plates, not the other way around.

Edit: an experiment comparing tags/BOW vs natural language sequences in image generation tasks would be interesting to see.

I think this does not work mainly because of the unusual situation you describe, such as "a horse riding a person"; most of the time Dalle 2 is really good at following the prompt.

>It is not possible to manually label hundreds of millions of images to train a model on them

Citation, please?

I think you mean "the developers of this technology do not want to pay to have hundreds of millions of images labeled".

It is not believable that someone would pay humans to label 400mln or 5bln images/samples to train a model on them, but I guess if you argument is "everything is possible" then gotcha

If it's done in a reCAPTCHA like way, it can be done fairly efficiently and for cheap. In fact Scale AI does just this, they do manual labor operations such as captioning images, as an API. Here's their product for image labeling: https://scale.com/rapid.

Unstable Diffusion is also doing their captioning like how I mentioned, with groups of volunteers as well as hired individuals.

Scale seems to do, for example, image classification but not captioning as it would be hard to compare the results with others people to verify the quality (when you have a discrete number of classes is really straightforward), also can you report where you read about the Unstable Diffusion plan for manually labeling image datasets? I want to dig deeper

> We are releasing Unstable PhotoReal v0.5 trained on thousands of tirelessly hand-captioned images

They seem to have created a much smaller dataset than LAION's, it would not work to train a generative model on such a small amount of images (obviously the images here do not have a single domain).

You seem to be confusing "possibility" with your personal opinion on what you think would be done by others.

As a human being I know human limitations, explicitly labeling 400mln/5bln images for a particular task seems absurd to me, but if you think it is realistically possible perhaps you can give an example.

The LAION dataset was designed for the broader community at the first place, so clearly the premise is that they don’t have millions to throw at the problem.

So the core of the dataset is image _URLs_ and text captions.

1. From a reproducibility perspective, isn't this kinda brittle in that even without malicious intent, some of those images will no longer be available when other researchers attempt to download them?

2. From a resilience perspective, if your site has some of the images in the dataset, could you swap in another image with the correct dimensions. Could you poison or skew the model in any interesting ways?

Imagenet (arguably the most used image dataset of the last 10 years) is the same, it's a list of URLs with full archives of the downloaded images available under some conditions.

Fair enough, but Imagenet is sort of a nightmare right now. I get it's a crowd funded and sourced effort, but hopefully at some point some brave soul(s) will step up to archive the data as-is in a very reproducible kind of way. :D :))))

isn't this a perfect use case for torrents? might be too expensive to host for 1 nonprofit company, but collectively.. there might be a few dozen people willing to host it.

don't know about the legality too.. does laion check all the licenses? or they skirt that by using urls?

The key is the scale of the dataset. Both the points you mention become irrelevant for a large dataset because

1) The chance that a significant percentage of the images become unavailable is low. Also, training on such a big dataset means your model generalizes well and is usually robust.

2) Again, you would need to inject adversarial/malicious images to a significant number of those links in the dataset for it to have actual impact on trained model. Again, unlikely.

For point 1 ... it depends on the timescale. In the fullness of time, surely a significant portion of images will be unavailable. From the perspective of allowing other researchers to work from the "same" baseline "today", this is likely good enough. In a generation from now, if someone wants to reproduce results from some landmark model trained against this dataset, we'd have problems. In other fields where people publish or share their datasets, would this be considered sufficient?

For point 2, I think it's possible that for some narrow topics, some domains have a significant share of images. I think these can affect the model, which is in part why they give special attention to watermarking. Suppose instead of merely watermarking images, for every image on my large collegiate track and field website I make sure someone is wearing a garment with a visible Nike swoosh. Can I skew the model towards associating Nike with the sport? I think this kind of thing may be achievable for niche areas.

Since artists already appear to believe LAION is “stolen content”, actually downloading everything wouldn’t help the case that it’s fine.

And from the storing perspective? The full image dataset weighs dozens of PB. How convenient is that to share?

This dataset is a massive failure when it comes to ethical research practices. LAION-5B openly indexed copyrighted data that it had no business collecting. They failed to go through an IRB when curating this data. The ethics review for this paper was a joke, where the ethics reviewer raises valid concerns and then discards their review because "if they don't publish it here, they'll publish it somewhere else anyways" [0].

LAION-5B has enabled some really cool technologies and a lot of promising startups. This work should have been carried out responsibly.

[0] https://openreview.net/forum?id=M3Y74vmsMcY

> LAION-5B openly indexed copyrighted data that it had no business collecting.

Seems like an open and shut fair use claim, web indexing (not even scraping, just indexing) is not uncommon...

Whether or not that data can be used ethically is an entirely separate conversation, one that IRBs have spent decades answering, and a process that LAION completely skirts on the basis of being a company.

I mean, sure, but that has nothing to do with the data being copyrighted or not.

> LAION-5B openly indexed copyrighted data that it had no business collecting.

This seems to be legal in many countries (from what I know, the UK, EU, Japan and Singapore) due to the TDM (Text and Data Mining) exception, especially for researchers.

It's the classic HN scraping butthurt. You can only do this if you're a billion (trillion?) dollar company and you do it behind closed doors.

All these concern trolls are bad actors. All of them.

Unless you literally steal someone's work and use it / sell it as your own, all the data mining is moral and should be legal if it isn't already.

> Unless you literally steal someone’s work and use it / sell it as your own

Artists creating work are not releasing it on the internet with ShareAlike licenses or any other license which openly allows derivative work or further distribution without a license. This is literally providing a means to stealing people’s work.

How is this any different from providing a listing of copyrighted movies and games and a means to download them, a la The Pirate Bay?

What specifically are you claiming required a review board?

Quick review of their site and the paper turns up nothing that commonly would be a topic that might merit such a review.

Related FAQs:

- https://laion.ai/faq/

LAION-5B includes images of humans without their explicit consent. Images of people generally involve IRB/HSR. Additionally, almost any IRB will mention that if you’re using data derived from humans, you must go through IRB.

LAION can say all they want that they’re not including images in their dataset. They include a script to download those URLs into images on disk. By being a company that’s not bound to decades of university ethics regulations, they are seemingly allowed to skirt what you learn on your first day as a researcher in academia. It may be legal, but it sure is not ethical.

Please provide link to another academic publication agreeing with your claim that linking to online content is unethical without the subject’s explicit approval.

It's one thing to link to online content. They also provide a download script to then turn the links into realizable images.

This defense, that they merely provide links and not images, is the thin layer of abstraction that their entire ethics case is built on top of. They give you everything needed to create massive datasets of human data without doing it for you.

Thanks a more specific claim that the OP didn't make.

This is just trolling and typical of people who just want to shoot down what others have done because they cant or haven't created anything themselves. How about showing a positive solution instead of the equivalent of finding reasons why we can't do anything. If everyone had this attitude we'd still all be hiding in trees somewhere

The positive solution is to scrape Wikimedia Commons for everything in "Category: PD-Art-old-100" and train from scratch on that data. Wikimedia Commons is well-moderated, the image data is public domain[0], and the labels can be filtered down to CC-BY or CC-BY-SA subsets[1]. Your resulting model will be CC-BY-SA licensed and the output completely copyright-free.

For the record, that's what I've been trying to do[2]; my stumbling blocks have been training time and a bug where my resulting pipeline seems to do the opposite of what I ask[3]. I'm assuming it's because my wikitext parser was broken and CLIP didn't have enough text data to train on; I'll have the answer tomorrow when I have a fully-trained U-Net to play with.

If I can ever get this working, I want to also build a CLIP pipeline that can attribute generated images against the training set. This would make it possible to safely use CC-BY and CC-BY-SA datasets: after generating

[0] At least in the US. Other jurisdictions think that scanning an image recopyrights it, see https://en.wikipedia.org/wiki/National_Portrait_Gallery_and_...

[1] Watch out for anything tagged with https://commons.wikimedia.org/wiki/Template:Royal_Museums_Gr... as that will taint your model.

[2] https://github.com/kmeisthax/PD-Diffusion

[3] https://pooper.fantranslation.org/@kmeisthax/109486435508334...

You can check my Google Scholar [0]. I have created many, high-impact datasets that were 1) formative in their respective areas and 2) have seen downstream usage in disasters and wars around the world. Not once in creating those datasets did we take the “easy” route by compromising on the ethics of data collection.

The positive solution here was to not collect data if there was a reasonable ethical concern. This classic mindset of “anything goes as long as we create value” is highly toxic.

[0] https://scholar.google.com/citations?user=4Cdwp_MAAAAJ

For practical context, Stable Diffusion 2.X was trained on LAION-5B as opposed to LAION-400M for Stable Diffusion 1.X.

At the least, Stable Diffusion 2.X is better at pain points of image generation such as text legibility and hands, potentially due to having more data points.

Problem with hands is probability.

It's more probable a finger has a finger on both sides of it than not. So the model diffuses lots of adjacent fingers.

But that's the same for everything that has structure. A small section of an arm is much more likely to have another small section of an arm next to it than to have a hand, yet SD's arms are usually well-proportioned.

There's a lot of loooooong necks, though.

No the problem with fingers is that they resemble hotdogs and the AI really likes hotdogs so you get a lot of fingers.

I can make things up too!

SD 2 also removed quite a lot of images of humans due to their fear of people generating CSAM, so the quality actually has gotten worse for anything resembling humans than SD 1.

2.0 removed too many of them due to a bug in the NSFW filter. 2.1+ should be better again.

But they’re harder to control without negative prompting.

This is incorrect. Stable Diffusion 1.x was trained on "laion-improved-aesthetics" (a subset of laion2B-en).

Double checked and both the initial comment and the correction are incorrect: the original v1.1 was trained on LAION-2B, then subsequent versions were finetuned on the aestethics subset.

Either way, the main point is the same: more training data gives better results.


1.1 wasn’t public. Public releases were trained as I said.

LAION is arguably as important as imagenet was in the early 2010s

So.. the data set is licensed under Creative Commons, the source images all have their copyright (so how did they make it into the data set?) but what about images created with Stable Diffusion that uses this data set? Are those derived works? What license would they fall under?

The license on the dataset only covers the collection of images and labels; which is not copyrightable in the US but copyrightable in the EU. The reason for this is the same reason why you can't copyright a phone book in the US but can in the EU[0]. The CC-BY license on the dataset only means you can avoid getting sued by the people who collected LAION-5B by attributing them.

The copyright on the actual images and text labels is far more of a problem. Generally speaking, it is extremely infringing to collect a bunch of images or captions and redistribute them. Like, getting-to-the-heart-of-ownership, if-you-cant-sue-for-this-you-own-nothing kind of infringing. However, it's when you start talking about training ML systems on the dataset that things get interesting. In the EU, there's an explicit copyright exception for data mining that would apply to, say, training DALL-E or Stable Diffusion.

But what about the US? Well, it's legal to crawl the web; and we specifically have Authors Guild v. Google where the Authors Guild lost in court trying to keep Google from scanning large numbers of books. AI researchers have sort of just taken this to mean "training AI is fair use". This is not court-tested, but it at least jives with some precedent, so I think it's OK to assume it's true.

However, it means absolutely nothing for the people actually using the AI, because fair use is not transitive. If I take every YouTube video review of a movie and edit them down to just the movie clips being used, and then assemble them back together... I haven't somehow made a "fair use copy" of a movie that you can just share around. I've just made the most inefficient form of copyright infringement you can do with a computer. Likewise, if I train an AI on a movie, that can be fair use, but asking it to spit the movie back out is not.

Now, keep in mind that some ML systems (such as Copilot) are very eager to reproduce their training set data. Sometimes in situations you wouldn't expect. These sorts of things are ticking time bombs for people who want to generate novel images, because the AI having trained on such a massive data set also gives you access to basically the whole data set. That's half of a US copyright infringement claim right there - the other half being substantial similarity, which basically is the "Corporate needs you to find the differences between these two pictures" meme.

The only way to keep AI from infringing copyright is to make sure it never sees anything that could potentially be under copyright.

[0] Strictly speaking, the EU has a separate concept of sui generis database ownership, but for this discussion we can treat it the same as copyright.

If you're wondering why phone books aren't copyrightable, the term of art to search for is "sweat of the brow".

how would commercialization (such as, selling things made with AI, like artwork, books, etc., monetizing AI models, like selling access to them) affect 'fair use'? so far, what I'm getting is that training a model on data may be 'fair use' (though, is it creating a derivative of that data? are training results a derivative? and would there be a difference if it's made for/used for commercial purposes, or not), but the output seems to be kind of in a 'copyright limbo'.

One of the fair-use factors is commerciality of the reuse; specifically non-commercial uses are more likely to be fair.

However, this factor is treated with little weight. Practically speaking it is very difficult to imagine a reuse of a copyrighted work that does not carry some commercial benefit to someone. At the very least, not having to pay for a license is a commercial benefit of its own. Generally speaking, assume all uses are commercial and you will understand a lot of modern fair use cases.

Legally speaking, "fair use" and "derivative work" are mutually exclusive. In fact, both terms were coined at the same time when SCOTUS created the entire derivative rights regime basically out of thin air in Folsom v. Marsh. They needed a legal tool to prevent people from stealing large sections of a work, but also didn't want to allow copyright owners to abolish the 1st Amendment. Hence, they set up a set of deliberately murky legal tests to determine if a use was "fair" or not.

If you want a quick standard to gut-check against, the question you'd ask is: "is this use something that other people would ordinarily pay for?" If so, then it's infringing. If not, then it might be fair use. So you can see why making an image generator might be fair use, but it's output would be infringing if you could identify an original work the AI was cribbing from. It'd be difficult to even fathom how licensing on a training set would work, given that there's no clear chain of value from a particular entry in the set to a particular model weight or output. But we can clearly identify if an AI system is regurgitating training output, or has been told to copy someone else's work and change it a little - at least after-the-fact in a court of law.

Good to see open data and open models become a thing, I hope this trend will continue and open AI will triumph like open source software did.

While impressive number of images today. I believe this will be an underwhelming amount of images compared to what models are trained on in the future.

This is an incomplete analogy but from the time a baby is born that baby will have seen 1,892,160,000 frames of data per eye 3,784,320,000 frames in a year. That baby practically knows nothing about the world still.

You are correct. Deepmind released a paper earlier this year showing that data is the primary constraint holding back these models, not their architecture size (ie a model with 5 billion parameters is not much better than one with 1 billion, but more data can make both much better) [0].

I will copy paste the main findings from the article here:

- Data, not size, is the currently active constraint on language modeling performance. Current returns to additional data are immense, and current returns to additional model size are miniscule; indeed, most recent landmark models are wastefully big.

- If we can leverage enough data, there is no reason to train ~500B param models, much less 1T or larger models.

- If we have to train models at these large sizes, it will mean we have encountered a barrier to exploitation of data scaling, which would be a great loss relative to what would otherwise be possible.

- The literature is extremely unclear on how much text data is actually available for training. We may be "running out" of general-domain data, but the literature is too vague to know one way or the other.

- The entire available quantity of data in highly specialized domains like code is woefully tiny, compared to the gains that would be possible if much more such data were available.

[0] https://www.alignmentforum.org/posts/6Fpvch8RR29qLEWNH/chinc...

Wonder how that relates to your earlier comment in the thread and if the impace of dataset quality on performance has been studied.

I'm not an ML engineer (anymore) so I don't know the particulars, but I'd say that while the amount of data matters, it's still better to have high quality data than to not have it.

This post is about image generation, not language models.

I'd imagine the situation is the same for image generation models too.

Pretty sure this is a troll.

The assumption that human eyes can be measured in FPS is, in itself, very questionable. And if it were indeed the case, then it would surely be far in access of 60fps…

Well, inhibitory alpha waves cycle across the visual field 10 times a second. People with faster alpha waves can detect two flashes that people with slower alpha waves see as one flash.

The assumption that human eyes can be measured in FPS is, in itself, very questionable.

In the strictest sense, yes. But it seems quite reasonable to think that there is something like an "FPS equivalent" for the human eye. I mean, it's not magic, and physics comes into play at some level. There's a shortest unit of time / amount of change that the eye can resolve. From that you could work out something that is analogous to a frame-rate.

And if it were indeed the case, then it would surely be far in access of 60fps

Not necessarily. Quite a few people believe that the human eye "FPS equivalent" is somewhere between 30-60 FPS. That's by no means universally accepted and since it's just an analogy to begin with the whole thing is admittedly a little big dodgy. But by the same token, it's not immediately obvious that the human "FPS equivalent" would be "far in excess of 60 FPS" either.

> There's a shortest unit of time / amount of change that the eye can resolve.

Sure. Otherwise movies and video wouldn't work at all.

Most of those frames are redundant.

There's value in redundancy and continuous stream of images where one follows the other.

It would be nice to have a dataset of a couple "raising" a Video recorder for 1 year as if they would a baby. A continuous stream of data.

Could train a model to predict the next frames based on what it's seen so far.

It would be nice to have a dataset of a couple "raising" a Video recorder for 1 year as if they would a baby. A continuous stream of data.

The project I'm working on right now is to build a sort of "body" for a (non ambulatory, totally non anthropomorphic) "baby AI" that senses the world using cameras, microphones, accelerometer/magnetometer/gyroscope sensor, temperature sensors, gps, etc. The idea is exactly to carry it around with me and "raise" it for long periods of time (a year? Sure, absolutely, in principle. But see below) and explore some ideas about how learning works in that regime.

The biggest (well, one of the biggest) challenge(s) is going to be data storage. Once I start storing audio and video the storage space required is going to ramp up quickly, and since I'm paying for this out of my own pocket I'm going to be limited in terms of how much data I can keep around. Will I be able to keep a whole year? Don't know yet.

There's also some legal and ethical stuff to work out, around times when I take the thing out in public and am therefore recording audio and video of other people.

Glad to hear you are working on such a project. There definitely will be a lot of privacy concerns in any such project so it may be difficult to open source the data to broad public.

But could still be useful to research institutes who follow privacy guidelines.

It might be best to do a short stint of 1 week to test the feasibility. That should give you a good estimate on future projections of how much data it will consume after a month, 3 months, and a year.

I imagine any intelligent system could work with reduced data quality/lossy data at least on the audio.

As long as it's consistent in the type/amount of compression. So instead of WAV/FLAC/RAW. You could encode it to something like Opus 100 Kbps and that would give you 394.2 Gigabytes of Data for a single year for the audio.

As for video... it would definitely require a lot of tricks to store on a hobbyist level.

Yep. Your reply here encapsulates a lot of what I've been thinking about for the past few weeks. I'd love to open-source at least some of the data I collect, but the privacy/ethics issues have to be considered. And as far as that goes, there are legal/ethical issues around simply collecting data even if I don't share it, that come into play where other people are involved.

It might be best to do a short stint of 1 week to test the feasibility. That should give you a good estimate on future projections of how much data it will consume after a month, 3 months, and a year.

Yep. That's basically the approach I took with "phase 1" where the only data being ingested was gps / accelerometer data. I just let it run for a couple of weeks and then extrapolated out what the storage requirements would be for the future. Obviously audio and video are going to change the equation a lot, but the same principle is what I am planning to employ.

I imagine any intelligent system could work with reduced data quality/lossy data at least on the audio.

Yep, that's another area I've been thinking a lot about. The "instinct" is to capture everything at the highest possible resolution / sampling rate / etc. and store in a totally lossless format. But that is also the most expensive scenario and if it's not strictly required, then why do it? We know human hearing at least can work with relatively crappy audio. Look at the POTS phone system and it's 8khz of bandwidth for example. Does that analogy hold for video? Good question.

As long as it's consistent in the type/amount of compression. So instead of WAV/FLAC/RAW. You could encode it to something like Opus 100 Kbps and that would give you 394.2 Gigabytes of Data for a single year for the audio.


As for video... it would definitely require a lot of tricks to store on a hobbyist level.

Definitely. One thing that may help with costs in the short-term is that I'm very explicitly not (for now anyway) using a cloud storage service. Data ingestion is to a server I own and physically have in my home. I can get away with this because while the aggregate total amount of data may wind up fairly big over longer periods of time, the rate at which I need to ingest data isn't all that high (there's only one of these devices sending to the server). And I can just keep adding 5TB or 10TB drives as needed. When one fills up, I can unplug it, replace it with another, label and store it, and move on. The big risks here are that I don't really have any redundancy in that scenario, especially if my home burns down or something. But in that case I have bigger problems to worry about anyway!

There are other downsides to this approach, like dealing with the case of needing to access the entire year's worth of data "at once" for analysis or training, but I'm not sure that need will ever even arise.

here was an article on using latent embeddings for compression. might be useful.


And unclassified. And of poor quality.

Babies have a much harder task. They have to construct a corpus of knowledge from absolutely nothing.

The upside is that babies get to interact with the environment they're training on. Image models can't move the camera a few cm to the right if they're interested in the perspective of a particular scene.

Not absolutely nothing, the neural net is initialized with some weights encoding basic things (breathing, sucking, crying, etc.). Newborn horse walks and follows mother after first 5-10 minutes.

How do we know they start from nothing?

In fact, we're pretty sure that they don't "start from nothing." E.g., https://en.wikipedia.org/wiki/The_Language_Instinct

We're not pretty sure of anything e.g. https://en.wikipedia.org/wiki/Educating_Eve

On the surface, that sounds like a reasonable position to take. ("Cowley proposes an alternative: that language acquisition involves culturally determined language skills, apprehended by a biologically determined faculty that responds to them. In other words, he proposes that each extreme is right in what it affirms, but wrong in what it denies. Both cultural diversity of language, and a learning instinct, can be affirmed; neither need be denied.")

GPT's ability to fool intelligent people into thinking that it is "intelligent" itself seems like a powerful argument that language, more than anything else, is what makes humans capable of higher thought. Language is all GPT has. (Well, that and a huge-ass cultural database.)

Intelligence is one of those areas in which, once you fake it well enough, you've effectively made it. Another 10x will be enough to tie the game against an average human player.

There's a really easy, yet unconscionably horrible experiment we could perform to test the assumption that we're preprogrammed with any sort of knowledge.

Take a baby and stick it in a room. Let it grow up with absolutely no stimulation whatsoever. They are given food and that's about it. What do you think it can demonstrate knowledge of by the time it reaches 5? 10? 15?

All behavior is learned behavior. People talk about sucking and breathing and walking horses and what not, but babies do have to learn how to latch and how to feed. Now, they can work it out themselves. But quick acquisition of a skill does not mean the skill already existed.

Not to mention it's a far cry from sucking to language. Or knowing what a person is. Or who a person is.

yes indeed. Video is the clear next step.

What makes this multimodal, labels?

One mode is natural language, the other is imagery. It is combination becuse the model will learn statistical associations between the modes e.g. "text to image", "voice to text".

Within these respective modes are even more subgroups e.g. language translation, audio diarization. For sd you can consider animation and photographs as separate modes the model has to learn. Although the language is fuzzy and im not being statistically rigorous as it is a weak point of mine.

Terribly unethical using unlicensed images. They could/could've crowdsourced image gathering and labeling instead of stealing images.

This is like saying Google Image Search stole your image.

(In fact it’s exactly the same; it’s allowed under the same laws and it respects robots.txt.)

Does Google.com allow anybody to instantly mimic an artist's style? Obviously AI laws haven't been put in place yet - it doesn't mean it's not unethical.

It's always been possible to imitate an artstyle. Nevertheless, they've never gotten IP protection - they're more like trade secrets.

What's notable is "AI users are trying to copy an artist" != "AI has learned from an artist" != "AI has seen the artist's images in the first place". The most popular supposedly stolen-from artist Greg Rutkowski is not in StableDiffusion's training images, even though users are actively trying to copy him, it's a coincidence that it appears to work. Is that unethical?

Also, AI laws (text and data mining exemptions) /have/ been put in place - to make this explicitly legal!

Imitate, not duplicate. Hours of work/talent vs. stealing an artist's work (see: unlicensed) and feeding it into a software program to spit out a slightly different version (complete with watermarks and all)...

You don't have much faith in an artist if you think you can "duplicate their style" by looking at 4-5 512x512 images of their work.

> complete with watermarks and all

That's not the AI "copying their watermarks", it's the AI learning "sometimes images have watermarks" and giving them some of its own.

Does LAION do this? Regardless of whether "allow anybody to instantly mimic an artist's style" is bad, who are you complaining about?

Is mimicking an artist's style considered unethical?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact