Do you actually need to simulate at that minuscule level of detail?
Or is it possible for a system to be built that can approximate biology similar to how LLMs approximate cognition without true understanding and reasoning?
Maybe it’s possible to build something analogous to LLMs, but it’s going to need to be based on completely different technology, due to:
1) the lack of abundant training data (nothing analogous to a trillion-word internet to scrape)
2) the unacceptability of hallucinations (can’t just dose people; need some other way to validate)
It is true that replay in the world frame will not handle initial position changes for the shirt. But if the commands are in the frame of the end-effector and the data is object-centric, replay will somewhat generalize.(Please also consider the fact that you are watching the videos that have survived the "should I upload this?" filter.)
The second thing is that large-scale behavior cloning (which is the technique used here), is essentially replay with a little smoothing. Not bad inherently, but just a fact.
My point is that there was an academic contribution made back when the first aloha paper came out and they showed doing BC on low-quality hardware could work, but this is like the 4th paper in a row of sort of the same stuff.
Since this is YC, I'll add - As an academic (physics) turned investor, I would like to see more focus on systems engineering and first-principles thinking. Less PR for the sake of PR. I love robotics and really want to see this stuff take off, but for the right reasons.
> large-scale behavior cloning (which is the technique used here), is essentially replay with a little smoothing
A definition of "replay" that involves extensive correction based on perception in the loop is really stretching it. But let me take your argument at face value. This is essentially the same argument that people use to dismiss GPT-4 as "just" a stochastic parrot. Two things about this:
One, like GPT-4, replay with generalization based on perception can be exceedingly useful by itself, far more so than strict replay, even if the generalization is limited.
Two, obviously this doesn't generalize as much as GPT-4. But the reason is that it doesn't have enough training data. With GPT-4 scale training data it would generalize amazingly well and be super useful. Collecting human demonstrations may not get us to GPT-4 scale, but it will be enough to bootstrap a robot useful enough to be deployed in the field. Once there is a commercially successful dextrous robot in the field we will be able to collect orders of magnitude more data, unsupervised data collection should start to work, and robotics will fall to the bitter lesson just as vision, ASR, TTS, translation, and NLP before.
"Limited generalisation" in the real world means you're dead in the water. Like the Greek philosopher Heraclitus pointed out 2000+ years go, the real world is never the same environment and any task you want to carry out is not the same task the second time you attempt it (I'm paraphrasing). The systems in the videos can't deal with that. They work very similar to industrial robots: everything has to be placed just so with only centimeters of tolerance in the initial placement of objects, and tiny variations in the initial setup throw the system out of whack. As the OP points out, you're only seeing the successful attempts in carefully selected videos.
That's not something that you can solve with learning from data, alone. A real-world autonomous system must be able to deal with situations that it has no experience with, it has to be able to deal with them as they unfold, and it has to learn from them general strategies that it can apply to more novel situations. That is a problem that, by definition, cannot be solved by any approach that must be trained offline on many examples of specific situations.
Thank you for your rebuttal. It is good to think about the "just a stochastic parrot" thing. In many ways this is true, but it might not be bad. I'm not against replay. I'm just pointing out that I would not start with an _affordable_ 20k robot with fairly undeveloped engineering fundamentals. It's kind of like trying to dig a foundation to your house with a plastic beach shovel. Could you do it? Maybe, if you tried hard enough. Is it the best bet for success? doubtful.
The detail about end-effector frame is pretty critical as doing this BC with joint angles would not be tractable. You can tell there was a big shift from the RL approaches trying to do very generalizing algorithms to more recent works that are heavily focused on this arms/manipulators because end-effector control enables more flashy results.
Another limiting factor is that data collection is a big problem: not only will you never be sure you've collected enough data, they're collecting data of a human trying to do this work through a janky teleoperation rig. The behavior they're trying to clone is of a human working poorly, which isn't a great source of data! Furthermore limiting the data collection to (typically) 10Hz means that the scene will always have to be quasi-static, and I'm not sure these huge models will speed up enough to actually understand velocity as a 'sufficient statistic' of the underlying dynamics.
Ultimately, it's been frustrating to see so much money dumped into the recent humanoid push using teleop / BC. It's going to hamper the folks actually pursing first-principles thinking.
That's an extreme position to take that rests on the claim that sponsorship/advertising is objectively bad.
Media & journalism have been underpinned by advertising for over a century. Tons of educational and informative services are available to the public for free because of advertising. Sponsorship has built art galleries, hospital wings, research centers, etc.
In this case, there's a relatively innocuous logo on a robotic lander that is 230k miles away on a desolate rock. It's not like this is a billboard in a nature preserve.
Whether advertising is objectively bad isn't necessarily the debate, but at some point it can cross a line. That line might be different for everyone, but most people will have it. You yourself give an example of something you suggest might be unaccaptable to some:
> billboard in a nature preserve
Where's the line? Why shouldn't we put billboards in nature preserves?
Can you elaborate on “properly tweaked”? When I use one of the Stable Diffusion and AUTOMATIC1111 templates on runpod.io, the results are absolutely worthless.
This is using some of the popular prompts you can find on sites like prompthero that show amazing examples.
It’s been serious expectation vs. reality disappointment for me and so I just pay the MidJourney or DALL-E fees.
1. Use a good checkpoint. Vanilla stable diffusion is relatively bad. There are plenty of good ones on civitai. Here's mine: https://civitai.com/models/94176
2. Use a good negative prompt with good textual inversions. (e.g. "ng_deepnegative_v1_75t", "verybadimagenegative_v1.3", etc.; you can download those from civitai too) Even if you have a good checkpoint this is essential to get good results.
3. Use a better sampling method instead of the default one. (e.g. I like to use "DPM++ SDE Karras")
There are more tricks to get even better output (e.g. controlnet is amazing), but these are the basics.
Thank you. I assume there's some community somewhere where people discuss this stuff. Do you know where that is? Or did you just learn this from disparate sources?
> I assume there's some community somewhere where people discuss this stuff. Do you know where that is? Or did you just learn this from disparate sources?
I learned this mostly by experimenting + browsing civitai and seeing what works + googling as I go + watching a few tutorials on YouTube (e.g. inpainting or controlnet can be tricky as there are a lot of options and it's not really obvious how/when to use them, so it's nice to actually watch someone else use them effectively).
I don't really have any particular place I could recommend to discuss this stuff, but I suppose /r/StableDiffusion/ on Reddit is decent.
Pretty good reddit community, lots of (N/SFW) models and content on CivitAI. Took me a weekend to get setup and generating images. I've been getting good results on my AMD 6750XT with A1111 (vladmandic's fork).
What kind of(and how much) data did you use to train your checkpoint?
I'd like to have a go at making one myself targeted towards single objects (be it car,spaceship, dinner plate, apple, octopus, etc). Most checkpoints are very heavily leaning towards people and portraits.
Are you using txt2img with the vanilla model? SD's actual value is in the large array of higher-order input methods and tooling; as a tradeoff, it requires more knowledge. Similarly to 3D CGI, it's a highly technical area. You don't just enter the prompt with it.
You can finetune it on your own material, or choose one of the hundreds of public finetuned models. You can guide it in a precise manner with a sketch or by extracting a pose from a photo using controlnets or any other method. You can influence the colors. You can explicitly separate prompt parts so the tokens don't leak into each other. You can use it as a photobashing tool with a plugin to popular image editing software. Things like ComfyUI enable extremely complicated pipelines as well. etc etc etc
Is there a coherent resource (not a scattered 'just google it' series of guides from all over the place) that encapsulates some of the concepts and workflows you're describing? What would be the best learning site/resource for arriving at understanding how to integrate and manipulate SD with precision like that? Thanks
I have found http://stable-diffusion-art.com to be an absolutely invaluable (and coherent) resource. It's highly ranked on Google for most "how to do X with stable diffusion" style searches, too.
I'm going to sound like an entitled whiny old guy shouting at clouds, but - what the hell; with all the knowledge being either locked and churned on Discord, or released in form of YouTube videos with no transcript and extremely low content density - how is anyone with a job supposed to keep up with this? Or is that a new form of gatekeeping - if you can't afford to burn a lot of time and attention as if in some kind of Proof of Work scheme, you're not allowed to play with the newest toys?
I mean, Discord I can sort of get - chit-chatting and shitposting is easier than writing articles or maintaining wikis, and it kind of grows organically from there. But YouTube? Surely making a video takes 10-100x the effort and cost, compared to writing an article with some screenshots, while also being 10x more costly to consume (in terms of wasted time and strained attention). How does that even work?
I've been playing with SD for a few months now and have only watched 20-30m of YT videos about it. There's only a few worth spending any time watching, and they're on specific workflows or techniques.
Best just to dive in if you're interested IMO. Otherwise you'll get lost in all the new jargon and ideas. Great place to start is the A1111 repo, lot of community resources available and batteries included.
How does anyone keep up with anything? It's a visual thing. A lot of people are learning drawing, modeling, animation etc in the exact same way - by watching YouTube (a bit) and experimenting (a lot).
Picking images from generated sets is a visual thing. Tweaking ControlNet might be too (IDK, I've never got a chance to use it - partly because of what I'm whining about here). However, writing prompts, fine-tuning models, assembling pipelines, renting GPUs, figuring out which software to use for what, where to get the weights, etc. - none of this is visual. It's pretty much programming and devops.
I can't see how covering this on YouTube, instead of (vs. in addition to) writing text + some screenshots and diagrams, makes any kind of sense.
This is the level we're generally working at - first or second party to the authors of the research papers illustrating implementations of concepts, struggling with the Gradio interface, things going straight from commit to production.
It's way less frustrating to follow all of the authors in the citations of the projects you're interested in than wasting your attention sorting through blogspam, SEO, and YT trash just to find out they don't really understand anything, either.
Thank you. I was reluctant to chase after and track first-party research directly, or work directly derived from it, as my limited prior experience told me it's not the most efficient thing unless I want to go into that field of research myself. You're changing my mind about this; from now, I'll try sticking close to source.
There's a relatively thin layer between the papers and implementations, which is another way of saying this stuff is still for researchers and assumes a requisite level of background with them. It sounds like you'd benefit from seeking out the first party sources.
This is where video demonstrations come in handy. Since many concepts are novel, it's uncommon to find anyone who deeply understands them, but it's very easy to find people who have picked up on some tricks of the interfaces, which they're happy to click through. I think gradio/automatic1111 makes learning harder than it needs to be by hiding what it's doing behind its UI, while eg- comfyui has a higher initial learning curve but provides a more representational view of process and pipelines.
Take a moment and go scroll through the examples at civitai.com. Does most of them strike you as something by people with jobs? Most of them are pretty juvenile, with pretty women and various anime girls.
An operative word here is people.... the set "people with jobs" contains a far higher fraction of folks who like attractive men than is represented here....
I think it'd have been convenient for me as well if the AI tool that has access to YouTube videos would've been able to answer queries . But it takes 5 minutes to reply and I forgot it's name. It was on the front page recently
You're not going to get even close to Midjourney or even Bing quality on SD without finetuning. It's that simple. When you do finetune, it will be restricted to that aesthetic and you won't get the same prompt understanding or adherence.
For all the promise of control and customization SD boasts, Midjourney beats it hands down in sheer quality. There's a reason like 99% of ai art comic creators stick to Midjourney despite the control handicap.
Yet you are posting this in a thread where GP provided actual examples of the opposite. Look for another comment above/below, there are MJ-generated samples which are comparable but also less coherent than the result from a much smaller SD model. And in case of MJ hallucinations cannot be fixed. MJ is good but it isn't magic, it just provides quick results with little experience required; prompt understanding is still poor, and will stay poor until it's paired with a good LLM.
Neither of the existing models gives actually passable production-quality results, be it MJ or SD or whatever else. It will be quite some time until they get out of the uncanny valley.
> There's a reason like 99% of ai art comic creators stick to Midjourney
They aren't. MJ is mostly used by people without experience, think a journalist who needs a picture for an article. Which is great and it's what makes them good money.
As a matter of fact (I work with artists), for all the surface-visible hate AI art gets in the artist community, many actual artists are using it more and more to automate certain mundane parts of their job to save time, and this is not MJ or Dall-E.
There's a distinction to be made here. Everything that makes SD a powerful tool is the result of being open source. The actual models are significantly worse than Midjourney. If an MJ level model had the tooling SD does it would produce far better results.
It only has the same look if it's not given any style keywords. I've been impressed with the output diversity once it's told what to do. It can handle a wide range of art styles.
>For all the promise of control and customization SD boasts, Midjourney beats it hands down in sheer quality.
The results are comparable, but MJ in this comment https://news.ycombinator.com/item?id=36409043 hallucinates more (look at the roofs in the second picture). And it cannot be fixed, maybe except for an upscale making it a bit more coherent. Until MJ obtains better tooling (which it might in the next iteration), it won't be as powerful. I'm not even starting on complex compositions, which it simply cannot do.
>OP posts results from a tuned model.
Yes, which is the first step you should do with SD, as it's a much smaller and less capable model.
I know what i'm talking about lol. I tuned a custom SD model that's downloaded thousands of times a month. I'm speaking from experience more than anything.
Don't know why some SD users get so defensive.
First off are you using a custom model or the default SD model? The default model is not the greatest. Have you tried controlnet?
But yes SD can be a bit of a pain to use. Think of it like this. SD = Linux, Midjourney = Windows/MacOS. SD is more powerful and user controllable but that also means it has a steeper learning curve.
Damn. I must have exceptionally dark taste in media, because I've definitely got a few "well, that was good, but I kinda wish I hadn't read/watched it" things in my history, and The Road wasn't even close to qualifying. I'd say the film Threads hit me harder, to pick something with similar setting and circumstances, though that's still not quite in that category.
(It Comes at Night is probably my #1 in that category for film, and I guess Watts' Blindsight in books—both messed me up for days after, the former with some depressive nihilism, the latter with days of fairly intense derealization that weren't too fun—and I'm not normally especially prone to either of those, I don't think)
Blindsight is a masterpiece about intelligence without consciousness. De-linking those ideas can be seriously jarring for humans, because we usually consider them two sides of the same coin.
Plus one of the more normal characters investigating the phenomenon just happens to be a vampire who--surprisingly enough--is neither formulaic nor boring.
> Plus one of the more normal characters investigating the phenomenon just happens to be a vampire who--surprisingly enough--is neither formulaic nor boring.
The writing-guide-esque "OK, now write down ten wild elements or characters that certainly do not fit in your world... flip page ... and now add one of them, finding a way to connect it to some other element you've already established" was almost comically transparent, but also so damn effective that I've added it to the ol' toolbox.
(I mean, I don't know that Watts literally did that sort of exercise, exactly, and even doubt that he did, but in my head that's definitely how that part got in there)
I feel pretty certain that the vampires were included in quite the opposite way - especially if you read the sequel, which features their story more heavily. To me, the existence of vampires in the Blindsight setting seems essential.
I found Echopraxia a disappointing follow-up to Blindsight, because the thematic core was much harder to grok, and it felt diffuse and attenuated. It's very hard to write a novel that explores what it means for scientists to encounter the limits of scientific rationality as we understand it.
Interestingly, (and apropos) McCarthy's last, Stella Maris, I think, did a much more eloquent job of exploring the same themes. Stella Maris is astounding, it's cosmic horror without the 'supernatural.' Instead there's only Gödel and Metzinger.
My impression is that Watts is best at first books in a series. Both Blindsight and Starfish were (and are) absolutely incredible to me, but both of their sequels fall short of the original promise. However, from reading his blog, this is not unexpected. He doesn't write sequels to further explore the same ideas, but to move background ideas into the forefront and explore those. This would naturally lead to disappointment if the reader wants more of the original premise.
That said, I think Echopraxia is a better sequel than Maelstrom and a pretty good book on its own. He is working on a final book in the Blindopraxia trilogy and I expect it to move even further from the wonder of the original book, but now that my expectations are set I am looking forward to it all the same.
I know this won't apply to everyone, but Blindsight is one of the few pieces of media that noticeably changed my life. Here be profound ideas, although your mileage may vary, especially if you're neurotypical or already very well read in concepts of truly alien intelligence. (The space probe chapter was one of the most life-changing passages I've ever read, so that may tell the reader something about my autism.)
I read the entire thing in HTML on my phone on the author's web site over a week or so. After the first few days of catching bits during breaks, I found myself sitting at home on the porch just reading from my phone - not typical for me. (I did later buy a copy.)
It Comes at Night didn't have anything come at night. It was derivative of everything else in the genre to the point of being a snore. I was so disappointed. One of my least favorite A24 films.
I suppose I was expecting peak Shyamalan, but A24.
Yeah, it was mostly just a slow-burn misery-fest. Now that I think about it, I'm not sure I'd put it in the "good, but also horrible" category, exactly. Like, I don't think there's actually much to it other than the misery. Not like a Funny Games, say, that's thoroughly miserable but also doing some other, interesting things that both justify and require the misery.
I like that you can play through both The Last of Us games and think "What a grim existence, scraping together things to cobble a defence against relentless threats. Horrific."
And compare it to The Road, whose environment would have you begging for a holiday to The Last of Us. Not just the lack of food, but the lack of ability to grow new food, to rest, to warm, to heal. The constant, dogged, thinking, scheming threats. And your kid isn't pushing ladders down to you nearly enough.
Please back up the claim "climate models has a terrible track record" and qualify the word 'terrible'.
Casting blame is a common denier tactic [1] used despite the models being useful and accurate [2].
The "whataboutisms" you mention are another common tactic. [3] Blaming EV batteries is a red herring; they have much lower lifecycle emissions than gas-based engines. [4]
These are all generally highly reviewed can lead to interesting discussions:
Gattaca - interesting take on genetics/DNA discrimination
Europa - hard science fiction, maybe a bit slow for a 12 year old
The Net - older movie but the concept of digital exile resonates today
City of Ember - post apocalyptic civilization that lives deep underground
First Man - dramatization of Neil Armstrong and the first moon landing (not realy "fiction")
Arrival - first contact situation, tastefully done
Contact - another first contact situation that explores how politics, skepticism, and fanaticism react
The Martian - easily as fun as the book
The Prestige - competing magicians in an industrial age setting
The Hunger Games - extreme class divide in a future setting
Jurassic Park - the original not the sequels
Stargate - wormhole travel to another planet (be careful with the TV shows though, SG1 on streaming has full frontal nudity and "not ok" scenes which were obviously not on the broadcast version and a total shock when we watched it as a family)
District 9 - has some graphic gore and language, so it might be 14+, but is an interesting look at aliens as refugees
Contagion - a look at how a pandemic could play out. Premiered before COVID.
The Maze Runner - interesting setting and look at group dynamics
Honey, I Shrunk the Kids - fun setting
The Village - scary movie with a sci fi twist
Galaxy Quest - comedy
Short Circuit - old movie that deals with AI sentience
Innerspace - another old one, but has some fun concepts
Based on the GIF, I thought it created an animated video.
Even when the .PNG downloaded I thought for sure it’d be an animated PNG.
If I’m doing some content creation, I probably already have an image editor, in which case I can create this effect myself or would prefer an integrated plugin to do it.
Motion graphics is much harder, and there’s more demand there to add some sparkle to a static image. OP, have you considered that angle?
OP here. Sorry to disappoint you with the lack of animation. I have definitely entertained the idea of creating a video variant of this app. I fear it will remain a pipe dream due to the demands of my day job and family life.
I think they mean to say that the animated gif[0] your site is misleading as it shows an animated transition that doesn't actually occur in the product.
Fortunately, robotic capability like that basically becomes the equivalent of Nuclear MAD.
Unfortunately, the virus approach probably looks fantastic to extremist bad actors with visions of an afterlife.
reply