Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: AI assisted image editing with audio instructions (github.com/shashekhar)
95 points by ShaShekhar 9 months ago | hide | past | favorite | 30 comments
Excited to launch AAIELA, an AI-powered tool that understands your spoken commands and edits images accordingly. By leveraging open-source AI models for computer vision, speech-to-text, large language models (LLMs), and text-to-image inpainting, we have created a seamless editing experience that bridges the gap between spoken language and visual transformation.

Imagine the possibilities if Google Photos integrated voice assisted editing like AAIELA! Alongside Magic Eraser and other AI tools, editing with audio instruction could revolutionize how we interact with our photos.




Forgot to share this link as well, not sure if you're aware of it but it's a great write up on fine tuning small local models on specific APIs and seems like it would be a perfect fit for your project. https://bair.berkeley.edu/blog/2024/05/29/tiny-agent/


I did integrated and tested the microsoft phi3-mini and it works really well. Having freedom to run locally without sharing private photo is my utmost objective.


Example instructions: 1. Replace the sky with a deep blue sky then replace the mountain with a Himalayan mountain covered in snow. 2. Stylize the car with a cyberpunk aesthetic, then change the background to a neon-lit cityscape at night. 2. Replace the person with sculpture complementing the architecture.

Check out the Research section for more complex instructions.


We're so close to being able to create our own Tayne

(https://www.youtube.com/watch?v=a8K6QUPmv8Q)


Love it! Voice interaction is a great modality for UI. A lot of people have a bad taste left over from early attempts but I expect to see a lot of progress made now that STT and natural language understanding is so much better.

The biggest reason we should be adding conversational UI to everything is the harm done by RSI and sedentary keyboard and mouse interfaces. We're crippling entire generations of people by sticking to outdated hardware. The good news is we can break free of this now that we have huge improvements in LLMs and AR hardware. We'll be back to healthy levels of activity in 5 to 10 years. Sorry Keeb builders, it's time to join the stamp collectors and typewriter enthusiasts. We'll be working in the park today.


I'd like to see a voice instruction layer that can work independently of the mouse/keyboard later without stealing focus. Things like moving files or preparing windows/positioning prior to switching.


One big problem would be that in open office environments there would be a lot of noise. I wonder if some sort of active noise cancellation could be introduced so the voices of your co-workers could be ~completely canceled out if you are wearing special headphones?


When I consider my own LLM workflow the amount of time reading/listening/thinking outweighs the amount of time spent typing/speaking. If that's any indication of how a fully fledged conversational workflow would work then I think open plan offices wouldn't be a lot louder than they currently are. Depending on how quickly agentic LLMs are developed I'm not even sure we will be using offices the same way we are now. We might only need to meet or checkin with our coworkers and our LLM agents every few hours or once a day or maybe even longer in order to realign and check on results. Maybe we'll get occasional messages asking us to confirm something or provide clarification, I could honestly see most knowledge work evaporating and leaving behind only high level coordination, research and ideation.

Before that, I'm certain we'll all be spending a lot more time reviewing work, trying out prototypes and tweaking prompts or specifications than we do typing or talking.


Have you tried sitting in a park for hours, talking out loud and seeing what happens?


Isn't that just like taking a phone call? I'm not sure what you're trying to imply.


I guess there are differences from country to country, but in some places you would not be left alone.


Ignoring the snark. This will change as technology is adopted, go back 40 years (or even less) and a person walking around staring at a little black rectangle would have been perceived as weird and anti-social. We used to make fun of people talking on the phone via bluetooth headsets and now everyone does it with AirPods or whatever.

If you've got the technology to enable you to seamlessly transition from working in your home to working while sitting outside at a cafe to working while sitting on a blanket under a tree in the park to working wherever you feel like it then there will be enough brave people that say "fuck what other people think" and just do it so they can enjoy being active and getting fresh air and eventually more and more people will join them. Eventually we'll reach the point where sitting inside at a desk for 8-12 hours will be the weird thing.


Nice job. I actually experimented with a chat driven instruct2pix sort of interface that connected via API to a stable diffusion backend. The big problem is that it's difficult to know if the inpainting job you've done is satisfactory to the user.

This is why usually when you're doing this sort of traditional inpainting in automatic1111 you generate several iterations with various mask blurs, whole picture vs only masked section, padding and of course the optimal inpainting checkpoint model to use depends on whether or not the original images is photorealistic versus illustrated, etc.


Right now, the inpainting is done on semantic mask (output from segmentation model). For more complex instruction, we also have to support contextual mask generation, which is an active area of research in the field of Visual Language Model. When it comes to perform several iteration, you can also do that on semantic level or get a batch of output. The sdv1.5 inpainting model is quite weak and we haven't seen any large scale open source inpainting model for a while.


Super cool! We're building an API that makes it easy to build chained multi-model workflows like this that run with zero latency between tasks - https://www.substrate.run/


It didn't just replace the sky and background, it replaced the trees. That wasn't part of the instructions.


I love how in the demo video, even the audio instructions themselves are AI generated. No human in the loop, at all! :)


I did it intentionally. The video had my voice, but then I decided to replace it with an AI voice.


Very cool - which method do you use for editing the images? is it SDEdit or InstructPix2Pix? another one?


Thanks. Stable diffusion inpainting v1.5. I'd played around this model so much that i ended up using it. I've read both papers SDEdit where you need mask for inpaiting and instructPix2Pix where you don't. I know, i'm a year behind when it comes to using new models like LEDIT++, LCM, SDXL inapainting etc. There is so much work to do. VCs won't fund me as it's not a b2b spinoff.


instructpix2pix is fine-tuned on sd-v1.5 which is a inpainting model (aware of contexts and semantics) that why it don't require mask.


soon the movie trope of saying "enhance" repeatedly could be a real thing!


This pitches a lot but only seems to support a specific inpainting operation?


The tools are there, we just have to connect it (check out the TODO section). For more complex instruction like when you want to create the mask, it requires a lot of contextual reasoning which i tried to point out in Research section.


Wow! We're now just a hair's-width away from finally being able to say, "Computer, enhance image!" without sounding like we're in a bad sci-fi show.


Think the only thing historical science fiction/Blade Runner photo inspect scene[0] didn't forsee was voically having AI assist/analyze photo to summerize list of items/objects avaliable to zoom/view. (vs. pan/zoom around). Although altavista glasses / hand gestures[3] would have been a future concept at the time, too.

----

[0] : https://scifiinterfaces.com/2020/04/29/deckards-photo-inspec...

[1] 'mirror reality' image / TERI[2] : https://www.hackster.io/news/blade-runner-s-image-enhancemen...

[2] : TERI, almost IRL blade runner move image enhancement tool : https://news.ycombinator.com/edit?id=40844595 / https://github.com/iscilab2020/TERI-3DNLOS/tree/TERI

[3] : Gest : https://news.ycombinator.com/edit?id=40844704


Using Whisper as the voice interface, an LLM to understand the prompt and issue function call commands and an image upscaler you could build this in a weekend. Would it be useful? Not especially by itself but I think there is a lot of promise in voice interaction with LLM operated software.


Make it so!


gMake it, you gAught it. (once there's enough bandwidth to go around[0])

[0] : Intel CPU with OCI Chiplet Demoed with 4Tbps of Bandwidth and 100M Reach : https://news.ycombinator.com/item?id=40844616


Zoom. Enhance!




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: