Hacker News new | past | comments | ask | show | jobs | submit | shaileshm's comments login

Great article!

This is one of the most overlooked problems in generative AI. It seems so trivial, but in fact, it is quite difficult. The difficulty arises because of the non-linearity that is expected in any natural motion.

In fact, the author has highlighted all the possible difficulties of this problem in a much better manner.

I started with some simple implementation by trying to move segments around the image using some segmentation mask + ROI. That strategy didn't work out, probably because of some mathematical bug or data insufficiency data. I suspect the later.

The whole idea was to draw a segmentation mask on the target image, then draw lines that represent motion and give options to insert keyframes for the lines.

Imagine you are drawing a curve from A to A. You divide the curve into A, A_1, A_2... B.

Now, given the input of segmentation mask, motion curve, and whole image, we train some model to only move the ROI according to the motion curve and keyframe.

The problem with this approach is in sampling the keyframe and matching consistencies --making sure RoI represents same object-- across subsequent keyframes.

If we are able to solve some form of consistency, this method might be able to give enough constraints to generate viable results.


I currently shelved 3K more words of why it's hard if you're targeting real animators. One point is that human inbetweeners get "spacing charts" showing how much each part should move, even though they understand motion very well, because the key animator wants to control the acting


Location: Germany

Remote: Yes

Willing to relocate: Yes

Technologies: Python, C++, CUDA, Blender, Unity3D

Resume: https://shailesh-mishra.com/resume.pdf

Email: check bio in HN or in resume

More info:

I am a recent graduate specialized in computer graphics (character animation) and computer vision (inverse rendering). I am currently in Germany. I would be happy to move as long as I get a proper visa sponsorship.

I am comfortable with anycode base of any kind. My personal website highlights the kind of projects I did in the past and my university.

Regarding LLMs, I am one of the co-authors on indirect prompt injection paper [1]. I also happen to have intuition of the maths behind diffusion. Meaning, I am also open to any GenAI related roles.

Feel me to contact me for an open position.

[1]: https://dl.acm.org/doi/abs/10.1145/3605764.3623985


> "internal framework developer for MegaCo"

NGL, because this is actually my dream job.


A lot of people look for jobs that seem like jobs their friends are going to go "Wow, cool!" In practice those are mostly orthogonal with fairly reliable jobs that pay fairly well and reliably.


This is what a truly revolutionary idea looks like. There are so many details in the paper. Also, we know that transformers can scale. Pretty sure this idea will be used by a lot of companies to train the general 3D asset creation pipeline. This is just too great.

"We first learn a vocabulary of latent quantized embeddings, using graph convolutions, which inform these embeddings of the local mesh geometry and topology. These embeddings are sequenced and decoded into triangles by a decoder, ensuring that they can effectively reconstruct the mesh."

This idea is simply beautiful and so obvious in hindsight.

"To define the tokens to generate, we consider a practical approach to represent a mesh M for autoregressive generation: a sequence of triangles."

More from paper. Just so cool!


It's cool, it's also par for the field of 3D reconstruction today. I wouldn't describe this paper as particularly innovative or exceptional.

What do I think is really compelling in this field (given that it's my profession)?

This has me star-struck lately -- 3D meshing from a single image, a very large 3D reconstruction model trained on millions of all kinds of 3D models... https://yiconghong.me/LRM/


Another thing to note here is this looks to be around seven total days of training on at most 4 A100s. Not all really cutting edge work requires a data center sized cluster.


Can someone explain quantized embeddings to me?


NNs are typically continuous/differentiable so you can do gradient-based learning on them. We often want to use some of the structure the NN has learned to represent data efficiently. E.g., we might take a pre-trained GPT-type model, and put a passage of text through it, and instead of getting the next-token prediction probability (which GPT was trained on), we just get a snapshot of some of the activations at some intermediate layer of the network. The idea is that these activations will encode semantically useful information about the input text. Then we might e.g. store a bunch of these activations and use them to do semantic search/lookup to find similar passages of text, or whatever.

Quantized embeddings are just that, but you introduce some discrete structure into the NN, such that the representations there are not continuous. A typical way to do this these days is to learn a codebook VQ-VAE style. Basically, we take some intermediate continuous representation learned in the normal way, and replace it in the forward pass with the nearest "quantized" code from our codebook. It biases the learning since we can't differentiate through it, and we just pretend like we didn't take the quantization step, but it seems to work well. There's a lot more that can be said about why one might want to do this, the value of discrete vs continuous representations, efficiency, modularity, etc...


If you’re willing, I’d love your insight on the “why one might want to do this”.

Conceptually I understand embedding quantization, and I have some hint of why it works for things like WAV2VEC - human phonemes are (somewhat) finite so forcing the representation to be finite makes sense - but I feel like there’s a level of detail that I’m missing regarding whats really going on and when quantisation helps/harms that I haven’t been able to gleam from papers.


Quantization also works as regularization; it stops the neural network from being able to use arbitrarily complex internal rules.

But really it's only really useful if you absolutely need to have a discrete embedding space for some sort of downstream usage. VQVAEs can be difficult to get to converge, they have problems stemming from the approximation of the gradient like codebook collapse


Maybe it helps to point out that the first version of Dall-E (of 'baby daikon radish in a tutu walking a dog' fame) used the same trick, but they quantized the image patches.


> Also, we know that transformers can scale

Do we have strong evidence that other models don't scale or have we just put more time into transformers?

Convolutional resnets look to scale on vision and language: (cv) https://arxiv.org/abs/2301.00808, (cv) https://arxiv.org/abs/2110.00476, (nlp) https://github.com/HazyResearch/safari

MLPs also seem to scale: (cv) https://arxiv.org/abs/2105.01601, (cv) https://arxiv.org/abs/2105.03404

I mean I don't see a strong reason to turn away from attention as well but I also don't think anyone's thrown a billion parameter MLP or Conv model at a problem. We've put a lot of work into attention, transformers, and scaling these. Thousands of papers each year! Definitely don't see that for other architectures. The ResNet Strikes back paper is a great paper for one reason being that it should remind us all to not get lost in the hype and that our advancements are coupled. We learned a lot of training techniques since the original ResNet days and pushing those to ResNets also makes them a lot better and really closes the gaps. At least in vision (where I research). It is easy to railroad in research where we have publish or perish and hype driven reviewing.


How does this differ from similar techniques previously applied to DNA and RNA sequences?


...Is graph convolution matrix factorization by another name?


No... a graph convolution is just a convolution (over a graph, like all convolutions).

The difference from a "normal" convolution is that you can consider arbitrary connectivity of the graph (rather than the usual connectivity induced by a regular Euclidian grid), but the underlying idea is the same: to calculate the result of the operation at any single place (i.e., node), you need to perform a linear operation over that place (i.e., node) and its neighbourhood (i.e., connected nodes), the same way that (e.g.) in a convolutional neural network, you calculate the value of a pixel by considering its value and that of its neighbours, when performing a convolution.


This field moves so fast. Blink an eye and there is another new paper. This is really cool and the learning speed of us humans is insane! Really excited on using it for downstream tasks! I wonder how easy it is to integrate animatediff with this model?

Also, can someone benchmark it on m3 devices? It would be cool to see if it is worth getting on to run these diffusion inferences and development. If m3 pro can allow finetuning it would be amazing to use it on downstream tasks!


Been using a splitkb for a year now. The only regret, I didn't buy it sooner.

Seriously nothing compares to the typing experience of split keyboard. Open shoulders are great, relaxed arms and wrists is a blessing, then there are layers. Those are sooooo awesome when used in correct way!

It has really been amazing experience. I can't imagine going to normal keyboards.


I switched to Colemak-DH on a split keyboard. My hands thank me too much. They are always rested!

And the shoulders, the best part of split keyboard is the open shoulders! It feels good to work on the split keyboard. One of the best investment!

I am glad you created a layout for yourself!

Considering speed, I am all recovered and even surpassed by QWERTY speed. I am average (~60wpm) when it comes to typing speed so it wasn't hard to catch up to it.

My switching experience was also similar to yours. Get to ~35 wpm and start using. It took me 14 hrs to reach there (1hr deliberate practice everyday for 2 weeks.)


I did the same but just went cold turkey (from a TKL mechanical with a QWERTY layout to a split using Colemak-DH). The one I got is also programmable and I eliminated the need to do a lot of weird hand gymnastics. It's only been a couple of weeks and I'm not back up to my old speed, but my shoulders and wrists feel so good!

And I can do about 40 wpm now, so I know I'll get there eventually.


Do you have large hands? I was wondering if Colemak-DH has less utility for people who don’t feel the stretch is significant when using Colemak. I’ve been using Colemak for 11 years and it’s saved my wrists from RSI


It's my understanding that in general Colemak-DH is an improvement (that I agree with), but it's not that significant that you should feel the need to switch, especially if you're happy with what you have.

I'm also under the illusion that Colemak-DH is more preferable if you have smaller hands, because lateral movement is more demanding the smaller hands you have.


Not at all! That is also the reason why I used DH because I didn't like the stretch.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: