So if I'm understanding correctly, you decompose kernels into their per_sm_workload, then you figure out per_sm_data_dependency and then you can schedule sm_workloads from the next kernel to start running as soon as the data dependency is satisfied, not needing to wait for the other sms from the previous kernel to finish.
In this case are you're strickly fusing pre defined kernels or are you also optimizing them? Is this complimentary to your earlier work on search-based compilers?
Thats reasonably accurate, we're fusing both pre-defined operations as well as codegenned operations. Block-level operations live inside the search space, as do kernel, warp and thread level operations. Since it's a unified search space, we can look through tons of combinations of kernel, block, warp, and thread level ops. When we go to compile them to runnable code, thread ops get compiled to warp ops, warp ops get compiled to block ops, block ops get compiled to kernel ops (megakernels live here!), so at the end of the day everything that gets ran is a kernel.
In other words, very complimentary to our search-based approach.
What they are doing is plain cheating the system to get their 3 conference papers so they can get their $150k+ job at FAANG. It's plain cheating with no value.
Confront the culprit and ask for their side; you'll just get some sob story about how busy they are and how they were only using the AI to check their grammar and they just don't know how the whole thing ended up fabricated... Waste of time. Just blacklist these people, they're no better than any other scammer.
Rookie numbers. After NeurIPS main conference, you’re dumb not to ask for 300K YOY. I watched IBM pay that amount prorated to an intern with a single first author NeurIPS publication.
A lot of research in AI/ML seems to me to be "fake it and never make it". Literally it's all about optics, posturing, connections, publicity. Lots of bullshit and little substance. This was true before AI slop, too.
But the fact that AI slop can make it pass the review really showcases how much a paper's acceptance hinges on things, other than the substance and results of the paper.
I even know PIs who got fame and funding based on some research direction that supposedly is going to be revolutionary. Except all they had were preliminary results that from one angle, if you squint, you can envision some good result. But then the result never comes. That's why I say, "fake it, and never make it".
cuEquivariance is unfortunately close sourced (the acutal .cu kernels), but OP's work is targetting a consumer GPU and also a very small particle system so its hard to compare, anyway.
In this case are you're strickly fusing pre defined kernels or are you also optimizing them? Is this complimentary to your earlier work on search-based compilers?
reply