That reads to me like: "I'm not playing honey, this is work!" I sure hope the team doing this had a lot of fun :)
The algorithm itself is OK but doesn't appear all that novel to me. It looks like they're using ShapeNet (or similar dataset) to train a 3D voxel autoencoder which can predict geometry from an image. And then they convert those voxels to LEGO instructions.
For the first part, I've seen equally good papers before, for example by the ShapeNet team. For the second part, there's working explicit algorithms. So their main contribution appears to be to combine two working things into a (somewhat) useful whole.
Seriously, though, I have a real-world problem which needs a solution: given the roughly 3000 loose bricks in my son's Lego collection, tell me which ones go with which sets? Thank you - we can take it from here.
The difficulty is that some bricks will look like other bricks, close enough that the image recognition might be accurate. But, some margin for error is fine.
2017 - Sorting 2 Metric Tons of Lego
2019 - AI-powered Lego sorting machine built with Lego bricks
We moth-balled it, because it would not be possible to make it at a pricepoint, which made commercial sense.
Then perhaps the AI experts are on track for a successful prediction:
> The team also predicts that by the year 2023, a machine will be able to “physically assemble any Lego set given the pieces and instructions, using non-specialized robotics hardware.”
What are the benefits of generating a lego set from an image, as opposed to generating a lego set from a 3d model (and doing the “image to 3d model” conversion with a different tool)?
Or using your phone's LIDAR sensors
I only took it to the paper/whiteboard design stage. The problem I had was voxels became wildly more memory consuming when you introduce more than a binary there not there type. I was looking at 10-15 megabytes in memory just to describe a simple 2x4 brick with the min resolution and it still worked as a lego. At the time I was lucky I had 96MB... The exe was pretty small (maybe 1-2MB total). The data on the other hand I could not find a nice way to compress it enough to work without a few dozen bricks filling my memory space. Mostly I got bored with the project and moved on.
The 'voxel' style has one very nice quality about it over the ldraw cad system. Unintentional gaps are nearly impossible to have and overlap is dead easy to find. Correct overlap lets you do interesting things in that you can intersect things correctly and it will just work on the snap grid. The ldraw system has to jump thru a few hoops to make it look like it is working. The effect is the same but more of a pain making sure your floating point is correct. ldraw has a nice quality in that rotating an object in freespace is 'cheap' and you only have to rotate a few points. Whereas a voxel system like I designed you in effect have to rotate everything.
At this time the ldraw system is the choice to use if you are doing this though. When I built my system there were maybe 200 different bricks total. Now there are thousands.
With ML detection you can get a good ways decently the hard part is 'hidden' feature. Depending on orientation with some pieces they will hide features. For example a 2x4 flat looks the same as a 2x2 flat on many orientations (extreme example but shows off the effect nicely). I have seen some people use tumbling the piece or a few cameras to mitigate the issue.
My proj was more along the lines of how do you represent a single piece in memory without using a planar point system such as what ldraw did. I had toyed with the idea of converting between the two systems for memory reasons. But it became compute/IO expensive very quickly on collision detection with other pieces.