Hacker News new | past | comments | ask | show | jobs | submit login
WildGaussians: 3D Gaussian Splatting in the Wild (arxiv.org)
108 points by alphabetting 57 days ago | hide | past | favorite | 19 comments



Not only is there code but they've actually pinned version numbers in their requirements.txt!

Truly an outlier ;)


Corridor Digital created a short film and accompanying making-of video showing the use of gaussian splats vs. photogrammetry to reconstruct an environment: https://youtu.be/GaGcLhhhbDs?si=rJVxF8yNwwbfLBw9&t=306


I love love love seeing how reflections are handled by making essentially a virtual geometry under the floor.

This implies pretty obvious & severe limitations that it's not modelling where lights are or how things bounce, but it is really neat to see & works shockingly well for this use case of moving forward through a scene.


If I recall correctly gaussian splats are direction-dependent, so they can lean reflections "correctly" but since there's no picture in the training set that forces it to do the reflection correctly there's no reason for it to.


The ability to re-light scenes is a step in the right direction, but their examples use the classic easy scenario of an overcast sky with no strong lighting directionality, which can be faked fairly convincingly using ambient occlusion rather than fully simulating (or inferring) the light paths.



I really want them to create a mascot image for the wildgaussian project that is a T-Rex holding two NErF blasters.


Any insights on what it does better than the concurrent works (Gaussian in the Wild, SWAG, Wild-GS)?


Ok, I'll guess I'll have to bite myself. I only spent a couple of minutes with each paper, so my understanding will at best be superficial. Trying to understand something quickly is hard, trying to represent something you haven't understood completely is harder. So please consider every bit of information in this post a likely misrepresentation, overgeneralization, or just plain wrong (welcome to the internet!). But it might help you select which papers you'd want to look at, so I'm still posting it.

All these in-the-wild-methods share a similar setup, in that they do two things: appearance modelling (for daytime and weather and season and exposure, usually with per-image and per-Gaussian embeddings), and modelling/masking of transient objects (for tourists and dogs and water and blimps).

WildGaussians (this post): For the appearance modelling, they have a learnable per-Gaussian and a per-image embedding. They are fed into an MLP to produce affine transformations applied to the spherical harmonics (SH) params. So if you like a fixed appearance, you can pre-compute SH params and throw the scene into a standard 3DGS renderer.

The affine transformations are inspired by "Urban radiance fields", which predicts affine params form image embedding alone. WildGaussians use also per-Gaussian appearance embedding for local changes.

For the transients, they have an "uncertainty modelling" module, which computes DINOv2 features of the rendered image, and of the GT image. They compare, upsample and binarize them into a per-pixel mask, which is thrown onto the DSSIM and L1 loss.

Paper reads well, probably interesting to dive into how the uncertainty modelling really works. Straightforward setup (with the split into appearance and uncertainty), can be followed along well.

SWAG: appearance modelling somewhat similar to WildGaussians. Has image embedding, looks up per-Gaussian embedding from the Gaussian coords in a hash grid (hello Instant NGP). Feeds them together with the Gaussian color into an MLP. Which produces image-independent color and an image-dependent "opacity variation". So instead of masking out transient stuff in the image loss during training, they learn for each Gaussian (through hash grid), whether it's visible in a particular image.

The authors note that they also considered using "Urban Radiance Fields" affine transformations, but that affine colors cannot model all appearance changes.. which is why WildGaussians have the per-Gaussian embedding, I think?

Interesting that they can reproduce per-image transient objects. But maybe the static stuff looks a bit worse from that (look at the water in the Trevi fountain, its.. missing).

Bit hard to quickly follow along with the opacity variation stuff, this would take a bit of time to grok. But overall also quite straightforward setup, interesting read.

Wild-GS: Global appearance embedding (per image), per-Gaussians local reflectance, material attributes per Gaussian. Fusion network decodes SH from these three components. For some reason projects the points from the image from the depth into 3D, and looks that up in a triplane, instead of looing up the triplane from the Gaussian position? There's "2D Parsing" and "3D Wrapping", and it's quite convoluted, I'd need more time to understand what's going on here.

Gaussian in the Wild: extracts image feature with a UNet, reshapes into a bunch of feature maps (K feature maps + projection feature map?). The Gaussians sample from these feature maps. Features are fused with an MLP into a color. Some adaptive sampling is apparently required.

Transient objects are handled by a 2D visibility map obtained from a UNet as well.

I think the main idea is to train networks that can extract information from the input image to model the appearance and transients. This is different from WildGaussians and SWAG, which train per-image and per-Gaussian (directly or thourgh 3D lookup) embeddings, and only small decoder MLPs.

WE-GS An In-the-wild Efficient 3D Gaussian Representation for Unconstrained Photo Collections: may be similar to Wild-GS. Too much stuff going on, to understand from quick browsing. If you like the <2D input image feeds into lots of different networks> idea (like Wild-GS), you may want to read this one, too.

Robust Gaussian Splatting: tackles motion blur by modelling the camera poses as a Gaussian distribution. And defocus blur (from physical apertures) through an additional covariance on the Gaussians. Also have an RGB decoder function with per-image embedding for some appearance modelling (different exposures of the same scene). Interesting to read when you want to get rid of motion and defocus blur. For an in-the-wild appearance modelling, choose one of the other methods, it's very simple here.

SpotlessSplats: Ignoring Distractors in 3D Gaussian Splatting: spatio-temporal semantic clustering of objects that are likely transient (moving dog, "transient distractor in casual capture"). Also offers a new densification/pruning scheme. Focus is on casual capture, no appearance / in the wild stuff. Results look great, probably an interesting read!


I get my NERF/Guassian etc updates from https://x.com/janusch_patas This guys posts researches only about these things. I was also going to submit wild guassian.


Can someone eli5 splatting? How does it differ to normal photogrammetry techniques


The core solve is similar but instead of generating a mesh, you generate a series of elipsoids that have color, transparency and some spectral radiance information.

The idea is that by layering enough of these on top of each other, you can make a high quality visual representation not bounded by geometry.

Think of them like brush strokes in mid air versus a sculpture.


Thats insane, Does hollywood use them? Or still use meshes?


Ideas around point-based rendering have been around for a long time. But, have not been widely used. Hollywood and games might use them for smoke, water, explosions, but rarely for solid objects.

There has been a big explosion of interest in the area since this release of https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/ which proposed a way to generate the point cloud data from photographs in a way that is procedurally a lot like photogrammetry. Hundreds of research projects have spawned off of that paper.


Gaussian splats themselves aren’t very useful for Hollywood because they don’t have any issues with higher cost representations that are more accurate.

That said, Gaussian splats are based on very old papers that do have use in Hollywood. The closest example is radiance caches for Renderman.

This would store points in space with very similar information to a splat, then interpolate between them to provide things like bounce lighting.

The first Pixar film to use this throughout was Up.


> higher cost representations that are more accurate

What are some examples of those representations?


Just standard geometry and materials that they’re willing to spend the time properly constructing and path tracing. Nothing fancy.

They just have a higher up front cost to make than Gaussian splats but are infinitely more directable.

Ultimately the ability to control everything at a micro and macro level is what film wants, not cheap rendering.


From my understanding, traditional photogrammetry typically generates 3d point clouds from image pixels by correlating visual features between images with known camera parameters, allowing the camera pose of each image to be estimated in a shared coordinate space. These point clouds are postprocessed to estimate closed surfaces which can then be converted into textured triangle meshes to be rendered using traditional 3d rasterization techniques.

Gaussian splatting represents a scene a cloud of 3d gaussian ellipsoids, with direction-dependent color components (usually represented using spherical harmonics) to deal with effects like reflections. The "Gaussian" part is important, because gaussian distributions are easy to differentiate, making it possible (and fast) to optimize the positions, sizes, orientations, and colors of a collection of Gaussian splats to minimize the difference between the input photos and the rendered scene. This optimization is usually done by starting with the same 3d point clouds and camera poses estimated using the same or similar tools as traditional photogrammetry (e.g. COLMAP), and using this point cloud to place and color your initial Gaussian splats. One of the key insights in the original Gaussian splatting paper was the use of some heuristics to determine when to split a splat into smaller ones to provide higher detail over a given area, and when to combine splats into larger ones to cover uniform/low detail areas.

The nature of Gaussian splats being essentially fancy point clouds means that they can't currently be easily integrated into existing 3d scene manipulation pipelines, although this is rapidly changing as they gain popularity, and tools to convert them into textured meshes and estimate material properties like albedo, reflectance, and so on do exist.


Corridor Digital has a video where they showed some of the limitations of photomapping vs. splats: https://youtu.be/GaGcLhhhbDs?si=rJVxF8yNwwbfLBw9&t=306 and there's an earlier one where they broke it down a bit more https://www.youtube.com/watch?v=YX5AoaWrowY




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: