Holy sh*t, can you imagine a year from now if they start using something like this for concerts or basketball games? Like imagine rewatching a basketball game but being able to move the camera on the court???? Might not be possible yet but this shows the techs possible. Let alone someday being able to scale it to realtime someday maybe lol
A thought experiment I like to employ when imagining the impact of a new piece of tech: reverse the timeline. If the tech were the status quo, how would the current status quo be marketed?
If moving the camera were the norm, the current status quo would probably be marketed along the lines of "No more micro-managing camera views, arguing over playback speed or fiddling with the timeline – this next leap in technology introduces pre-edited video, where each game is preprocessed by a team of highly-skilled, professional producers, selecting the best viewing angle and playback speed so you and your friends can just sit back, relax and enjoy the game."
If the fictional press release sounds good enough, the tech probably won't hit.
I remember 30 years ago during switching to digital TV broadcasting that proponents of the tech tried to sell a future where viewers of sports events would be able to select which camera to watch. Then again, imagine watching a game with 10 friends and trying to agree over cameras...
Great point. Are you familiar with McLuhan's media tetrad?
The hypothetical status quo is a wonderful example of retrieval. Every new technology should be expected to have the effect of reemphasizing something previously obsolete.
I think for routine game-watching you're 100% right. But for replaying amazing/interesting/controversial plays, this tech would an enormous improvement over being captive to the broadcast team. Everyone would love to be able to grab control and fly around and zoom in on that one power dunk, critical fumble, bad foul call, etc. on demand.
You want a killer app for VR/AR goggle style things? You’re right this would be amazing.
Apple demoed some kind of volumetric video to the press with the Vision Pro. There was a short clip of a concert and an NBA game (Nuggets?) among other things. I heard a number of people said it was like being there.
This is a step past that. Apple’s was recorded from some kind of special camera rig (I assume), but I seriously doubt it was full volumetric video from a large number of angles. It sounded more like volumetric video if you were stuck in a (very good) seat in the venue.
I’d be curious to know just how much horsepower it takes to play these back.
I thought they recorded their special videos using the Vision Pro itself, which has enough sensors to build depth maps of the scene and provide novel views within a small range from the original position.
But I am half speculating and I don't really remember. That's just the impression I remember having.
That’s a feature of the headset, but I think some of the demo videos were recorded in some other manor. I seem to remember hearing a discussion on a podcast (The Talk Show? Dithering?) where it was mentioned you could see a camera rig somehow.
Perhaps that was something you could see on TV when they accidentally showed the rig that was at the same event?
Of course it’s possible they were only using a camera rig so that no one would see the device before they were ready to invade it. Which would be very Apple like.
So I’m speculating a bit. But even if they were going to professionally record events I would think they would do better than just have someone sit there with a headset on.
I saw the image you’re referencing yesterday. The unit appeared to be a plane, briefcase sized. It had two fisheye lenses about 10cm diameter each and separated by what I expect is an average human IPD.
So I think that unit was simply capturing 180 degree stereo video. Not enough to compute volume without most of it being some ML inference.
From the paper, it seems a 3060 is enough for 60FPS for the DNA-Rendering dataset. On the full-screen datasets it manages 25 fps. A 4090 might be needed to stay above 60 fps.
Still pretty heavy I’d say but it certainly came a long way and shows us real volumetric video is doable.
That's real-time rendering though, right? Is there anything preventing it from being pre-rendered in non-real-time first? Or does it have to be rendered in real-time?
I'm not familiar with any of this at all, so I'm genuinely curious.
If the context was to use in a VR/AR headset it has to be real-time. And I guess that use-case, and the related that you interactively want to walk around a scene are two of the main use-cases
I think the best way to consider it.. is as a 3-d cloud of individually addressable pixels. The size of the cloud is dimensions of real-time rendering.
I started working on something like that a couple decades ago. I figured, with all of the camera feeds at a football game, there would be plenty of views generate 3d models, even with relatively naive approaches. Then the NFL did it shortly after (2008) [1], and it didn't catch on.
This is what they use in the latest FIFA / FC games, they call it Volumetric Data Capture, basically using video footage from real sports events to capture model and animation data for the players, allowing their unique mannerisms and movements to translate to the game. In previous iterations they would have football players in motion capture suits, but they wouldn't get all the players, plus if they did it would be in stifled, studio conditions, not their natural environment.
Anyway, not quite the same as turning a match into 3D, but definitely related.
We did this (as a side effect) for the premier league ~2009-2012 (liquidated JUST before VR appeared, where the content worked fantastically, and then ~2014 with the moverio glasses, even better in AR)
We did live player tracking (~33 cameras) on-site at every game, and for fun rendered players fifa-style free-camera. We even did some renders (capture of realtime engine) for canal+ highlights as an experiment.
edit: my own gpgpu-only,(frag shaders :), sub-100ms, uncalibrated-cameras (footage directly from sky/match of the day) r&d a few years later, also works really well on a LookingGlass https://twitter.com/HoloSports/status/1327375694884646913
(I took this to sky sports but they said it was a bit too in-the-future)
Actually a company has been working on this for a few years now, and I believe they are currently in production. Their focus is football/soccer I believe. I was going to do a research internship at them before I dropped it for a different one. Here it is:
https://www.beyondsports.nl/
Looking at it, they heavily focus on tracking the movements of players now to replay in AR
This is what we were building at my previous start-up, though we had a focus on outdoor sports. We had built a 3D virtual world and used GPS tracks to follow athletes (ultra-marathons, paragliders), etc.
We theorized that 2 go-pro cameras on the athlete would let us completely re-create the entire scene from all angles, and inform an AI of how to re-paint our virtual world with real-world weather environments etc.
Unfortunately, 5 years ago, everyone said I was crazy to think any of this was possible.
There is a video capture of our 3D scenes from 2017 on our old website (we were a full 3d world, not video) https://ayvri.com - the tech was acquired just over a year ago.
They don't even stream the NBA games in 4k because TV networks only support 1080p. I doubt they'd buy into such an expensive technology for such a niche audience.
It will be very interesting to watch how tech like this affects mainstream society.
I imagine pornography will use it at some point soon. Maybe something like chaturbate where your interactions with the cam performer are more customized?
Could it be used with CCTV to reconstruct crime scenes or accidents?
Wedding videos might be a popular use, being able to watch from new angles could be a killer app.
Or a reworking of the first Avengers movie, view all the action from multiple viewpoints.
And all this will probably be built in to the pixel 18 pro or something.
There's an ST:TNG episode I remember too where they have an image and they get the computer to back-trace all the reflections in the image to produce what isn't easily seen.
"Light Field" photography has existed for a few years now, yet there is still no porn using it that I am aware of.
I tried a demo a while back that was very impressive, despite being relatively low-res stock footage. Simply being able to move your head a few inches in any direction without taking the world with you is a much better experience than contemporary VR video.
This seems unprecedented. Imagine if you have this but you can update the scene programmatically. Ask your AI to change the location or actors. Now you have a very convincing artificial scene with anything you can imagine in it.
Re-reading the paper, I totally missed the hardware that they did this on, which was consumer-level GPUs, so I think you may be right - 3 years is probably a good time frame for seeing this kind of tech in commercial games.
My reasoning for initially saying 10ish years:
GPU architecture release cadence is frequent, but not THAT frequent. NVidia released the Ampere (RTX 30X0) in 2020, 3 years ago. Ada Lovelace (RTX 40X0) was released a year ago, in Oct 2022. It's _possible_ to use 4090s to do medium neural-network things right now, but to be on the leading edge, you need multiple datacenter-level cards, each of which is > $5k. Even though it's possible to do crazy things with the most recent generation of GPUs, there don't yet exist games that really take advantage of it. The closest that I'm aware of is the Cyberpunk-level games that make native use of raytracing capabilities.
I think it'll probably be a year or two before we see games come out that really require the level of 40-series cards ( or 7900-series if you're of the AMD persuasion), which is a lag-time of ~3 years after the cards were released. I think the software development time and market saturation are the driving factors in the gap here.
I was under the mistaken impression that the video output was produced on the datacenter cards. They got reasonable performance on a 30-series NVidia card. In 3 years, it's totally reasonable to expect that AAA game players will have that level of GPU performance in their gaming machines, so yeah, I think you're right.
I imagine this would be helpful when making movies if you could basically play around with the scenes without having to refilm it several times to get the best one.
When it comes to perspective and the like, they already do this; multiple camera angles, CGI, and the odd reshoot. Like having Henry Cavill come back for a reshoot, then CGIing out the mustache he had for his next role.
Between this and LLMs, we're half-way to building a holodeck. What's missing at this point is just hard light - i.e. the being able to feel the physical substance of simulated objects, and being able to experience it all without a wearable/personal device.
> Even though we have no idea how to even approach that.
We'll have to cheat somehow.
Sure, this is still effectively magic, but a few years ago I thought we're also anywhere near having the software layer solved - specifically, the Star Trek style computer interaction and holodeck "storyteller" - the thing that would let you create a high-fidelity interactive world with believable characters and a story generated as you play, out of a command like "computer, give me a cafe in Paris, circa 1890". Now we suddenly have all the pieces for that - we can literally just do it, as long as we constrain the medium to just text, with perhaps some generated imagery for extra mood[0]. And I'm not even talking about GPT-4 - I've had a convincing holodeck textual roleplay on AI Dungeon back ~2-3 years ago, back when GPT-3 just came out.
Note: I don't want to hijack this thread and make it into yet another LLM discussion - but I want to point out we have a bunch of parts converging into an entirely new kind of experience. And while we may not crack hard light any time soon, a wearable VR/AR holodeck experience now seems in reach in under a decade. Or perhaps closer to SAO[1] than a holodeck, but still - something that felt way beyond our capabilities just a couple years ago.
--
[0] - Though GPT-4V should be able to play a board game where you send an image of an updated board each turn, shouldn't it?
There is sort of a hardware arm to that industry. They’ve made it so you can send money to a model who is on stream and your “donation” will trigger a mechanical actuator that …does things…
But as for sensation for the viewer, no, there’s just what you’d find in a toy shop.
I suppose you could build up what would essentially be a 4D sprite sheet or animation set of a character and use that to support natural looking arbitrary movement. I'm not sure that isn't just a mo-cap character with extra steps, though.
Even the most skilled animators with years of budget still can’t escape the uncanny valley which is why CG animation has converged on a style of blob-humans as the current standard.
I have very little hope of AI driven animation looking ok in the next many decades. Don’t underestimate how hardwired your senses are at finding artifacts in movement. Static images are much easier to “fake”.
> we precompute the physical properties on the point clouds for real-time rendering. Although large in size (30 GiB for 0013_01), these precom-
puted caches only reside in the main memory and are not explicitly stored on disk,
Does the cache size scale linearly with the length of the video? 0013_01 is only 150 frames. And how long does the cache take to generate?
Looks so, suspect the authors precomputed everything they can to reach the highest frame rate. Like predecoding all frames in a movie into raw pixels?
I think volumetric video should be thought of as a regular video, where the decoding and playback happen at the same time. A few papers down the line this could be easily implemented.
How many cameras does this method require? As far as I can tell from the paper it still generates from multi-view source data. I can't say for sure but it seems like a large number from what I can parse as a layman.
So, something we could easily see done at say... an NBA event or football field... hell I imagine i can think of some ... adult use cases that would probably make a bundle off of tech like this if it can be optimized down... as my favorite youtuber would say ... WHAT A TIME TO BE ALIVE!
Very cool renderings, but ironically my browser is having a heck of a time rendering their website. The short videos keep jumping around, starting and stopping randomly... which i guess is very VR.
Add volumetric sound, integrate VR and you almost have recreated braindance from the Cyberpunk 2077 game. Doesn't seem that far off in the distance.
The missing component from complete braindance would be integrating physical senses. AFAIK we are pretty far away from having anything revolutionary in that domain. Would love to be proven wrong, however.
If I’m understanding the paper correctly then the four dimensions are the position, density, radius, and color of the spheres in their volumetric model. So for any given viewing position and point in time, their model produces a 4D scene that is then rasterized to 2D.
don't forget to add the 3 color dimensions. (this may seem pendantic, but when doing feature-extraction, these extra dimensions really are significant)
So's a video game, and we call that "real-time 3D". Time is mentioned, but it isn't counted again as a dimension, perhaps because any given momentary view is a time slice, not a time range like it is an XYZ range.
I think the difference is that in a video game you are in one location only at any given moment and things travel only forward in time. We can view from any location at any time in volumetric video.
In a lot of racing simulators you can change the position of the "virtual camera". It can be in the cockpit, on the hood, behind the car and on some games in an arbitrary position. Usually replays allows you to see from other competitors and where TV cameras would seat in real world.
CS:GO, TF2, GTA5 and Trackmania (and likely many more) have replay systems where you can pause, play and rewind with a freefly camera. Lots of games have a rewind mechanic: Grid, Baba is You & Viewfinder come to mind. Others have a "Photo Mode" where you can pause with a freefly camera: Starfield & Witcher 3 come to mind.
Valid, yeah. It occurs to me though that the difference is we are making a representation of the real world that can be manipulated like such, as opposed to a simulation of a fabricated world.
It's not a 3D model that is animated using a skeleton and keyframes like traditional 3D. It's many consecutive 3D models that create the illusion of continuous motion (aka video).
4D is the name that has come to describe the jump from static 3D models (photogrammetry) to 3D "video" models.
Time is the forth dimension. The input data is a video, so the model learns the colors and the position of the elements (basically points). You can rende the scene from any angle at any time once the models is trained
Downvoted at the time I see it, but actually correct. It's based on K-planes https://arxiv.org/pdf/2301.10241.pdf which effectively splits each space-time relationship off from the spatial relationship. It's just mathematics, guys. The original NeRF paper talked about a 5D coordinate. You know like a k-dimensional vector?
Yea it's probably to have a catchy name and get some attention. Although it's technically accurate to call it 4D since it includes time, I think 3D video recording would probably get the point across to more people in a less sensationalist way.
Is it technically accurate? Seems like its actually 6Dof view angles + time. The paper mentions 4D view, 4D point cloud, dynamic 3D scene and 4D feature grid.
Related: there was a small project that done similar stuff with Kinect v2 a ~7 years ago that was really impressive for the time. https://github.com/MarekKowalski/LiveScan3D
Now that Kinect v2 can be found for next to nothing and is very easy to mod to use without an expensive adaptor it's a bit of a shame the project was abandoned, from what I've seen the bigger limitations of the project can be overcome (only one Kinect per PC, mainly).
Watched the video for where the idea of IMGUI came from and it was frankly terrifying. I mean, just the assumption that the frame rate is fast enough that mouse up and mouse down occur in two different passes.
The code page leads to a repository that just has a README.md saying the source code is "coming soon"
If it actually works, this is huge. I'd be using it tomorrow.
But that first demo gif strikes me as something being off.
The algorithm isn't picking up on the legs in the background painted on the wall... In the paper, I don't understand how what they've built could differentiate between a picture of someone painted on a wall and the part of the scene that should be rendered in 3D.
This is my question as well. What's the input required to generate these 3D scenes? Is RGB video enough or does it also require spacial data? Is planning around the same scene enough or does it require multiple cameras?
I think there has been some serious misinterpretation of what 'real time' means in the context of this paper; and, possibly, that the researchers have avoided overt clickbait claims because they knew the term 'real time' would do the work for them.
This is not some neural codec that can convert any novel or unseen object live, like a kind of 3D YOLO - the paper mentions that it requires up to 24 hours of training on a per case basis.
Nothing can be edited, no textures or movements - all you can do is remove people or speed them up or slow them down, and that's been possible in NeRF for a few years now.
What's funny here is the use of the word "uncrop"! I had never heard that word used before DALLE*2 was released. And I've been working in computer graphics for 30 years lol. Also I watched a lot of Red Dwarf.
The effect is cool but I must be the only person on this website that doesn't see a future for it
Seems very niche, with massive data size restrictions, making it difficult to broadcast or stream on existing infrastructure.
But even if you solved the infrastructure problem, it feels like a gimmick that would be uninteresting pretty quickly.
Sporting events maybe benefit a bit by being able to find the right angle for any shot, but honestly they will probably just find the best angle and post that video as a clip
Movement is per-object, meaning camera movement can be encoded as a vector while the scene it's moving within remains largely static, that leaves a lot of redundant data. Streams could be frustum culled based on the user/camera's perspective. There's potential for high compressibility.
One of my favorite things in VR is google maps, I like "walking" around in cities without leaving my house. I am longing for the day that we can also do this
I'm keen that it gets the credit it deserves - I'm terrified it will stop working one day soon and the world will have lost a true wonder.
(They did just open up the underlying APIs so it would be possible to build a replacement now - although it's free in preview and pricing hasn't been announced - so no idea if it's economically viable)
How would you stream the output of something like this, if you wanted to? So that people could continue to change the viewpoints.
You couldn't possibly stream the full list of voxels generated by capturing the entire image with all of the cameras, right? That would probably exceed PCI bandwidth capabilities.
You'd need the server-side to generate models, send those models, and then stream the vectors?
Does anybody else get the impression that holograms are inevitable? This type of tech seems like the medium now all we need is a good way of displaying them.
This is where my mind goes with all of these advancements. Always and immediately to how they will contribute to making a photorealistic Holodeck via mixed reality a closer reality.
Is a lenticular lens in front of a display really considered a hologram? I thought you needed to actually capture the light wave pattern to be a hologram, whereas a display is just colour and intensity.
Well, I didn't know that it was only 1D lenticular (they keep that under their hat!) But let's pretend for a second that it is 2D lenticular.
In that case, yes it is absolutely the same as a hologram. Consider this thought experiment. Take a real hologram. Cover everywhere up except a tiny opening - a pixel effectively. Now what information does this pixel encode? It's just colour as a function of view angle.
Now do the same thing for a 2D lenticular display. You can reproduce exactly the same thing - a colour that varies as a function of angle. Therefore it is the same.
I guess you could consider LookingGlass to be a hologram in one dimension rather than two. Or alternatively it is a hologram if you promise never to move up or down!
LookingGlass' small portrait display is more like (2D) lenticular. Their larger displays use a micro lens system that directs pixels out in beams in different direction. So you also get parallax in the up & down direction.