
Behind the Tech with John Carmack: 5k Immersive Video - ot
https://developer.oculus.com/blog/behind-the-tech-with-john-carmack-5k-immersive-video/
======
Animats
I dunno. Looking at canned content with VR goggles seems like a dead end. It's
3D TV on steroids. Remember 3D TV? Works fine. Too much trouble just to watch
TV. Market failure.

It's been five years since the Oculus Rift DK1 appeared, and there's still no
killer app. Game developers are pulling back from VR.[1][2] The VR virtual
worlds are a disaster. Sansar has about 50 (fifty) concurrent users. SineSpace
and High Fidelity have similar numbers.

VR goggles are cool for about an hour, and in the closet in a month.

The technology is coming along just fine, but that's not the problem.

[1] [https://www.gq.com/story/is-vr-gaming-over-before-it-even-
st...](https://www.gq.com/story/is-vr-gaming-over-before-it-even-started) [2]
[https://mashable.com/2018/01/24/virtual-reality-gaming-
loser...](https://mashable.com/2018/01/24/virtual-reality-gaming-loser-
gdc-2018-survey/#WG.n6Sqwp5qm)

~~~
samplatt
You can keep believing that, sure. I'll just be over here, continuing to earn
my living delivering interactive 360 video and VR demonstrations for our
clients :-)

~~~
hota_mazi
You making a living off this doesn't invalidate the fact that VR is having a
hard time catching on.

~~~
samplatt
I'm a biased observer, sure. But I feel very confident in saying that VR is
never ever going away, let alone already dead like OP was saying, or even
_remotely_ comparable to 3D TV.

It's not (as it stands) going to be adopted as the media-platform of choice
for consuming mass amounts of netflix and marvel movies, but there's way more
niches to make it a viable go-to solution for many problems.

------
modeless
Do you feather the edges of the high resolution videos to blend with the low
resolution background? I found this necessary when I implemented something
similar because otherwise the borders are far too obvious, though it feels bad
to waste those pixels at the edge of the video.

How are the videos stored? Is each gop a separate file or do you seek around
in larger files? Is it feasible to stream over HTTP or is it only possible to
play local videos for now?

~~~
JohnCarmack
I was prepared to blend the edges, but it turned out not to be necessary. If
the compression ration was increased enough that there were lots of artifacts
in the low res version it might be more important.

I was originally going to put it into an mp4 file with the base stored first,
so normal video players could at least play the low res version, but the
Android MediaExtractor fails when presented with more than 10 tracks, so I
just rolled my own trivial container file.

Peak bitrate for Henry is around 40 Mbps, so it wouldn't stream for most
people. With some rearrangement of the file so each strip has a full gop
continuous, instead of time interleaving all 11, the bitrate wold be cut in
half, but it would still be a lot of fairly small requests, so it would call
for pipelined HTTP2.

~~~
modeless
Ah, so all the frames for every strip are interleaved in your container, and
you just read sequentially and ignore the frames you don't need? That's
probably the right thing for local videos.

I had each gop in a separate file like HLS or DASH (except for the background
which was a single file also containing the audio track). It's unwieldy but
makes HTTP streaming a little simpler because you don't need range requests or
an index.

Also, instead of bitstream hacking to stitch three strips into one, I encoded
multiple strips into "pre-stitched" views. This means that every strip is
encoded redundantly in multiple views, bloating the on-disk video size. But
for streaming that only affects the server, not the client, and it's nice for
the client to only download/decode one view at a time (plus the background)
instead of three strips. Bitstream hacking to join the strips would definitely
be better if it can work, though.

------
vesche
Finally a John Carmack post that isn't on Facebook! Great read too, can't wait
to see more.

~~~
goldenSilence
I think the distiction is between personal experiences, and activity on
projects, requiring review, to determine the disclosure of information which
may or may not reveal trade secrets.

------
v8engine
It mentions "...5120 pixels across 360 degrees is a good approximation. Anyone
over this on current VR headsets is simply wasting resources and inviting
aliasing."

How does increasing the resolution cause aliasing? Shouldn't it be the
opposite?

~~~
cptskippy
Today's VR Headsets are low resolution. If your video source is higher
resolution than the display then it must be downsampled. Downsampling will
introduce aliasing artifacts unless an anti-aliasing pass is preformed prior
to downsampling.

> Image scaling can be interpreted as a form of image resampling or image
> reconstruction from the view of the Nyquist sampling theorem. According to
> the theorem, downsampling to a smaller image from a higher-resolution
> original can only be carried out after applying a suitable 2D anti-aliasing
> filter to prevent aliasing artifacts. The image is reduced to the
> information that can be carried by the smaller image.

\-
[https://en.wikipedia.org/wiki/Image_scaling](https://en.wikipedia.org/wiki/Image_scaling)

In an ideal situation, your source resolution would match your target
resolution. If it doesn't then you must expend resources scaling. VR can't
leverage the 2D scalers found in most hardware since the source data is being
mapped into 3D space. If you have to downscale your source then on top of the
processing resources being used to scale in 3d space, you're also wasting
bandwidth delivering those resources at a higher resolution.

~~~
wonderlady
I really don't like the idea that it's good enough now since the headset block
us. Technology innovation can come from every area. If others make better 360
cameras or players and exceed the 5K experience, they should not considered as
"wasted" just because one kind of display headset doesn't meet the same level.
I do like Oculus Go for its nice price and acceptable experience. But there
are quite some headsets that perform better, at least capable to display 8K or
even above, Vive Focus, Gear VR with S8 and S9, and the upcoming Pimax. So,
please don't block others and ourselves on technology innovation! Eventually
if we take human eyes as the ideal benchmark, we need to achieve 16K!

~~~
cptskippy
I don't think he's advocating that we stick to 5k but rather that 5k is good
enough for today's devices and as they improve we can up the resolution.

Advocating for higher than necessary resolutions today based on future
prospects is a bad gamble because consumer adoption is still in it's infancy.
What if you make a poor user experience today in anticipation of a better
experience tomorrow but instead kill the market so there is no tomorrow?

Video game developers and video content providers are both well versed in
dynamically scaling to match the capabilities of the consumer.

------
jokoon
He talked about equirectangular projection. I wonder how he is achieving that,
with what kind of shader, I asked the question once, and it seems you cannot
do that with a vertex shader, but maybe a geometry shader?

I already read articles about the pannini projection, which is quite cool, but
I guess occulus is facing the same issues and has similar solutions...

~~~
undershirt
These are the steps for transformation that may answer your questions:

1\. Cubemap (source) - 360° videos are often sourced from six different images
stitched together.

2\. Equirect (video) - the cubemap is transformed to "equirect" trivially with
a fragment shader—which maps every output pixel to some input pixel.

3\. Perspective (output) - Mapping equirect video to a perspective projection
is also done with a fragment shader, just using a different transformation
depending on the focal point of the user.

4\. Pannini (alternative output) - You mentioned this projection, and it's an
alternative to the "perspective" projection that allows a wider output FOV
that minimizes distortion in the periphery.

~~~
jokoon
How would you achieve a 170 degrees pannini projection in real time using
opengl? Do I really need a cube map?

~~~
undershirt
You can use maps containing fewer faces than a cube, or any partial mapping
that covers the 170 degree area you need.

I researched this a bit here:
[https://github.com/shaunlebron/blinky](https://github.com/shaunlebron/blinky)

------
Njkl
This might not be possible if the "system software" doesn't cooperate, but
it's possible to encode videos such that you can keep them "warm" without
decoding all frames, for example in a IBBPBBPBBPBB structure where all
B-frames are not referenced by any other frame (other arrangements are
possible). Forcing this structure has a cost, but it's much smaller than
having more I-frames. You can then alternate decoding 3 such streams (each one
offset by 1 frame, including the I-frames - this is not a problem, is just
means you'll not be ready to output anything for 2 frames after a seek) for
the cost of 1. Switching to 60fps is then "instantaneous". Old iTunes used to
code h.264 video like this (with a PBPBPB structure, so it could play at half-
rate, which it did if the CPU couldn't keep up). Note unreferenced does not
imply B-frame nor the other way around.

Another (admittedly crazy) idea, for a setup with a lower-res version and a
higher-res overlay, is trying to store the difference only, affording a
(significant?) bitrate reduction for the high res "patches". This is very
tricky to do in practice, though (needs larger range or losing the lsb; the
codecs aren't really designed for this). I don't think it has ever been done.

------
seanalltogether
I always wondered if HTM spheres are a worthwhile solution for lower sized
textures. You start with 8 equilateral triangles that make up the sphere and
subdivide as needed. You could possibly use a square texture made of 8 right
triangles that each need minimal distortion (as opposed to an equirectangular
texture).

[https://arxiv.org/ftp/cs/papers/0701/0701164.pdf](https://arxiv.org/ftp/cs/papers/0701/0701164.pdf)

~~~
FreeFull
Alternatively, you could maybe use a cube map. You would still have a bit of
distortion, but that could be handled during the video generation anyway.

~~~
samplatt
Distortion visible in cube maps goes doubly so when viewed in VR. Cube maps
are easy but not optimal.

~~~
slavik81
What's the source of the distortion?

------
iamleppert
Another approach to space efficiency would be to do away with the need for
dual video streams entirely and just average the stereo images together to
form a single monocular image. Then, send a disparity map along with the
monocular video. Decode the mono video and use the disparity map to
interpolate the view of either eye. You’ll have all the information you need
for reconstruction and the disparity map can be efficiently compressed via
normalization and perhaps even by sending just the vectors of the contours.

Another idea is to take advantage of the fact head motions are really just a
translation vector of the camera. There’s no need to send pixels that have
just transposed locations unless they have changed in time.

If I was designing such a system I’d try to take advantage of the fact there
isn’t a lot changing fundamentally in the scene when you move your head, and
maintain some sort of state and only request chunks of pixels that are
actually needed. You wouldn’t even have to use a traditional video codec as
the preservation of state would be far more efficient than thinking about
things in terms of flat pixels and video.

~~~
munificent
_> Decode the mono video and use the disparity map to interpolate the view of
either eye._

By "disparity map" are you thinking something like a heightmap applied to the
scene facing the viewer and then you use that to skew things for each eye?

If so, how would that handle parts of the scene that are occluded/revealed to
one eye but not the other?

~~~
iamleppert
True, occlusions would be a problem but we’re taking about fake
autostereoscopic 3D here, where most of the stereo rigs used for capture have
but a modest baseline. Almost all of the depth perception comes from
disparity, occlusions would still in fact be very visible by the averaging
method I described and be at whatever depth plane of the occluder which is
probably a good guess anyway. Not like your other eye would receive a
correspondence from an occlusion in the real world.

~~~
romwell
FYI, there's online software[1] to recreate 3D/stereoscopic 3D imagery from
the depth-enabled photos taken e.g. by Moto G5S (which has a dual-camera setup
that computes the depth map, but no API to extract/store the image taken by
the other camera).

My personal opinion is that true stereoscopic images feel better when there's
enough detail; those occlusions do matter. For some imagery it doesn't matter
as much though.

[1][http://depthy.me](http://depthy.me)

------
DanielBMarkham
(Much hand-waving here)

I wonder if the problem isn't a misaligned paradigm for just what a codec
does. Right now, codecs exist to take bytestreams and fill frames at a certain
rate. They're a bitstream-to-framerate device.

What if instead of delivering to a frame during a certain time period, the
codec had to instantly deliver whatever it could, but only geared to the
actual acuity of the users retina. Seems like there would be less information,
you would never have lag, and all optimization would be around the detail of
data delivered, not framerate or screensize.

I also birthday parties and Bar Mitzvahs, folks. I'm here all week. (I figure
it couldn't hurt to throw crazy ideas out there. Every now and then a crazy
idea actually amounts to something. Coding is cool because it not only let you
solve problems, sometimes it lets you change the universe the problem lives
in. Good luck, John!)

~~~
undershirt
only delivering the visible portion of a 360° video assumes that the video is
ready to be decoded at any angle, which is the entire problem.

~~~
DanielBMarkham
But the problem is to display something for a moving angle, not display
something for everything. As best I can tell, it is a different problem.

I want to restate what I'm hearing so you can correct me. You are saying
something like "Gee, if we could show any retinally-matched conic the user's
pointing their eyeball at? To do that we'd have to have the entire scene
rendered anyway"

I'm saying you can't do it now. It hurts when I do that. So stop doing that.
Instead of trying to render stereo scenes quicker, try to render a moving
conic real-time quicker. When the codec runs, it's optimizing areas of a frame
changing over time. If instead it optimized possible vision movement paths,
you'd end up solving the framerate and resolution problems. Then you could
concentrate on optimizing the codec in a different way than people are
currently trying to optimize it. It couldn't hurt, and there may be
opportunities for consolidation if you look at the problem as visual-movement-
path-rendering instead of frame rendering. I don't know. I just know it's a
problem now. Set your constraints differently and optimize along a different
line. Sometimes that works.

Does that explain the differences here? Or do you want me to start spec'ing
out what I mean by retinal-path codec? At some point this will a bit over the
top.

ADD: I'll say it a little differently. Our constraint now is "how fast can you
render this frame". Codecs are built for rendering frames on screens where
people watch movies and play games. But that's not the world we're trying to
solve a problem in. They may be configured the wrong way. Instead, require
that anything the eyeball is looking at should render consistently in, say
5ms.

This sounds like the same thing, until you realize that the eyeball can't look
at everything at the same time. Different parts of the image are temporally
separated. So if I update the image in back of your head once every 200ms?
You're not going to know. As far you're concerned, it's all instant.

It becomes a different kind of problem.

~~~
undershirt
You want to predict focal paths and encode video for them ahead of time?

~~~
DanielBMarkham
I think we're close now.

I want to predict the temporal cost of focal path movement, then optimize the
stream based on those temporal "funnels" I guess you'd call them. This is in
opposition to looking at segments of the screen all equally and optimizing
across frame changes. I don't care about frame changes. All I care about is
how fast the eyeball can get from one spot to another -- and that's finite. If
I can create the image I want instantly wherever the eyeball is looking,
framerate and resolution issues are a non-sequiter. (This would also probably
scale to more dense displays easier, but I'm just guessing)

------
Daemon404
The slice approach should be pretty readily doable today with e.g. libx264
(which you can force exactly N rectangular (full-row) slices with, using
i_slice_count), and calculating N based on the resolution of the eleven
vertically and horziontally stitched clips and their boundaries. (And with VP9
using some craft tile-column/tile-rows setup, maybe...)

This is assuming the videos are stitched pre-encode, of course... From the
post, it almost sounds as if the idea would be to stitch independent H.264
streams into a new unified one using slice mangling + slices... which would be
pretty crazy stuff.

(As a side note, it's a shame flexible macroblock ordering is only in the
baseline and extended profiles... I still don't understand that decision at
all.)

EDIT: Dawned on me that the hard part is on the client/decode side. D'oh.

------
sigi45
If you have the keyframe always at the seek time, would that be enough?
Because you said the delay is too long?

If yes, would it be possible to render, for every second (half second or
whatever resolution you like) a snippet which only contains the seek time
until the next keyframe and than tell the webserver 'if someone seeks to time
x, take this snippet and when done, jump to the original video'?

You could do that transparent with some fuse fs.

------
masonicb00m
John Carmack's attention to quality and detail is incredibly inspiring.

------
holografix
Legitimate question: are there “untouched nature” type simulations for one of
the leading VR platforms? Sitting at the top of a mountain, different times of
the day and weather. Exploring a huge, very realistic forest. Being a pigeon
observing a busy street from above as you fly around and perch yourself on
different spots.

------
Maurice_Ribble
I find it interesting that you went in a different direction than the earlier
ideas discussed for this problem. I can see where this is simpler in many ways
and probably more importantly a better fit for a lot of content than some of
those early ideas.

You mentioned the overhead for handling many streams being a problem. Can you
go into more detail on the type of overhead you saw and do you feel this is
something can can be addressed with better drivers or some application
changes? Or do you think the only solution is packing multiple stripes into a
single stream using the encoding bit manipulation you mentioned? It seems a
shame for hardware that can support more streams leaves them inaccessible to
real world applications.

------
mattbierner
Fascinating. Ammeture question though: are there other ways to encode the
images besides pixels that will ultimately be better suited to 360 content?
Holograms came to my mind because they degrade gracefully as you toss out bits
of the signal

~~~
samplatt
Lytro's light-field camera technologies have _huge_ potential for 360 content
and the photogrammetry industry, but they effectively sat on the technology
doing very little with it (compared to their competitors), until Google hired
some of the employees as the company shut down earlier this year.

------
Retric
"5120 x 5120 at 60 fps" here is why that's overkill. If you start putting dots
on a sphere you can go in one direction turn 90 degrees, put 5120 another
axis. However, if you tile like that with 5120 x 5120 you get a lot of wasted
pixels at the poles.

If a 5120 pixel Radius is fine... then Radius of sphere = 2 pi * r, and
Surface of sphere is 4 pi * r^2. So 5120 / 2pi =r substitute for r > 4 *
(5120/2pi)^2 simplify > (5120)^2 /(pi) ~= we need ~1/3 aka 1/pi of 5280^2.

However, I suspect you actually want more than 5120 pixel radius.

~~~
JohnCarmack
In the center of the lenses, the circumference resolution is a bit over 5120,
but definitely less than 5760. Even at 5120, it is a bit overkill (and
potentially aliasing) at the edges:
[https://twitter.com/ID_AA_Carmack/status/975198157838499840](https://twitter.com/ID_AA_Carmack/status/975198157838499840)

You are off by a factor of 2 in your pixel calculation, because 5k x 5k is for
a stereo pair of spheres. Equirect projections waste a fair amount, but
compared to the 300% miss to get to 60 fps stereo, it isn't dominant.

~~~
rgbjoy
Would we ever see a dynamic resolution that tracks where you are looking at
and just lowers resolution outside the point of interest? Wouldn't that save a
lot?

~~~
thefreeman
[https://en.wikipedia.org/wiki/Foveated_rendering](https://en.wikipedia.org/wiki/Foveated_rendering)

~~~
rgbjoy
I thought I was smart for just 10 seconds of my life... thanks for the link

------
nomercy400
You don't need 5120 (for 360 degrees on X-axis) x 5120 (for 360 degrees on
Y-axis) do you? If you cover 360 degrees in the X direction, then your Y-axis
only has to be 180 degrees, because you don't need 'behind you' on the Y-axis
as that is already covered by the X-axis' wide 360 degrees. Or am I making an
error here?

~~~
michaelt
It's stereo, so Carmack means 360°x180°x2 for two eyes.

One would assume there would be a lot in common between stereo cameras' views,
so presumably there are compression efficiencies to be found.

------
Ono-Sendai
> Directly texture mapping a video in VR gives you linear filtering, not the
> sRGB filtering you want.

Isn't this back-to-front? Generally filtering is considered to work better in
a linear colour space. Using a sRGB texture will convert each pixel to a
linear colour space, before the reconstruction filtering is done, AFAIK.

------
MikusR
Not mentioned in article, but Exynos versions of s8/9 don't need these tricks.

~~~
JohnCarmack
No, Exynos has the same block limit as the snapdragon chips -- 4k60. The
difference is that Exynos doesn't have the same 4096 maximum dimension limit,
so it can do 5120x2560 (monoscopic) at 30 fps, while snapdragon can only
decode 4096x2560 at 30 fps. The view dependent player is about playing
5120x5120 (stereo) at 60 fps.

~~~
MikusR
It can play a HEVC file that is 6144x3072 at 60fps.

~~~
JohnCarmack
I just tried, and while it does decode a 6kx3k 60 fps video, which is very
admirable, it doesn't hold 60 fps while doing it. There are probably encoding
options to minimize work on the decoder that could let you push it a bit more.
MediaExtractor seems to be arbitrarily limited to a lower resolution, but that
can be bypassed.

~~~
MikusR
Can you try with skybox vr? At least on my note 8 Oculus video really doesn't
like h265 video that's more than 4k.

------
mrybczyn
Retina has 100 million photoreceptors per eye. Also, need higher FPS. Maybe
120 is good enough?

~~~
anonytrary
Correct me if I'm wrong, but isn't the difference between 60Hz and 120Hz
refresh more or less imperceptible to the average human? I'm sure there's a
distribution, but I'd be hard pressed to find a person who could differentiate
between 100Hz and 120Hz refresh. It seems like a waste to push rendering
beyond the point at which we can even tell there's a difference.

Edit: Thanks for the feedback, I guess it is perceptible. Nevertheless, I
think my argument becomes valid at some N. Sure, N !== 60, but N = 144 or 120
may be more reasonable. I'm not too concerned with what N is, more so with the
fact that "doubling the refresh rate" _eventually_ becomes an act of futility.

~~~
seanmcdirmid
It is definitely perceptible. Compare a normal iPad (60 Hz) to an iPad Pro
(120 Hz), the fluidity of movement is very apparent in just playing around
with the home screens.

~~~
kbenson
It may be perceptible, but the number of factors that could affect the
different systems in that comparison means it's not exactly a good test. At
the simplest level, there's no guarantee it's actually even rendering updated
frames at the rates in question if it's limited by some other factor, and the
differing hardware may change at what point that limit is hit.

~~~
seanmcdirmid
You can drop frames at 60Hz ss well.

But really, when you achieve 120Hz, it’s beautiful, it reminds me of when
retina displays came out. We are a bit closer to realistic rendering.

~~~
kbenson
> You can drop frames at 60Hz ss well.

Yes, what I was trying to get at is that just because the hardware is capable
of 60 frames in a second, that doesn't mean the software was delivering 60
frames a second. The iPad Pro has a different processor than the iPad (A10X
Fusion vs A10 Fusion), and in a lot of tests it's significantly faster.[1]

The iPad pro does have more pixels to push around, but that doesn't exactly
negate the CPU difference, it just makes it more complicated to draw an actual
comparison. That that's before we even get to the actual graphics processor,
which itself could do a better job of offloading some processed to hardware
(better OpenGL/Metal/whatever support). For all we know, you were seeing a
average of 35 updated frames a second on the iPad, and you're now seeing an
average of 55 updated frames on the iPad pro. In that case, the doubling of
the screen refresh might help a little (in reducing noticeable laggy frames a
bitas it can update between what would be frames at 60Hz), but it wouldn't be
earth shattering. I doubt it's that bad, but as an example, this should show
how a Hz rating on what a screen is capable of doesn't mean much.

The real benefit of higher screen refresh rates is to better support different
lower _native refresh rates_. Much video content is at 24 FPS. A 30Hz or 60Hz
screen can't represent that faithfully, and will need to double some frames. A
120Hz screen can perfectly represent 24 FPS content[2], and that's the real
reason screens (and TVs) ship with that refresh rate. Different media
(television, internet video, DVDs, Blu-Rays, video game systems, etc) all have
different refresh rates they want to deliver.

1: [https://www.notebookcheck.net/A10-Fusion-
vs-A10X-Fusion_8178...](https://www.notebookcheck.net/A10-Fusion-
vs-A10X-Fusion_8178_9162.247596.0.html)

2: I'm ignoring that it's often actually 23.976 FPS or something.

------
pcnix
With John Carmack at the problem, this is definitely something I'll be keeping
a close eye on. For people that are unaware, he's one of the original creators
of Doom, and his innovations in computer graphics are legendary.

~~~
ubertakter
Not disparaging, but how likely is it that someone doesn't know who John
Carmack and reads Hacker News? Surely the Venn diagram of those two sets
doesn't have much overlap...

~~~
sp332
You're probably underestimating how popular HN is.

And how old you are (ducks) But seriously it's 2018! It's the distant future
where Doom is run in emulators in browser windows.

------
rasz
I am somewhat shocked Oculus is wasting Carmacks time and potential on
something nobody wants (watching 360 panorama movies).

~~~
sturmen
I like 360 panorama movies. Done right, the feeling of presence is excellent.
VR is a niche to begin with, so I would hesitate to say that any one
application (ex. CAD, gaming, 360 panoramas, etc) is the "killer app" so far
since the audience is so small. As it grows (which I pray it will), we'll see
definitive trends emerge.

~~~
white-flame
The feeling of presence is so limited compared to realtime-rendered 3d spaces
where you can move your head around from the central camera point even just a
bit. Adding some artificial parallax shift to the movie frames might be enough
to give it that extra oomph to feel truly immersive.

~~~
sp332
The Oculus Go doesn't have 6 DoF, and he's been working on that and phone-
based VR for the last few years. Because they're cheaper, they move a lot more
units! And because they're more constrained, they need more optimization
attention from someone like Carmack.

