On a more serious note, it's software architecture done on paper. Any experienced programmer will tell you it's close to worthless. You don't have to be a graphics expert. Abstraction of passes and pipelines isn't anything new, it even bled into the modern APIs like Vulkan. This post is just some inexperienced programmer's vague description of how he or she would like things to work in theory. It's not worth a discussion.
For instance in our engines (Nebula2 and 3) there are so-called "frameshader" files (XML, nowadays it would be JSON of course) which describes a frame as a sequence of passes, the result of a pass is a valid render target texture or the final visible image, and a pass consists of per-material batches (or buckets).
Higher up are 'stages, views, and entities', a stage is a collection of graphics entities, a view is a 'view into a stage' (a view owns a camera and a frameshader), and entities can be models, lights and cameras. Multiple views can be attached to the same stage, and views can depend on each other (although in the >10 years we use this concept, dependent views were hardly needed).
>> Passes can receive as inputs: [...] The previous Pass outputs in the form of textures in memory (Render Targets)
This feels a bit arbitrarily limited. I find rendering passes more 'graph-like' than a pure linear pipeline. For example, you might render some shadow maps - some of these might be bound to a single camera (e.g. for traditional directional cascading shadow maps fit to the camera frustum), but other shadowmaps could easily be shared between multiple viewpoints (e.g. point light dual paraboloid shadow maps). And even your 'per-camera' shadowmaps might be reused between multiple viewports for e.g. two very similar eye cameras in VR, whereas other parts of your scene will need to be re-rendered wholescale per-eye (unless you start implementing e.g. nVidia's single-pass multi-viewport rendering - how does that fit into this UML diagram?)
And then of course you might have multiple passes reading from the same gbuffer(s) or the scene depth buffer, or even the result of previous frames (temporal anti-aliasing and motion blur), and other things that aren't really per-camera (dynamic light probes for reflections, etc.)
I've also 'fond' memories of bugs from the magic configuration of passes, inferring inputs and outputs from their position in a list. I'd lean strongly towards being completely explicit about inputs and outputs. It's more dumb code, but it'll make things beautifully straightforward to understand, modify, and fix.
>> The basic concept of boiling things down into passes with inputs and outputs that can be managed as data
This has been overkill for every single one of my hobby projects, just more cruft to wade through, a solution searching for a problem. At least one of the more graphically simple professional titles I've worked on didn't even bother with this, and I didn't miss it. That said, the professional rendering code I'm most familiar with (having ported it from D3D9-only to multiple other rendering APIs) did have something similar, and it seemed reasonable enough. If you want to go "crazy" exposing control of rendering passes directly to your artists to tweak with scripts or other UI, it's a reasonable thing to represent. It can also make adding some debug tooling easier (e.g. profiling information about different passes, adding debug views of the outputs of individual stages, etc.)
A nifty graphics study of DOOM: http://www.adriancourreges.com/blog/2016/09/09/doom-2016-gra...
Thanks for the countless hours debating random programming topics. #gamedev was a pretty wonderful incubator for a 14 year old. Whenever I ran into a wall that couldn't be solved with googling, someone was usually able to give some sort of tip or an arcane D3D incantation to point me in the right direction. Hope you've been well!
Been well, hope the same for you :).
If all you're making are smartphone games, this might work . But for more complex scenarios, this sort of thing would be a non-starter. You ideally need to support N:1, Cameras to Pipelines. Not just for N=2 for VR, but N=3 for VR+spectator mode, and N=4+ for projection mapping to surfaces as well.
 I say "might" because smartphones are getting more powerful and will probably very soon (as in 2 to 3 years, i.e. about when such a new project would be usable) be decent VR platforms.
The practical complication in rendering can be stated like this: combine assets A and B and parameter P to make asset C, then (subsequently) combine asset C with asset D and parameter J to make output O. Some of the assets/parameters change with each frame rendered, others are entirely static, and oversight over GPU time and memory usage is needed, so you have to consider which things are loaded when. Then, while in production, a design change necessitates that a new parameter be added somewhere, and it turns out that you have to reorder which things are processed first and add a custom codepath because one of your target GPUs doesn't support the feature you need without a hack.
Memory is tight - GPU memory might be a few gigs, and your artists will fill the space. Time is tight - 16ms/frame max if you're targeting a smooth 60fps, and even higher framerates (and tighter time budgets) are recommended to e.g. reduce nausea for VR. You know exactly how long you need some resources, they are large (over 100MB for a single set of 4k gbuffers), and spilling active data out of the cache can be disastrously slow, or lose necessary information you cannot recreate.
At some point you'll get rather hands on and start micromanaging a lot of this stuff - it's much simpler to write, debug, and optimize a lot of this if you implement it 'by hand', than to design, write, debug, and optimize an algorithm trying to handle a perfectly generic answer/solution to all these problems.