You can run prebuilt systems like the MVS 3.8 Turnkey System , or build MVS 3.8 from the distribution tapes itself and learn much of these systems internals . Also, you can run pretty much anything that runs on mainframes like MTS  and z/Linux .
I have a custom MVS 3.8 installation that I use for learning and it's basically undistinguishable from a modern z/OS installation, from the operations point of view. I learned to write HLASM, JCL, APL, and a lot of other ancient programming languages.
ps. There's a magical tumblr which masscans the internet every day and posts screenshots of mainframes which are directly connected to the internet (with no firewall/vpn/security) @ http://mainframesproject.tumblr.com/
The first paragraph in Wiki refers to a "DAT box". This translated virtual addresses to physical. IIRC it was 8 entries, fully associative.
The reason it was called a "box" is because it was huge, perhaps the size of a refrigerator. Yes, this single CPU function was about the size of a "rack" of today's computers (the dimensions aren't exactly comparable, the DAT box was wider and not as tall as a modern-day rack).
The reason it had 8 entries was because that was the minimum number required by the instruction set. You could have an "execute" instruction which crossed two pages (2 byte alignment, 4 byte instruction). It could target something like a "move character" (MVC) instruction (a memory-to-memory move of up to 256 bytes) that also might be in two pages. The MVC source and target could each be across two pages.
2 pages execute instruction
2 pages move character instruction
2 pages source operand
2 pages destination operand
8 entry TLB minimum size needed by the architecture
Each bit of each of the 8 physical address registers went to a light bulb. When the insides of the DAT box were pulled out of the enclosure (this was possible while it was running) you could watch the lights blink. They did blink a lot, up until you hit a tight loop in FORTRAN. Then they'd freeze for a minute or more! That was one way to watch your program execute.
When I was in college, Assembler was taught in IBM 360 Assembly on an IBM 370 using CMS as the OS. It was quite a bit different from programming a 6502 in HS. I still have the banana book somewhere. These were very interesting machines, but you could really suck up the shared time with a tight loop.
to add: XEDIT was the primary editor and it had some good point with a lot of bad. Also, for the life of me I cannot remember using directories when displaying files on that machine. I spent quite a lot of time helping people get the JCL right for the statistics package they were using.
Directories only showed up in CMS many years later. In the early days (when the IBM 370 was still around), the only way of grouping files was by having multiple virtual disks and assigning letters to them: A-disk, B-disk, C-disk, etc.
The operating system had an instruction called "supervisor call", or SVC. When you needed the OS to do something for you (like read from a file) you did an SVC. Much like making a system call into Linux.
The 2nd byte of SVC was an operand. It meant you could have up to 256 different requests to make of the OS. But IBM didn't define all of them.
So our school system programmers used some of the highest SVC numbers (255 on down) to do various custom things. But, these functions could also be dumped in hex and disassembled.
When you did that you found all sorts of magical things you could ask the OS to do for you. E.g. let you write anywhere in memory. From there it was a few instructions before you were "root", or what passed for the equivalent in those days.
Security thru obscurity. It didn't work in the early 1970s. It still doesn't work today.
The normal SVC (supervisor call) instruction was the API for programs to talk to CMS, the operating system that was most commonly run on top of the virtual machines. It was actually possible to run several different operating systems on the virtual machines, including CP (the hypervisor) itself. Systems programmers would do that for testing new versions of CP on a live system.
There were disk letters:
TEST PASCAL A
A0 (invisible in read only mode)
Those were the days. VM/CMS ... I spent many long nights as a teen hacking on series 370 mainframes.
XEDIT had a lot going for it. On full screen terminals a lot of the interaction was local so basically zero latency. The keyboards on these things were basically the best keyboards ever made. Prefix commands. The ultimate in programmability/customization via EXEC/EXEC2/Rexx . Email applications were written using XEDIT as a base. Being able to tab into the prefix command and operate on a bunch of lines (cc anyone?) made a lot of common operations very easy.
Very different than vi or emacs that were designed for character oriented devices. XEDIT was designed for those full screen terminals and had a unique feel to it. It was very hard for me to switch to other editors...
The funny thing I remember was how you typed commands in the ===== at the left side of the screen. It was quite nice for that.
Despite it's reputation IBM had some amazing people (e.g. the name Mike Colishaw comes to mind) who apparently had the freedom to do some cool stuff. There was source code for everything (you got source code for your OS) ages before anyone knew what "open source" was and there was a vibrant community of sharing code between various academic sites who ran IBM mainframes.
The main issue was job completion predictability - most things we do with computers are fundamentally batch, and almost all the really, really important ones like bank account daily settlement and reconciliation are totally batch. There's simply nothing to be done while you wait for the process to complete nor anything of higher priority that you'd want to preempt that task. So the question is, if the task is business-critical important or if it's critical to major institutions such as the global economy - like, say, the Depository Trust Corporation's nightly cross-trader settlement process which is, in fact, still a mainframe batch - why would you want the process to be anything other than a deterministic length of time for a fixed input? You'd be willing to commit a whole piece of hardware to getting the job done, right? As it turns out, that's the reason. There are an awful lot of things that are more important than economical full-utilization of a machine, and most of those tasks are still carried out on mainframes, and usually they're still done in batch.
There are a bunch of secondary reasons as well, though: a 3270 terminal ran in the thousands of dollars a unit in the 1980s; the network was really, really slow, and sharing the terminal server was worse than slow; if you were lucky(?) enough to have a token ring desktop and CM/2 on your machine so you didn't need a 5250 death-ray CRT next to you, you were unlucky enough to be on token ring and good luck with that; at 9am when the world woke up and logged in, the entire SYSPLEX ground to a halt waiting for all the interactive logins to complete, even though folks would then idle most of the day... on and on and on, and all of those were issues with time-sharing systems that, for most applications, worked just as well if you punched a record card (I know, right? Punch cards...), put it in a stack, and handed it off to the data processing department at 5pm.
If I still had $X billion in transactions to clear a day where X > a number that would get me jail time if I screwed up, I would probably still do it on a zSeries mainframe running CICS and IMS but running almost totally in batch. Because why chance it?
Personally, I think the better and modern take on it is a compute cluster where certain nodes can be brought up for dedicated, batch runs while others run interactive functionality. The embedded safety and security scene have been trying to do it with the partitioning, MILS kernels that strictly separate and schedule workloads based on priority w/ fault-isolation. Recent ones allow resource donation by partitions that are done so waste is minimal. Finally, there's security benefits in that batch runs make it easy to eliminate covert storage and timing channels. Hell, you can even do what I did (and cloud is just now doing) in designing a custom OS image per batch app to load for that on a minimal kernel. Reduces resource requirements and problems.
Can you elaborate? For instance, SWIFT does like 15M messages/day (according to Wikipedia). That's...really not that much in absolute terms for even a cheap server today.
Also note that a 'transaction' in this sense is not a http request response. The way it's used in mainframe systems it's a business transaction which can include 100's of smaller 'transactions'.
Banks and other financial systems can't perform daily reconciliation until markets close and stock and fund prices are known. Hence these systems store everything up to process in a nightly window.
Minor nit...we found token ring degraded vastly better under load than ethernet. While 10Mb enet was faster bursting than 4Mb TR, the aggregate utilization for TR was better and more deterministic. Maybe your SNA folks were oversubscribing the ring. But, yeah, pretty much everything in the IBM ecosystem was 10x the cost of the emerging ethernet world and that pretty much doomed it.
A modern graphics card is ~100GFLOPS. A system/360 is about 1MFLOPS. So a frame which renders in 1/60 a second is roughly a half hour batch job on the IBM. (This is much better than I was expecting!)
(1mflop figure from https://www.clear.rice.edu/comp201/08-spring/lectures/lec02/... )
You will still find the moral equivalent of batch processing in modern realtime applications; the batches are shorter to be sure, but they are often still scheduled against deadlines and run to completion.
I don't have any resources on the discussion that happened at IBM, but "Hackers" by Steven Levy has a good account of the 1960s anti-time-sharing crowd at MIT.
(On the more fun side, I took the original "Tron" as an allegory for the time-sharing debate; it falling more on the anti-time-sharing side)
Good for historical and learning perspective of anyone into IBM mainframes. However, anyone wanting to see awesome mainframe architecture look up Burroughs Architecture. I'd take many of its features in my PC today.
The thing that the System/360 architecture got right was a clean exception mechanism when an attempt was made to execute a "privileged instruction" when in user mode. Privileged instructions were all those that operating systems used to manage and run applications. By having a clean exception/trap mechanism, the hypervisor was made feasible, because it could run the virtual machine in user mode, let the S/360 hardware do all the normal user-mode instructions natively, and trap out to the hypervisor when a privileged instruction was run.
Kudos to Andris Padegs and his original System/360 architecture team for having the foresight in the early 60s to implement the instruction set in this way.
If you looked at the source code of VM/370 CP, which all came from CP/67, you would see code which intercepts exceptions caused by user mode applications attempting to execute supervisor instructions such as SIO (Start I/O) or LPSW (Load PSW), and emulates them like the real S/370 hardware would do. Therefore, a user-mode application could actually be an operating system that thought it was running on real hardware.
SIO is the fundamental way that an OS communicates with I/O devices, all of which were virtualized by CP. LPSW is how an OS dispatches one of its tasks, so CP virtualizes the hardware state and switches the virtual CPU from virtual supervisor to virtual user mode. Of course it's all much more complex than that.
In particular, CP could virtualize an operating system that itself did virtual memory and acted as a hypervisor. If the virtual OS was itself a hypervisor, it would run its second level virtual OSes in (virtual) user mode. The first-level CP would get a privileged instruction exception, and would look a the virtual machine's virtual state and see that it was in virtual user mode. Thus, it would simulate a privileged-instruction execution exception in the virtual machine, which in turn would emulate that privileged instruction for the second-level virtual OS.
The most difficult and compute intensive work was in simulating the virtual memory hardware of a virtual machine that was itself using virtual memory for its user tasks, or in the case of a second level hypervisor, its virtual machines. We had to have code in CP that simulates the translate lookaside buffer for the virtual machine, and also does direct lookups within the virtual page and segment tables maintained by the virtual OS.
But it all worked, and we could happily run virtual OSes like OS/VS1, OS/VS2 (later called MVS), and CP itself, underneath CP as virtual machines.
However, as you could imaging, performance was not great for many workloads. So, the hardware engineers in Endicott and Poughkeepsie came up with "Virtual Machine Assist" microcode, which would step in and run the high-use hypervision directly in microcode on the hardware, which was an order of magnitude faster than doing it in S/70 instructions. A good example is Load Real Address (LRA), which could be run very quickly in microcode.
I spent thousands of hours working on the CP source code, first as a user, then as a developer and design manager at IBM in the early days of VM/370 and VM/SP. I was too young to have been involved in the Cambridge Scientific Center's early work on the 360/44 and later the 360/67, but did get to meet and talk with some of the original people.