That common interview question about query autocomplete/sentence completion? Shannon solved it and demonstrates it in this paper, almost a decade before FORTRAN existed. New grads still struggle with that problem. PhD's still struggle with that problem.
Pretty much every machine learning classifier is using a loss function described in that paper.
I probably have 10 papers floating around, loose pages. Annoying. I print them when I want to read them but of course rarely can immediately.
I loved this book as a teenager and recently got reacquainted with this after years by the almighty AvE on youtube when he took apart this an old Helicopter Radio/Telephone here: https://www.youtube.com/watch?v=6eoBj5W7Vdc
Most modern textbooks tend to approach linear algebra from geometric perspectives. Albert's text is one of the few that introduce the subject in a purely algebraic approach. With a solid algebraic foundation, the author was able to produce some elegant proofs or results that you don't often see in modern texts.
E.g. Albert's proof of Cayley-Hamilton theorem is essentially a one-liner. Some modern textbooks (such as Jim Hefferon's Linear Algebra) try to reproduce the same proof, but without setting up the proper algebraic framework, their proofs become much longer and much harder to understand. Readers of these modern textbooks may not realize that the theorem is simply a direct consequence of Factor Theorem for polynomials over non-commutative rings.
With only about 300 pages, the book's coverage is amazingly wide. When I first read the table of content, I was surprised to see that it not only covers undergraduate topics such as group, ring, field and Galois theory, but also advanced topics such as p-adic numbers. I haven't read the part on abstract algebra in details. However, if you want to re-learn linear algebra, this book may be an excellent choice.
(As you no doubt know, different books have different audiences. Before I wrote my Linear book, when I looked at the available textbooks I thought that there were low-level computational books that suited people with weak backgrounds, and high-level beautiful books that show the power of big, exciting, ideas. I had a room with students who were not ready for high. I wrote the book hoping that it could form part of an undergraduate program that deliberately worked at bringing students along to where they would be ready for such things. Naturally, with that mindset I read your post as meaning that the audience for the book you described is just different. Anyway, thanks again for the pointer.)
Intro to CS: SICP (1979)
Algorithms/data structures: CLRS (1989)
Theory of computation: Sipser (1996)
Compilers: Dragon book (1986)
Calculus: Spivak (1967)
Linear Algebra: Dover's by Shilov (1971)
The given year is for the first publication, some of them are still being updated and I probably read a newer edition.
Reading the book is the most beautiful and simple way that a person can really understand what a computer and come to the realization that it is not black magic.
There's a couple of follow-ups too, such as "Why Johnny Still Can't Encrypt" , and "Why Johnny Still, Still Can't Encrypt" .
"Zen in the Art of Archery" — Eugen Herrigel (1953)http://www.ideologic.org/files/Eugen_Herrigel_-_Zen_in_the_A...
"What is it like to be a bat?" — Thomas Nagel (1974) https://organizations.utep.edu/portals/1475/nagel_bat.pdf
"The Tragedy of the Commons" — Garrett Hardin (1968) https://www.hendrix.edu/uploadedFiles/Admission/GarrettHardi...
What Is Mathematics? by Richard Courant and Herbert Robbins published in 1941. One of the most beginner friendly yet rigorous books out there for a survey of many areas in mathematics.
You can find hundreds of gems here from erstwhile Soviet Union - https://mirtitles.org/
It's full of beautiful renderings and diagrams, covers the core algorithms of 2D and 3D graphics, introduces the mathematics required, and many other related subjects such as user interface design.
Apparently there is a 3rd edition from 2013 which looks at modern GPU-based rendering, though I don't own a copy.
"The Presentation of Self in Everyday Life" — Erving Goffman (1956) https://monoskop.org/images/1/19/Goffman_Erving_The_Presenta...
Korzybski's book used to be huge, recommended by all kinds of famous people. I spent a few hours reading in it one day, to see for myself. (Plus had heard a fair bit about it before.) Korzybski basically seems a huge crank, who thought himself and his baby General Semantics as important as Aristotle. The quote one always hears from it is "the map is not the territory", and well, that's about the only thing worth quoting from it. Plus he tried to get rid of "is" from the language, i.e. "A is B". Seemingly because such sentences are deceptive - if you say "The car is red", well, it's many things besides red, so the sentence is a lie is many ways. It's a very strange objection. As if it's bad because it doesn't say everything, just one thing. Aristotle he's not.
Also there's an interesting contraption featured in the book, made of metal with holes, strings, plugs, used to make maps of levels of concepts. I don't know if it's practically useful.
Apart from that, what makes it a big slab of a book, are a host of chapters on different academic subjects serving as introductions to those subjects, e.g. one on maths, calculus I think, supposedly illustrating general semantics applied there. These seem mostly intended to give the impression Korzybski is a genius polymath. People who didn't know anything about that subject might learn something from that, and feel the book taught them something. But it's nothing to do with Korzybski's theories.
 This was extended by a student of Korzybski's to E-Prime, a language without any form of the verb 'to be'. https://en.wikipedia.org/wiki/E-Prime
Just to go meta: whatever book you learned something from originally/in college you should keep. It might not always be the best, but keeping the context of your original understanding can really help and speed up recollection when needed. (This probably applies most to textbooks used for whole classes as opposed to minor topic references.)
E. F. CODD
I would add: An Introduction to Database Systems - Date
Also a few unmentioned so far:
Discrete Mathematical Structures - Kolman
Introduction To Systems Analysis And Design - Hawryszkiewycz
Modern Operating Systems - Tanenbaum
anytime worth reading, simply tells you how everything in a computer work like memory and processor.
"The Mythical Man Month" (1975) - because human nature hasn't changed
"The History of Fortran I, II, and III" (1979) - because this historical piece by the author of the first high level language brings home the core principles of language design [https://archive.org/details/history-of-fortran]
"The Unix Programming Environment" (1984) - because the core basics of the command line haven't changed
"Reflections on Trusting Trust" (1984) - because the basic concepts of software security haven't changed
"The Rise of Worse is Better" (1991) - because many of the tradeoffs to be made when designing systems haven't changed
"The Art of Doing Science and Engineering: Learning to learn" (1996) - because the core principles that drive innovation haven't changed
"xv6" (an x86 version of Lion's Commentary, 1996) - because core OS concepts haven't changed
'The prince' - Machiavelli (early 16th century)
and (if this counts as old) Berger and Wolpert, The Likelihood Principle, 1984
Sipser's _Intro to the theory of computation_ (1996; 3e in print)
Aho, Sethi, & Ullman's _Compilers: principles, techniques, and tools_ - 'the dragon book' - (1986; 2e in print)
Various authors' _Handbook of theoretical CS_ (2 volumes, 1990-1)
...and for papers, I like the curated list in Fermat's Library: https://fermatslibrary.com/journal_club
Kenneth Hoffmann And Ray Kunze. Linear
Algebra, 2nd Edition, Prentice-Hall,
Englewood Cliffs, New Jersey, 1971.
Halmos, Finite Dimensional Vector Spaces
George E. Forsythe and Cleve B. Moler,
Computer Solution of Linear Algebraic
Paul R. Halmos, Naive Set Theory, Van
Nostrand, Princeton, NJ, 1960.
More has been done since this book, but
this book is a gorgeous introduction to
axiomatic set theory. So, even people
who want to dig into the latest work would
do well to have this as the first book.
And for people wanting to read any of the
more advanced material here, knowledge of
this book will be from good to have to
Walter Rudin, Principles of Mathematical
The third edition is a lot better than the
H. L. Royden, Real Analysis: Second
Beautifully written, elegant, but maybe
don't work way too hard on the exercises
about upper/lower semi-continuity, and
there is a better summary than
Littlewood's three principles.
Bernard R. Gelbaum and John M. H. Olmsted,
Counterexamples in Analysis
John C. Oxtoby, Measure and Category: A
Survey of the Analogies between
Topological and Measure Spaces
Walter Rudin, Real and Complex Analysis
Walter Rudin, Functional Analysis
Leo Breiman, Probability
Kai Lai Chung, A Course in Probability
Theory, Second Edition
Jacques Neveu, Mathematical Foundations
of the Calculus of Probability
Erhan Cinlar, Introduction to Stochastic
J. L. Doob, Stochastic Processes
I. I. Gihman and A. V. Skorohod, The
Theory of Stochastic Processes I, II
Donald E. Knuth, The TeX book
Donald E. Knuth, The Art of Computer
Programming, Second Edition
Leo Breiman, "Statistical Modeling: The
Two Cultures," Statistical Science, Vol.
16, No. 3, 199–231, 2001.
Paul R. Halmos, "The Theory of Unbiased
Estimation", Annals of Mathematical
Statistics, Volume 17, Number 1, pages
Paul R. Halmos and L. J. Savage,
"Application of the Radon-Nikodym Theorem
to the Theory of Sufficient Statistics",
The Annals of Mathematical Statistics,
Volume 20, Number 2 (1949), 225-241.
I find well typeset TeX a joy to read. Whereas FDVS is a bit cramped and looks antique.
IIRC Hilbert space was a von Neumann idea: It is first, just a definition -- complete inner product (dot product in much of physics and engineering) space. But the good stuff is (1) importance of the examples and (2) the theorems that show the consequences, e.g., in Fourier theory.
Well, the vector spaces of most interest in linear algebra are actually (don't tell anyone) finite dimensional Hilbert spaces. So, one role of FDVS is to provide a text on linear algebra that is also an introduction to Hilbert space, that is, that tries to use ideas that work in any Hilbert space to get the basic results in linear algebra.
The treatment of self-adjoint transformations and spectral theory are likely the most influenced by this role.
This role is accomplished so well that sometimes physics students starting on quantum mechanics are advised to get at least the start they need on Hilbert space from FDVS.
Sure, a better start is the one chapter on Hilbert space in Rudin's Real and Complex Analysis. The chapter there on the Fourier transform is also good, short, all theorems nicely proved, the main, early results made clear.
Also a good start on the basic results of self-adjoint matrices are the inverse and implicit function theorems given as nice exercises in the third edition of Rudin's Principles .... And spectral theory is in Rudin's Functional Analysis. Also get a bonus of a nice treatment of distributions, that is, replace the Dirac delta function usage in quantum mechanics.
For how to get the eigen value and orthogonal eigen vector results for self-adjoint matrices from the inverse and implicit (these two go together like ice cream and cake) function theorems is in Fleming, Functions of Several Variables. Then you will be off and running on factor analysis, principle components, the polar decomposition, the singular value decomposition, and more.
Books by Max Born (Atomic Physics, and Theory of Relativity) are absolute awesomness.
In other fields, the Standardized Barbers Manual (for barbers) and The Modern Tailor Outfitter And Clothier (tailoring) are still extremely relevant
Zen and the Art of Motorcycle Maintenance (Philosophy).
Gettier cases and the life and times of a bat are far less interesting to “civilians” ;)
From information theory:
A mathematical theory of communication (information theory, Claude E. Shannon) 
Three approaches to the quantitative definition of information (A. N. Kolmogorov) 
A formal theory of inductive inference (Ray Solomonoff, 1964) 
Language identification in the limit (Mark E. Gold, 1967) 
Inductive Inference of formal languages from positive data (Dana Angluin, 1980) 
A theory of the learnable (PAC learning, Leslie Valiant, 1984) 
Occam's Razor (Blumer et al, 1987) 
Long Short-Term Memory (Hochreiter and Schmidhuber, 1997)
The second batch of papers start with Solomonoff's inductive inference papers, kinda
important if you want to learn things from other things. Mark Gold's paper proves that
it is impossible to learn a non-finite automaton from examples. Dana Angluin's follow
up extends this with learnability results about various classes of CFG. Any time someone
claims that their deep neural net has learned a CFG, point them to these two papers.
Valiant's paper is the theroy of machine learning as we know it today. It basically
relaxes the assumptions made in inductive inference and introduces the notion of error.
If you can't learn some concept perfectly, what degree of error is likely from some
set of training data? Blumer's paper discusses a further bound on that amount of
error that follows an Occamist bias (simplest truths are better) and is a basis
for understanding overfitting (error increases as the hypothesis space does).
These two sets of papers probably look disconnected - but, learning is compression.
Compression, with generalisation, I guess. Anyway, no, they're not unrelated.
The final paper is the one that introduced LSTMs and the, er, "constant error carousel"
(the solution to vanishing gradients, which this paper is worth reading for).
These are papers that one must read if they're interested in machine learning. Carefully
so. They're not even "old" papers- more like, essential ones.
I'm totally omitting a whole bunch of others, obviously.
Online pdfs (not all free):
[3.1] https://www.sciencedirect.com/science/article/pii/S001999586... (Part 1)