
Algorithms, Data Structures, and Design Patterns for Self-Taught Developers - antjanus
http://antjanus.com/blog/web-development-tutorials/learn-the-unlearned-algorithms-data-structures-and-design-patterns/
======
philip1209
What I seek are better resources for engineers and scientists who are self-
taught developers. I, along with many of my peers, have technical schooling
and deep understandings of mathematics and optimization, but ventured little
outside of Matlab in undergraduate courses. Finding learning materials that
are between targeting non-technical levels (e.g. "Programming Collective
Intelligence") and targeting academic computer scientists (e.g. classic
algorithm textbooks) has been difficult.

~~~
rdouble
The best algorithms book for the self taught developer is the latest version
of Sedgewick.

<http://algs4.cs.princeton.edu/home/>

~~~
philip1209
Great, thanks. I'll read it this weekend.

The one book I found amazing for learning was Cracking the Coding Interview.
The brief summaries of data structures followed by lots of examples and
solutions helped me to develop an intuitive sense for run-times that made more
academic texts approachable.

------
graycat
The author seems to prefer Quora, Reddit, etc. to the famous text books. Not
so good. E.g., tough to find better authors than Knuth and Sedgewick.

It happens that recently I posted an answer to his question in

    
    
         https://news.ycombinator.com/item?id=5631365
    

It appears that the author regards the famous texts as too difficult to read?
No, they are mostly easy enough to read, especially for the material the
author seems to want to learn.

The author seems to suggest that without a good university course, learning
that material is difficult. However, programmers long learned about, say, AVL
trees or heap sort from Knuth's TACP or just journal articles -- I first
learned quick sort from a journal article -- instead of university courses.
The material is not very difficult.

As I wrote in my post

    
    
         https://news.ycombinator.com/item?id=5631365
    

can get an overview of the topics so that know what is more versus less
important and then use the famous books just as reference material and not try
to study all the contents from cover to cover. For the overview, a few good
lectures would be sufficient, but lists of the more popular topics are easy to
find and mostly good enough.

~~~
antjanus
I think my article didn't do justice to my beliefs. I don't use Quora or
Reddit as a source of information per se, it's a place for discussion and I
happen to find information there once in a while. That happens but I always
research it further outside of Quora.

I don't believe that University courses are the way to go. Otherwise, I
wouldn't have dropped out. But I do reference to them because you can easily
skip around in the lectures and find what you need.

Learning from books is fine as well, however, I don't have the exposure to
that material most of the time. If you have any ideas or reference you suggest
I should add, let me know. I would only reference something I've used or
someone else recommended. I wouldn't mind reading this stuff either. Like I
said, I just don't have the exposure to these books.

~~~
graycat
> I don't believe that University courses are the way to go.

Well, there are pros and cons here. I'd say that the main issue is quality. If
you are getting material directly from a famous, tenured, chaired, full
professor at a top research university, say, Princeton or Stanford, then you
have a good shot at getting the best quality there is. That sense of quality
can be important in picking good future directions and avoiding poor ones. So,
the stuff you get from Knuth, Sedgewick, Ullman, etc. is going to be tough to
improve on. There are ways at times to have some improvement, but basically
start with Knuth, etc.

If you want to get the stuff from Knuth, etc., good, but you don't necessarily
have to get it in person; their books can also be good. In my education, I got
my sense of quality from some high end material in pure and applied
mathematics, and that sense of quality applies well enough to computing when
read from Knuth, etc. So, the courses I took in math were important.

I'd say that likely somewhere do need some good courses from some good people.
If you can't get such courses, then pick very carefully what you study. If you
see something in both Knuth and Sedgewick, then likely that is an important
topic, so take it seriously. If you are going to buy a car, then you will
investigate well before you buy. Do the same for topics in computer science.
That is, there's a lot of junk out there, and you could waste a lot of time on
that junk, so be careful in what you pick to pursue.

But in the end, just for computer science, no, you don't actually need
courses. I've taught courses in computer science at two major universities,
but I really never took a course in computer science and, instead, learned
from Knuth, papers, and what I dragged over from pure and applied math. I've
published in computer science and artificial intelligence but, still, never
took a course in the stuff.

By "exposure to that material", I'm unsure just what you mean: Go to a
library, photocopy 20 pages or so from each of Knuth and Sedgewick, take notes
on the rest, and call it done.

In the end, first cut, what you want from Knuth and Sedgewick is mostly just
to use them as references. To this end, maybe get copies of the tables of
contents. Might be able to do well enough just from Amazon.

Uh, for 'Knuth', mostly I'm talking about just his volumes 'The Art of
Computer Programming', TACP. His volume 'Sorting and Searching' is the most
relevant here. One of his other volumes has a fantastic collection of binomial
and combinatorial formulas. If you have trouble going to sleep, that volume
could do wonders.

Or, the main material is so simple we can cover it right here: You want to
know the basic material on algorithms and data structures. Great. So, the main
algorithms you want are just what Knuth calls "sorting and searching". For
sorting, learn at least merge sort and quick sort. For more learn heap sort
and radix sort. The heap data structure is darned clever; I recommend you
learn it. Right: Radix sort is what the old punched card machines used and in
an important sense is the fastest of the four I mentioned. That's an hour
each. Likely Wikipedia could be enough for these. Actually Wikipedia has a
super cute animated GIF on what heap sort does. For an exercise: You have a
disk file with 100 million numbers. You want to read the file once and end up
with the 100 largest of those numbers. Write some software using the heap data
structure to construct a 'priority queue' to make this operation fast. This is
real: I needed it in my work and coded it up. It may also be a standard Google
interview question.

For searching, that's first cut mostly just binary search, and that's mostly
just what anyone does when looking up a word in a dictionary. Read about
binary search for an hour and write a binary search routine. Done. Exercise:
Suppose you have 1 million key-value pairs in an array sorted in order on the
keys. Suppose you have 100 keys and for each want to look up the values. So,
you can just apply binary search 100 times. Right. But could you speed that up
a little? What would be the 'best' way to speed that up? For this last, there
may be a research paper in it. Maybe. This exercise is real: In my work I
needed to do this, thought of a solution, wrote the code, and it's in my
production software. The solution I found I'm sure is significantly faster
than just doing binary search 100 times, but I don't know the fastest solution
(and don't much care).

Publish? The world is awash in computer science profs who would jump through
flaming hoops to publish anything they can get accepted. So if you publish,
then in a sense you have beaten a lot of them. So, net, no one can say you are
not a computer scientist, even if you never touched Rails. And for silly job
interview white board exercises to 'see how you think', can just say that your
best evidence on how you think is in your published papers, which actually is
a good answer. Besides, really it's nobody's business 'how you think' as long
as you do, and a published paper proves that you do think well. Maybe you
think only standing on your head in a shower. Fine.

For more in binary search, suppose for some positive integer n you have 100
million n-tuples of floating point numbers and want to put this data into a
data structure to permit, given one more such n-tuple, a fast way to find its
nearest neighbor in the 100 million. Yup, it's a real problem I had once. So,
likely a first step will be something like binary search except on the n
dimensions. Now you will have reinvented k-D trees, and it's in Sedgewick. Now
for the rest, that's another exercise!

For data structures, of course you know arrays from just nearly any
programming language. With arrays you can construct queues, linked lists, etc.
These are too trivial even to describe.

Then, of course, there are trees. First cut, these are dirt simple. There are
some issues in 'storage management', but now mostly can leave that to the
programming language, e.g., any of the Microsoft languages on the 'common
language runtime' (CLR).

Then with trees, can have a 'binary' tree, and that just means that each
parent has at most two children. Then can quickly see how to have the 'leafs'
of the tree have your data. Data? Right, that is mostly just key-value pairs,
and you want to 'search' given a key and look up the value. We're talking
something about as sophisticated as working with file folders or index cards.

Well, a binary tree can become 'unbalanced', so often we don't want that. So,
the first good solution was AVL trees in Knuth. Now there are red-black trees,
likely in Sedgewick. They are cute. They are also the basis of collection
classes in object oriented languages and the 'look up' means in several
interpretative scripting languages.

Such balanced binary trees are so good that they are nearly a 'do everything'
data structure and remove much of the interest in hash tables, extensible
hashing, etc.

But trees can be not just binary but with each parent with multiple children.
Then have a multi-way branched tree. The main one of interest is B-trees from
Bayer and McCreight or some such, at the time at Boeing, long ago. They are
used much like AVL trees, that is, are balanced, and mostly are for storing
data in direct access, fixed length records on hard disk. Then likely B-trees
are or have been the key bottom level means of access for all or nearly all of
database.

There's no end of more, but here we've outlined s most of a first cut. Can
cover it well in two lectures, and I gave you an outline in a few minutes.
We're not talking a biggie here.

Then there is the big-O notation. So, suppose have a sorting algorithm, and
for a positive integer n when sorting n items the sorting time grows
proportional to, say, (n)ln(n). So, we say that the algorithm runs in 'order'
(n)ln(n) and write that as O((n)ln(n)).

Look up in Knuth the exact definition of big-O. Personally I don't use big-O
notation so don't remember the exact definition. I could guess. But once I
looked up the exact definition and tried to redo calculus with just big-O
notation instead of the usual limits and concluded that the goal was either
too clumsy or not possible. So I'm not a fan of big-O.

For computing just saying that the execution time is proportional to, say,
(n)ln(n) is good enough (as accurate as we please for sufficiently large n
even if not very accurate for small n), and big-O is only a little more
precise but still quite crude as a way to express running time. E.g., usually
in big-O people are only counting, say, comparisons or multiplications and
ignoring everything else. In particular they usually ignore virtual memory
locality of reference, effect on cache concurrency, etc. And they omit the
constant of proportionality out front. Gads. So, big-O is usually quite crude.
So getting too serious about big-O notation is to try to polish something
better flushed.

Why care about (n)ln(n)? Because in sorting the usual alternative is n^2, and
easily in practice the difference in running time is very important.

This stuff about running time, that is, 'computational time complexity', is a
big theme in high end computer science research. Indeed, the famous unsolved
problem P versus NP is just about running time. That is, it's easy enough to
find an algorithm that solves problems of size n, for positive integer n, in
time proportional to 2^n, but what we would like, first cut, is an algorithm
that is guaranteed to solve the problems in time proportional to n^k for small
positive integer k. Find one of those and collect $1 million from Clay
Mathematics in Boston and get your choice of chaired, full professor slots at
leading research universities. You will also get a nice parking spot, a nice
table at the faculty club, a nice office with a good view of the quadrangle,
get to meet rich people the university president wants to give money, etc.
Commonly you will have your choice of undergraduate coeds as your secretary.
So, rush right away and find an algorithm that shows P = NP and f'get about
all the details of C++, TCP/IP IPv6, PowerShell, Python, Rails, IIS, SQL, LALR
parsing, ASP.NET, HTML5, etc.!

Or with your algorithm, to heck with the professorship and, instead start a
company, Optimization as a Service, OaaS, and be worth, say, $10 billion. Am I
joking about the $10 billion? Not really. An algorithm that showed P = NP
would be one of the largest steps up in the 'ascent of man' ever.

