"Researchey" green-field development for data-science-like problems:
1. If it can be done manually first, do it manually. You'll gain an intuition for how you might approach it.
2. Collect examples. Start with a spreadsheet of data that highlights the data you have available.
3. Make it work for one case before you make it work for all cases.
4. Build debugging output into your algorithm itself. You should be able to dump the intermediate results of each step and inspect them manually with a text editor or web browser.
5. Don't bother with unit tests - they're useless until you can define what correct behavior is, and when you're doing this sort of programming, by definition you can't.
Maintenance programming for a large, unfamiliar codebase:
1. Take a look at filesizes. The biggest files usually contain the meat of the program, or at least a dispatcher that points to the meat of the program. main.cc is usually tiny and useless for finding your way around.
2. Single-step through the program with a debugger, starting at the main dispatch loop. You'll learn a lot about control flow.
3. Look for data structures, particularly ones that are passed into many functions as parameters. Most programs have a small set of key data structures; find them and orienting yourself to the rest becomes much easier.
4. Write unit tests. They're the best way to confirm that your understanding of the code is actually how the code works.
5. Remove code and see what breaks. (Don't check it in though!)
0. Don't, unless you've built it and it's too slow for users. Have performance targets for how much you need to improve, and stop when you hit them.
1. Before all else (even profiling!), build a set of benchmarks representing typical real-world use. Don't let your performance regress unless you're very certain you're stuck at a local maxima and there's a better global solution just around the corner. (And if that's the case, tag your branch in the VCS so you can back out your changes if you're wrong.)
2. Many performance bottlenecks are at the intersection between systems. Collect timing stats in any RPC framework, and have some way of propagating & visualizing the time spent for a request to make its way through each server, as well as which parts of the request happen in parallel and where the critical path is.
4. Oftentimes you can get big initial wins by avoiding unnecessary work. Cache your biggest computations, and lazily evaluate things that are usually not needed.
5. Don't ignore constant factors. Sometimes an algorithm with asymptotically worse performance will perform better in practice because it has much better cache locality. You can identify opportunities for this in the functions that are called a lot.
6. When you've got a flat profile, there are often still very significant gains that can be obtained through changing your data structures. Pay attention to memory use; often shrinking memory requirements speeds up the system significantly through less cache pressure. Pay attention to locality, and put commonly-used data together. If your language allows it (shame on you, Java), eliminate pointer-chasing in favor of value containment.
General code hygiene:
1. Don't build speculatively. Make sure there's a customer for every feature you put in.
2. Control your dependencies carefully. That library you pulled in for one utility function may have helped you save an hour implementing the utility function, but it adds many more places where things can break - deployment, versioning, security, logging, unexpected process deaths.
3. When developing for yourself or a small team, let problems accumulate and fix them all at once (or throw out the codebase and start anew). When developing for a large team, never let problems accumulate; the codebase should always be in a state where a new developer could look at it and say "I know what this does and how to change it." This is a consequence of the reader:writer ratio - startup code is written a lot more than it is read and so readability matters little, but mature code is read much more than it is written. (Switching to the latter culture when you need to develop like the former to get users & funding & stay alive is left as an exercise for the reader.)
This is my #1 pet peeve with GitHub. When I first look at an unfamiliar repo, I want to get a sense of what the code is about and what it looks like. The way I do that with a local project is by looking at the largest files first. But GitHub loves their clean uncluttered interface so much, they won't show me the file sizes!
I think this Chrome extension called "GitHub Repository Size" might be exactly what you are looking for
Check out Octotree if you're using Chrome. No file sizes still, but I've found that when you just want to quickly explore some potential new source this beats having to clone the repo first.
This is true not just for data science but when trying to solve any numerical problem. Using a spreadsheet (or a R / Python notebook) to implement the algorithm and getting some results has helped me in the past to really understand the problem and avoid dead ends.
For example, when building a FX pricing system, I was able to use a spreadsheet to describe how the pricing algorithm would work and explain it to the traders (the end users). We could tweak the calculations and make sure things were clear to all before implementing and deploying the algorithm.
> Don't ignore constant factors. Sometimes an algorithm with asymptotically worse performance will perform better in practice because it has much better cache locality.
Forget the cache, sometimes they're just plain faster (edit in response to comment: I mean faster for your use case). I've e.g. found that convolutions can be much faster with the naive algorithm than with an FFT in a pretty decent set of cases. (Edit: To be specific, these cases necessarily only occur for "sufficiently small" vectors, but it turned out that was a larger size than I expected.) Caching doesn't necessarily explain it I think, it can just simply be extra computation that doesn't end up paying off.
> sometimes they're just plain faster
Not faster for sufficiently large N (by definition).
But your general point is correct.
I've best seen this expressed in Rob Pike's 5 Rules of Programming , Rule 3:
Rule 3. Fancy algorithms are slow when n is small, and n is usually small. Fancy algorithms have big constants. Until you know that n is frequently going to be big, don't get fancy.
True, but supposedly researchers keep publishing algorithms with lower complexity that will be faster only if N is, like 10^30 or so.
Or so Sedgwick keeps telling us.
6. Use assertions for defining your expectations at each stage of the algorithm - they will make the debugging much more easier
I work with a lot of traders. One antipattern I noticed is that when there's a problem with the data, they'll do all sorts of permutations and aggregations and then scratch their chins and ponder about it for hours.
Go to the fucking source and find an example of the problem! Read it line by line, usually it will be obvious what happened.
Corollary: Don't assume your data is correct, most outliers ina large data set are problems with the data itself. Build a few columns that serve as sanity checks. One good example is a column that shows the distance between this sequence number and the last, anything >1 is a dropped message.
On HN, I often check specific commenters activity, or new comments in a thread that can be days-old, because I often find hidden gems. A link to more elaboration would definitely count as such. Perhaps an audience of even just a dozen or so people like me might be worth it.
EDIT: Coincidentally (I swear), there's exactly a dozen of comments that positively engage with your comment!
I reformatted this in a Google Doc, if anybody wants: https://docs.google.com/document/d/1Pix3-l3Qz1aLOuxoiiP1PTV6...
It's helped that it's really been 12 (well, 13+change) years of experience, rather than one year repeated 12 times. Each year has brought something new and different that's just out of my comfort zone.