Hacker News new | comments | ask | show | jobs | submit login

Meta-habit: learn to adopt different habits for different situations. With that in mind, some techniques I've found useful for various situations:

"Researchey" green-field development for data-science-like problems:

1. If it can be done manually first, do it manually. You'll gain an intuition for how you might approach it.

2. Collect examples. Start with a spreadsheet of data that highlights the data you have available.

3. Make it work for one case before you make it work for all cases.

4. Build debugging output into your algorithm itself. You should be able to dump the intermediate results of each step and inspect them manually with a text editor or web browser.

5. Don't bother with unit tests - they're useless until you can define what correct behavior is, and when you're doing this sort of programming, by definition you can't.

Maintenance programming for a large, unfamiliar codebase:

1. Take a look at filesizes. The biggest files usually contain the meat of the program, or at least a dispatcher that points to the meat of the program. main.cc is usually tiny and useless for finding your way around.

2. Single-step through the program with a debugger, starting at the main dispatch loop. You'll learn a lot about control flow.

3. Look for data structures, particularly ones that are passed into many functions as parameters. Most programs have a small set of key data structures; find them and orienting yourself to the rest becomes much easier.

4. Write unit tests. They're the best way to confirm that your understanding of the code is actually how the code works.

5. Remove code and see what breaks. (Don't check it in though!)

Performance work:

0. Don't, unless you've built it and it's too slow for users. Have performance targets for how much you need to improve, and stop when you hit them.

1. Before all else (even profiling!), build a set of benchmarks representing typical real-world use. Don't let your performance regress unless you're very certain you're stuck at a local maxima and there's a better global solution just around the corner. (And if that's the case, tag your branch in the VCS so you can back out your changes if you're wrong.)

2. Many performance bottlenecks are at the intersection between systems. Collect timing stats in any RPC framework, and have some way of propagating & visualizing the time spent for a request to make its way through each server, as well as which parts of the request happen in parallel and where the critical path is.

3. Profile.

4. Oftentimes you can get big initial wins by avoiding unnecessary work. Cache your biggest computations, and lazily evaluate things that are usually not needed.

5. Don't ignore constant factors. Sometimes an algorithm with asymptotically worse performance will perform better in practice because it has much better cache locality. You can identify opportunities for this in the functions that are called a lot.

6. When you've got a flat profile, there are often still very significant gains that can be obtained through changing your data structures. Pay attention to memory use; often shrinking memory requirements speeds up the system significantly through less cache pressure. Pay attention to locality, and put commonly-used data together. If your language allows it (shame on you, Java), eliminate pointer-chasing in favor of value containment.

General code hygiene:

1. Don't build speculatively. Make sure there's a customer for every feature you put in.

2. Control your dependencies carefully. That library you pulled in for one utility function may have helped you save an hour implementing the utility function, but it adds many more places where things can break - deployment, versioning, security, logging, unexpected process deaths.

3. When developing for yourself or a small team, let problems accumulate and fix them all at once (or throw out the codebase and start anew). When developing for a large team, never let problems accumulate; the codebase should always be in a state where a new developer could look at it and say "I know what this does and how to change it." This is a consequence of the reader:writer ratio - startup code is written a lot more than it is read and so readability matters little, but mature code is read much more than it is written. (Switching to the latter culture when you need to develop like the former to get users & funding & stay alive is left as an exercise for the reader.)

> Take a look at filesizes. The biggest files usually contain the meat of the program, or at least a dispatcher that points to the meat of the program. main.cc is usually tiny and useless for finding your way around.

This is my #1 pet peeve with GitHub. When I first look at an unfamiliar repo, I want to get a sense of what the code is about and what it looks like. The way I do that with a local project is by looking at the largest files first. But GitHub loves their clean uncluttered interface so much, they won't show me the file sizes!


I think this Chrome extension called "GitHub Repository Size" might be exactly what you are looking for

Nice, thanks!

Yes, absolutely! I really want a whole-repo tree view of files along with their sizes and file types.


Check out Octotree if you're using Chrome. No file sizes still, but I've found that when you just want to quickly explore some potential new source this beats having to clone the repo first.

That is great, thanks!

> 2. Collect examples. Start with a spreadsheet of data that highlights the data you have available.

This is true not just for data science but when trying to solve any numerical problem. Using a spreadsheet (or a R / Python notebook) to implement the algorithm and getting some results has helped me in the past to really understand the problem and avoid dead ends.

For example, when building a FX pricing system, I was able to use a spreadsheet to describe how the pricing algorithm would work and explain it to the traders (the end users). We could tweak the calculations and make sure things were clear to all before implementing and deploying the algorithm.

Great advice!

Great advice. One nit to pick:

> Don't ignore constant factors. Sometimes an algorithm with asymptotically worse performance will perform better in practice because it has much better cache locality.

Forget the cache, sometimes they're just plain faster (edit in response to comment: I mean faster for your use case). I've e.g. found that convolutions can be much faster with the naive algorithm than with an FFT in a pretty decent set of cases. (Edit: To be specific, these cases necessarily only occur for "sufficiently small" vectors, but it turned out that was a larger size than I expected.) Caching doesn't necessarily explain it I think, it can just simply be extra computation that doesn't end up paying off.

Good correction, but a small second nit.

> sometimes they're just plain faster

Not faster for sufficiently large N (by definition).

But your general point is correct.

I've best seen this expressed in Rob Pike's 5 Rules of Programming [0], Rule 3:

Rule 3. Fancy algorithms are slow when n is small, and n is usually small. Fancy algorithms have big constants. Until you know that n is frequently going to be big, don't get fancy.

[0] http://users.ece.utexas.edu/~adnan/pike.html

>Not faster for sufficiently large N (by definition).

True, but supposedly researchers keep publishing algorithms with lower complexity that will be faster only if N is, like 10^30 or so.

Or so Sedgwick keeps telling us.

Great comment! I have one point to add to '"Researchey" green-field development for data-science-like problems':

6. Use assertions for defining your expectations at each stage of the algorithm - they will make the debugging much more easier

To expand on your #2:

I work with a lot of traders. One antipattern I noticed is that when there's a problem with the data, they'll do all sorts of permutations and aggregations and then scratch their chins and ponder about it for hours.

Go to the fucking source and find an example of the problem! Read it line by line, usually it will be obvious what happened.

Corollary: Don't assume your data is correct, most outliers ina large data set are problems with the data itself. Build a few columns that serve as sanity checks. One good example is a column that shows the distance between this sequence number and the last, anything >1 is a dropped message.

Great list. Only thing I'd add is lean towards clever and often simple architecture when modelling a solution, it often will beat clever programming..

Addition to Performance/2.: Synchronization costs are typically the biggest deal in applications that involve I/O (e.g. hard drive or network). Try an average database transaction with and without synchronization. 1) On sqlite3, it's dozens vs hundreds of milliseconds. Bigger databases, probably not much difference. 2) Lookup NFS sync issues. It's a huge speed/safety tradeoff. 3) On some file systems, a debian installation may take 10 minutes or 90 minutes depending on whether you disabled sync (eatmydata command).

This is a great comment! If you have a blog could you make it a blog post? It deserves to be read more widely.

I've been tempted to start a blog...I've gotten a few requests on HN...but I'm still in the "rather be a programmer than a web celebrity" phase. I'm afraid it'll be too much of a distraction from my projects. Plus, I usually write better in response to a prompt than coming up with content in a vacuum.

You could perhaps consider turning a such 'response to a prompt' into a full-blown post on whatever easy-to-use blogging system (Medium, wordpress.com, hell even pastebin or its ilk?), and linking to it alongside a comment.

On HN, I often check specific commenters activity, or new comments in a thread that can be days-old, because I often find hidden gems. A link to more elaboration would definitely count as such. Perhaps an audience of even just a dozen or so people like me might be worth it.

EDIT: Coincidentally (I swear), there's exactly a dozen of comments that positively engage with your comment!

Thanks for these notes!

I reformatted this in a Google Doc, if anybody wants: https://docs.google.com/document/d/1Pix3-l3Qz1aLOuxoiiP1PTV6...

Could you put a link back to the original source in the doc? That way you capture the comments & discussion, which has a bunch of useful clarifications.

There is a link to your comment thread in the footer. I'll add link to the entire OP as well.

Point #5 about unit tests is so true, wish i knew it before jumping on the bandwagon. I wasted so much time writing tdd code only to learn later that the specs had changed. This can save you an insane amount of pain and time.

I hate printers. I haven't used one in months. However, I just printed this comment and I'm hanging it up in my office. Perfect.

You're a genius! How long have you been into the developing world, if I may ask?

12 years, plus college, a gap year, and a couple internships & projects in high school.

It's helped that it's really been 12 (well, 13+change) years of experience, rather than one year repeated 12 times. Each year has brought something new and different that's just out of my comfort zone.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact