Hacker News new | past | comments | ask | show | jobs | submit login
A Modern App Developer and an Old-Timer System Developer Walk into a Bar (zhen.org)
148 points by zhenjl on Feb 14, 2016 | hide | past | favorite | 82 comments

I suppose I'm a modern app developer, so here's what I would actually do:


1) I'm not going to invent my own system. I am not the first person who wants to store addresses and boolean values. If I'm looking at this occasionally over many months, I don't want to keep relearning my elegant bit-structure.

2) SQL is great, Postgres is even better with built-in network operators[1] for the more specific queries (e.g. subnets.)

3) Bitfields are all fun and games until you want to change things. I'm already having flashbacks to rails `has_bitfield` columns. Also, fewer people can confidently twiddle bits than can understand boolean SQL columns.

4) The author mentions possible compression methods. I'd rather a tested project like Postgres think about everything to do with storing my data.

I understand the criticism, but with cloud storage so cheap it feels like optimization just to show off. I'd rather save time than bits. If there is a satire that involves me using Postgres when I shouldn't, that would be welcome.

[1] http://www.postgresql.org/docs/current/static/functions-net....

Exactly. Change "json" to "relational DB" and it's not close as ridiculous as author implies. All questions mentioned could be answered with a single SQL: although it will take quite a while. And it is much more extendable than this "huge array" approach.

The only thing is that by saying "Postgres" you say pretty much nothing: the real question here is how'd you structure the db. 300M a month is pretty huge, so it's really worth to think if we should just pile everything in 1 table (unique key ip+month, all ports are columns) or if we can denormalize something.

How about using parent-child table partitioning ? I have used that and pretty much like the feature - https://wiki.postgresql.org/wiki/Month_based_partitioning

300M per month doesn't sound like all that much to me?

Well, it will be quite a table to do your selects against. Assuming that your hardware for such a project is fairly modest, after a couple of months index won't really fit into the memory, and even selecting a record by the key would take 30 jumps over that table just after 4 months. And most of the tasks in question require running through the whole thing. Not a tragedy, but well worth considering aggregating some data as it comes, especially as it costs pretty much nothing at this point (nmap takes longer anyway). Like counting number of IPs with the open port 22 would be just incrementing some number at the dedicated table by simple rule in the code. Maybe it's worth to invent some more interesting denormalization in order to be able to run more complex queries easily.

Maybe I'm wrong: I'd have to try out that to see how fast it actually computes, but I remember having problems with much smaller tables: like these 300M in total, maybe. Although, it was much "wider", with varchars and stuff.

Not straight Postgres, but this dataset would be ideal for a columnstore index with compression - huge runs of the same value are readily compressed with run-length encoding.

Partition the whole thing by the status of the up flag, perhaps additionally by month, and the SELECTs can be pretty well optimized.

I'm not really sure what the point of this article is. If you're clever, you can represent any data as a bit array. And once you're there, counting bits or XORing them together is easy. But is that system easy to understand? Is the code easy to read for the rest of the developers who work on the project? Or for future hires? What happens if the data set becomes too large to fit in memory on a single machine? Storage is cheap and getting cheaper every day. The whole data set described in this problem (done the "inefficient" way) fits on a flash drive at today's capacities.

There are always tradeoffs when it comes to architecture design. Speed and storage space are only two of the many factors that require consideration.

I don't understand this argument. Is it really likely that a spaghetti nest of Python, json, Hadoop, and Elasticsearch is easier to understand and to maintain than maybe 100-200 lines of self-contained Go code with minimal (no?) outside dependencies?

Or that new hires are going to think bit counting and XOR are ninja CS voodoo, while everyone already knows how to Elasticsearch?

How about cost? What's the monthly all-in cost for each approach?

This is a nice satire. It points out why modern software can be faddy, inefficient, fragile, hard to maintain, and expensive to run, with no obvious benefits (including "easy to understand", which seems to be code for "I already know how to use all those dependencies and APIs, so let's assume new hires will too.")

Not at all. I can appreciate an elegant, problem-specific solution as much as anyone else. The point is that there are always tradeoffs.

What's the query interface like for the bit array? Does it even have one? Seems like you'd have to sit down and add more code whenever you wanted to know something new about the data. Would you write your own query layer? Would you eventually need a dedicated team to maintain it? You may say that is an extreme conclusion, but I've seen this very story play out multiple times in large dev organizations.

As amorphic pointed out, indexing the data with ES makes it easy to access even for non-technical users. Each layer of abstraction comes with its own costs – of course – but also its own benefits. Tradeoffs.

> What's the query interface like for the bit array?

I don't see any request for a query interface in the list of requirements.

The point of the packed-bit array is that you don't need any kind of query interface other than a few accessor macros. It's a memory-mapped array of 64-bit integers; you just index into the array and use a macro to mask out the bits you're interested in.

The "old-timer" solution - which actually solves the stated problem instead of a hypothetical future problem with lots of extra requirements - can be written in a few pages of C. It shouldn't take more than an hour, perhaps two (plus some time for testing).

If and only if the requirements change0 to something significantly more complex would something like a query interface make sense (YAGNI). The task as stated just isn't very hard.

> But is that system easy to understand?

Yes, if you documented it properly.

> Is the code easy

I haven't used go, but it's easy in C. Setting up macros for all the bit manipulation is a very common technique. Usage would be trivial - just call a couple accessor macros. It's certainly easier than walking the parse tree of a JSON record. Using a simple bitmap would also skip the initial parsing step.

It's easier to use the bitmap. The "example usage" section below is very simple. Setting up ElasticSearch or fiddling with a JSON parser is more work (and a lot harder on CPU/RAM).

    /* the "2nd array" */
    uint64 *port_states;
    #define IP_PORT_STATES(ipaddr) (port_states[ip])

    #define PORT_22_INDEX 0
    #define PORT_80_INDEX 1
    // ...etc...

    /* some of the accessor macros */
    #define PORT_STATE_MASK (0x00000003)
    #define PORT_STATE(ipaddr, port_index) \
        (PORT_STATE_MASK & (IP_PORT_STATES(ipaddr) >> (3 * (port_index)))

    #define PORT_CLOSED 0
    #define PORT_OPEN   1
    // ...etc...

    /* example usage  */
    uint32 open_count = 0;
    uint32 ip = 1;
    do {
        if (PORT_STATE(ip, PORT_22_INDEX) == PORT_OPEN) {
    } while (ip < 0xffffffff)
> future hires

If they can't handle calling a couple macros over a uint64 array, I wouldn't recommend hiring them.

> too large to fit in memory

It's not necessarily in memory - the "old-timer" is using a memory mapped file.

edit: bugfix

This is 206, why are you using #defines instead of "const int"s.

Because the last time I wrote a bitmap like that was about a decade ago using a proprietary compiler for a Z80 clone that implemented a "C89-like"[1] language? Call it a bad habit from too many years working with embedded micros.

If I was using a modern(-ish) C, an enum might be more appropriate.

[1] When you only have 16k of RAM and a 256 byte stack, having "int x;" default to "static" storage instead of the stack is a feature.

Assuming C code (not C++), then const int is not the same as a #define. An enum is only good if the values fit within the range of a native int.

static const's may not be optimized to literals. constexpr is not fully supported.

"if the data set becomes too large to fit in memory on a single machine?"

You just use mmap. If it become bigger than the disk capacity of single machine, you can still mount a distributed filesystem and use mmap.

Slightly off topic: http://yourdatafitsinram.com/ and more seriously https://www.youtube.com/watch?v=SiSY1b0am5w which is a really good talk about the challenges of sysadmining for scientists. What do you do when 2TB of RAM isn't enough?

You add an index and smaller-block loader. The size and structure of a "smaller-block" depends on the project and how you are using it.

In more extreme situations, you to learn all about materialized views, cache invalidation, etc.


Ill add that having your data in Elastic means that a (probably non-technical) user still can use it as part of their analysis without having to get dev involved.

Yes! ES is great for a lot of reasons, not the least of which is the one you point out.

"I can perform a simple AND operation on the 3 monthly bit arrays, and then count the number of “1” bits."

I think an old-timer would tell you that he'd be using an "OR" operation to count the number of hosts seen in the last three months.

As a long-time embedded systems engineer, you learn to guard every CPU cycle and memory location jealously. That behavior, along with avoiding needless abstractions that complicate the system AND cost CPU and memory, have served me well as I've switched to higher-level software.

As an aside, memory resident databases are now practical even for problems of this size. I've used the CERN Colt library to provide sparse arrays stored in memory in place of databases several times now. My current pet project is processing a mouse genome for my daughter's lab. With the amount of data being read, I'd always be I/O bound if I wasn't packing a lot of working data into memory. This is the programmers equivalent of the CPU's cache, but one we can easily control. Give it a try!

It may be because I haven't had my coffee yet, but how would using OR help here? 1 OR 0 OR 0 returns 1, but the host was offline in 2 of the past 3 months, which means that it should be 0...

You might be right ... but it's entirely dependent upon what the author meant. The problem statement was "How Many Total Hosts Were Seen as “Up” in the Past 3 Months" which I read as "How Many Total Hosts Were Seen as “Up” in any of the Past 3 Months". You've correctly pointed out that the author might have meant "How Many Total Hosts Were Seen as “Up” in all of the Past 3 Months".

This sort of ambiguity is why BDUF is so hard to pull off. Only lawyers are trained to write completely unambiguous prose (and they fall back to having very precise meanings for words to help them succeed).

Given that the modern app developer did a de-dupe operation for his solution, and that the most plain reading of the English text is "in any of the scans", I think the OR is correct.

Should the last operation by the old-timer system developer be "(this_month XOR last_month) AND this_month" instead of "(this_month XOR last_month) XOR this_month" ? Or am I missing something there?

Why wouldn't you want to use an SQL database with an index to store this data? Just have a column for the ports and a boolean flag for the status of the port?

Honestly, I'd probably use bash/nmap/ping and psql to insert data. Want to query? psql and grep.

These examples seem like a great way to re-invent the wheel with modern buzzwords.

I was thinking this as well. This is a problem with a well-defined schema. Why would you use a schemaless NoSQL database? Additionally, many of the bitwise operations used by the old school dev have analogous abstractions in SQL, which are probably easier to understand.

If there's not so many data, awk and sed will do the job.

Taking the story to extreme, modern day developer should take a Hadoop (despite the fact it would be 235x time slower than awk - http://aadrake.com/command-line-tools-can-be-235x-faster-tha...) and old timer should take some APL descendant like Q with kdb+.

Why not SQLite ?

I'd just take that as favorite tool for the job, SQLite's pretty interchangeable. Also, if you've already gotten a database engine running, it's probably best to just keep everything there. Otherwise, you get into the situation I was in at a previous job in which certain page loads from an app written before I started there required a sun jsp app for the main part of the page, which made requests to pull some data from an Oracle database, some data from an apache mod_perl app that queried from a mysql database, some data from an apache mod_jk app that pulled from a c++ app with its own database format, and some data from an apache mod_jk app that stored and retrieved from a postgresql instance. Everything was nice and well-documented, but mein gott was that a lot of moving parts.

In PostgreSQL you could use a native type to store your IP addresses and subnets (even for IPv6): http://www.postgresql.org/docs/9.5/static/datatype-net-types...

or redis

Business Guy shakes his head and gets a subscription on https://www.shodan.io/

I was afraid I was going to miss the joke until I clicked the link and got a real chuckle.

And that is so often the best kind of response.

Lost me at "Old-Timer Developer: I will use Go."

The old-timers in your vicinity must be different than the ones in mine.

I think we might come from a similar vicinities. Here I am thinking go was modern. I expected old timer to be using c. As that's probably what I would have done

"What's this Go? C for people that are afraid of malloc() ?"

I think Go is an acceptable modern language for lovers of C. Especially those who are allergic to C++.

Go is an acceptable modern language for modern haters of C

Ah. Thank god the imaginary old school guy is better than the imaginary straw man modern guy! I'd better go learn Go immediately.

Personally I'd start off by looking into using Cython or Numpy and maybe pickle to disk for storage.

Does that make me the old guy or the new guy or is it a false dichotomy?

Funny quote from the article: "This is a big data problem." Hmm maybe on a raspberry pi, or some kind of retrocomputing challenge (Do this on a simulated IBM 1130!).

I don't know how an app developer thinks, but I know the article characterization of the old developer is wrong, we'd chuck it into the existing relational DB, maybe spin up a new cloud instance and reserve some NAS space if necessary but this is a pretty small data set by modern standards so probably nothing special is required. All the reporting devolves into silly SQL onliner competition. The first question is a COUNT(*) and GROUP BY. The second question is a ridiculously simple SELECT. The third question is another simple select if you stored your ip addr bytes in separate columns, even if you allow non /24 addrs. The fourth is another GROUP BY.

The article does fit the stereotype I've seen that new programmers prioritize ease of storing data over ease of reporting, and old timers vice versa. Like the difference between coming up with the fastest next move in checkers vs "solving" checkers in the game theory sense.

Maybe this is a stupid question, but why not just use a relational database?

Because then it would seem that modern app developers are actually competent, and that's not the author's intent.

Is this post Go biased? Why not C++ or Erlang? :) You can do the same with Python

Agreed, I think very few "Old-timers" would be using go.

Looking at the other posts in the blog, it's definitely Go biased. Not that it's necessarily a bad thing.

But it's making some seriously tenuous assumptions about "modern app developers". It's almost "Goofus & Gallant" levels of hyperbole.

A couple of errors near the end for "How Many Hosts Were “Up” Last Month But Now It’s “Down”" and vice versa. The error is that using xor on the same item twice is a no-op...

(A xor B) xor A == B.

The correct equations would be,

!this_month and last_month = "up" last month, down now.


this_month and !last_month = "down" last month, up now.

This just compares a decent programmer with a terrible one. Not a modern app developer with an old timer. This is especially evident where the systems programmer is aware enough to use Go (which is a debatable choice, but certainly displays good awareness), but the modern app dev doesn't know when (not) to use Big Data and when to try optimizing things (e.g. the JSON choice).

The only "mistake" I see a modern app developer making might be using Python, but hey, computers are fast (so python should still be fine). Most would probably come up with a similar bit-twiddly solution or a simple DB-based solution. No biggie.

There's no dearth of horrible systems code out there either. It might be less likely for a systems programmer (even a horrible one) to mess up this exercise, but you could probably choose an exercise which would have the reverse effect. In the end, it's a toy exercise, not one where you actually design a significant piece of software.

There is a point buried in all of this; which is that learning systems programming will probably make you a better modern app developer since you get the correct mindset to tackle problems like this (also, vice versa?). But there's too much hyperbole obscuring it.

There's also the other point about how systems programming is done vs modern app dev. It's a valid one, but there are benefits to both approaches, and it boils down to each being useful in its own domain.

As VCs often look for keywords, I suppose the trick is to build the old-timer system but describe it as the modern web developer.

Lol store the bits in a file called hadoop.txt. if anyone asks how you did a query say you looked in hadoop.

Couple of defensive, bruised egos here in the responses. Lighten up, it's funny! We all know that one guy for whom the answer to every question is HADOOP MAP REDUCE!! I thought the writing was fun.

Unfortunately our hero the array developer got a bit too clever, pun intended, and produced an incorrect program. The last two of the XOR tricks won't work the way he thought.

Point of this article is not the point of this article. The clever bit was:

> ...the IPv4 address will convert into a number...

Many fancy, post-modern app developers might insist, "You won't do math with an address; it's not a number." But some things are numbers, with a scheme and pattern you can exploit.

Old-time system developers use C.

Old-time system developers wrote Go.

Wouldn't you classify golang authors as old-time system developers?

And likely C89... certainly not Go.

There's no simple solution without a lot of context. http://mockingeye.com/a-classic-of-soviet-engineering/

It's fascinating how many people here are defending the "modern app developer" approach, mostly with arguments about flexibility, maintainability, ability to pass on the code to junior developers, and the like. If you think about what the "old-timer's" code here would actually be, these kinds of objections make no sense at all.

Assuming the code to run nmap itself would be equivalent either way and we're interested in the data storage and analysis functions here, the old-timer would write that entire part of the program in about one screen of any decent programming language in perhaps 10 minutes. The functions would be short and simple, needing only basic iteration and bitwise arithmetic. Any junior programmer who's going to get anywhere in this industry would be able to understand that code in moments with no specialist knowledge or additional training. Nothing about the code would get in the way of any reasonable commenting or testing policy either.

If in the future someone didn't find the compact data structure appropriate for some new application, they could easily convert the data to a more suitable alternative format, because the current format would be well-specified, simple, efficient, and without external dependencies.

Hypothetical arguments about scalability are silly. The problem is fundamentally built around IPv4 addresses, which have been 32 bits wide since they were devised and will still be 32 bits wide tomorrow and next year. Designing for something more scalable is some horrible combination of scope creep, YAGNI violation, and worst of all, not giving even cursory thought to what the requirements actually mean. (I await the seemingly inevitable unintentionally amusing response about IPv6...)

The only thing I really quibble with here is the characterisation of the two types of developer. I don't think this is really about modern vs. old-timer. It's just about a good programmer -- who looks at each problem on its merits, chooses suitable tools for the job, and leaves their options open -- and the bad programmer, who does not.

Well, that and the fact that no self-respecting old-timer would misuse the word "performant" so heinously, but I digress. :-)

There's absolutely a scalability problem lurking there regarding ports, and what happens when you want to scan more than 21 of them.

How so? The old-timer's approach would scale trivially to several times that many ports before running out of RAM on a modern PC, or many times more just by introducing a memory-mapped file. It would still be simpler and quicker to implement than the kinds of external dependencies or formal database structures being proposed by others here, and the alternatives have significant scaling concerns of their own by that point, not least if you're paying for resources like CPU, storage and network data transfer on a metered scheme.

In any case, you'd probably reach a point where sparsity dictated a different choice of data structure for efficiency long before the dense representation actually broke, at which point the old-timer would most likely suggest a different solution rather than stubbornly sticking with the one that is no longer a good fit, and maybe the modern app developer would too.

Realistically, there are only a few hundred major protocols whose recognised standard ports you'd want to scan assuming we're talking about TCP here, so this still looks to me like inventing hypothetical future scenarios and over-engineering to allow for unlikely future requirements.

That's actually a good comparison between different mindsets. Even if it reads a bit biased pro-go (or pro-old-time-system-developer) it doesn't tell what's the best way to implement this - and in my opinion it's neither one.

So what's actually the best way to implement this? What are the motivations for choosing one way?

"Modern App Developer Way": + adoptable, scalable, readable data formats - storage, computation

"Old-Time System Developer Way": + storage, computation - "locked-in" data formats

"Database Developer Way": + ease of implementation ? storage, computation

Any other options/ideas?

Use Redis?

> Old-Timer Developer:

> I will use [a language with mandatory garbage collection]

I lol'd

As a relatively junior developer, I have been told my some of my senior colleagues that I sometimes miss the forest for the trees, and so this may be another instance of that.

But it doesn't seem "trivial" to me how one gets the output of nmap into whatever database, data structure one chooses here. I know nmap can produce XML, presumably there is a csv format, but it would seem like the XML/CSV -> JSON conversion (following our intrepid Modern App Developer) would be an easier more maintainable way to go, versus XML -> to bit array (memory map file). Also, is managing the nmap or masscan and whatever other ancillary processes required to execute this plan equally as onerous in either paradigm? Finally, and this is likely controversial, this particular problem "feels" like its stacked against the Modern App Developer, given that it isn't trying to solve a problem most Modern Apps try to solve (or try to solve as an end rather than a means to an end)

Or, if you want to use a binary format without having to deal with all of the complexities of inventing your own, you can also use Protocol Buffers. You can even use Protocol Buffers as a schema for JSON too as of 3.0.

Pentester: I'll use mass scan, someone else already solved this problem. https://github.com/robertdavidgraham/masscan

I'm betting on the modern app developers system being much easier to maintain and understand ( and pass off to junior programmers), at the cost of speed.

I don't know about that. I'm a front-end developer, so I know none of the above. I'd rather decipher some complex bit-counting scheme one person made (and documented) in a language I don't write, than to try to piece together a whole system of tools written by various people I can't talk to, and tied together by one person for this specific task.

Which one is going to require more reading? Which one is going to be easier to refactor? Although neither develop sounds like a teammate I'd hope to collaborate with, I'd much rather inherit Mr Old-Timer's code to maintain than Mr Modern Developer

I was going to say no way, basic boolean logic is doable by every programmer. But then I saw the "old-timer" had some boolean errors themself. So......Maybe you're right.

If anybody is curious, the zmap project from the University of Michigan does this and more.


One could differentiate the two proposed solutions by the level of involvement of third parties (e.g., cloud hosting providers, authors of garbage collected or scripting languages, etc.). Some of these third parties have been involved since "old times", others have not.

Based on my recent experiences, the modern app developer would write a slow, bloated web app that takes ages to produce a result but has a slick UI with awesome animations.

Did this all happen while they were in the bar? Seriously though, both stereotypes are off-base (Go? really?)

I think this is unfair because it's basically in the old-school systems domain of problems.

basically just an example of how choosing a bad data structure can make your life a headache

Am I the only one who thinks that there are far more efficient data structures than a bit array to represent this data ?

How about an RLE encoded list of hosts, for instance ? (since they're consecutive 32-bit integers). There'd be way less data than in a bitfield which will make most of these queries far faster than iterative bitfield lookups. Also, much more of the data would fit and stay in memory, which means that all queries that iterate over it will be 10x faster or maybe more.

Of course experimenting with data structures is not something you can do very efficiently in Go, as it'll be painfully verbose code. C++ would be far more useful.

But this is basically the old argument for/against optimizing code. The real problem with the "old timer" programmer is that there are quick ways to break his program. When new queries present themselves, the "old timer" will quickly find his datastructures not optimized for the queries, or that they require complex calculations, which means that for data analysis the modern app developer will probably "win".

When it comes to putting a product in production, it needs to be fast and cheap. Any company that doesn't hire an old-timer developer for that will quickly find their costs exploding. This may be acceptable for a few weeks when trying to find product/market fit but it won't last long.

My background is much closer to the "old timer" but I think I'm definitely on the side of modernity here, along with everybody shouting "sql".

Also I don't see why the old timer needs both a bit array and a uint64 port array, they can put the up/down bit in the high bit of each uint64.

Old timer spends rest of 2016 to implement his solution that saves $100 in computing resources.

Reality: OId timer spends about 10 minutes to implement his solution.

If the older timer's solution does save $100 in computing resources, then at that rate the the old timer is saving around $1M per year compared to the "modern" developer.

Given both are experts, point-and-click will always be faster to implement.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact