Hacker News new | past | comments | ask | show | jobs | submit login
Hard Things in Computer Science (frankel.ch)
83 points by nfrankel 51 days ago | hide | past | favorite | 45 comments

TTL certainly isn't the only or even best way to achieve cache invalidation - if the cache stores the result of time-intensive computations, then using a key that is a hash of all the inputs is sufficient (though you still need to deal with cleaning out old entries to reclaim space). From what I've observed often the "hard" problem with caching is knowing when it makes sense to use it at all - I've seen dramatic improvements in system behavior and reliability both from adding and removing caching.

Yeah I stopped reading at that point, cache invalidation is not 'also known as TTL'.

It's a simplification but it isn't wrong. Mostly.

Write through cache doesn’t need invalidation. You can also invalidate through cache shoot downs. I’m sure there are many other techniques. Invalidation just means “what’s your strategy for removing things from the cache”. Using an expiry age is only one technique.

'News aggregation, also known as HN, [...]'

Computing with limited numerical precision (aka floating point) is 1.000000000000003534 additional hard thing to deal with.

Modern computers can calculate with a very large finite precision using decimal types. We have so much machine to waste today that if calculations of that sort are relevant, just throw more iron at it.

As someone who worked on indoor positioning at Apple, I’ll disagree with this. We were doing some basic probability estimation of WiFi signals (ML techniques but not AI). First, you still need the computation to compute within a reasonable time/power envelope. Secondly you need it to be numerically stable (adding a large number of very large and very small floating point together). Doing this using infinite precision libraries would be infeasible because while fast they aren’t as fast as regular floating point. Additionally using regular floating point meant that we could utilize BLAS and Lapack which had SIMD HW acceleration. At the end we managed to fit this within the power envelope of GPS with minimal CPU utilization (after many rounds of optimizing the CPU and WiFi scanning). Additionally, the runtime performance of the core numerical algorithm determined how long the training (building the maps was similar to locating yourself) and -validation tools (checking if a commit changed runtime performance) ran.

Now sometimes you don’t need this level of effort but don’t underestimate just how much CPU you might be leaving on the floor. Sometimes it can mean what seems like an impractically slow algorithm actually runs fine. Even if you ran an infinite precision version of the ML code on a modern desktop CPU you might struggle to get perf to get the system to run in real-time (we needed to run the simulation within 5-10 ms and then do a bunch of even more complicated math to compute the blue dot location every 250ms)

Nope! More precision sure helps but not in all cases. For example, having more precision will not help in boundary problems where you have to decide if something (for example an amount of money) is above or below a threshold.

Moreover, many problems related to money) must compute with limited accuracy. For example, all intermediary results must be in 1/1000th of euros. You can use floats (yeah, I know, everybody say it's a no go) or fixed point, but in the end, you absolutely can't avoid thinking about rounding.

See also catastrophic cancellation, [0] although I'm unsure of the impact of using very large floating point types.

[0] https://en.wikipedia.org/wiki/Catastrophic_cancellation

Some of us have to write SW that can run on a potato.

As another problem of applied CS I also suggest to add "Character Encoding" to the list of strongly underestimated complex problems :)

Agreed, and also just text in general, internationalization, line-breaks, horizontal vs vertical text, left-right vs right-left, etc, etc.

OTOH that starts getting quite "application oriented" and where do you draw the line? compsci is perhaps more about computation and how to achieve it, analyze it, etc, and in that case even dates and times should not be in the list, being an "application" domain concern.

  > where do you draw the line?
How many clients do you have, or want to have? How many bugs do you want them to suffer?

My language is written Right-to-Left. Jira couldn't care less about my market, so I don't use it, even though I could trivially add support myself in Firefox's User CSS file. But there are another 300 million people who speak RTL languages they are ignoring along with me. Is 300 million people a small market? "But everybody in tech speaks English" they argue.

Being _able_ to use their product in a language foreign to myself doesn't mean that I'll _choose_ to use their product given alternatives.

I meant where do you draw the line as to what is "computer science", and what is domain and application-specific problem-solving?

Of course your language has just as much right to be dealt with correctly in computers as every other, but that's probably a separate conversation about the right to equal access to technology whatever the economics (A cause I agree with), etc, etc.

  > I meant where do you draw the line as to what is "computer science", and what is domain and application-specific problem-solving?
Oh, I see.

I actually think that the line is clear. Solving a practical problem or bug? That's not computer science. Developing or improving a generally useful algorithm or technique, such as a sort function? That _is_ computer science.

Science is improving our understanding of how things work or can be made to work. Actually putting that knowledge to application is Engineering (or tinkering).


Text Rendering Hates You - https://news.ycombinator.com/item?id=21105625 - Sept 2019 (170 comments)

Here's a classic article on this topic: https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

More than just Encoding the whole G11n is definitely the hardest.

Most do barely L10n, some do make it to proper I18n. Proper G11n...Maybe the fingers of one hand.


I wholeheartedly agree. I was about to list it as a reference only, then I forgot.

Proving code correct is orders of magnitude harder than the rest IMHO. It is amazing that enough progress has been made that large scale projects like CompCert and seL4 can now be done successfully by small teams of people.

Interesting... I didn't realize there were large open projects that were formally verified. Are there other examples besides CompCert and seL4?

I don’t know if you’d count it as large, but portions of Amazon FreeRTOS.

I'd say it is not hard but impossible. I have a hard enough time to get requirements specifications that are unambiguous. Proving I coded what I was asked for is not a rigorous science practice. It's a social contract.

I agree with "impossible". In fact, it's worse than you say. Specs can be unambiguous and still be wrong. How do you prove that the specification doesn't have a bug?

And then, you can formally prove that code does not have certain kinds of bugs. You cannot formally prove that code has no bugs, because 1) you don't even know all possible kinds of bugs, and 2) even for the kinds you do know, you don't have formal proofs for all of them.

Bugs in the specs. Yum.

That's a software engineering problem, not a computer science one.

No bugs have ever been found in CompCert and seL4. The NSA gave up trying to hack seL4. The first time that has ever happened. So yes there might in theory still be bugs in the specs for CompCert and seL4. However the results speak for themselves.

Non-proven correct software doesn’t even have formal specs. Having a formal spec is already light years ahead of other software because it forces you to nail down exactly what the code is supposed to do.

There are lots of domains, where the domains are hard (compression and compiler design comes to mind), but when it comes to pure computing I would say: Lockless high contention multi-threading. Its the thing that has made everything easy hard and fun again for me.

I would add “synchronization of any kind”. Anything from versioning/code collaboration to syncing any kind of state between separate systems. I have worked with a lot of (mostly inexperienced) engineers who underestimated just how complex something like Git can be and the incredible amount of engineering behind it.

The hardest things in CS are also the simplest things:

* Strings (think Unicode, collations, string sizes, etc)

* Numbers (think currencies, precision, explaining floats to people, etc)

* Dates (as mentioned in the post)

And it's funny that we start teaching programming with these concepts.

Yeah this nails it. In my distributed systems class, the first thing we learned was that communication of a network was not guaranteed. I remember thinking, well how do you do anything? And then it dawned on me, that's literally why we have this as a field.

I think the same kind of thinking applies for all the simple things in each domain.

Not to diminish how hard these things can be. It just reads more like a "hard things in (somewhat advanced) software engineering" more than computer science IMO. I expected reading things like "is graph isomorphism NP-complete?" at least included.

> The only reliable way to have bug-free code is to prove it. It requires solid mathematical foundations and a programming language that allows formal proofs.

I'm going to be the "actually" guy and say that, actually, you can formally verify some studff about programs written in traditional/mainstream languages, like C. Matter of fact, this is a pretty lively research area, with some tools like CBMC [0] and Infer [1] also getting significant adoption in the industry.

[0]: https://github.com/diffblue/cbmc

[1]: https://fbinfer.com/

I stopped reading when he mentioned TTL and only TTL as a valid cache invalidation method. It may be in some contexts where stale data is not a deal-breaker, but certainly not in the general or even most cases. Fail.

Over the years I added my own: the hardest thing in computer science is making something useful usable.

Not explicitly listed, but obviously hard, by definition, in (theoretical) CS: NP-Hard and NP-Complete problems.


* knapsack (fit the most “boxes” into min number of containers, sometimes 4D+ “boxes”)

* traveling salesman

I'm fascinated by problems, that are NP complete, but still can have a ton of optimizations that can reduce the time needed by many orders of magnitude (millions or more), like SAT solving. Or approximate solutions / heuristics like approx. traveling salesman. Seeing huge improvements on real-life problems, even if you know the problem is asymptotically exponential feels like you've fought against impossible odds and won.

There's a whole subfield of CS, approximation algorithms, studying what you described. Interesting indeed.

funny, how many times in my career I've encountered the knapsack problem. but I can say off-by-one hit me many more times.

People say DNS is hard to get right but I don't see why. It's just cache invalidation and naming things, how hard can it be?

DNS isn't hard to get right. Using DNS practically is the thing that is hard. Is your cache something from NICs, ISPs, routers (and routers behind routers), operating systems (and operating systems behind operating systems), runtimes like containers (and runtimes inside runtimes), a Docker image that overrides /etc/resolve.conf, and then programming language (and if you're unlucky, a programming language that uses the stdlib of another language instead of the syscalls directly), third party libraries (and fourth party libraries used by third party libraries), or maybe yourself calling the wrong API. Or if you passed the domain name to a distributed system, the problem gets doubled because your distributed system may not share the same stack ... and then you figure out it's actually because the guy who said they updated the DNS record actually updated the A record without updating the AAA record.

I think the hard part of cache invalidation is talking about reliably invalidating caches in response to data changes. If you don't care about stale data then cache invalidation is not at all hard. You can just use TTL. TTL is not hard.

Anyway yeah there are definitely lots of hard things.

Applications are open for YC Winter 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact