Hacker News new | past | comments | ask | show | jobs | submit | x1f604's comments login

From the book:

(Warning: Spoilers ahead)

> The next day I told Parry that I was flattered but would not make pentaborane. He was affable, showed no surprise, no disappointment, just produced a list of names, most of which had been crossed off; ours was close to the bottom. He crossed us off and drove off in his little auto leaving for Gittman's, or perhaps, another victim. Later I heard that he visited two more candidates who displayed equal lack of interest and the following Spring the Navy put up its own plant, which blew up with considerable loss of life. The story did not make the press.


When Gergel was writing this I was working for one of the similar companies, a research chemicals company fairly close to the bottom of that list also.

We made lots of different unique chemicals ourselves but distributed many more.

Quite a few from Columbia Organics, I remember their isopropyl bromide well.


Hahaha. Fuck. The history of pentaborane is littered with human tragedy. What an appropriate compound for this troubled age.

Definitely don’t read about the history of acetylene then.

Same as it’s always been.


Hahah. Oh gosh. As an aside: Your username checks out. Azides are nothing to be sneezed at either, IIRC.

Hah, first time someone noted that connection!

On the original topic of the thread, check out Chemical Forces video on boranes - [https://youtu.be/8hrYlhTYl5U?si=4SDJq4MxAEu714iY]

I’m not a chemist, but used to read my copy of ‘chemistry of powders and explosives’ to get to sleep, and synthesized a few out of curiosity over the years. There are some real fun wiki holes in the topic too.

The azides do tend to be a bit unstable as well, same as the fulminates.

Most are still more stable than the organic peroxides, at least if they’re uncontaminated,

Energetics chemists tend to be the Leroy Jenkins of scientists.

Lead(ii) azide has mostly been replaced by lead styphnate or other compounds in commercial use, safer to synthesize [https://en.m.wikipedia.org/wiki/Lead_styphnate]


In the analytical lab we had been using dinitrophenylhydrazine, in very low concentrations, in the determination of trace aldehydes. When the previous bottle was almost empty, I found out it could not be reordered from our established supplier. One chemist showed me how little there was left, he had been banging the bottle against the bench to get the last gram out. I was about in shock, apparently less so than the compound itself, and advised don't do that again because it's like a cross between TNT and rocket fuel.

Then found out the DNPH was no longer available in dry form, packed under water now under a different part number and with a revised SDS.


I don't think it's a register allocation failure but is in fact necessitated by the ABI requirement (calling convention) for the first parameter to be in xmm0 and the return value to also be placed into xmm0.

So when you have an algorithm like clamp which requires v to be "preserved" throughout the computation you can't overwrite xmm0 with the first instruction, basically you need to "save" and "restore" it which means an extra instruction.

I'm not sure why this causes the extra assembly to be generated in the "realistic" code example though. See https://godbolt.org/z/hd44KjMMn


Even with -march=x86-64-v4 at -O3 the compiler still generates fewer lines of assembly for the incorrect clamp compared to the correct clamp for this "realistic" code:

https://godbolt.org/z/hd44KjMMn


Even with -march=znver1 at -O3 the compiler still generates fewer lines of assembly for the incorrect clamp compared to the correct clamp for this "realistic" code:

https://godbolt.org/z/WMKbeq5TY


Yes, you are correct, the faster clamp is incorrect because it does not return v when v is equal to lo and hi.


I think the libstdc++ implementation does indeed have the comparisons ordered in the way that you describe. I stepped into the std::clamp() call in gdb and got this:

    ┌─/usr/include/c++/12/bits/stl_algo.h──────────────────────────────────────────────────────────────────────────────────────
    │     3617     \*  @pre `_Tp` is LessThanComparable and `(__hi < __lo)` is false.
    │     3618     \*/
    │     3619    template<typename _Tp>
    │     3620      constexpr const _Tp&
    │     3621      clamp(const _Tp& __val, const _Tp& __lo, const _Tp& __hi)
    │     3622      {
    │     3623        __glibcxx_assert(!(__hi < __lo));
    │  >  3624        return std::min(std::max(__val, __lo), __hi);
    │     3625      }
    │     3626


Thanks for sharing. I don't know if the C++ standard mandates one behavior or another, it really depends on how you want clamp to behave if the value is NaN. std::clamp returns NaN, while the reverse order returns the min value.


From §25.8.9 Bounded value [alg.clamp]:

> 2 Preconditions: `bool(comp(proj(hi), proj(lo)))` is false. For the first form, type `T` meets the Cpp17LessThanComparable requirements (Table 26).

> 3 Returns: `lo` if `bool(comp(proj(v), proj(lo)))` is true, `hi` if `bool(comp(proj(hi), proj(v)))` is true, otherwise `v`.

> 4 [Note: If NaN is avoided, `T` can be a floating-point type. — end note]

From Table 26:

> `<` is a strict weak ordering relation (25.8)


Does that mean NaN is undefined behavior for clamp?


My interpretation is that yes, passing NaN is undefined behavior. Strict weak ordering is defined in 25.8 Sorting and related operations [alg.sorting]:

> 4 The term strict refers to the requirement of an irreflexive relation (`!comp(x, x)` for all `x`), and the term weak to requirements that are not as strong as those for a total ordering, but stronger than those for a partial ordering. If we define `equiv(a, b)` as `!comp(a, b) && !comp(b, a)`, then the requirements are that `comp` and `equiv` both be transitive relations:

> 4.1 `comp(a, b) && comp(b, c)` implies `comp(a, c)`

> 4.2 `equiv(a, b) && equiv(b, c)` implies `equiv(a, c)`

NaN breaks these relations, because `equiv(42.0, NaN)` and `equiv(NaN, 3.14)` are both true, which would imply `equiv(42.0, 3.14)` is also true. But clearly that's not true, so floating point numbers do not satisfy the strict weak ordering requirement.

The standard doesn't explicitly say that NaN is undefined behavior. But it does not define the behavior for when NaN is used with `std::clamp()`, which I think by definition means it's undefined behavior.


Based on my reading of cppreference, it is required to return negative zero when you do std::clamp(-0.0f, +0.0f, +0.0f) because when v compares equal to lo and hi the function is required to return v, which the official std::clamp does but my incorrect clamp doesn't.


Medium recommended me this article. Reading it made me realize that I've wasted my youth on useless side projects that don't generate revenue and that realization filled me with dread, despair, and self-hatred - I hate myself for not spending my youth more wisely. But I wanted to know what you guys think. Do you agree with it?


Passion projects are what really matters. I think making everything about revenue is what's wrong with the whole world, and I think feeling the way you feel (you wasted your time) is a direct symptom of living under such a parasitic economic system. Our right to access the basic necessities of life is dependent on our willingness to turn a profit for some already-rich person. You didn't waste your youth, you used it well. It would have been a waste to spend such a wonderful time working yourself to death.


No, I don't agree. There's more to life than money. In money terms, you may have wasted your time. Did you learn anything? Did you have fun? Not a total waste, then.

Look, you can make a side project into an obsession that eats your life. Don't do that. But for more reasonable levels of "side project", they have a place even if they never make any money.


> Consider a company that stores users’ emails in the cloud — that is, on a vast array of servers. You can think of the whole collection of emails as one long message. Now suppose one server crashes. With a Reed-Solomon code, you’d need to perform a massive computation involving all the encoded data to recover your emails from that one lost server. “You would have to look at everything,” said Zeev Dvir, a computer scientist at Princeton University. “That could be billions and billions of emails — it could take a really long time.”

I have to take issue with the above characterization. It seems to imply that a server crash means the user has to wait for the data to be reconstructed, or that it will necessarily take a long time for the data to be reconstructed. But I don't think either of these claims are true in the general case.

We can look at Backblaze for a real world example of how an actual file storage company uses Reed-Solomon for error correction:

> Every file uploaded to a Backblaze Vault is broken into pieces before being stored. Each of those pieces is called a “shard.” Parity shards are added to add redundancy so that a file can be fetched from a Backblaze Vault even if some of the pieces are not available.

> Each file is stored as 20 shards: 17 data shards and three parity shards. Because those shards are distributed across 20 storage pods in 20 cabinets, the Backblaze Vault is resilient to the failure of a storage pod, power loss to an entire cabinet, or even a cabinet-level networking outage.

> Files can be written to the Backblaze Vault when one pod is down, and still have two parity shards to protect the data. Even in the extreme and unlikely case where three storage pods in a Backblaze Vault are offline, the files in the vault are still available because they can be reconstructed from the 17 pieces that are available.

So BackBlaze splits each file into 20 shards, with 3 of those being parity shards so that only 17 out of 20 shards are necessary to reconstruct the original file.

Regardless of whether you store each email in a separate file, or if you store all your emails in one giant file, the point is that your emails will be divided into 20 pieces across 20 separate physical machines, so that the loss of any one machine (or even an entire cabinet) will not impact your access to your emails.

I would be extremely surprised if any real company that was actually in the business of storing user data (e.g. AWS, Azure, GCP, Backblaze etc) would store user data in such a way that the crash of a single server would require a "really long time" for the user data to be recovered. Rather, I think it's most likely that the loss of a single server should not have any noticeable impact on the time that it takes for a user to access the data that was stored on that server.

As for the second claim, I don't think it should take "a really long time" to recover even billions of emails. I know that (depending on the parameters) the Intel ISA-L Reed-Solomon implementation can achieve a throughput of multiple GB/s on a single core. So even if you were storing all your emails in a single, really huge file that was tens of gigabytes in size, it still shouldn't take more than a few minutes to recover it from the available shards and to regenerate the shard that was lost.


I agree that it's a strange example to give. This would also be like saying imagine you had a single Reed-Solomon code for your entire hard drive. That would indeed be very painful to recover data, but we don't have a single Reed-Solomon code for hard drives. You'd pick a block size that is suitable for your application.


There is theoretical inefficiency and practical inefficiency. Something might be O(n^3).. but if your n is small (as in backblaze case, where you do it on a file by file basis, rather than for your filesystem), it is still useful.

In other cases, your optimal algorithm might have a large constant cost (setup cost etc) which for small n might make it practically inefficient. n^2+c1 and n^3 + c2, but c2 >>> c1 happens a lot.


The article offered that example as an extreme, impractical, but easy-to-imagine case to show the utility of using codes over smaller data segments. I read this article as a discussion about data entropy, data encoding, and information theory.

Nowhere did they suggest that concatenating zillions of emails could be a real world system, or that such a system would be good or practical, or that any actual real system used this approach.

What you describe with Backblaze is using redundant storage to sidestep the problem, so it's apples and oranges.


Sidestep what problem? Backblaze is a practical application of Reed-Solomon coding. And the article text is " With a Reed-Solomon code, you’d need to perform a massive computation involving all the encoded data to recover your emails from that one lost server. " How is it apples and oranges?

Reed-Solomon coding is redundant, that's the whole point.


This is theoretical work. It was just an example trying to illustrate the difference.


This would be true if you were to optimize for the very extreme case of running an error correction code over all of your data at once. This would give you the absolute best case tradeoff between redundancy and data storage, but would be completely intractable to actually compute, which is the point they are making. In practice error correction is used over smaller fragments of data, which is tractable but also doesn't give you as good a tradeoff (i.e. you need to spend more extra space to get the same level of redundancy). From what I understand one of the appeals of the codes mentioned in the article is that it might be tractable to use them in the manner described, in which case you might only need, say 3 extra servers out of thousands in order to lose any three, as opposed to 3 extra out of 20. But it seems like it is not likely.

(In practice, I would say existing error correction codes already get you very close to the theoretical limit of this tradeoff already. The fact that these 'magical' codes don't work is not so much of a loss in comparison. While they would perhaps be better, they would not be drastically better).


Does it mean that when Blackblaze needs to retrieve me my file, it has to issue 20 parallel network requests, wait for at least 17 of them to complete, then combine the responses into the requested file and only then it can start streaming it to me? That seems kinda bad for latency.


Yes, you pay a cost for latency, but you get a phenomal amount of durability at much lower stretch factor.

If they make sure they no two shards occupy the same hard disk, they could lose up to three hard disks with your data shared on it and still be able to recreate it. Even if they lose just one, they can immediately reproduce that now missing shard from what they already have. So really you'd need to talk losing 4 hard disks, each with a shard on, nearly simultaneously.

So that's roughly the same durability as you'd get storing 4 copies of the same file. Except in this case it's storing just 1.15x the size of the original file (20:17 ratio). So for every megabyte you store, you need 1.15 megabytes of space instead of 4 megabytes.

The single biggest cost for storage services is not hardware, it's the per rack operational costs, by a long, long stretch. Erasure encoding is the current best way to keep that stretch factor low, and costs under control.

If you think about the different types of storage needs there are, and access speed desires, it's even practical to use much higher ratios. You could, for example, choose 40:34 and get similar resilience to as if you had 8x copies of the file, while still at a 1.15x stretch factor. You just have that draw back of needing to fetch 34 shards at access time. If you want to keep that 4x resilience that could be 40:36 which nets you a nice 1.11x stretch factor. If you had just 1 petabyte of storage, that 0.03 savings would be 30 terabytes, a good chunk of a single server.


No, you are confusing file retrieval with file recovery. The reconstruction only needs to happen if some form corruption is detected (typically in the case of a bad/dead disk).


I don't know exactly how Backblaze does it, but in the normal case, reconstruction is not computationally expensive because the 17 data shards are just pieces of the original file that can be served directly to users.

It's only when a data shard is lost that computation is necessary to regenerate it using the parity shards.


This is actually better for latency, perhaps counter intuitively. Let’s say that each server experiences some high latency requests. Normally, if one server stored that file, you’d get high latency, this scheme on the other hand cuts down on overall latency


The requests are parallel and therefore complete in the same(ish) amount of time as a single request, so the latency isn't increased.


...not only I am being downvoted for asking a simple factual question ("is this how this works, and do I understand the consequences correctly?"), I am also getting obviously incorrect answers.

So, let's consider two scenarios:

1) You make a file retrieval request for a 10 GiB file, the front server resends the request to the a single storage server, the single storage server spends 100 ms to locate the file, starts streaming it to the front server, and it takes 10 minutes to transfer it completely; meanwhile, the front server retranslates the data. So you see a 100 ms delay before the file starts streaming, which takes another 10 minutes to complete.

2) You make a file retrieval request for a 10 GiB file, the front server chunks the request and sends them 10 storage servers, each storage server spends 100 ms to locate their chunk of file, then they start streaming it to the front server, and it takes 1 minutes to transfer each chunk completely; meanwhile, the front server waits for all chunks to arrive, which takes 1 minute 100 ms, then sends their concatenation to you, which takes 10 minutes. So you see a 1 minute 100 ms delay before the file starts streaming, which then takes another 10 minutes to complete.

Obviously, the latency in the second scenario is worse. Or do I miss some important thing which obvious to everyone else in the sibling comments?


> meanwhile, the front server waits for all chunks to arrive, which takes 1 minute 100 ms, then sends their concatenation to you, which takes 10 minutes

It doesn’t have to wait for all chunks to arrive, but can start streaming the moment it has the first byte (for some protocols, it may even send you chunks out of order, and start streaming the moment it has _any_ byte)

Also, if throughput from the server to you is higher than that of a single disk, the second case can get the file to you faster. As an extreme example, if that throughput is over ten times that of a single disk, sending the complete file to you can take close to 1 minute.

Having said that, if it has to combine X out of Y > X streams to produce the actual data, getting that will be slower than retrieving a single stream because retrieval times will have some variability.


The front server doesn't need to wait for the entire first chunk to arrive (as in scenario 2), any more than it needs to wait for the entire file to arrive before starting (as in scenario 1). Unless a chunk needs repair - then of course it needs access to lots of the chunks to rebuild the missing chunk.


Do people understand code by line-by-line reverse-engineering what the code is doing, or do they understand it by relating it to what they've written before?

If the latter is the case, then you get better at reading code by writing code. Writing lots of code puts those code-patterns into your long term memory, and then when you see those code-patterns again you'll recognize them.

For system design too - if you've designed lots of systems yourself, then when you see a new system, you'll be able to relate it to the systems that you've designed before.

So maybe building greenfield projects also makes you better at maintaining existing projects?

It'd be great if someone could point me to some existing literature on this topic. I've looked around and can't find any.


> So maybe building greenfield projects also makes you better at maintaining existing projects?

I think it does, because it builds a higher-level sense of how something "could" or "should" be and familiarity with thinking at the system level.

I've had a lot of problems with people who (seemingly) only have experience maintaining projects. They seem to have a tendency to focus narrowly on "fixing" a bug close to its source, and often lack understanding of a lot of fundamental technologies (because they only attend to things that have broken), and get stuck by a kind streetlight fallacy where they fix things in more familiar areas. The end result is ugly kludges and whack-a-mole bugs where a superficial "fix" in one area causes a new bug to pop up in another.


>> So maybe building greenfield projects also makes you better at maintaining existing projects?

Only to an extent, i think.

> I've had a lot of problems with people who (seemingly) only have experience maintaining projects.

And similarly I've had problems with people who (seemingly) only have experience with starting a new project and then simultaneously over-engineering and piling on tech debt.

I think "i've never bootstrapped a project before" is easier to cure than "I don't have a good sense of what's expensive vs what's cheap tech debt."

Some tech debt is basically 0% interest. Was this small hack bad? Yeah. Will it need to be fixed, eventually? Yes, for sure. But does it compound every day, or does it just sit there being a little yucky? Very easy to determine in hindsight. Very hard to predict as you write the hack. The end result being the simultaneity in my example. People will over-engineer things that won't end up being problems and under-engineer things that will turn out to require compounding hacks and it would've been cheaper to just get it right the first time.


This. I've worked in both greenfield and brownfield areas. The chief failuremode of greenfield is that it works, but it is unscalable and shoddily done. The chief failuremode of brownfield is some kind of feature cardinality explosion that becomes impossible to maintain after some critical juncture. Reading between the lines, they're sort of the same failure mode happening at different times, but both driven by results-driven-programming rather than architecture.


>> I've had a lot of problems with people who (seemingly) only have experience maintaining projects.

> And similarly I've had problems with people who (seemingly) only have experience with starting a new project and then simultaneously over-engineering and piling on tech debt.

The best thing is to build a greenfield project and then maintain it for several years, fix your mistakes, then do the same thing on a new project again (including the maintenance).

You really need the full spectrum of experience, and put yourself in the shoes of the future maintainer when you're building something new.


There are lots of greenfield-only programmers who build something (by glueing stuff together), then they proceed to ship their barely working software to get that promotion / bonus and they run away to the next project

They never have to maintain anything, some poor soul has to fix the broken software.

They also dont even know what mistakes they made, since they arent there to solve it.


I think there’s something to be said for taking a job on a late stage project early in your career, participating in RCA analysis of all of the problems, and then picturing yourself in the room when those decisions are made and asking if you would have made the same decisions or better ones. Then work next on a greenfield project to test your theories out and adjust/grow.

I never felt like I had 1 year of experience 5 times because I moved between several kinds of projects at different lifecycle phases and with different constraints, and drew lots of parallels between them. At the end of five years I had project histories in my brain that spanned more than ten years. And got jobs that should have gone to someone with 8 years’ experience. I do not think this was a coincidence.


I haven't had to do it a lot, but in previous cases where I came into a codebase blind and had to fix bugs or figure out what's going on, I've always started by mapping the flow first and then digging into the details of what's going on. When looking at the flow, I go breadth-first through everything and then dig in from there.

I maintain two major maps, which are what is the code logically trying to do, but also what exactly it's doing. (i.e. Sanitize inputs, check error conditions, do work, return result as a logical model and then "ensure foo is < 0 and > -10" etc. as the specific checks).


>If the latter is the case, then you get better at reading code by writing code. Writing lots of code puts those code-patterns into your long term memory, and then when you see those code-patterns again you'll recognize them.

Let's hope what you write the first time is a good practice.


> The subject seems to be completely non linear

Is this not the nature of learning in general? Why is it supposed that learning things in the linear A-B-C-D fashion is even possible for most humans, rather than supposing that most people would need to revisit certain topics before learning new topics, e.g. A-B-A-D-C instead of A-B-C-D?


By linear I did not mean the order in which you study, but rather how your understanding builds up, as in, how much time you put in, and how much you grow.

For example if your native language is subject verb object (cows eat grass) it is quite linear to learn subject object verb (cows grass eat) languages (e.g. Japanese), you put in time, and you make progress. There are other subjects where when you get stuck you cant move on, and the pedagogy and androgogy systems we have came up with for math/physics and etc are getting better and better to understand what people don't understand and how to move them further. Which on its own is quite problematic when you have a class of 30 kids and you are moving with the 'average' kid which does not exist.

In the same time, programming is fairly new, and teaching it is still evolving, educators still disagree on what is important and in which order (reminds me a bit of this Feynman video about 'Greek' versus 'Babylonian' mathematics https://www.youtube.com/watch?v=YaUlqXRPMmY)

There was a great example someone used, the amount of people who get confused by the equal sign, and some actually never go through understanding references and values for many years:

    x=5 
    y=x
    x=6 
    print(y)
and

    a=[] 
    b=a
    a.push(1)
    print(b)
Now I am teaching my daughter and I spent about 3 days per week just on pointers and strings (We even made a card game we play from time to time https://punkx.org/c-pointer-game/), and I can see when she is stuck and what exactly she is stuck on, but how can you do that with 30 kids, when the most subtle nuance in the questions they ask can give you the deepest hint in what they are missing?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: