Hacker Newsnew | comments | show | ask | jobs | submit | login

How interesting. I would have thought such a common database operation (querying by ranges) would have been a better solved problem by now.

Also, how come the author of the blog post writes O(k) instead of O(1) for constant time? Is it because 1 is as arbitrary a constant as any or is there some difference that I am not aware of?

Link to the original paper [1]

[1] http://www.vldb.org/pvldb/vol6/p1714-kossmann.pdf


(I'm the author of the post)

You're right in terms of big-O it's O(1), that is O(k) and O(1) mean the same thing. I said O(k) because BFs normally require multiple, but constant amount, of operations (hashes).


I must respectfully disagree that Haskell's memory footprint is simply 'low'. This is because the memory footprint of a given Haskell program is not at all transparent, and Haskell is notorious for leaking memory in a maddeningly opaque fashion [1, 2, 3, 4]. Space leaks might be relatively straightforward to diagnose and fix for a true domain expert, but I would not want to have to rely on someone having such abstruse knowledge in a production application. It goes without saying that a space leak in a production app is a really, really bad thing.

Although, I suppose one's choice of Haskell is a function of one's own risk/reward profile. Haskell and its failure modes are hard to understand. That induces extra risk that some people (myself included) might be uncomfortable with. That said, I am now enthusiastically following you guys and hope to see the proverbial averages get sorely beaten.

[1] http://neilmitchell.blogspot.com/2013/02/chasing-space-leak-...

[2] http://blog.ezyang.com/2011/05/calling-all-space-leaks/

[3] http://blog.ezyang.com/2011/05/space-leak-zoo/

[4] http://blog.ezyang.com/2011/05/anatomy-of-a-thunk-leak/


Since the original paper is behind a paywall (at least for me), can anyone explain the specifics of what the researchers did to produce this new alloy?

>> Dr Kim and his colleagues have, however, found that a fifth ingredient, nickel, overcomes this problem.

I'd imagine that it didn't take a world-class team of scientists to have come up with the idea of alloying using nickel. There is no way materials scientists and metallurgists hadn't tried this by now, so what did they do differently?


Materials Science PhD student here, I'll do my best.

The morphology of the brittle B2-FeAl intermetallic compound is the key. In the conventional lightweight steel alloys, the B2 intermetallics make the alloy brittle (so they don't work harden very well), so in the past researchers optimized their alloys to avoid forming these intermetallics [0]. The nickel promotes the nucleation of the intermetallic particles during heat treatment [1], so that you get a more-or-less uniform distribution of many nanocrystalline B2 particles, instead of a smaller number of larger or more clustered B2 domains. The small B2 particles contribute to strain hardening by pinning dislocation motion, without reducing the ductility of the alloy.

From the Nature letter:

[0]: "One of the general concepts employed until now in the alloy design of Fe-Al-Mn-C-based, high-aluminium, low-density steel has been the suppression of ‘brittle’ intermetallic compound formation by stabilizing the ‘ductile’ austenite matrix."

[1]: "To expand the stability domain of B2 above the recrystallization temperature (normally, 800–900 °C) of deformed austenite, the alloying recipe of an austenitic low-density steel was modified by adding 5 weight per cent nickel (Ni), which is one of the most effective elements for forming B2 with aluminium."


Since this is mostly undecipherable for me, I'll just quote:

«A common method of uniformly distributing fine particles in a matrix is to make the best use of highly potent nucleation sites for inducing the precipitation of the particles. In this study, potential nucleation sites for B2 during annealing of wrought sheet steel include (1) grain boundaries or edges of recrystallized austenite crystals and (2) deformation shear bands, which are common in hot- or cold-worked low-density steel. To expand the stability domain of B2 above the recrystallization temperature (normally, 800–900 °C) of deformed austenite, the alloying recipe of an austenitic low-density steel was modified by adding 5 weight per cent nickel (Ni), which is one of the most effective elements for forming B2 with aluminium. The addition of Ni to low-density steel may appear to conflict with the collective wisdom of ferrous alloy design; Ni has been regarded merely as a well-known austenite stabilizer like Mn and C; and Ni has been little noticed in low-density steel design, mainly because it is not a critical determinant of the density in ferrous alloys.» (citations omitted)


For interest's sake, note that the blur kernel used here is an approximation of the Gaussian [1]. Also, the vImage documentation includes a brief discussion on where the values in these kernels came from [2]

[1] http://en.wikipedia.org/wiki/Gaussian_blur

[2] https://developer.apple.com/library/ios/documentation/Perfor...


Hey Patrick! I'm a huge fan of your work, and really enjoy reading your blog. I have one question though.

AR is HIPAA compliant, which implies that there is (medically) sensitive information hitting your servers. Why is it not an issue for you and your support agents to actually see that data yourselves (as you would when manually fixing CSV errors)? If your seeing this data doesn't violate the letter of HIPAA, surely the ethical impetus behind the act would prevent you from doing so?


I have spoken all the eldritch rituals which legally permit a doctor to share patient information with me personally as long as they have a contract with my name signed in blood on it.

Just kidding. It isn't actually that bad. Appointment Reminder is a "Business Associate" of Happy Teeth Dental. I'm it's HIPAA compliance officer, attend a yearly training session, have been threatened with the most severe of sanctions if I misused patient data, see only the data required for my job, and have my name and access rights recorded in a spreadsheet ready to be audited (along with my access logs). That's probably half of the list. Clearly HIPAA can't completely ban non-doctors from seeing medical data or the entire medical sector grinds to a halt, right?

With regards to support agents, some people at the company are approved for access and some are not. The system enforces access rights, naturally.


HIPAA does not forbid e.g. Patrick from viewing or working with medically sensitive data. If it did, it would effectively prevent any medical software or services from operating at all.

HIPAA does however have an awful lot to say about what can and cannot be done with this data, how it must be handled, who it can and cannot be divulged to, and so on. For example, when and where it must be encrypted, how its use must be audited, etc.

It is in some ways like PCI compliance. All parties handling sensitive medical/financial data on your behalf have to follow certain secure practices, or risk facing steep fines and legal action.


This question seems a bit odd if Patrick viewing the file doesn't actually violate HIPAA (and IANAL, but I believe it doesn't necessarily). What is "the ethical impetus behind the act" you're referring to here?


As someone who grew up writing code, and is now studying mathematics at a tertiary institution, I was quite surprised to read that parallels between mathematics and software engineering are 'surprising'.

On the contrary, mathematics has formed the basis (no pun intended) of so much software engineering. Take for example the very concept of a function/subroutine/method/ - this comes straight from the world of mathematics (albeit with minor modifications to make it convenient).

The algorithms that do all the heavy lifting in order to facilitate this web browsing experience are all grounded in mathematics - memory management in your kernel, database {everything}, even HTML layout management! The whole of complexity and asymptotic analysis is actually just mathematics.

Many of the pioneers in computer science originated as mathematicians. Alan Turing, John McCarthy and Donald Knuth for example.

It may not seem like it on a daily basis writing CRUD apps in an OO language, but software engineering is inextricably linked to mathematics. Such results are the furthest from surprising!


You are confusing software engineering with computer science. Most HN users certainly know that the fundamental theories of computation are rooted in rigorous mathematics. The point of the article is is to share the insight that mathematical notation is itself a constructed system, much like complicated software implementations, and which via design decisions provides users with powerful abstractions.

Did you even read the article before lecturing us about Turing and Knuth??


If that's true then why do mathematicians seem allergic to improving their own language, while software developers seem to constantly invent new ones?


The UCI machine learning repository [1] contains a wealth of data sets intended to be used for machine learning. Many of the data sets have had analyses performed on them that could be considered canonical. For example, the Abalone dataset's [2] associated problem is the prediction of a specimen's age, given its measurements. The problem has been analysed thoroughly; a cursory Google search for "Abalone data set" reveals that plenty of people have considered the problem.

Also Amazon (via AWS) [3] have made it really easy to access public data sets.

I hope this is helpful.

[1] http://archive.ics.uci.edu/ml/

[2] http://archive.ics.uci.edu/ml/datasets/Abalone

[3] http://aws.amazon.com/public-data-sets/


This book interesting because it forgoes the traditional approach of most mathematical statistics books. The preface states that it is done like this in order to avoid the "cookbook" approach taken by many statistics students. This is why it is ironic that "Bayes' Recipe" appears 15 times in this text, and on page 131 there is a five step algorithm for parameter estimation, and my favourite, oft-repeated, never explained recipe - "n > 30, you'll be fine". There is no mention of the CLT, MLE, method of moments estimation, biasedness of estimators, convergence in probability, how sampling distributions arise, or any of the theory of distributions that underpin all of the inferential procedures detailed in the book. I think that excluding these topics actually increases the cookbooky-ness of the text.

It is important that students understand the provenance of the inferential techniques they use so that they don't land up doing bogus science (which hurts the world) by not knowing the failure modes of these techniques. Of course not all students of statistics know the requisite mathematics to understand it all, at the very least put the failure modes into a cookbook form.

For the sake of science please don't ever do any inferential statistics without knowing when the method you're using works and when it breaks, what it is robust to, and what assumptions it makes. Statistics is really easy to break when used naively. The mathematics of statistics is not easy, and often results are highly counter-intuitive.


"There is no mention of the CLT, MLE, method of moments estimation, biasedness of estimators, convergence in probability, how sampling distributions arise, or any of the theory of distributions that underpin all of the inferential procedures detailed in the book."

Lot's of good criticisms in this thread, which I'll have to look at. This one, however, is not. :) how many intro stats book, of the traditional kind, mention MLE, method of moments, biased vs unbiased estimators, etc...? None that I've seen. So, you're right, it becomes more "cookbooky" as a result, however, I would argue that all Bayes analysis follows the same recipe, whereas frequentist analysis typically follows many recipes - not obviously connected. It is that part that I criticize, not the fact that there is a recipe for doing things.


>> how many intro stats book, of the traditional kind, mention MLE, method of moments, biased vs unbiased estimators, etc...? None that I've seen

Oh - there are quite a few. Here's a small sample (no pun intended):

- Probability and Statistical Inference by Hogg & Tanis (we used this in my stats course)

- Modern Mathematical Statistics with Applications by Devore & Berk

- Probability and Statistics by DeGroot & Schervish


Ah, yes. I concede the point. What I find interesting in all this is that the term "Introduction" is used is so many ways. When looking, for instance, for an intro bayes book you get things like Lee and Bolstad which, for some is intro. However, if you tried to teach med students or business students from that it would be a disaster.

Personally, MLE I see as just an approximation of MAP - which is superior. Biased vs unbiased also doesn't play into probability theory as logic, except as a consequence of those parameters that maximize the posterior.


n greater than 30:

A quantity that follows normal distribution has two things to estimate, the variability of the quantity (standard deviation), and the mean. Both of these are estimated with uncertainty from a series of observations of the quantity (the data). The t-distribution allows us to make predictions, taking into account both sources of uncertainty for a normally distributed thing.

However, as the number of observations increases towards thirty, the estimate of the standard deviation gets really, really good, so you can happily ignore the uncertainty for that. Then you just need the normal distribution.


That is hugely misleading. It's only reasonable if the data are actually independent draws from a normal distribution. IRL, they're not.


What books do you recommend?


Eli Bendersky's blog. It gets reposted on HN quite often and with very, very good reason. It's an absolute treasure trove of technicalia. I've spent many hours deep in his articles. On all sorts of cool technical topics, like parsers, debuggers, cool language features, abstract math...

And I'm sure I'm not the only frequenter of HN that loves this stuff.


I suppose the overarching principle here is communication between programmers. If I was the programmer building some system depending on an API with some opaque behaviour I'd get really frustrated: "Why does the connect method just not work sometimes and block??!!?".

It's just considerate to other human beings to let them know (using a suitable means) that an API call has failed (for whatever reason) and quickly. Opaque loop-and-retry-until-we-succeed makes problem diagnosis stupidly difficult. Anything that makes the audience programmer's feedback cycle slower and impedes problem diagnosis is both counter-productive and irritating. Simply communicating "I CAN FAIL AND HERE IS WHY..." is a Really Good Thing.

In my experience (read this with an 'anecdote' filter turned on) teams that communicate everything to the point of superfluity generally work better. This extends to your code, particularly APIs.



Applications are open for YC Summer 2015

Guidelines | FAQ | Support | API | Lists | Bookmarklet | DMCA | Y Combinator | Apply | Contact