datarecipes's comments

datarecipes · on Nov 6, 2023

Just had a look (https://github.com/duckdb/duckdb/issues/9399). Yeah it's worrying that such a trivial query returned incorrect results - but credit to the Devs for getting it fixed quickly.

To my knowledge the only databases that can be described as "military-grade" in terms of testing are SQLite and Postgres.

nyanpasu64 · on Nov 6, 2023

Apparently DuckDB requires your real-life name to file an online bug report, bucking every norm of online handles for communication, as well as enabling doxxers and stalkers to find and trace people in real life.

silverlyra · on Nov 6, 2023

I looked at the bug report form and for context it links to a post “Dear anonymous internet user asking for help” https://berthub.eu/articles/posts/anonymous-help/

It seems to be about wanting to know who you're talking to when providing free support for an open-source project, and whether the person submitting an issue is using the project for personal use or within an organization.

> If I don’t know who you are, am I enabling you to build the new Turkish censorship infrastructure, or helping you implement [Russian internet blocking] more efficiently? These are two examples that actually happened by the way.

sega_sai · on Nov 6, 2023

Yes, that was a surprising requirement when submitting a bug report. (I understand patches may need to be like that due to copyright issues)

Cthulhu_ · on Nov 6, 2023

It's the same with a lot of open source contributions; those need to do so for legal and copyright reasons.

If you're afraid of doxxing and / or stalking though, at least you have the choice to not contribute. You can still post somewhere else and ask someone else to make the report for you if need be.

monsieurbanana · on Nov 6, 2023

I'm not aware of any other open source project that required a real name just to file a bug

Cthulhu_ · on Nov 6, 2023

What about Oracle and MSSQL? I'd imagine that especially the former would be "military-grade" (whatever that entails)

burgos_thrw · on Nov 6, 2023

Even Postgres has its share of the bugs, e.g. simple search shows https://www.postgresql.org/message-id/CAGckUK2GLF%3Dd9J5ErEW...

Oracle might be military grade because they have the entire web page on how to report wrong results bugs: https://support.oracle.com/knowledge/Oracle%20Cloud/150895_1... https://support.oracle.com/knowledge/Oracle%20Database%20Pro... etc.

Query engines are (not)surprisingly complex software products. Add to that the constant (and aggressive, due to the competition in the field) evolution and adition of the new features that can interact with every existing feature in any existing context and you have a perfect environment for bugs to appear.

code_biologist · on Nov 6, 2023

I'm not sure if the fix is reassuring or not: https://github.com/duckdb/duckdb/pull/9411/files

sega_sai · on Nov 6, 2023

I certainly liked that the added the problematic query to the list of tests, which I think is a healthy sign.

datarecipes · on May 10, 2023

Artificial Institutions

Galacta7 · on May 10, 2023

I think they prefer the term Synthetic Persons.

datarecipes · on June 7, 2021

Both the Brier score and log loss are proper scoring rules (i.e. optimized when the predicted probabilities are the true outcome probabilities), and the choice between the two seems to have minimal impact on the conclusions that can be drawn (https://pubsonline.informs.org/doi/abs/10.1287/deca.2013.028...). I covered the Brier score in the post as I thought it would be easier to digest for a general audience.

As Frank Harrell wrote on his blog (https://www.fharrell.com/post/class-damage/), one advantage of the Brier score could be its interpretability and the ability to break it decompose it into discrimination and calibration components.

srean · on June 7, 2021

Indeed. Note though that proper scoring rules form a large class and it can matter which one you choose.

For example, for logistic regression, things become a lot simpler if one chooses log loss (equivalently KL divergence) because one ends up with a convex minimization problem. Had one chosen Brier score here the problem is no longer convex and where one starts the training iteration will determine where the updates converge to. Sometimes this indeterminacy is a problem -- am getting poor results, is it because the data has changed, or is it that my initial seed has changed and the udates have converged to a worse solution.

datarecipes · on June 7, 2021

This looks quite cool - would you be able to drop me a line at datarecipes@pm.me?

datarecipes · on June 7, 2021

Agreed. It would be great to hear your views on some of the key gaps in modern data science curricula that could be covered in the blog - would you be able to drop me a line at datarecipes@pm.me? Thanks!

datarecipes · on June 7, 2021

Thanks for your comment.

I agree that the post lacks depth, but it was intended to be a gentle article accessible to a general audience, so they can start applying it in practice in their day to day lives. I would, however, really love to hear your views on what might be a more rigorous treatment of similar topics that can be introduced in an accessible way - would you be able to drop me a line at datarecipes@pm.me? Thanks!