Hacker News new | past | comments | ask | show | jobs | submit | shadowmint's comments login

If the business wanted to track the rate of failures and create predictive models about when things fail, or detect anomalous behaviour, that's what they would have set out with as the goal, and, perhaps, some ML model might have helped, but probably, it would've been too unreliable and any number of standard predictive models with well known characteristics would have been used instead.

That's not what they wanted.

What people are being sold is AI/ML as a magic bullet that will do something useful regardless of the situation, and it lets business people avoid making decisions about what they actually want, because AI/ML can be anything, so they just signup for it and expect to get 20 things they didn't know they wanted handed to them on a plate.

Turn out, it's not enough to just collect a bunch of data and wave your magic wand at it. It wasn't with web analytics 10 years ago, it's still not.

What you actually need is someone who has a bunch of tricks up their sleeve, and has done this before, and can suggest a bunch of Business Insights the business might need before they start building anything, people that actually decide what to do, and actions taken to investigate, and solve those problems.

I mean, to some degree you're right; perhaps ML models could be useful for tracking hardware failures, but that's not what the parent post is talking about. The previous post was talking about just collecting the data and expecting the predictive failure models to just jump out magically.

That doesn't happen; it needs a person to have the insight that the data could be used for such a thing, and that needs to happen before you go and randomly collect all the wrong frigging metrics.

...but hiring experts is expensive, and making decisions is hard. So ML/AI is sold like snake-oil to managers who want to avoid both of those things. :)

Projects rarely end up doing what was planned when they started. As long as ML is solving real problems in practice, upper management will keep treating it as magic fairy dust to sprinkle around aimlessly.

It's all about how you package things. ML connected to an audio sensor could predict failure modes that are dificult to detect otherwise. Now that might not be was was asked for, but a win is a win.

Why does having a different opinion to you mean someone has no idea what they’re talking about, or has never used a thing “seriously“?

I used to love angular, then I got a job which was a “| async” dumpster fire and spent a year watching a team of smart c# developers wallow in a mire of disaster so bad it became a two week regression to change a text field on a form. So full of amazing functional statement no one, even the original authors, could touch it without breaking something in the process.


Your milage may vary. I no longer particularly like angular, personally, because I find it a chore to herd inexperienced FactoryInjectorConstructorFactoryPattern angular developers into not screwing things up.

...but talented team can do well with it too, and I’ve seen people screw up react projects too.

It really is more about good practice and experience than framework, your personal preference is probably, like mine, basically irrelevant.

Lovely theory, but in practice it works more like this:

1) write everything in python because its easy and quick to do so.

2) its slow as.

3) abandon software and write it in something else, or, live on with slow ass software and blame python for being slow and rubbish forever more.

re-writing python in c is a hideously painful process, and its proven to be very unsuccessful practically.

Writing new code in c/c++/whatever and exposing a python api is where successful projects like numpy and tensorflow live.

python is very good at what it is, but no one is ever going to go and rewrite your python code in c to make it faster; its just going to be slow forever.

No, in practice is works something like this:

1) write everything in python

2) yeah, the performance here is good enough so ship it

3) there is no 3

There are very few situations where performance is going to be an issue for you where there is not an existing C module solution that will solve the problem for you. The tired old 'python is slow' trope is getting more and more irrelevant every day. There are other aspects of the language that may make it a mediocre solution to the problem at hand, but out in the real world most people are simply getting the job done with python.

I spent 4 years as a professional python developer.

We certainly shipped (using django) and it was certainly slow, and remains a painfully slow very successful enterprise app.

I’m not arguing that the slowness is deal breaking, but it is slow, and it does, routinely, break the SLAs its supposed to meet.

So... unusably slow? no.

...but slow? yes, it really is.

imo. your milage may vary. /shrug

Unless you are careful, the Django ORM will generate a lot more database accesses than needed. I'd almost bet that most of the time the user spends waiting for the app, the app is waiting for the database.

I'd emphasize that for each tasks that's likely to be a performance bottleneck there are, usually, existing high performance extensions: someone has had the same needs before you.

If there are performance issues, rewriting part of a Python application in C is much less likely than refactoring it, without using other languages, to use an existing high performance library.

Application-specific Python extensions are usually intended to allow scripting of the application, with little concern for Python performance (which is the same as doing the same thing without scripting).

New foreign language Python extensions are usually found in new Python libraries, to make existing proven C or C++ libraries available to Python applications or to improve on existing Python libraries.

Reddit disagrees. And they weren't using 3, which is even slower.

By using Python they were able to ship, which is why you have heard of Reddit and they were able to grow enough to have a concurrency problem (something Python still sucks at); the number of sites of any significance that started by using Java for page delivery is probably somewhere around 0.

Oh cmon, LinkedIn started with Java.

And even after this long it doesn't work quite right.

It's a miracle they shipped it at all.

They would be able to ship in any language, that is what software engineering is all about.

True, but would they be able to ship within the window they could become relevant? Would they be able to add the required features?

You can ship your own clone of Reddit next week, with blazingly fast code, running on two tiny VMs and supporting more load than Reddit, but would it be successful?

Probably, depending on the sales/marketing teams.

Reddit was written in lisp, they switched to python only after having initial success.

Pretty sure Python 3 has had better performance since 3.5 or so?

I still see benchmarks from bilingual projects which show py2 being faster, such as http://falconframework.org/#sectionBenchmarks

YMMV. The benchmark you mentioned used 3.6.

https://hackernoon.com/which-is-the-fastest-version-of-pytho... has a benchmark that includes 3.7.

Nope: Cython + Numba always sufficed so far

Sure, but it sucks to be left with code you have to rewrite because you're donating your time to the cause to find problems and smooth the path for other people in the future.

Maybe some people are in to that for fun, but the for the majority of people, the message should be:

stick with stable folks.

Fortunately Rust is one of the closest languages to the "if it compiles it works" ideal, so big refactors aren't as bad as they otherwise would be.

> sql to write etl in will drastically reduce the amount of work needed to write etl.


My experience with writing an ETL in SQL is that it is almost never, quick, easy, correct or easy to test, and also almost always denormalized, or unconstrained (dimensonal keys which aren't 'real' foreign keys, just numbers so you can parallelize the data inserts and updates without constraint errors).

So... your milage may vary with that.

It's most certainly not true that writing any kind of ETL that uses SQL saves time in all cases.

Well SQL would present the ETL declaratively for one ... whether the output is denormalised or unconstrained has nothing to do with SQL.

It sounds like you already have an idea of what you want to do, but I think you should pause and think more deeply about what you have, vs. what you want.

What I would want in your situation is:

    - All the data in one place.
    - An easy way to explore the data. 
    - A single source of truth for transformed data.
    - Metadata to explain the data model (ie. documentation).
What you're proposing does some of those things, but it also:

    - Adds yet another maintain-forever technology to your stack.
    - Adds yet another pipeline (or set of pipelines) that does the same thing.
    - Moves from an architecture that is clustered for scale (ie. spark) to one that only scales vertically (postgres). 
    - Potentially introduces *yet more* sources of truth for some data.
> I was thinking that in a first iteration, data scientists would explore their denormalized, aggregated data and create their own feature with code.

^ Moving data into postgres doesn't make this somehow trivial, it just enables people to use a different SQL dialect. The spark API is, for anyone competent to be writing code, not meaningfully less complicated than using the postgres API.

I appreciate the naive attractiveness of having a traditional "data warehouse" in a SQL database, but there is actually a reason why people are moving away from that model:

    - it doesn't scale
    - SQL is terrible language to write transformations in (its a *query* language, not an ETL pipeline)
    - it's only vaguely better when you have many denormalised tables, vs. s3 parquet blobs
    - you have to invent data for schema changes (ie. new table schema, old data in the table) (ie. migrations are hard)
More tangibly, I know people who have done exactly what you're talking about, and regretted it. Unless you can very clearly demonstrate that what you're making is meaningfully better, it won't be adopted by the other team members and you'll have to either live forever in your silo, or eventually abandon it and go back to the old system. :/

So... I don't recommend it.

The points you're making are all valid, and for a small scale like this, if you were doing it from scratch it would be a pretty compelling option... but migrating entirely will be prohibitively expensive, and migrating partially will be a disaster.

Could you perhaps find better way to orchestrate your spark tasks, eg. with airflow or ADF or AWS Glue or whatever?

Personally I think that databricks offers a very attractive way to allow data exploration without a significant architecture change.

The architecture you're using isn't fundamentally bad, it just needs strong across the board data management... but that's something very difficult to drive from the bottom up.

You changed my perspective a little bit by asking the right questions.

> Moves from an architecture that is clustered for scale (ie. spark) to one that only scales vertically

I did a quick estimate of the volume, and we won't reach 1Tb before > 5 years. We're not in a line of business where the number of clients can increase dramatically so it's fairly predictable. I don't want to design for imaginary scaling issues.

> Potentially introduces yet more sources of truth for some data.

It is more intended to replace the current mess.

> SQL is terrible language to write transformations in (its a query language, not an ETL pipeline)

Actually this is the point that concerns me the most. The need to transform the data in non-trivial ways. But surely people didn't wait for Spark to do this?

> Unless you can very clearly demonstrate that what you're making is meaningfully better

This is a very good point, and I think I should come up with a quick POC to demonstrate and get buy-in.

> Could you perhaps find better way to orchestrate your spark tasks, eg. with airflow or ADF or AWS Glue or whatever?

I feel that it would just be solving the mess by adding more mess.

I disagree with the author of the parent comment in regards of using SQL and using Spark instead. I actually first wrote my "SQL advocation" as a reply to this comment but decided to leave leave this view for what it is and write my own "rant" against complicating "big" data transformations with Spark or EMR (Hadoop Pig) or vendor-locked Spark-instrumentations like AWS Glue.

But I agreed with the parent comment's author about pretty much anything until the third bullet point of the second list. I'd like to get more reasoning behind his SQL hate.

> you cannot generalize from a non-random sample

So, honest question:

If any survey of any size can be ignored on the basis that the sample is not random, then how is any survey meaningful?

Isn’t this a self defeating argue?

You can’t prove the sample is random, all you can do is show differences between samples and suggest its not consistent... but how do we go away and prove that some other survey we’re comparing it to is from a random sample?

ie. Isnt this just a convenient excuse to deny that a survey is meaningful?

Statistically, how do you mathemtaically quantify the effect of selection bias?

...because, it seems to me, unless you can actually do that, you’re just doing some arm chairmhand waving because you don’t like the results youre seeing.

This has come up several times (eg. js survey about react vs angular), and no one has ever given me a meaningful and mathematical response.

Its always just.. “it must be sample bias”, regardless of the 90000 people they surveyed.

I don’t accept you can survey 90000 developers and cannot offer any generalisation from those results without quanatitively proving there is an overwhelming sample bias, and specifically quantifying the degree of that bias.

Am I missing something here? Everyone seems thoughorly convinced that this is perfectly normal.

(I’m not proud, I’ll take your down votes, but please answer and explain what I’m missing)

The key is in how you randomly select the sample from the population.

This was the author’s point. Just because you have 90k SO respondents doesn’t mean you can say anything about developers as a population. You can say lots of stuff about SO users. Or maybe developers who use SO. But just because you have lots of responses doesn’t mean you know what developers or jugglers or farmers or whatever population interests you.

The confusion rests with SO’s statement that their survey should be representative of developers in general (or CS graduates or whatever other than only SO visitors).

It's not even a random sample of SO visitors, as the there is, at the very least, self-selection bias.

I agree, although I think it’s more easy to correct for this bias to generalize to all of SO than to all programmers.

It IS sample bias. Read the linked article, which describes this as the worst kind of selection bias, when the sample is made of volunteers.


The way to deal with this is to try to construct a representative sample. Here is Gallup's method in 1936:

> But George Gallup knew that huge samples did not guarantee accuracy. The method he relied on was called quota sampling, a technique also used at the time by polling pioneers Archibald Crossley and Elmo Roper. The idea was to canvass groups of people who were representative of the electorate. Gallup sent out hundreds of interviewers across the country, each of whom was given quotas for different types of respondents; so many middle-class urban women, so many lower-class rural men, and so on. Gallup's team conducted some 3,000 interviews, but nowhere near the 10 million polled that year by the Digest.

Stack Overflow did not attempt to construct a representative sample of developers. Therefore they cannot claim that we can learn from their sample about the population.

> If any survey of any size can be ignored on the basis that the sample is not random, then how is any survey meaningful?

One can take efforts to make the sample more random. This is part of the reason why the U.S. Census is legally compelled, for example - to try and reduce self-selection bias. Or the push for mandatory standardized tests in schools.

One can contextualize the results. Applying, say, English literacy rates from a U.S. Survey to China is obviously going to be totally wrong. Applying a developer salary survey at Google to Game Developers is going to be totally wrong. But within their context, they can be more accurate. Outside of their original context, the survey can be re-run.

> ie. Isnt this just a convenient excuse to deny that a survey is meaningful?

While convenient, it's sometimes also inconveniently true that a survey isn't terribly meaningful, or isn't in the context it's being reapplied in. Statistical stuff is hard, a lot of surveys are bad, and while you can make some reasonable guesses and extrapolations, it's worth doing so with a giant grain of salt.

I’m with you. How does the saying go? “All models are flawed, some models are useful.”

This survey obviously does not tap directly into the brains of every developer on the planet and extract their unbiased answers to the questions. But it’s still a useful model for seeing trends in the software industry.

Further, from my personal perspective, I’m pretty ok with the self-selected sampling bias inherent in the survey. The kinds of developers who see the value of Stackoverflow and are willing to participate voluntarily in the survey are the kinds of developers whose opinions I generally care about :). That’s my own bias, which I acknowledge exists, and it doesn’t particularly bother me.

Edit: further, none of the results jump out at me as particularly surprising. If there were some extraordinary results here, I’d want someone to do a more rigorous follow-up to dig into that, but there isn’t so...

> I don’t accept you can survey 90000 developers and cannot offer any generalisation from those results without quanatitively proving there is an overwhelming sample bias, and specifically quantifying the degree of that bias.

Surely you have this backwards? If you want to argue that a survey offers any generalisation, then surely the onus is on you to prove you've accounted for sample bias (amongst others)?

That seems fair; but they have a whole methodology section.

If you want to argue with it, surely the onus is on you to do it concretely?

> Because of your methodology, we must assume a biased sample.

^ I find this quote problematic.

Why must we assume that? If you want to distribution comparisons and point out there survey results are skewed by X compared to some other survey Y... ok.

...but that’s not whats happening right? Its just a flat out arbitrary assumption.

I don’t like arbitrary assumptions when I’m doing maths.

Its easy to say something is wrong, but if you can’t quanitfy how its wrong, I’m struggling to see why I should accept the assumption being raised here.

The js survey was very similar; it was arbitrarily asserted it went to more react developers... but no one actually proved that. They just... assumed it.

> Why must we assume that?

Because you should distrust flawed methodologies by default. The incorrect assumption is that the sample produced by a known flawed methodology is representative.

> Its just a flat out arbitrary assumption.

It is not at all arbitrary. It is based on well known issues with this particular method of sampling.

> ie. Isnt this just a convenient excuse to deny that a survey is meaningful?

Nope, not at all.

It is true that no sample will ever be perfectly representative of a larger population. However, some samples can clearly be more representative than others and the easiest way to tell is to look at sampling methodology. Sample size has absolutely no effect on removing bias.

Here's some info so you can learn more about sampling methodologies: https://blog.socialcops.com/academy/resources/6-sampling-tec...

Now, doing actually representative samples is HARD in many situations, so is knowing how representative your sample is. This is why we can't predict things like who is going to win elections.

> I don’t accept you can survey 90000 developers and cannot offer any generalization from those results without quantitatively proving there is an overwhelming sample bias

I didn’t see anyone point this out here yet specifically, but what you’re missing is that these 90k devs chose to respond to the survey, and the group is made of only SO participants, they were not developers selected at random. That’s the problem here.

There is an overwhelming bias, and it has been proved. Stack Overflow admits that openly and Julia talked about it in her answer to the OP’s commentary:

“Developers from underrepresented groups in tech participate on Stack Overflow at lower rates, so we undersample those groups, compared to their participation in the software developer workforce. We have data that confirms that”

> Am I missing something here? Everyone seems thoughorly convinced that this is perfectly normal.

What you're missing is that one of the intuitions you have is simply wrong. The intuition is that sample size can undo the ill effects of non-random sample. As stated in the original article and elsewhere in the comments, it cannot:

> It is an error to use the sample size of a non-random sample to support the underlying comparison with the population of interest. Sample size can decrease random error, but not bias

Anybody can choose to ignore anything. But, if the selection process is demonstrably random (no accidental biases) and we can assume that the SO people are trustworthy (no intentional biases), we can then start making generalizations about the entire population. Everything ultimately comes down to trust, unless you run your own study. But then, do you trust yourself to conduct it properly?

no, the trivial frontend is built in python; the real code is usually c++ or c.

Ignore that reality if you want to, but it is a fact.

Big complicated python projects are seldom pure python, they are usually a friendly python frontend to a serious application written in something else.

It seems in no way remarkable that someone wanting to build a serious backend type piece of functionality would pick another language that was, just for example, multithreaded.

The CPython interpreter itself is a C program. Acting like extension modules “aren’t Python” is highly disingenuous.

Oh please, go read the source code for tensorflow and then come back and we can have a real conversation.

I don't understand this logic. The users are learning Python, not C++ when they're trying to learn data science or implement a machine learning model. Should I say Tensor flow isn't written in C++ but CUDA or OpenCL? Any self respecting researcher is training their models on GPU or FPGA, not CPU.

The point is that Python is the entry point for large majority of data scientists currently and its absolutely disingenuous to try to dispute that reality.

> The users are learning Python, not C++

The key word here is "users". Python is fantastic for users. It allows users to get things done without having to worry about types or memory or the underlying hardware in any way.

I’d say it’s very disingenuous to claim that Python users don’t have to care about types. Many types are directly related to semantics, so of course they have to care about types!

I think Python made some really good choices with their types from UX that lead to you having to care more about your logic than the machine. Only one integer type, for example! For most users it’s perfect UX.

However, I think (possibly due to the early time when a Python was made) some of the decisions made practical but regrettable trade offs. Duck typing is ultimately just type inference with strong performance penalties and really weak static tooling, for example. _Many_ static type systems are still very far from the user’s actual domain, but the core techniques don’t have to be.

> “one person sources the data, another models it, a third implements it, a fourth measures it”

Call it whatever you like, it is what it is.

oh come on. 18000 responses.

This is the js survey results all over again: no. Unless you can statistically prove the results are biased, you don’t get to ignore the results because you dont like them.

Finding data points with no methodology that contract the survey result does not invalidate the survey results.

Thats. not. how statistics work.

A great deal of effort was put into this survey, and the stats you’re looking at are more likely biased than the ones in this survey.

The stats and the methodology here are clearly documented; if you want to argue with them, be specific and provide concrete statistical proof for your assertions.

Specifically, why do the stats you have prove anything, and what confidence do you have that they are representative?

If you're that strict then this survey is also useless as you cannot prove anything about general population of Python developers.

I think the point here is people tend to reject survey data because they can only see some tiny minimal subset of the data and it doesn’t match in aggregate.

In this situation often the smaller dataset is wrong.

...not always. But often.

Human intuition based on limited data can seem compelling, but it’s always worth acknowledging you might be the outlier.

18000 respondents is a lot, especially when a specific effort has been made to sample from various sources.

The parent post didn’t even bother to check their own biases.

Another thing that I've noticed is that often what we, humans, feel makes intuitive sense (and therefore must be correct), when you actually look at the data, is often very wrong. Just because it makes sense to us doesn't mean its correct. Basically, humans are naturally biased not only towards certain held beliefs, but also towards things that just "sound" correct, regardless of what the reality is. It makes sense though, if an explanation of something is understandable to us, then we can think about it logically (or otherwise) and come to conclusions, but unfortunately often things that are understandable are also based on flawed assumptions.

I don't know if that's the case here, but I do often think that people gravitate towards the first thing that they can understand that seems to, at a glance, check out, without investigating.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact