That's not what they wanted.
What people are being sold is AI/ML as a magic bullet that will do something useful regardless of the situation,
and it lets business people avoid making decisions about what they actually want, because AI/ML can be anything,
so they just signup for it and expect to get 20 things they didn't know they wanted handed to them on a plate.
Turn out, it's not enough to just collect a bunch of data and wave your magic wand at it. It wasn't with web analytics 10 years ago, it's still not.
What you actually need is someone who has a bunch of tricks up their sleeve, and has done this before, and can suggest a bunch of Business Insights the business might need before they start building anything, people that actually decide what to do, and actions taken to investigate, and solve those problems.
I mean, to some degree you're right; perhaps ML models could be useful for tracking hardware failures, but that's not what the parent post is talking about. The previous post was talking about just collecting the data and expecting the predictive failure models to just jump out magically.
That doesn't happen; it needs a person to have the insight that the data could be used for such a thing, and that needs to happen before you go and randomly collect all the wrong frigging metrics.
...but hiring experts is expensive, and making decisions is hard. So ML/AI is sold like snake-oil to managers who want to avoid both of those things. :)
It's all about how you package things. ML connected to an audio sensor could predict failure modes that are dificult to detect otherwise. Now that might not be was was asked for, but a win is a win.
I used to love angular, then I got a job which was a “| async” dumpster fire and spent a year watching a team of smart c# developers wallow in a mire of disaster so bad it became a two week regression to change a text field on a form. So full of amazing functional statement no one, even the original authors, could touch it without breaking something in the process.
Your milage may vary. I no longer particularly like angular, personally, because I find it a chore to herd inexperienced FactoryInjectorConstructorFactoryPattern angular developers into not screwing things up.
...but talented team can do well with it too, and I’ve seen people screw up react projects too.
It really is more about good practice and experience than framework, your personal preference is probably, like mine, basically irrelevant.
1) write everything in python because its easy and quick to do so.
2) its slow as.
3) abandon software and write it in something else, or, live on with slow ass software and blame python for being slow and rubbish forever more.
re-writing python in c is a hideously painful process, and its proven to be very unsuccessful practically.
Writing new code in c/c++/whatever and exposing a python api is where successful projects like numpy and tensorflow live.
python is very good at what it is, but no one is ever going to go and rewrite your python code in c to make it faster; its just going to be slow forever.
1) write everything in python
2) yeah, the performance here is good enough so ship it
3) there is no 3
There are very few situations where performance is going to be an issue for you where there is not an existing C module solution that will solve the problem for you. The tired old 'python is slow' trope is getting more and more irrelevant every day. There are other aspects of the language that may make it a mediocre solution to the problem at hand, but out in the real world most people are simply getting the job done with python.
We certainly shipped (using django) and it was certainly slow, and remains a painfully slow very successful enterprise app.
I’m not arguing that the slowness is deal breaking, but it is slow, and it does, routinely, break the SLAs its supposed to meet.
So... unusably slow? no.
...but slow? yes, it really is.
imo. your milage may vary. /shrug
If there are performance issues, rewriting part of a Python application in C is much less likely than refactoring it, without using other languages, to use an existing high performance library.
Application-specific Python extensions are usually intended to allow scripting of the application, with little concern for Python performance (which is the same as doing the same thing without scripting).
New foreign language Python extensions are usually found in new Python libraries, to make existing proven C or C++ libraries available to Python applications or to improve on existing Python libraries.
It's a miracle they shipped it at all.
You can ship your own clone of Reddit next week, with blazingly fast code, running on two tiny VMs and supporting more load than Reddit, but would it be successful?
https://hackernoon.com/which-is-the-fastest-version-of-pytho... has a benchmark that includes 3.7.
Maybe some people are in to that for fun, but the for the majority of people, the message should be:
stick with stable folks.
My experience with writing an ETL in SQL is that it is almost never, quick, easy, correct or easy to test, and also almost always denormalized, or unconstrained (dimensonal keys which aren't 'real' foreign keys, just numbers so you can parallelize the data inserts and updates without constraint errors).
So... your milage may vary with that.
It's most certainly not true that writing any kind of ETL that uses SQL saves time in all cases.
What I would want in your situation is:
- All the data in one place.
- An easy way to explore the data.
- A single source of truth for transformed data.
- Metadata to explain the data model (ie. documentation).
- Adds yet another maintain-forever technology to your stack.
- Adds yet another pipeline (or set of pipelines) that does the same thing.
- Moves from an architecture that is clustered for scale (ie. spark) to one that only scales vertically (postgres).
- Potentially introduces *yet more* sources of truth for some data.
^ Moving data into postgres doesn't make this somehow trivial, it just enables people to use a different SQL dialect. The spark API is, for anyone competent to be writing code, not meaningfully less complicated than using the postgres API.
I appreciate the naive attractiveness of having a traditional "data warehouse" in a SQL database, but there is actually a reason why people are moving away from that model:
- it doesn't scale
- SQL is terrible language to write transformations in (its a *query* language, not an ETL pipeline)
- it's only vaguely better when you have many denormalised tables, vs. s3 parquet blobs
- you have to invent data for schema changes (ie. new table schema, old data in the table) (ie. migrations are hard)
So... I don't recommend it.
The points you're making are all valid, and for a small scale like this, if you were doing it from scratch it would be a pretty compelling option... but migrating entirely will be prohibitively expensive, and migrating partially will be a disaster.
Could you perhaps find better way to orchestrate your spark tasks, eg. with airflow or ADF or AWS Glue or whatever?
Personally I think that databricks offers a very attractive way to allow data exploration without a significant architecture change.
The architecture you're using isn't fundamentally bad, it just needs strong across the board data management... but that's something very difficult to drive from the bottom up.
> Moves from an architecture that is clustered for scale (ie. spark) to one that only scales vertically
I did a quick estimate of the volume, and we won't reach 1Tb before > 5 years. We're not in a line of business where the number of clients can increase dramatically so it's fairly predictable. I don't want to design for imaginary scaling issues.
> Potentially introduces yet more sources of truth for some data.
It is more intended to replace the current mess.
> SQL is terrible language to write transformations in (its a query language, not an ETL pipeline)
Actually this is the point that concerns me the most. The need to transform the data in non-trivial ways. But surely people didn't wait for Spark to do this?
> Unless you can very clearly demonstrate that what you're making is meaningfully better
This is a very good point, and I think I should come up with a quick POC to demonstrate and get buy-in.
> Could you perhaps find better way to orchestrate your spark tasks, eg. with airflow or ADF or AWS Glue or whatever?
I feel that it would just be solving the mess by adding more mess.
But I agreed with the parent comment's author about pretty much anything until the third bullet point of the second list. I'd like to get more reasoning behind his SQL hate.
So, honest question:
If any survey of any size can be ignored on the basis that the sample is not random, then how is any survey meaningful?
Isn’t this a self defeating argue?
You can’t prove the sample is random, all you can do is show differences between samples and suggest its not consistent... but how do we go away and prove that some other survey we’re comparing it to is from a random sample?
ie. Isnt this just a convenient excuse to deny that a survey is meaningful?
Statistically, how do you mathemtaically quantify the effect of selection bias?
...because, it seems to me, unless you can actually do that, you’re just doing some arm chairmhand waving because you don’t like the results youre seeing.
This has come up several times (eg. js survey about react vs angular), and no one has ever given me a meaningful and mathematical response.
Its always just.. “it must be sample bias”, regardless of the 90000 people they surveyed.
I don’t accept you can survey 90000 developers and cannot offer any generalisation from those results without quanatitively proving there is an overwhelming sample bias, and specifically quantifying the degree of that bias.
Am I missing something here? Everyone seems thoughorly convinced that this is perfectly normal.
(I’m not proud, I’ll take your down votes, but please answer and explain what I’m missing)
This was the author’s point. Just because you have 90k SO respondents doesn’t mean you can say anything about developers as a population. You can say lots of stuff about SO users. Or maybe developers who use SO. But just because you have lots of responses doesn’t mean you know what developers or jugglers or farmers or whatever population interests you.
The confusion rests with SO’s statement that their survey should be representative of developers in general (or CS graduates or whatever other than only SO visitors).
The way to deal with this is to try to construct a representative sample. Here is Gallup's method in 1936:
> But George Gallup knew that huge samples did not guarantee accuracy. The method he relied on was called quota sampling, a technique also used at the time by polling pioneers Archibald Crossley and Elmo Roper. The idea was to canvass groups of people who were representative of the electorate. Gallup sent out hundreds of interviewers across the country, each of whom was given quotas for different types of respondents; so many middle-class urban women, so many lower-class rural men, and so on. Gallup's team conducted some 3,000 interviews, but nowhere near the 10 million polled that year by the Digest.
Stack Overflow did not attempt to construct a representative sample of developers. Therefore they cannot claim that we can learn from their sample about the population.
One can take efforts to make the sample more random. This is part of the reason why the U.S. Census is legally compelled, for example - to try and reduce self-selection bias. Or the push for mandatory standardized tests in schools.
One can contextualize the results. Applying, say, English literacy rates from a U.S. Survey to China is obviously going to be totally wrong. Applying a developer salary survey at Google to Game Developers is going to be totally wrong. But within their context, they can be more accurate. Outside of their original context, the survey can be re-run.
> ie. Isnt this just a convenient excuse to deny that a survey is meaningful?
While convenient, it's sometimes also inconveniently true that a survey isn't terribly meaningful, or isn't in the context it's being reapplied in. Statistical stuff is hard, a lot of surveys are bad, and while you can make some reasonable guesses and extrapolations, it's worth doing so with a giant grain of salt.
This survey obviously does not tap directly into the brains of every developer on the planet and extract their unbiased answers to the questions. But it’s still a useful model for seeing trends in the software industry.
Further, from my personal perspective, I’m pretty ok with the self-selected sampling bias inherent in the survey. The kinds of developers who see the value of Stackoverflow and are willing to participate voluntarily in the survey are the kinds of developers whose opinions I generally care about :). That’s my own bias, which I acknowledge exists, and it doesn’t particularly bother me.
Edit: further, none of the results jump out at me as particularly surprising. If there were some extraordinary results here, I’d want someone to do a more rigorous follow-up to dig into that, but there isn’t so...
Surely you have this backwards? If you want to argue that a survey offers any generalisation, then surely the onus is on you to prove you've accounted for sample bias (amongst others)?
If you want to argue with it, surely the onus is on you to do it concretely?
> Because of your methodology, we must assume a biased sample.
^ I find this quote problematic.
Why must we assume that? If you want to distribution comparisons and point out there survey results are skewed by X compared to some other survey Y... ok.
...but that’s not whats happening right? Its just a flat out arbitrary assumption.
I don’t like arbitrary assumptions when I’m doing maths.
Its easy to say something is wrong, but if you can’t quanitfy how its wrong, I’m struggling to see why I should accept the assumption being raised here.
The js survey was very similar; it was arbitrarily asserted it went to more react developers... but no one actually proved that. They just... assumed it.
Because you should distrust flawed methodologies by default. The incorrect assumption is that the sample produced by a known flawed methodology is representative.
> Its just a flat out arbitrary assumption.
It is not at all arbitrary. It is based on well known issues with this particular method of sampling.
Nope, not at all.
It is true that no sample will ever be perfectly representative of a larger population. However, some samples can clearly be more representative than others and the easiest way to tell is to look at sampling methodology. Sample size has absolutely no effect on removing bias.
Here's some info so you can learn more about sampling methodologies:
Now, doing actually representative samples is HARD in many situations, so is knowing how representative your sample is. This is why we can't predict things like who is going to win elections.
I didn’t see anyone point this out here yet specifically, but what you’re missing is that these 90k devs chose to respond to the survey, and the group is made of only SO participants, they were not developers selected at random. That’s the problem here.
There is an overwhelming bias, and it has been proved. Stack Overflow admits that openly and Julia talked about it in her answer to the OP’s commentary:
“Developers from underrepresented groups in tech participate on Stack Overflow at lower rates, so we undersample those groups, compared to their participation in the software developer workforce. We have data that confirms that”
What you're missing is that one of the intuitions you have is simply wrong. The intuition is that sample size can undo the ill effects of non-random sample. As stated in the original article and elsewhere in the comments, it cannot:
> It is an error to use the sample size of a non-random sample to support the underlying comparison with the population of interest. Sample size can decrease random error, but not bias
Ignore that reality if you want to, but it is a fact.
Big complicated python projects are seldom pure python, they are usually a friendly python frontend to a serious application written in something else.
It seems in no way remarkable that someone wanting to build a serious backend type piece of functionality would pick another language that was, just for example, multithreaded.
The point is that Python is the entry point for large majority of data scientists currently and its absolutely disingenuous to try to dispute that reality.
The key word here is "users". Python is fantastic for users. It allows users to get things done without having to worry about types or memory or the underlying hardware in any way.
I think Python made some really good choices with their types from UX that lead to you having to care more about your logic than the machine. Only one integer type, for example! For most users it’s perfect UX.
However, I think (possibly due to the early time when a Python was made) some of the decisions made practical but regrettable trade offs. Duck typing is ultimately just type inference with strong performance penalties and really weak static tooling, for example. _Many_ static type systems are still very far from the user’s actual domain, but the core techniques don’t have to be.
Call it whatever you like, it is what it is.
This is the js survey results all over again: no. Unless you can statistically prove the results are biased, you don’t get to ignore the results because you dont like them.
Finding data points with no methodology that contract the survey result does not invalidate the survey results.
Thats. not. how statistics work.
A great deal of effort was put into this survey, and the stats you’re looking at are more likely biased than the ones in this survey.
The stats and the methodology here are clearly documented; if you want to argue with them, be specific and provide concrete statistical proof for your assertions.
Specifically, why do the stats you have prove anything, and what confidence do you have that they are representative?
In this situation often the smaller dataset is wrong.
...not always. But often.
Human intuition based on limited data can seem compelling, but it’s always worth acknowledging you might be the outlier.
18000 respondents is a lot, especially when a specific effort has been made to sample from various sources.
The parent post didn’t even bother to check their own biases.
I don't know if that's the case here, but I do often think that people gravitate towards the first thing that they can understand that seems to, at a glance, check out, without investigating.