Hacker News new | past | comments | ask | show | jobs | submit login

I might be under-informed but "data science" seems to involve a lot of vague BS. Computing has always centered on data.

"Data practitioners", "data practice", "data management" just seems like weird rebranding of stuff that businesses have been doing since the 1960s. Partly what the article seems to be relating regarding SQL.

What is "data science" besides computing, data storage of some sort, and analyzing the data? Because it's "BIG" now? Because now we "extract knowledge and insights" from data, since apparently giant databases were amassed for no reason in the past? Because now we "combine multidisciplinary fields like statistics", since apparently nobody had thought to run statistics on their data before? Because "AI"?






Perhaps a more useful characterization would be that it's what happens when you take work that would traditionally belong to statisticians and actuaries with only limited programming and IT skills, and give it to someone who's a more of a jack-of-all-trades programmer/statistician.

I don't think that this leads to changes in the traditional ways of handling IT because of any fundamental change in a data analyst's needs. It's more that, historically they were outside the formal tech sphere, and therefore lusers who had to make do with whatever resources they were able to squeeze out of IT. Now, armed with better technical chops, they're able to make a stronger case for their needs, which means that more is being done to meet those needs.

Which is not to say that your skepticism is unwarranted. "Data science" is still a catch-all term for a new and ill-defined way of working, and things are still pretty wild west. Anyone who tells you they're sure they know what the best practices are is probably trying to sell you their product, which happens to be the best practice. Not many people have the benefit of 20 years of hard-won experience from which to draw a set of really strong guidelines.


The big difference is that historically you could run pretty much all analysis on a single computer (if not in Excel - I’ve seen and built many actuarial models in Excel myself) but consumer internet scale data (both in terms of the number of rows and the number of features available) is too big to model without distributed computing and more programming work on the part of the analyst.

Moreover it’s an iterative process where you start with hypotheses on what features will be useful then build a model and might need to pull more data. For example you could build a recommendation engine just using the set of products purchased by customers but then might explore adding user agent features like devices or might add data about the products like category or price information.

An analyst (“data scientist”) who can pull and interact with that data is much more effective than one who needs to wait for someone else to pull the data for them (most actuaries I know are weak programmers outside of R and wouldn’t know where to begin using distributed computing or getting a TensorFlow gridsearch running on a cluster of cloud GPUs :)


I actually think that analytics clusters is one of those wild west things. The statistician in me is suspicious of a lot of the Big Data movement, for various reasons that could largely be summarized as, "I'm pretty sure these ways of working are motivated less by any true analytics need and more by the needs of cloud services providers to sell cloud services."

I like some of these technologies for taking care of the dragnet data collection and engineering, but, when I'm actually doing a data analysis, it's rare I want to run it on a Spark cluster; I'd much rather run a Spark job to collect the information I need and sample or digest it down to a size that I can hack on locally in R or Pandas. Yeah, there will be some sampling error, but the potential dollar cost associated with that sampling error is much lower than the cost to eliminate that sampling error. And it's basically zero compared to the elephant in the room: the sampling bias I'm taking on by using big data in the first place. "Big data" is, from a stats perspective, just the word for "census data" that people like to use in California.


I totally get your skepticism but the bottom line is that throwing computing power at a problem generally leads to a much better solution even if only because you can do a much more extensive grid-search. This doesn’t have to be using some complex cluster like you could just have n copies of the same VM that each run experiments with parameters pulled from a Google sheet until all the experiments are done.

Sure, the cloud providers sell computing power but it’s a race to the bottom and makes much more sense than buying hardware for these kinds of bursty analytics workloads.

I don’t think “Big data” from a stats perspective is analogous to census data - in some cases yes but for applications like recommendation engines you lose a lot of valuable signal by sampling.


Given that GCP will happily supply VMs that can be resized to dozens of CPUs and hundreds of GB of RAM and back, billed by the minute, sampling bias isn't even a necessity for quite a lot of things a laptop wouldn't be able to handle. I used to write Spark jobs for those slightly-too-big data problems, but Pandas and Dask are quite sufficient for lots of things, without all the headache that distributed computing entails. Plus, data people have no need to store any potentially sensitive data on their personal machines, that's another headache less. It's not going to work well for petabyte-scale stuff, though. I guess for those kinda things and for periodic bespoke ETL/ELT jobs, Spark is still useful.

The ability to re-size GCP VMs totally blew me away when I first discovered it. Just power off the machine, drag the RAM slider way up and turn it back on. SOOO much easier than re-creating the VM on AWS.

I also way prefer to just crank up the RAM on a single instance and use Pandas/Dask instead of dealing with distributed computing headaches :)


I'm pretty suspicious of the idea that most companies have data that is too big for a non-distributed environment.

I do think that most companies log lots of data in different places without much thought or structure, and as a result need a "data scientist", who spends most of their time simply aggregating and cleaning data that can then easily be processed on a single machine. ie: they spend 10% of their time doing statistics.

Maybe I'm wrong, but the above is how all the data scientists that I know describe their work.


In my experience it's a bit of both.

Most of a data scientist's time is definitely spent getting data from various places, possibly enriching it, cleaning it, and converting it into a dataframe that can then be modeled (and surprisingly a lot of those tasks are often beyond pure actuaries / statisticians who often get nervous writing SQL queries with joins nevermind Spark jobs)...

But also once you've got that dataframe, it's often too big to be processed on a single machine in the office so you need to spin up a VM and move the data around which also requires a lot more computer science / hacking skills than you typically learn in a stats university degree so I think the term data scientist for someone who can do the computer sciency stuff in addition to the stats has its place...


Data science, unfortunately, is often considered a buzzword because it varies so significantly from org to org, and often there's a lot of No True Scotsman arguments resulting from that which makes everyone unhappy. But at minimum, every data scientist needs to know SQL; there's no way around it.

I wrote a full blog post ranting about the disparity between expectation vs. reality for data scientists a year ago: https://minimaxir.com/2018/10/data-science-protips/


> But at minimum, every data scientist needs to know SQL; there's no way around it.

Is it no longer the case that every programmer has to know SQL? I have always and still do consider knowledge of SQL a minimum requisite for any programming job...


It is not the case that every programmer has to know SQL (if it ever was).

Programming is now such a big field that plenty of people will go their whole careers without knowledge that's foundational for other programmers - whether it be SQL, linear algebra, memory management, distributed systems, ...

In a way, it's very cool. Instead of a single-dimensional "how good a programmer are you", we've got a multidimensional space. That, naturally, yields a lot more interesting corners, like, say, what could be invented by someone who knows a lot about databases and GPUs.


There are plenty of programming jobs that have no need for SQL, it's just easy to forget how broad the field is. There was never a need for SQL when I was programming microcontrollers, or DSP work in python/matlab

Granted, SQL is an extremely useful skill for programmers, but there are use cases (e.g. front-end dev) where it's not required.

Just because "computing has always centered on data", doesn't mean that every aspect of that was done well (not implying any recent changes brought by "Data Science" here).

For a lot of systems, a lot of engineers/programmers often care very little about the "data", and they very often don't need to. E.g. on a high abstraction level your webserver doesn't care if it's sending HTML, JSON or whatever, and here we are just speaking about formats and not even higher semantics of the data.

I think there can be a lot of value in data-focused roles in companies, because there are many (hard) questions around data that are often badly answered as afterthoughts:

- What are the exact semantics of this data?

- How do make sure the data is clean?

- How do we evolve the data over time?


A good data scientist is a jack of all trades - mediocre programmer, mediocre ML modeler, mediocre ETL architect, mediocre statistician and a mediocre analyst. You hire this person because you don't need a specialist in each of these fields But you need someone who can do these things. Also you don't want to pay a metric ton of money so you can't expect some superstar.

Good data scientists also need to be great at a couple of things. They must be excellent presenters, and excellent diplomats. Getting the data is the hardest part. Data silos are create all over the place and getting access to that data requires speaking to a lot of higher ups.

Can confirm. Am data scientist. Am mediocre at everything :)

> Also you don't want to pay a metric ton of money

Doesn't the job title of "data scientist" still correlate with high pay, or has the hype started to wane?


A data scientist doesn't command the same premium as a machine learning engineer or a kubernetes expert, I'm afraid

Depends where. I'm at Amazon, and they are really serious about not letting DS become just business analysts who just SQL/Python. But elsewhere I do get the impression there is some dilution. Although honestly, that's natural. Just like how in the SDE world there are template web devs and hardcore infrastructure guys, who get paid massively different amounts but have the same job title, that's happening to DS.

Depends on the location and company. In a smaller city working for a smaller company an entry level data scientist might be happy to pull in 60k. In a major city working for a large company you can at the very least double that number.

If the person can do all of these in a way that makes them a profitable investment, it doesn't really seem so high

These people often have PhD's, I doubt they are mediocre statisticians or mediocre analysts.

Hmmmm. I wish your implication was correct, but I fear it's not.

>Computing has always centered on data

Sure, well-thought out data structures were an integral part of good clean code, but the bulk of data was an artifact of software in the form of logs and largely unused database records, not the "center" of anything. Today, there are many people who take these logs and database records and run very specialized code on them. I entered industry in the mid 2000s and it was very difficult to find jobs that concentrated on this component back then. We're often called data scientists today, and there's orders of magnitude more attention that goes into it. This is a real shift that has brought data into the "center" of many organizations.


I used to work for a big consulting firm. A pretty fancy one, not the fanciest, but fancier than most that claim to be fancy.

10 years ago they had a big initiative around “Analytics.” I’ve programmed and worked in data for a few decades so I tried to figure out why these biz people were interested in Analytics.

When I asked they brought up all the terms you said. I showed examples from various projects how they were just describing basic data programming. But they disagreed because this was new.

I figured it out a few years later when they spent $5B on the initiative and it was making money like gangbusters. This was now a product line. It was new that they were selling it.

So now there’s people who want more money and need to differentiate and thus the titles and whatnot. I wish there was an easier way to view work output of people because now the only way I can differentiate between “data engineers” and “data scientists” and “AI wizards” is by looking at their GitHub profiles or talking to them in depth and profiles are almost never complete or accurate and talking takes a lot of time.


I think the term “data science” has largely been co-opted by data analysts and SQL analysts. At a big software company that you’ve definitely heard of, “data scientist” consultants were people who created daily reports for upper management that they created by clicking through graphical UIs charting usage data (ie we gained a lot of users in this region, because of this customer, or lost some because of a holiday). Data scientist employees were SQL analysts and data engineers.

In my experience most of the people doing what I consider data science have the titles “applied scientist” “machine learning engineer” “data engineer”.


I'd say that data science adds some domain knowledge to computing. Usually, the domain knowledge is in business analytics, but it could also be medicine or social sciences. Where traditionally trained computer scientists fail is not so much at implementing the solution, but being able to gain beneficial insights either from the needs of the clients or from the results of computation.

After a new round of hiring, we came to the conclusion that "pure" computer scientists will not do the cut, even those with a phd, simply because we are quite convinced that they won't be able to understand the problem at hand.


Data Science as a practice is not anything new. There has always been practitioners of statistics, math, prediction, and so on. What is new is the merging of technologies that harnesses the large amounts/kinds/disparities of data that is being stored and processing it into useful information to drive decisions for an organization. Thus a term was coined "Data Science".

The term was coined nearly 20 years old by some accounts, but is probably older.

This is a good summary of its meaning and history: https://en.wikibooks.org/wiki/Data_Science:_An_Introduction/...


Isn't this sort of marketing of a fancier job title not a theme in the entire industry?

Programmer -> Developer -> Software Engineer

Sysadmin -> Operations Engineer -> Site Reliability Engineer

QA Engineer -> Automation Engineer -> SDET


Sounds pretty dismissive. AI and Big Data are buzzwords that shouldn't be used very often. Data Science is about creating solutions that are right for the size of your data, and the questions being asked of it.

A C++ program with a couple megs on the heap is appropriate, just as an ETL pipeline is appropriate for gigabytes to terabytes.


The joke is that a data scientist is a statistician who works in San Francisco.



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: