Hacker News new | past | comments | ask | show | jobs | submit login
Everyone wants to do the model work, not the data work [pdf] (storage.googleapis.com)
358 points by Anon84 on March 29, 2021 | hide | past | favorite | 112 comments

So, I don't have time to read the entire paper, but I thought it would be interesting to share my experience from the other side. I lead the engineering for a company that does the "data work", we gather terabytes of data, mostly photography and lidar, every day.

So far it's been a crappy industry to be in, it seems the only way to make money is to provide the entire value chain, and this is the business model we're pivoting to.

Just a bunch of photography and lidar scans isn't worth much to anyone, and it's very costly to establish a quality data stream. We had enormous capex on sensor gear, training personnel, running the operations. All this to save a couple bucks on sending an inspector physically to the site.

If you do the entire value chain, you gather the data, you process it into information, and then you turn that information into actionable reports. So that means that besides running our operations, we're also doing the modeling work, and we're providing the industry specific expertise for the actual inspection. So our customers pay for knowing what is broken, where it's broken, and what the impact of that defect is.

In the end, I believe we'll be very successful, but it will be because our investors were willing to take us all the way, where many other companies have simply given up on the concept.

That's an insightful experience, thank you for sharing. To give a counter-counter point, in my experience working with LIDAR (terrestrial in my case, not airborne) there's a wealth of tools that claim to process large amounts of unstructured pointcloud data and make all kinds of fancy visualizations and 3d models (Pix4D comes to my mind). The "catch" was often to find high quality data to feed these tools (other than the sample data provided by them). I'm convinced that having high-quality proprietary data is the moat the investors are looking for, not the algorithms. After all, universities are cranking improved algorithms all the time for free.

Oh for sure, I suppose that's also why the investors are still interested in us. But high-quality proprietary data is not what the customers are interested in. They just want to see a report that says how much m2 of brickwork needs to be repointed in the next 5 years. And they don't care if it's a machine learning model that deduced that information from high quality drone imagery, or if it was a dude on a ladder.

We're also working with terrestrial lidar btw. When we started out we thought we were going to do everything using photogrammetry, but right now we're deriving most of our geometry from the lidar.

I work in legal tech. The money is in identifying, gathering, processing and letting people search though terabytes of emails, spreadsheets, slack messages, etc. Corporations and law firms pay handsomely for eDiscovery. Ingesting and making sense of millions of emails for example ( think reply, reply to reply ) is a interesting problem.

We’ve found something very similar in a totally different industry. We started with data collection and people were willing to pay for the data, but not enough to create any kind of long-term sustainable business. We’ve slowly moved up the chain to provide operations support and now “actionable intelligence” alongside industry expertise and people are finally excited and willing to pay us real money for a full solution.

Now the challenge is keeping the team from falling back into the easy habit of just selling the data or trying to design product that are nothing more than glorified dashboards.

Does that mean you develop more individual solutions for every customer instead of a one size fits all product?

Inspections are generally standardised over entire industries. This is nice for the companies because it means inspectors can be trained anywhere, and directly hired to perform their industry standard inspection methods.

What I meant was that we had to actually hire or partner with industry inspectors, so we could sell a package deal to the customer, we had to go outside of our comfort zone and sell inspections, when all we really wanted was to sell data.

Real story. I worked on nudity detection for mobile realtime video chat app with on-device deep learning model. (It was successful.) Should be easy: label some data, start from a good network pretrained on ImageNet (we used SqueezeNet), fine tune. The problem was that ImageNet is photos, not video frames, and distribution of video frames from mobile video is somewhat different. Incredibly large proportion of frames are either blank (camera is blocked) or ceiling (camera is misdirected). We ended up adding lots of data for blank and ceiling explicitly labeled as blank and ceiling. It became an excellent ceiling detector, detecting ceilings it never have seen before.

Is there a mobile video frames dataset for labeling ceilings? Why not? I can't believe I am the only one who experienced this. Why is not a ceiling dataset worthy of, say, CVPR? It will improve mobile video on-device deep learning more than most of CVPR papers. This is a serious problem.

Edit: I understand why niche datasets remain in industry and not academia. Public datasets are better if they are of general interest. But ceiling dataset is of general interest to anyone who wants to process video frames originating from blocked or misdirected camera aka smartphones, and it's hard to imagine topics of more general interest once you have obvious things like face.

Little crafty trick: grab the flickr dataset (yfcc100m) and snag photos tagged as whatever you want. It's like a janky google image search. I've put together datasets of airplanes, bicycles, dogs, etc this way.

It's not entirely accurate, but it's good enough. Within a few hours you can have a pretty large dataset of whatever you want, really. (Yay for massive dataset plus user tags.)

You can snag a copy from my server, if you want. (Warning: It's a direct link to a 54GB json file.) https://battle.shawwn.com/sdc/f100m/yfcc100m_dataset.json

I have a script called "janky-image-search" that returns random results from searching this dataset. Here are a few random hits for `janky-image-search ceiling`:






The quality is hit or miss, but it seems better than 70%.

EDIT: Here's a gallery of the first 100 hits for "ceiling": https://cdn.gather.town/storage.googleapis.com/gather-town.a...

But as you can see, it's not effortless. Most of those ceilings are old. So it depends what you want. It's why labeled data is worth billions of dollars (scale.ai et al).

Super neat!

(Small detail: the first 100 hits for ceilings are all _interesting_ ceilings? Nothing like the kind of ceilings usually encountered on a video chat app?)

Why would anyone post something boring on a photography site? There's a bar for quality on flickr; it's not twitter.

Yeah, flickr might not be a good data set to train for boring images. Maybe we need a high quality repository of bad photos.

Of course, the point is that it may not be good data for training a model.

I was clarifying, since the above poster was using question marks.

I found that people don't upload misdirected ceiling photos and videos to YouTube and Flickr. How unfortunate. Yes, it's better than nothing.

Now you've made me want to create a random-ceilings dataset. I spent about an hour trying to think of some way to get you a bunch of un-curated ceilings, to no avail.

If you ended up using the ceilings uploaded to your app directly from your users, it might be possible that there wasn't any better solution than the one you came up with.

Perhaps if there's a database of those 360 photo globes you could crop out the upward-pointing part?

That dataset will work well for detecting nudity in Zoom calls from the Sistine Chapel :)

Before I download 54gb from your server, can you tell me the structure of the json?

What is a good tool to search it without 64gb of free memory?

My janky-image-search script looks like this:

  shift 1
  curl -fsSL https://battle.shawwn.com/sdc/f100m/yfcc100m_dataset.json | jq '{user_tags, machine_tags, description, title, item_download_url, item_url}' -c | egrep -i "\"($tag)\"" "$@"
Basically, it trades bandwidth for memory. And hetzner has free bandwidth.

Here's curl -fsSL https://battle.shawwn.com/sdc/f100m/yfcc100m_dataset.json | head -n 1, which should give you all the info about the structure. Or most of it.

  {"date_taken": "2013-08-10 11:05:54.0", "item_farm_identifier": 3, "item_url": "http://www.flickr.com/photos/9315487@N04/9497940823/", "user_tags": [], "title": "IMG_7216", "item_extension_original": "jpg", "latitude": null, "user_nickname": "meowelk", "date_uploaded": 1376370010, "accuracy": null, "machine_tags": [], "item_server_identifier": 2856, "description": null, "item_secret_original": "15f4257122", "user_nsid": "9315487@N04", "license_name": "Attribution-NonCommercial-ShareAlike License", "capture_device": "Canon EOS 7D", "item_marker": false, "item_id": 9497940823, "longitude": null, "item_download_url": "http://farm3.staticflickr.com/2856/9497940823_0c0d854111.jpg", "license_url": "http://creativecommons.org/licenses/by-nc-sa/2.0/", "item_secret": "0c0d854111"}
I just dump those to disk and then use `jq` to extract the item_download_url.

FWIW, the full ceiling search finished with 9,325 results. Here's the first 1k.


Thanks for the information. So the 54 gigabytes is just plain text JSON? How big is the full image dataset?

Good question. It's 100m images, but I haven't tried to find how large all of it is.

We downloaded a subset of 3 million images, which is apparently 379.77 GiB. So a linear extrapolation would be (100/3*379.77) = ~12,659 GiB for the full 100m images.

12TB really isn't too bad. It's massive, yes, but imagenet 21k is 1.2TB.

Wow I immediately recognized that first one as the ceiling of a Washington DC Metro station.

I don’t need it, but that’s really awesome of you!

Because fewer PhD are minted by "Improved on SotA ceiling detection rates by 50%."

Maybe not PhD, but as a research topic, I think doing this well is an interesting task. I learned that ceilings are not featureless. (I know, it's a kind of ridiculous to say I "learned" this, since I could have learned this from, oh well, looking at my ceiling just now, but it really was a surprise when I encountered this in machine learning data context.)

Ceilings often have lightings! Lightings (either on and off) vary between geography and culture! (They do! Extreme variations!) Lightings, if on, will cause flares! We just sampled from actual distribution (privacy policy included we can monitor connection for anti-abuse and we can use data for research and development of anti-abuse), but dataset collected with care will be valuable. I am sure actually doing this will reveal other considerations, and even just with points I mentioned, it should be good enough for a paper, if not for PhD.

That sounds like the problem in a nutshell. "X is simple, so just gin me up a data set of it." (Four months later) "X is not simple, in all these ways..." Which is another way of saying that discovery and communication with SMEs is an important side skill in project success.

What would be the novel scientific contribution over any other image classifier? What's the storyline? How do I get excited? You collected a dataset for some boring class and trained a standard sota model on it. This is an undergrad project report, not a research paper.

The main question people ask when considering reading a paper is "So what??"

Not an ML person, but I generally understand the concept. Why do you care about ceilings here? I get that the camera is pointed weird ways, but if you're trying to detect nudity presumably if trained on a bunch of images of that subject, it would just call ceilings False.

No, because training set does not include ceilings.

I think I understand, so you basically need in your training set anything that the production model would presumably see for it to work well? You can't just say here are a bunch of positives and a bunch of negatives, the negatives have to be actual things the model will see.

If you think about it, this can't be otherwise. Positive/negative is misleading, since that distinction is the entire thing we are trying to teach. Let's call two labels A and B. You tell the model:

This long penis is A. This short penis is A. This cat is B. This dog is B. Now, what is this ceiling?

The model, looking at the ceiling, discovers a long fluorescent tube. It is long! Neither cat nor dog is long (The model is yet to discover longcat), and while penis comes in long and short variety, all penises seem long-ish. The ceiling is A.

In the typical ML task formulation, closed-set classification, the model is forced to make a choice between the provided categories, even if the output is not really either. And neural networks tend to perform erratically outside of the training data distribution.

Adding common inputs to the training (or at least validation and test) sets is a good solution. Its hard data work, but will pay off. There are some techniques outside of closed-set classification that can help reduce the problems, or make the process of improving it more effective:

- Couple the classifier with a out-of-distribution (novelty/anomaly) detector. Samples that score high are considered "Unknown" and can be flagged for review. - Learn a distance metric for "nudity" instead of a classifier, potentially with unsupervised or self-supervised learning (no labels needed). This has higher chance of doing well on novel examples, but it still needs to be validated/monitored. - Use one-class classifier, trained only on positive samples of nudity. This has the disadvantage that novel nudity is very likely to be classified as "not nudity", which could be an issue.

Yes, with emphasis on "well". It does work without ceilings. But your training set should include everything you will frequently encounter in test set. Ceilings are too frequent to miss.

If your model supports Out of distribution detection, you are very correct.

Why didn't you train on random images from the web? If you carefully select your sources, most of them will be non-nudes.

I'm sure there's ways to extract loads of e.g. tiktok videos, especially the non trending / highly produced ones (I'm making assumptions about the use case here).

If you "carefully select your sources" then you aren't training on "random images from the web".

If you are using random images from the web, there is a decent chance that you will get a result like this.


The Guardian labels it sexist. I assumed that it was because the majority of young attractive women on the internet are posting bikini pictures.

> The Guardian labels it sexist. I assumed that it was because the majority of young attractive women on the internet are posting bikini pictures.

They're both true at the same time, which is exactly the point. The people in the article in the paper weren't trying to teach their neural net that men usually wear suits and women usually wear bikinis, but they did. They only noticed because the failure mode was really obvious. In some other situation, the exact same kind of sexist bias might exist in the data -- and important decisions, affecting people's lives, may be made -- without anybody noticing first.

I'm wondering if you could just take some 3d scenes of say flat interiors and then render them in a good quality with camera looking in random directions to use that as training data. You get actually correct labels because you can calculate it when rendering and label accordingly.

Yes, this is extremely common and extremely frustrating.

Companies, governments, individuals come to me all the time asking how to "Do AI" without knowing anything about their product, data provenance, testing plans, state observability etc...

Sure, you can do a toy problem with some interesting vision or NLP framework and some off the shelf data that isn't ready for production or anything serious. Who cares? That doesn't do anything novel, it doesn't actually change your product function.

Chomsky said the most important thing to do as a young person is to follow the phrase "Know Thyself."

People who want to do AI need to "Know Thy-data and product architecture" first.

I did a talk last year on exactly this: https://www.youtube.com/watch?v=jK-SBu1iiHo

My 2 cents is that ML oriented job titles are too fine grained with everyone in theory serving a mythic modeler who just needs to be fead data. Everyone finds the 1% of the time they spend building models the rewarding part - however many of the job titles like Data engineer, or ML engineer do not include modeling in the job description. There are companies which differentiate scientists into multiple categoies based on whether they do ML, can code, or just statistics.

Now the problem is that any of these folks can get into trouble for doing the others job, and the pay scales are often radically different. Relegating core responsibilities such as data cleaning/collection to support roles.

I'd bet that flattening the title structure in these groups would yield better results, and allow team members to more seamlessly move across Backend, Data, ML, and science oriented tasks.

Some companies don't separate ML roles by title, but that still leads to problems.

Without management-driven separation, too many people try to work on just modeling. You wind up with nobody labeling and cleaning the data. Everyone complains that it's hard to get their models to work on the limited data available, but no one wants to be relegated to the data cleaning role.

I have yet to see a good company layout that makes everyone happy. Too many people want to work on modeling, and the worst part is that individuals are right to fight for it. The more practice you get, the more you improve at the work. Plus, companies tend to value good modelers, so your salary goes up too. It's a very frustrating feedback loop.

Separating modeling work by title or by team is one of the few ways companies can keep from having a run on the modeling projects.

There are a few other ways, but they are all pretty bad too. You can have a "keep what you kill" structure, where everyone gets first dibs on doing modeling on the data they've collected and cleaned. I've also seen a company that makes the modeling job as miserable as possible with slow tooling and high expectations, so lots of people wash out and only a few modelers are left.

I wish I knew a good solution to this. My only idea is to play-up the importance (and salary!) of the support roles, but I'm not satisfied yet.

As the more senior data analyst/ML person I've found myself taking on that role more and more. I've been 'relegating' the hands on modeling to the more junior members of the team, taking only a more supervisory/mentoring role when it comes to modelling. Quite frankly the modeling job is a lot easier and is more suitable for people fresh out of school, as it aligns more with what they've been studying.

Data collection/cleaning/labeling etc. and generating good data sets that solve the problem we're facing requires a lot more domain knowledge and experience, and quite frankly I'm a lot better and faster at it compared to a freshly minted Masters/PhD student.

I've seen far more ML projects fail because they couldn't get the right data set, and very few fail because they couldn't find the right model. So I'm getting more and more convinced that getting the data set right should be the primary responsibility of the most senior member(s) of the project team.

There is the bank solution. Working collections is the poster for job that will make you hate your life, but it must be done. So the rule is all jobs must be posted to internal transfers for two weeks, and if someone qualifies applies you can't look for someone outside the company. As a result it is nearly impossible to get a job without first working a collections job for 9 months. Collections hires basically anyone and helps them move on to a better position after you have done your time. Everyone else knows you don't hire anyone who hasn't done their time in collections as it is the only way that needed job can be done.

Data is not nearly as evil - hate your whole life - as collections, so I hope that we don't have to stoop to this level. It is an option though.

what is collections in this case?

Calling people who aren't paying their mortgage and threatening to fore close if they don't make their payments. Or similar to those who don't pay their credit cards. Generally people who are in this situation are having a lot or trouble in life.

This is a common management dilemma when there are specific tasks that are highly valuable in future engineering careers but have limited application within the current organization. Or when key "promotion worthy" projects come around and individuals are strongly incentivised to have the "core" contribution.

The best approach I've found is to reward based on impact to multiple individuals, and to some extent modulate it by rewarding individuals who "run towards fires". A straight scientist who makes the best datasets in a scalable fashion will thus have high impact on a painful part of the business that no one else will work on.

Like, this blows my mind. I can't even imagine a world where one would model without data cleaning.

In the olden times, your model would normally error out if you hadn't done any data cleaning (length 1 factors, constant columns etc) but i suppose with NN's they just ignore stuff like that because of the regularisation, I assume.

Every good DS I know regards the data as much more important than the models, normally from bitter, bitter experience. Models are "sexy" though, so people love them. Additionally you can do the XKCD compiling thing x 10 with models, as they take forever to fit on any non-trivial datasets.

I actually think that the cause of this is poor management. ML/DS//statistics are tools, they do not supply any business value by themselves. If people focus too much on the tools, then their manager should refocus them on the actual problem.

Mind you, if you don't have the background in this area as a manager, you're less likely to be able to spot these pathologies.

Personally, as an experienced DS, I know that my time is best spent understanding the data and the business problem, as that will drive better results. It's sad that people don't realise this (including some of my current management chain, unfortunately).

This is interesting because one of the criticisms towards data science have been that it's a term so broad that two people can have the title data scientist and do completely different jobs. Hence the differentiation over time into research scientists, ML engineers etc.

Personally I might prefer the generalist hat but I see things moving in the opposite direction.

There is also the problem of hiring - if 1-10% of the job requires a lot of statistical theory/research and the rest doesn't, do you hire PhDs and ask them to mostly do the engineering tasks that they aren't necessarily prepared to do? What about the people who are perfectly capable of doing 99% of the job and picking up the extra 1% of theory - do you exclude them because they don't have the academic credentials?

I would take the approach that there is a spectrum of knowledge and hire the quote "T" shaped individual. Strong in at least one area, but broad enough to contribute anywhere.

People change faster than orgs/titles, it would be better to keep this part flexible to maximize their growth. This can be further extended to the whole engineering culture.

It's not uncommon that a frontend engineer shares a title with a backend software engineer - these skills are different enough that the individuals are no longer fungible for the most part. Why not give DS's engineering titles and treat it as a management objective to ensure you have the right people on any given project?

If scientist is the preferred title, why not give anyone the science title?

I think it's a problem of visibility up the management ladder. (A) If management understands ML enough to launch a feasible project, (B) if management understands that models are a component of ML, (C) if management understands models are created from data, (D) if management understands that production models must continually be fed data.

Typically, understanding seems to stop somewhere between A & B.

Maybe they could try actually hiring those of us who have PhDs in other fields who are actually enthusiastic about generating large quantities of reproducible and transferable data.

I hear people say all the time how this is what they want, yet for some reason if you aren't already published in machine learning no one wants to take a risk. Instead it's easier to complain about poorly formatted datasets.

Don't go for a research job. Your actual data skills will be super valuable to a company that's trying to increase revenue/decrease costs. Bonus if you did an individual PhD, as you'll likely be able to organise your workload better.

My skills are strongly in the direction of "design, build, and operate a test lab that will be used to automate collection of high quality datasets relevant to the problem of interest".

Whenever I've tried to actually use existing data to attack the problems I'm most interested in the fundamental problem is that data that is comparable between different labs doesn't exist.

And to avoid any confusion, I'm not saying it doesn't replicate, just that it's not easily comparable, like if you want to know what platinum catalysts will do over an alkene but the data you can find was done in a batch reactor with 22nm platinum and you're using a continuous flow reactor with 10nm platinum. And it's usually worse than that once you include how it's supported, what flow rates were, heating profiles, etc.

It just goes on and on if you don't do it yourself.

> My skills are strongly in the direction of "design, build, and operate a test lab that will be used to automate collection of high quality datasets relevant to the problem of interest".

Those sound like incredibly useful skills in any business which deals with data (which is most of them). I'm partial to consumer tech (psychology background), but there are opportunities for people with these skill sets.

what's an individual PhD?

One where you don't get handed a topic and a lab by your advisor.

> Yes, this is extremely common and extremely frustrating.

Particularly when you can raise billions by being confidently wrong.

Probably stems from the foundational misnomer of “artificial intelligence” (which glamorizes the algorithm) compared with, say “data-driven algorithms” which emphasizes both aspects suitably.

AI might have been a passable term in the era of rule-based expert systems, but both “AI” and “ML” are particularly misleading terms (when applied to modern data-heavy methods), leading to consequences like that pointed out in the article.

Further, glossing over the importance of data to emphasize the algorithms is not accidental, given the subject’s roots in CS, and the rhetoric used to promote the field. The sub fields where it plays out differently are those which have a healthy respect for the prices of data gathering.

I've also heard of "matrix networks" as an alternative to "deep learning". Immediately becomes more clear and less hyped.

What doesn’t help is that these techniques are increasingly taught in undergrad programs and short term data science experiences.

More content, more tools, no time for understanding of the full scope of analysis.

Statistical education has called for the use of real data for a long time and it is still woefully absent in undergrad statistics and probability courses. Students are given clean/cleanish data so they can be evaluated on the correct application of tools, not on a full stack process of data sense making. When I took over our stats course I took out the coding (instructors were arguing over R vs Matlab) and put in excel. Excel opens their eyes because it can’t make math invisible like R can...they can’t obfuscate their mistakes in abstractions.

It’s hard because doing that is at the edge developmentally of young adults...but it’s like giving them a Ferrari as a present for their drivers licenses when we teach this stuff. If experts can’t do it right, think about the implications of teaching them to 19 year olds.

But will taking away abstractions address what the paper is talking about?

If we can abstract away the models and training, I see an opportunity to focus more of the dataset. If you look at something like Pytorch Lightning (and I think fast.ai but am not familiar) you can deploy very powerful, and proven, models without having to get into the details. What might still be missing are more tools to work with the dataset - easily varying size, composition, using active learning, etc. But it's all supported by more abstraction.

(I'm a huge believer in the importance of understanding things from first principles, so I agree with making sure students develop their skills that way. But when it comes to practical work with datasets, abstraction is very important)

At a professional level not directly, but I believe (opinion statement) starting with the abstractions and the abstractions being the focus of all the coursework you typically take contributes to the problem over the long haul.

I'm a firm believer Blockpad (https://blockpad.net) can be a powerful Excel-like tool in these environments, especially with engineering data.

Wow, never seen this before, it looks awesome but the pricing is a bit steep when I compare it to the greater power (but less purpose driven focus) that I have with CoCalc (https://cocalc.com/) giving me nearly instant, cloud hosted hosted Jupyter notebooks wherever I want them.

I'm sure they're open to feedback!

Woah...was not familiar with this...

I can tell you Tulane uses their stuff on an education license.

Well, why do they want to? Is it their fault? People follow incentives.

Infrastructure work is low status, janitor-adjacent or like the guy who fixes broken toilets. Academic, abstract, scientific, work is less immediate, more rhetoric-focused, more open ended, similar to lawyers and politicians and is about selling ideas. Shoveling data is like shoveling dirt, while academic abstract modeling is like an architect drawing sketches of a palace.

Like it or not, infrastructure work is thankless. Complaining is useless, either accept it or switch jobs.

This is some of the reason, but I've noticed the same behavior in data scientists who are "vertically integrated" and who are only measured on the quality of the final model.

In this case, their incentives are aligned with good data work, but they still suck at it and neglect it.

I believe there's a few psychological reasons:

- People aren't trained to do it in college, so it's undervalued.

- It requires domain knowledge, which is hard.

- The work is quite gruelling and annoying. It's very detail oriented.

- You don't feel like you're making progress while you're doing data work. Modelling work feels closer to the final output.

- Self-delusion during modelling is easy since you can overfit on the holdout data which gives you those dopamine hits.

- Modelling satisfies curiosity. During data work, the data scientist can't wait to hurry up to the modelling stage and see what the results look like.

- Many people are just bad with data work, they don't have the attention to detail required and produce data with glaring errors.

Again, this boggles my mind. Data cleaning is 90% of the job. If you don't like this, then you should not be working in ML/DS.

I actually enjoy data cleaning, sure it's annoying sometimes but the insight you gain actually allows you to build a much better model.

And, to counterpoint one of your points, I would note that part of my (psychology) degree was collecting data from our friends for various surveys and experiments. We'd then add everyone's data together and run analyses on this.

It's really crazy to me that stats/ML/everyone else didn't do any of this, as it certainly lead to me being a much better data scientist.

You have the right practical mindset, and it might have something to do with your degree choice. People that pick psychology aren't going to have their head in their clouds and aren't doing DS for prestige reasons - DS is more a means to an end instead of an end itself. You're also trained only in fairly simple, mostly linear models, which helps.

There's an even more extreme type to the one I outlined. This type of DS dislikes data engineering and dislikes applied modelling.

They prefer to sit in an armchair and think about modelling, writing up a LaTeX document with abstract and overcomplicated math formula.

I met one DS like this who was from a theoretical physics background and his years of operating in this mode couldn't be shaken off.

Just as an FYI, DS didn't exist when I did my undergrad, and Facebook were the only people with DS during my PhD.

I got into DS because I really enjoy analysing data and running experiments, and as a backup to academia.

Incentives. Do you have skin in the game? Are you building your cv and career and brand or are you building the product? Do you personally benefit more from 10% improvement due to data cleaning or 1% improvement from some fancy tweak of the model which you can hold talks on and write papers about, with math formulas etc?

No, but I like making/saving money, doing analysis, writing code and talking about how to do all of those things better.

Especially the making money part, as that's ultimately why my employer/clients pay me.

Not to mention it pays far less. I've interviewed for data jockey positions on AI/ML teams that were paying about $85k. The people building the models were making double or triple that.

This is the real issue. It's considered less-skilled and therefore by supply/ demand more can do it and it pays less.

I think this is probably only because it's harder to assess skill in it, though. It's easy to test for knowing the vagaries of the ADAM solver or statistics or whatever. What is the equivalent for being a good data cleaner, though?

That’s a long paper to write given it’s all based on a survey of 50 people.

Nevertheless the point is true, and importantly it’s probably caused by “AI engineers” who are actually pretty mediocre and were just attracted to this job for the wrong incentives. If interviews for DS and AI focus on recruiting people for not just theoretical middling knowledge but also the drive to solve “the real problem” that will let them tackle the project in a truly fundamental way, the majority of issues pointed in this paper will probably go away.

This is a qualitative, not quantitative study. The goal of qualitative research (in cases like this) is, essentially, finding categories, not the relative distribution of the population in those categories. I've seen well-executed qualitative studies with less than 10 participants, 53 is actually on the higher end for this approach.

You usually try to select for a pretty diverse cohort (which the authors did in this case), not a representative one.

A diverse cohort means you'll probably have all categories that matter. A representative cohort means you'll have a distribution over these categories that is pretty similar to the real world.

The result here is essentially displayed in Figure 1. One could now follow up with a quantitative design to e.g. discern which particular step is the most common obstacle.

I wanted to point out the same thing, I'm counting six authors on this piece of science complaining there aren't any doers around. At this point we have reached peak papers, there is more truth to be found in a HN thread and thats saying a lot.

Agree, it lost some credibility for me as well. I immediately search for methodology before reading anything nowadays.

> We presented a qualitative study of data practices and challenges among 53 AI practitioners in India, East and West African countries, and the US, working on cutting-edge, high-stakes...

I'd rather read a well-written opinion article and hear more about what those 53 people have to say.

Are you saying that; quantities studies > qualitative studies?

Or are you saying that their qualitative methodology is lacking? Is so, what would be a better way of approaching a topic like this?

I'm actually going to take the counter-side of this argument. It's actually very important for people to do the model work first. Yes, you need some amount of decent data, but the first step of a project should always be building a prototype to answer the question "Can we fundamentally do this". It's perfectly reasonable in that case to do a small amount of work picking and cleaning up some data and then really focusing on building the model. Once you've got your model then you can answer the best case scenario of whether the problem is tractable.

Once you've got the model then you need to work on operationalise, and that's where work on your data pipeline becomes important. It's always going to be the case though that building your initial prototype is more fun/interesting and probably 1000x easier, than getting your prototype into production. And it's important to remember that operationalizing your model is probably going to involve working on your model, retraining it and refining it to work with your new production environment.

The authors have tried to dress their waterfall up as "cascades" but the truth is that they've defined the development process as a waterfall and are trying to fix that rather than adopting modern development practices.

I agree we need to start from an mvp and so with a data model, but does this model really need to be AI or ML? My counter point to your counter point is yes, we need a model, but probably the initial one can be done with simple BI/SQL type aggregation methods. Most of AI/ML I see is optimizing an already existing model, not create something completely new.

A few years back the team that do "Business analysis" came to our team asking for help setting up some machine learning. What for, we asked. Well, we'd like to predict hardware expenditures over the next n years.

Ok. Well that sure seems like a thing the business might like an estimate of! Sure. Next question: Have you run a basic regression over the data you've got, maybe a simple model would be close enough to throw some error bars up and get going.

Of course not.

You see "we'd like to predict hardware expenditures over the next n years" for me is a red flag. When I hear those types of requests I hear "we'd like to shift responsibility and ownership of our financial estimates to someone else".

I'd suggest using a lasso, tbh. It's pretty fast, relatively performant and one can interpret the coefficients.

That being said, I'd do your simple SQL aggregations first, to ensure that one can actually learn something (if we have 0.0001% positives, a simple approach probably isn't going to work well).

I don't know. How can you gauge the effectiveness of your model without good data? A model can be wrong, but it can also be changed/redone/rebuilt. Without data you can neither develop nor verify imho.

It gives you a working MVP and a lower bound on what's possible. Tnen you can iterate either on model or data or both - data can also be further cleaned, changed and expanded. How can you gauge ROI of your data cleaning/collection efforts without it? Maybe an extra day you've put into it made a difference, maybe it didn't and you'd have better spent it on modelling or other tasks.

I work in a ML Engineering team, and it has almost become a meme for me to demystify ML to prospect candidates.

First of all I have to convey the idea that we are just engineers, so the vast majority of our job is "boring" engineering stuff. Write, maintain and upgrade code, fix bugs, scale our systems... other than the domain, you wont see much difference.

What people usually see as the "cool" part of the job is often done by scientists. There is an interface between teams where we get the result of science and need to get it "production ready", but that might be around 10% of the job.

It reminds me of this [0]:

"Before March 2020, the country had no shortage of pandemic-preparation plans. Many stressed the importance of data-driven decision making. Yet these plans largely assumed that detailed and reliable data would simply . . . exist. They were less concerned with how those data would actually be made."

[0] https://statmodeling.stat.columbia.edu/2021/03/21/whassup-wi...

> Under-valuing of data work is common to all of AI development.

I don’t think that’s true in bioinformatics. For example, people highly respect the laboratory work done to collect data on protein function.

Exactly, in fact it’s the opposite of anything. I’ve seen plenty of life sciences companies where the bioinformatics team was added on long after founding the company, almost as an afterthought.

I barely skimmed this so criticism is unfair, but... it could be a criticism of many "working papers."

There's a type of euphemization that's been irking me in recent years: The vague everyone. No one ever talks about X.

"Data largely determines performance, fairness, robustness, safety, and scalability of AI systems [44, 81]. Paradoxically, for AI researchers and developers, data is often the least incentivized aspect, viewed as ‘operational’ relative to the lionized work of building novel models and algorithms"

Do they mean that projects (at google? in society?) under resource "data work?" Do people get paid less. Do they mean incentives in academia? Do they mean that companies specialising in these areas are less successful? That universities have priorities wrong?

Just for the meander... at the top level, the economy does seem to be valuing data work highly. Adwords & Amazon marketplace antitrust cases suggest that "data work" is at the core of these mega businesses. TSLA's current valuation is partially based on tesla's exclusive dataset. In at least some very substantial examples, data work is very highly valued while models are seen as a temporary advantage... IE it looks like the market (some industrie) expects models/algorithms to be commoditized while the datasets are expected to be valuable intellectual property.

The authors do really have some insights here. A lot of "AI project" post mortems do find "data work" was underestimated. I guess that means it's underappreciated by definition. But... I think it would have gone someplace more concrete if they had started someplace more concrete.

> at the top level, the economy does seem to be valuing data work highly. Adwords & Amazon marketplace antitrust cases suggest that "data work" is at the core of these mega businesses

What is valuable about adwords isn't the "data work", right? It's their pricing power in the advertising space...

I cleaned data and made the models for MD and PhD researchers. Their work lies more in the thought sphere, they analyze all the work I do, then mold and shape it week to week until it's publishable. They obviously make the basic ideas behind what I should pull and what kind of models to use, and get the grants and funding for my salary so, they are pivotal even if they're not doing the hands on data work.

I suppose skillful AI/ML people like to think that great AI/ML is all about their skills and the models they produce.

Someone else might think that the true value is in the data, while AI/ML skills can be bought for 120$ per hour :)

Perhaps one day we'll have tools that will figure out the right AI model for you given data, but we won't have tools that can as easily come up with the data.

If a start-up stage company is just getting started on their AI/ML journey has budget for 3 FTE people. The company already has traditional ETL/BI expertise and a "DWH". Who would you hire (data scientist, ML engineer, Dat Engineer) and how would you allocate the division of responsibilities?

Is is because data work doesn't need a degree and the degree holders don't want to risk their career for doing something that is off topic to their degree?

It's the other way around. Data work does require understanding of the domain, and degree holders don't want to risk their career for doing something that is off topic to their degree, that is, learning the domain.

I am just pointing out the your statement has an other side, too. It's kinda like with programming languages, people think that particular programming language, rather than domain details, are the transferable skill for SW developer. In data science, it's the statistical modelling theory that is considered to be the transferable skill, rather than the business domain.

But you always need both, and there is no particular reason why would, in hiring, one prefer knowledge of the tooling to knowledge of the domain. It just happens to be, as a kind of spontaneous symmetry breaking.

For a strong engineer today, the most valuable/transferable skills are ability to learn new things (in both tech and domain) and effective problem-solving (in different settings). Such an engineer can learn the most important aspects of statistical modeling needed for their problem at hand.

The kind of researchers who are averse – "to getting their hands dirty" – to do data tasks will have a hard time staying relevant in any production setting. Unless the company has dedicated research department which is a million miles away from product teams, they are going to have a tough time being impactful.

It is similar to an article I read a while back saying we need "data engineers" not "data scientists". I think its generally true. Why my company, while it can do data science consulting, is choosing to focus on data engineering consulting first.

I've always been annoyed by the title "data scientist" because I don't consider their methods worthy of the name science. It seems like the winds are turning and more people are questioning data "science."

But talking about changing the branding to deliver the product that was advertised all along feels like putting the cart before the horse. Sure, maybe it will work. Or maybe in a few years we'll be talking about how we need more "data sleuths" to replace all those haphazard "data engineers" that just didn't have a job title that emphasized solving the problems rigorously enough.

Someone needs to do the work of making that article readable on mobile...

What is modelling in this case? Choosing a sequence of layers and fine tuning a bunch of hyper parameters? Isn't that also kinda boring?

So true. Deep learning is kind of lazy from the point of view of the model builder. Pretty much any other technique - ML models that require feature engineering, graphical models, differential equations, optimization problems, even running simple statistical tests - each of those is more rewarding to work with than DL.

The more interesting aspects of deep learning are often around engineering the whole pipeline and doing it efficiently at large scale.

Which is why being in the business of doing the data work is the supreme choice /data engineer consultant

Appreciation and renumeration as a rule tends to be higher the closer one is to the final customer.

isn't it typical for data engineers to be paid less (way less?) than the modelers?

Check levels.fyi, for pure SWE/Data/Backend vs. ML, the former often pays more.

Companies don’t pay financial rewards for doing the data plumbing work, yet it’s often much harder, prone to unusual failure domains mixed with high scaling, and carries more stringent on-call and incident triage responsibilities.

Given that it’s (a) more difficult, (b) more business critical, (c) more stressful and requiring an incident alerting on-call rotation, then “data work” should be much better paid and offer job security and career growth.

Yet no company I know of pays expert modelers & researchers less than expert data platform engineers.

So either the companies know something you don’t (e.g. that data platform work is more commodity and easier to replace than rarer modeling talent) or there’s a free lunch you can get by exploiting the arbitrage opportunity to pay data platform experts more and consume correspondingly higher business value that other orgs are missing out on by putting modeler / researcher higher on the status hierarchy than data platform engineer.

My perspective after many years of experience managing machine learning teams (both platform/infra and research/modeling) is that data platforming is just a worse job. It’s unpleasant and stressful and business stakeholders who are removed from backend engineering complexity and just want the report or just want the model couldn’t care less about organizational structures and workflows that support healthier lives for intermediate data platform teams. Because of this, the pay and bonuses for data platform roles should be much higher, but politically speaking it’s impossible to advocate for that, so it becomes a turnover mill where everyone burns out to keep the existing shitty system running, with comparatively low pay and low autonomy, and so nobody ends up wanting to join that team or do that work.

Applications are open for YC Winter 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact