No worries, fair question. It worth noting that my job is not data analysis. So I do use data analysis to evaluate our metrics and model performance.
Really none of it is really automatable. I'm working developing NLP features for our product (question answering, search, neural machine translation, dialog, etc). Our customer data is diverse, in different formats, and thier use cases are all distinct. So most of my work is novel applied research and development.
I do assume that the data format is different (alas I also assume that they are all some sort of a text file with known fields and types).
But after you setup the dataset definition and defined the schema, the rest can be based on neural search?
Moreover, isn't there a state of the art architecture for each of the task. E.g. Seq2Seq for machine translation. Can you just use that as a base line, and let the NAS engine search hyper param, etc?
Happy to talk more offline, my email is my profile. The short answer is no because there are more complexities involved - both related to our specific use cases but really natural language in general. If that were the case NLP would be solved and any company that could exist would already. From my experience, I'm not sure where the line is between choosing the right model vs having the right data solves most problems. There have been novel architecture developments like rnns and lstms that have shown well to support certain domains. New architecture are developing each year and the space moves very quickly. On the flip side, having pedabtyes of data (like BERT or OpenGPT) and simpler architectures is also powerful but prohibitive to everyone that is not Google or state government. The real answer is probably somewhere in between and whiles it's unsolved, there is work for me to do. That being said, our strategic philosophy is to make our AI a commodity so that we can differentiate ourselves on other features.
Most of our problem don't cleanly map to existing NLP tasks. State of the art often isn't as high as you think in many tasks. For example, the machine translation in relation to beta feature we're building that lets you ask the question of arbitrary single tables (kind of like wiki-tables) but we don't the know the schemas in advance or the questions the user may ask about. Outside of having the issue of having quality annotated data (which we often don't - cold start problem), we need to do more than simple model tuning. It requires building custom architectures.
But even when you consider known tasks, state of the art models do not often produce those same results on real-world data. If you put aside data quality issues (which is another huge challenge for us), in the context of question answering, the training data rarely captures the distribution of the natural language in the wild. People ask questions differently and use language that doesn't match the content in our knowledge base.
I could go on. But short answer, it's not as straightforward as you think. Even at google scale, machine learning is not solved. For everyone else with fewer data and domain-specific use cases, it's even harder.
Thanks for the answer. I am happy to discuss offline. While I did my master on computational linguistics which is related to your field, I am currently creating a new auto ml platform so I would appreciate your feedback. My goal is to automate the straightforward parts.
As you mentioned, some tasks in NLP like full conversation are not solved and will likely never be solved with deep learning by itself (at the level of the conversation). There should be some sort of symbolic AI or taxonomies/knowledge graph (like RDF) in combination with deep models.
>But after you setup the dataset definition and defined the schema, the rest can be based on neural search?
Sure, but hyperparameter tuning and architecture selection takes such an insignificant amount of any competent ML practitioner's time so as to be pretty much irrelevant.
At least for me, my time is mostly spent:
1. Understanding (or designing) the process that generated the data.
2. Organizing the training schema.
3. Understanding the customer's business problem so that an appropriate ML system can be designed.
4. Doing an initial design of the ML system based on that understanding and then iteratively designing new components for said system based on customer feedback.
5. Developing or researching how to measure model performance.
6. Searching for alternative data sources.
7. Answering customer and stakeholder questions about the ML system
8. Implementing the ML system in code.
None of these can be automated with current technology, and there's a reason for that: if it was possible to automate a task then our team already would have.
Sure, provided you have enough data to feed a neutral net, and the problem is well suited to it and you don’t mind giving up huge chunks of explainability.
I recently replaced a classifier at work that was using a neural net with a decision tree and some hand chosen features. It performs a bit better, it takes way less time to train and it’s significantly more explainable: my teammates asked why it sometimes miss-classifies a certain edge case, and because the features and model properties were so easy to understand, fixing the issue was a couple of hours work and not a case of “who knows”.
One of the difficulties is that the broader scope your optimiser has to push towards a solution, the better your measurements need to be. And having an accurate measure of which thing is "better" can be prohibitively expensive.
The costs of errors varies drastically in different domains and for different use cases, so something important is understanding how and why different models typically fail and making tradeoffs there.
I have a serious question (not for bashing)
Can you please describe what part of your job CANNOT be automated?