Like the recent project I'm doing trying to classify country music songs based on their topic on the data blog I write on (https://bigishdata.com), the amount of time it's taking to scrape lyrics, remove duplicate / incorrect songs, and then do manual classification for training data is taking far longer than running the ml algorithms in the end one I've gone through that process.
I've been looking for jobs recently, and I've seen only one job posting that mentions data cleaning as a necessity, whereas the rest only talk about data science and algorithm knowledge, or overall ETL design on the data engineering side. Seems like data set knowledge should be emphasized more.
Edit: The basis of succesful implementation of these tools is to have the data in digestible format and I feel that transforming the data to that business usable format is where the big job is.
In my opinion well done ETL and DW are not going anywhere, even though in some circles they are said to be things of yesterday. Then there's a huge difference between an OK ETL/DW and a Brilliant ETL/DW. Designing a good ETL process is as large parts business and context knowledge as it is a application of data engineering skills. For example, it requires business knowledge AND data engineering knowledge to determine what kind of granular level advanced metrics could or should be calculated during ETL. Service level metrics and service level categorization for different kind of customers/claims/orders/... would be a perfect simple to understand example problem - there could be attributes and value ranges behind multiple relations that probably need to be taken into account and understood.
Edit 2: I've been involved in both sales and execution of so called data discovery sprints, which are a 4-6 week periods where we bring a data engineer, a subject matter expert and client key personnel working together and let them go "fishing". The key thing is that this provides an low cost way for the clients to possibly gain insight on the potential their data could provide. On the other hand, many prospective clients just have so messy data that this data discovery job can't be recommended, which leads to other possible opportunities (MDM, ETL, DW).
It's interesting to talk to different people about data quality and what they think it means, or how they choose to deal with it, and it's all over the place. Some people just mean open and consistent formats, some people have stylistic preferences for data shape, some people talk about accuracy of values, etc etc.
In some ways it's an extension of the thought that the world is inherently noisy, and we've been thinking about that one already, it's just that it turns out you don't need sensor data a la robotics to get noisy data - it's already in the datasets we know and love, and you accumulate more of it, the more sources you pull into your analysis.
Data in a regulatory regime can be excruciatingly difficult, and lend itself to "gut instinct" being used because fear of risk and regulation lock things down too tight to be useful.
I agree with this. It's fine to teach machine learning using the iris dataset, but there is rarely, if ever, a section dedicated to "real" problems. It was a shock to me just how high a percentage of time is spent cleaning data. It is a fundamental skill that is not only underestimated but "undertaught".
Actual data cleaning, usually in an automated sense, is more 'data engineering' than 'data science' or applied statistics. Feature engineering and 'massaging' training data is more related to DS but it's understood that this data being consumed by the DS is already in decent shape.
I think perhaps the problem here is the term science covers a lot of disciplines.
I propose harder stats be data theoretical physics, with data biology and similar referring to cases with harder messy real world complications. I'm sure we can come up with a full spectrum.
I'm a programmer by trade but I use R because the people who actually work with data use it, and they write good tools for it... I think there is some confusion in the programming world about this. Programmers work with data, but they don't do it nearly as much as "professionals".
Tidy data is a good intro if you're not familiar with it:
And I would recommend going through other publications by Wickam, all on his site -- they are quite readable.
Like I recently dealt with finding duplicate song lyrics in my 5000 set of lyrics, and to do that, I just had to google around for StackOverflow answers or random blog posts before I found something that I could adopt and chance for what I had.
Basically we're trying to get them to pre-emptively do data cleaning, so their logs will actually be useful for potential future data projects.
Multiple legacy systems with no consistent cross reference to unambiguously identify the same customer. Assured that systems have been gone through and all the names made consistent. Consistency for a human is not consistency for a computer. "Commers Ltd" is not the same as "Commers Ltd." And, isn't it lovely when a salesperson decides to add a location to a customer name. Now we have "Commers Ltd Dallas" as a unique customer. Business process discipline is often lacking and will mess you up.
Subscription data sources that change their schema with no notification to paying customers. And, when you are scraping data from websites you need to constantly be checking that your scrapers are still working properly. Source websites change regularly.
Crazy processes like entering a negative invoice to indicate a refund to customers but forgetting to zero out the cost of goods related to the invoice. We may have refunded the money but we didn't do the work twice. Arggh! Errors abound.
Good luck modeling data to not need this. I have caught myself actively telling people to do this more than once, and I'm really thinking about replicating some data so those changes are less disrupting.
That "Commers Ltd Dallas" probably has differing billing and delivery addresses, points of contact, invoice formats, customers representatives and preferred sellers, product selection, and probably everything else you have on your DB.
The real problem is when there is chaotic, and organic mixing, matching and re-purposing. I've seen it many times with "non-technical" individuals. They don't know what their software can do. E.g. Redmine. So the support individuals just log everything under the same "IssueType. They then "categorize" it using Category custom field, instead of the standard category which has enumerations. And then they then use that Category field to drive reports/process. Instead of using a different IssueType or Tracker, which is what it was designed for, and has tools that help you leverage/manage the complexity of different standardized processes.
Then, they decide to to add "Sub-categories" into the category field, instead of using a project-hierarchy or something. Then they want to do billing reports from the time logged per X and of types A,B,C, and at that point it's a giant mess and I stop caring. If they want to not use the software as intended, then do "fixing" by filtering and fiddling with Redmine CSV exports in Excel afterwards, that's their problem. Oh, and they ask that everyone has permissions to everything, allowing all users to change the status of each IssueType as they please, without any process.
I just feel sorry for the poor individual that get's a raw extract of that data and has to use it for something.
I ended up building another table and logic just to do roll-ups and account for name variations. But I told the client that they really need to invest in a serious cross-referencing middleware that tracks identities across all these systems and uses ID numbers to coordinate all the legacy systems.
But yes trying to develop models or even simple aggregations that rely on this kind of data can be quite frustrating.
(The chance for which isn't that big when done "properly".)
Just expecting an effect is a bias towards outliers.
Though, that's what has always appealed to me about truly large data sets, low risk of turing up jack-squat.
For example, some people think data cleaning is "Convert 12-FEB-2012 to 2016-02-12" type problems, and can't believe that such a task would be 80 to 90% of the difficulty in data work (compared to say, learning enough ggplot2 to make a nice chart).
On the other side of the equation, you have people who want to do a JOIN-GROUP-BY aggregate so they can calculate how much "evil" Wall Street money goes to each political candidate, a la OpenSecrets's calculation , only to find that the FEC does not classify campaign contributions by industry type or company, nor is the "employer" field filled with normalized entries such as "Evil Wall Street Company" that would lend itself to easy GROUP BY calls. For fucks sake, I've found that executive-level/professor folks can't even spell "Goldman Sachs" and "Berkeley" correctly (even on a typed form)
And that doesn't even scratch the surface of how little this person knows about the data question the purport to answer, or about how the FEC, the American political system, and real life works. Among the data cleaning problems they will have to mitigate are also the 2 hardest problems in computer science (how things are named/classified, and how up-to-date the data is).
I don't have any better ideas at the moment for how to break apart the category of "data cleaning" that reveals the many facets of the problem but also still preserves the interelatedness of the facets. But it's possible to be very good at some of the parts of data cleaning without knowing the rest.
* Implies that the data is deficient/falls short of expectations.
* Implies that the shortcoming currently makes it ineligible to graduate to the next level.
* Implies that with hard work and additional time likely it can be made sufficient though still not ideal.
* Implies that someone failed to help the data to meet expectations.
* Implies that you need special outside expertise, namely someone with the knowledge needed to assess the shortfall, possibly help you clarify your standards, design steps that when followed should result in "good enough" data, and who is able to articulate the remaining weakness(es) which need to be accounted when assessing future suitably of that dataset for a given purpose.
* Implies that your data will be stuck in school all summer while their friends are out having so much fun.
They live and breath numerical linear algebra and are comfortable reading advanced theoretical books or papers.
It's easy for them to pick up the basics needed to pass interviews and find a data science job. How would they go about adding some rigor to their understanding of ML and statistics?
and this for statistical ML:
One more interesting thing I have observed in data projects failing: organizations culture around data and the gap between data science team and engineers. Say, you have 2 top notch data scientist who know enough (stats, markov chains, algorithms and so on..). But let us say an average engineer in the organization doesn't know even a bit about A/B testing or difference between building a machine learning model Vs. obtaining predictions from already built model. Then no matter how good your so called data scientist are, the end result in terms of product or solution delivery is always sub-optimal. If the engineers and data science teams can't speak a common language, the result is always disastrous. Note that the gap is specifically about understanding data analysis as a domain.
The efforts to narrow down this gap must be driven by the lead data science member or CTO. Something like 'data bootcamp' mandatory for every new joinee can help. I had read about Facebook having such a bootcamp mandatory.
In summary, it is important to iterate quickly and to validate your results. Using complex models, like gradient boosted decision trees, can often iterate much more quickly than simple models because you don't have to do extensive data preparation. Many analysts are stuck in the mode of using linear or logistic regression for every problem, when there are better tools out there.
Also you misspelled "breathe". As in "live and breathe".
This is me. I work for a non-profit that is stuck in the stone age--not for lack of money, mind you, but because the IT Director is an incompetent megalomaniac who views "security" as a reasonable justification to refuse any and all requests, and treats everyone like an enemy.
I haven't been allowed to use Python or R. In fact, the only programming language I have access to is VBA (for applications, not the stand-alone variant). Of course that's a huge mess because the IT director disables macros once a month, generally right after another crypto attack makes the news. Thankfully, he didn't even realize that it was possible to use VBA from inside any office application until after I had already used it to create several Access applications which made the jobs of the most important people in the organization easier. So when he breaks VBA every director in the organization yells at him and the functionality is restored nearly instantly.
Of course he could restrict the applications to run only signed macros, but he won't give me permission to sign things because he is (literally) afraid I might hack something.
On top of that, my computer is a Core 2 Duo from 2007 or so with 4 gb of ram. He bought over 100 of them used from a computer recycler about 2 years ago. For the first three months at this job I had a Pentium D, which literally couldn't run Excel and Firefox at the same time. I'm not allowed to get a better computer, because the employee handbook states that every computer needs to be the same for "security" reasons. If my director used our budget to purchase a computer I wouldn't be granted access to any of the databases containing our data because of "HIPAA compliance." (For the record, we don't have any medical data whatsoever. We only have names, addresses, and donation amounts. We don't even know the birthdays of our constituents.)
The worst part is that we randomly started losing data after all of our network drives were moved offsite at one point to provide "redundancy." I created several tickets about this issue, and each time I was told that it couldn't have possibly happened, and there was no record of the file ever existing. I created a script that created a log file each hour with a list of files and their attributes from each directory to try record proof of this happening. After I recorded about a week of files disappearing randomly overnight, he reported me to HR for hacking.
Once I proved nothing I did was wrong he amended the "IT security" section of the employee handbook. Several of these measures were impossible to follow because of restrictions he had placed on the computers/network. I brought this up with HR, and they removed these measures from the handbook. Once this happened, he sent an email to me cc'ing my boss and HR accusing me of trying to frame him by deleting files. I don't know how that accusation even made sense, because the files would still have to show up in transaction logs.
Despite all this, I KNOW my director and HR aren't going to believe me when I tell them I'm quitting because our IT director is an incompetent tyrant. From their perspective, IT issues are something that can be solved by compromise, just like everything else. So IT has to let me use VBA, and that should be enough.
Anyways, long story short, anybody hiring in Chicago?
I can sympathize with having to deal with VBA. I'm working in a lab that deals with lots of questionnaire data and uses Access as the main tool for gathering said data because that's the way things have been done in the past, despite the fact that nobody in the lab really knows how it works.
I can also sympathize with having your network drives go down and render everything inoperable. Everything we use in the lab is stored on an offsite network drive, probably because of HIPAA compliance, and said network drive has dropped out twice in the past couple weeks. Once for almost an entire day, and once for an hour or two.
Best of luck with the job search.
I don't want to hate too much on Access. It's really an amazing program. You should see the processes some of my coworkers managed to create despite not knowing an iota of SQL.
Thanks for the well-wishes. I'm being very methodical because I really want to make the move count. I'm okay with waiting for the right opportunity because I really enjoy my job outside of the horrible IT situation: my boss is awesome, the people I work with are awesome, there is a lot of variety in the role, and I get to make a lot of decisions. Regardless, the IT situation is limiting my growth, so I'm on the lookout for the next thing.
The average age of our board is somewhere north of 80 (not kidding), and they don't understand what IT is. True story--our parent organization didn't have a website until 2003, because the IT director thought websites were a fad. He was forced to buy the domain after someone else bought it and used it to post stuff the board found unappealing.
The only way to describe the entire situation is Kafkaesque.
That being said, the amount of VBA I've seen that doesn't work with Option Explicit is a little staggering, so maybe I should be so self-conscious.
> my computer is a Core 2 Duo from 2007 or so with 4 gb of ram
That's a super freaking powerful machine, you have to be efficient in your programs and good on your algorithms. My machine was a Pentium III 800MB RAM for the longest time. There is a lot you can do on that. Use algorithms that need to load data in chunks, exploit memory mapping and generate native code if you can. They go a long way, likely much further than some may think.
The question is whether the better computer save more than 8 hours over the course of a year? The answer is yes. The amount of time it takes to load stuff in from the disk slows down programs immensely. If my computer had 16 gb of memory I'd be able to store and manipulate all my data in memory.
That's not even counting how much time is wasted optimizing code that I run rarely. That time is much better spent doing other things.
always looking to meet smart people!
Moving daily huge amounts of data into some cloud and then back out might be too slow or too expensive.
There needs to be strong mgmt support for getting outcomes, because the production team are going to have to change to support it, and they usually like doing things their own way. Typically, without analytics, sensible logging formats or a clue as to why the outside world behaves the way it does.
I've seen companies that treat data projects as if they were this great unknown projects where the developers could get away with using bad or no patterns and not follow patterns that other applications in the company use.
Technologies like Spark have made more common and easier to develop big data applications and implement design patterns that regular engineers can understand and follow.
Couple a great data engineer that with great data scientist using tools like Spark, R, H2O, Alluxio, Parquet, etc. and companies can truly exploit their large sets of data effectively.
The problem is DevOps and bridging the gap between a scientist's environment and a production environment and keeping both as flexible and testable as possible.
We started a company to bootstrap companies into this culture by providing DevOps services and UIs which simplify the deployment of Kubernets, Spark, Druid, H2O, etc. clusters. We also provide tools and services for simplifying and automating ETL pipelines with which models can be trained.
If you are interested in finding out more about these services contact us at: firstname.lastname@example.org.
Err...Oil needs to be transformed and refined before it can be called a product (like gasoline, plastics). So the analogy is good and even supports #1!
Commodities need to be transformed into a product before it's valuable.
Another time I was working with an engineer who built a neural net to predict something. Turned out it was a really poor choice as interpretability was important for the problem and the neural net's predictive power was actually worse than more traditional models.