Hacker News new | past | comments | ask | show | jobs | submit login
The First Rule of Machine Learning: Start Without Machine Learning (eugeneyan.com)
765 points by 7d7n on Sept 22, 2021 | hide | past | favorite | 172 comments

Furthermore, follow https://twitter.com/_brohrer_/status/1425770502321283073

"When you have a problem, build two solutions - a deep Bayesian transformer running on multicloud Kubernetes and a SQL query built on a stack of egregiously oversimplifying assumptions. Put one on your resume, the other in production. Everyone goes home happy."

This reminds me of an experience I had watching a company trying to replace a system with ML.

First they marketed it heavily before even thinking. During test cycle they fed the entire data corpus in and ran some of the original test cases and found some business destroying results pop out. The entire system ended up a verbatim port of the VB6 crap which was a verbatim port of the original AS400 crap that actually worked.

The marketing to this day says it’s ML based and everyone buys into the hype. It’s not. It was a complete failure. But the original system has 30 years of human experience codified in it.

The AI taxonomy includes the term "Expert Systems" for these kinds of things. On the one hand it's definitely not of the new wave of ML AI so hyping those things as innovative is off. On the other hand we should definitely give more attention to that kind of setup and understand how to build/maintain/test it properly. Otherwise often it ends up being ran by a few hundred Excel sheets and a few severely underpaid people and that's a disaster waiting to happen. The AS400->VB6->NewShiny path actually sounds like a success case given the messes that are out there.

Rule engine is the term I believe.

Often the biggest benefit is that the ML version is good at catching when the experts hadn't had their cup of coffee as well.

Most experts are like family doctors, they get the correct diagnosis 70% of the time. And even if you juice them up real good, they will ALWAYS lose 5% to human error.

The ML also hits the 70% mark, but it's a different 70%, so it'll fix 70% of the errors. Then you're batting at 0.91 instead of 0.70.

If I had a nickel for every time I've seen "business rules engine" turned into "AI" in the last few years...

But I guess if we complain that half of our colleagues and the media don't understand ML, why should we expect management to?

When the command from C-level is "We need some AI projects to tell our shareholders about," we shouldn't be surprised when middle management suddenly has successful AI projects in their slide decks.

If you have an existing rules-based decision-tree system, and you compare its performance with a bunch of other decision trees, and it does better, you are implementing a random forest that happens to be identical to your original system.

Artificial Intelligence.

If you talk enough all the models and hyperparameters you compared and suchlike that you experimented with, you can probably sufficiently impress people with the talk about the enormous deep learning model you spent several months developing that they won't even remember you mentioning the two-clause Boolean expression that you actually put into production. And of course it's AI. You used k-fold cross validation to select it.

In the same way the terms “blockchain” being used for “digital signing” or “cloud” for a server...

Well really-existing AI is just either "taking a mean()" or "programming a rule". So all of programming actually counts as (symbolic) AI.

Isn't your organization itself a machine the learned these rules over time? Maybe the marketing checks out

We did the same thing when I worked for a resume search/sort/share site. Built a big ML tool that could look at job listings and resumes and pick who was best for each job. Our training data set was millions of resumes, hundreds of thousands of jobs, and in most of those jobs, we could say which resumes got shortlisted and which resumes got hired.

In the end, it gave basically the same results as keyword searching. But we marketed the shit out of it.

If it worked, why was it crap?

There is a certain value in understanding why something works and how you can either continously improve it or adjust a few dials when there is an exceptional situation.

Part of the fascination with ML is the (dangerous) myth that you don't have to wrap your head around a complicated problem anymore, instead the solution will just magically fall out on the other side of the blackbox if you just feed it enough data.

Understand ing the intricates of the problems you are dealing with however is a value in itself.

Yes, but (please allow me):

«Part of the fascination with ML is» solving the mystery behind the ability to automatically build functions and behind those functions.

Surely, both in practice and axiologically, understanding and deterministically solving have a great value. Also because of that, the fact that systems exist that can adapt into solutions, but contain a transparency problem ("yes, but why"), contains an immensely fascinating theoretical challenge, in the learning that may come from the attempt to understand the "grown, spawned" (as if a natural phenomenon) system.

The laziness is not necessary: there is a great deal of fascination in unveiling the mysteries in the blackbox.

Then of course, when you have a practical problem to solve (instead of that intellectual challenge and promise), pick your best solution. And surely it is sensible to call it dangerous to rely on something not properly understood, which may hide the potential faults ("yes, we found out it fails here, and it may be that we kind of assumed it "saw" shapes, while really it "sees" textures..."). In professional practice those "active" fascinations (understanding the spawned) may be luxury.

While the methods are very interesting, I often wonder about assumptions about what can be modeled. We already know that it's not possible to correctly predict an arbitrary nonlinear system numerically (since that system could be chaotic).

It's one of the reasons why for specific problems, heuristics or statistics are way better than any attempt at nonlinear modeling / ML prediction (e.g. highly accurate climate models vs. struggling weather models).

I even hear people pitching ML for applications where determinism and explainability aren’t optional, like regulatory and financial reporting for a financial institution.

Indeed! Regulatory reporting is clearly defined, i.e., what needs to be reported and how. Yet I see groups trying to use ML to determine what to report, which makes me think they don't fully understand the topic they are working on.

To be fair, explainability is still a hot topic of research, as well as discriminatory bias tradeoffs.

People don’t know when something is done, finished and complete. They have to go and fuck around with it.

Look at windows for example. Image how good that would be if they didn’t keep trying to fuck around with it and actually finished something.

How do you get promoted and what will you put on your CV if you don't change stuff and just keep the lights on? This is a dilemma all the way from top management to developers. How does a project manager build a career if there are no projects?

Change is needed because people want to have jobs and they will make work for themselves if none exists.

>Change is needed because people want to have jobs and they will make work for themselves if none exists.

Traditional jobs must have solved this problem somehow. You don't usually see e.g. windowmakers or installers coming up with windows in the shapes of superellipses because square windows are already solved, or stoves coming with integrated fridges because "just an oven and a top" is already solved.

Yes, it is called “fashion.” People periodically replace clothing or reconstruct buildings or alter cooking or food presentation. It doesn’t change much but it does maintain a great deal of economic activity, keeps the motors running as it were. By and large operating systems and websites and mature software systems are similar. It is a good thing because it soaks up the attention of people who would delay efforts to evolve, example Microsoft contemplating its OS navel as Netscape came about. It is also good because a small amount of the large economic value of fashion is still more than sufficient for the development of something new. As the “startup” has become fashionable the cycle repeats where the relatively inanimate bones of effort that doesn’t create new value is used as a framework for the rare things that do.

> You don't usually see e.g. windowmakers or installers coming up with windows in the shapes of superellipses because square windows are already solved, or stoves coming with integrated fridges because "just an oven and a top" is already solved.

Well, actually you do. You'd be surprised. Let's ignore for now that there are different window makes and models. There has been a transition from custom-made windows to ready-made windows, whose production cost is higher but the total cost ends up being low because you can have a crew drop by in a construction site and get it done in a few minutes.

Then there's the current progresses in energy efficiency, and also fancy gizmos like actuators and sensors and all kinds of domotics.

Then there's security systems embedded into windows, which further adds sensors and networking and actuators.

You're now far beyond your mom and pop's windows.

And progress ain't done yet. We're starting to see self-dimming windows and also self-cleaning windows.

Most of the window making jobs are just production. In software you don't have the same friction of production. You don't have to type in the source code each time you install the program. But you have to manufacture a new window and physically install it each time which is labor intensive. Also, the window designer job is not hyped as much as IT jobs are.

Furthermore, you do see household appliances getting fitted with useless feature bloat and shoddy software and wireless and touchscreens on microwaves etc. It happens. IoT, subscription based software updates for power drills etc... Tractors that can't be repaired and contain a jumble of proprietary software as a service etc.

A couple of years ago I worked for a bank replacing an in-house library that basically moved and transformed data from one database to another with a highly contrieved Spring Batch solution.

There was absolutely nothing wrong with the "ugly" framework code, it was quite beautiful, well structured, configurable and fast. Somebody didn't like that you didn't write java code and the properties file based DSL was indeed odd, but nothing wrong with it after you bothered to read the library code.

The Spring Batch code was more explicit, but much uglier, overall.

People have to have list items for their yearly review cycle and their CV. "Replaced a legacy system with a more modern solution" can be presented in a light that earns you cookies. But it may be seen as useless by the higher ups if all they care about is new features. You have to know what impresses your boss and your boss' boss or the interviewer at your next job.

> There was absolutely nothing wrong with the "ugly" framework code, it was quite beautiful, well structured, configurable and fast.

I'm sure there is far more to the story than the new guys doing a poor job writig a replacement service.

For example, old code does add maintenance costs just by the fact that it's either tied to old frameworks or even OSes, either of which might not be maintained anymore. Also, I don't feel it's a honest argument to criticize java for the sake of being java. If there was ever a production-minded programming language and tech stack, that would undoubtedly be java.

I'm sure the guys who designed the replacement service would have a few choice words regarding the old service.

> How do you get promoted and what will you put on your CV if you don't change stuff and just keep the lights on?

Find a new need. Every good product (and many bad products) is an answer to some need. And the world's full of all kinds of needs that we can work on.

However, sometimes we start projects without proving they actually answer a need, or sometimes the internal corporate needs don't match the user's needs (I'm looking at you, integrated advertising in Windows 11).

Finding a new need is risky and difficult. Tweaking and rewriting parts of an existing product with proven market adoption to fit the new fads delivers more predictable flashy results and successes for your CV and career and visibility within the organization.

You have to constantly reinvent yourself, or someone else will and take all your customers.

I'm not saying you are wrong, but you aren't right. There is a balance. You can't stand still, but quality that comes from improving the current thing is important as well.

Presumably what’s meant is it hasn’t improved in 40 years, and even then it was probably just “barely good enough”. This might be considered MVP but that depends on whether you have to actually use it or not.

One thing that I could imagine: it bases it's decision only on few (or the 'wrong') features while you (or marketing) want to consider more.

We have had a project where we were asked if our model would consider X. So we added X to the model but this didn't increase performance. Now the sane, simple answer would be to just ignore X. But then people come and ask why, doubt that it doesn't improve results, competition without ML considers X.

That doesn't happen (or is hidden) in a none ML situation where some decisions aren't questioned by a benchmark.

VB6 is looked down upon.

For very good reason, by an overwhelming majority of developers. The fact that a few developers thought VB.NET was even worse than VB6 doesn't lessen VB6's dreadfulness, it just highlights VB.NET's dreadfulness.


>The final release was version 6 in 1998. On April 8, 2008, Microsoft stopped supporting Visual Basic 6.0 IDE. The Microsoft Visual Basic team still maintains compatibility for Visual Basic 6.0 applications through its "It Just Works" program on supported Windows operating systems.

>In 2014, some software developers still preferred Visual Basic 6.0 over its successor, Visual Basic .NET. Visual Basic 6.0 was selected as the most dreaded programming language by respondents of Stack Overflow's annual developer survey in 2016, 2017, and 2018.

Stack Overflow Developer Survey 2016: Most Dreaded: Visual Basic: 79.5%


Stack Overflow Developer Survey 2017: Most Dreaded: Visual Basic 6: 88.3%


Stack Overflow Developer Survey 2018: Most Dreaded: Visual Basic 6: 89.9%


I think VB6’s bad reputation is that it is stuck in the 90s. I don’t know if any version of a language from that time would be popular now (let’s leave FORTRAN and COBOL aside).

I disagree that VB.net was dreadful. But it broke backward compatibility but I think for good reasons: arguments being byref by default in VB6, collections being inconsistently 0 based or 1 based, the SET keyword that wasn’t really serving any purpose and was inconsistently applied, having to provide parameters within brackets or between spaces depending on whether the return value is assigned to a variable or not, etc...

I have a lot of sympathy for the frustration of someone who has to maintain a huge code base when backward compatibility is broken, but I think the changes VB.net introduced were necessary.

Oh, the memories, I loved VB6.

I remember when my father had to use medicall services billing program supplies by a natinal health insurance company, and he had some problems with it.

Luckily, I was a student in Bucharest and I went to their headquarters to play middleman between my father and their "informatician".

This "informatician" was the sole architect, UX designer, developer, tester, release manager for this program -- VB6+ access.

I sort of helped him debug the code, he built me a special version and handed it to me on a CD.

The program was ok UX wise and blisteringly fast. Years later, they hired this corrupt company that built software for the State and produced a horrendous program, that took terrible and just the startup took 15 minutes (parsing hunonguous XML and inserting it line by line into a local sql database, as far as I remember reading the logs).

The contract ran into HUNDREDS or millions of euros. Granted, the scope of the program was a bit wider.

At least VB.net kept my favorite legacy error handling strategy: https://docs.microsoft.com/en-us/dotnet/visual-basic/languag...

Plenty of people still write C99.

It's not the language itself, but the platform. VB6 was primarily used to write RAD GUI applications. The GUI elements VB6 provide are very outdated by today's standards.

I only wish there was a modern tool as simple as VB for crud application.

What's wrong with VB.NET Winforms with the Visual Studio WYSIWYG? I still have yet to find a better GUI building experience (alternately the same thing in C#)

You can still buy PowerBuilder, which was always superior to VB6 for relational applications.

It's been unsupported for 12+ years ? If you have code relying on it and haven't migrated to something supported it means your code is not maintained (don't care what your excuse is, using VB6 in 2020 means you're not actively maintaining the project), written 2 decades ago with the coding standards of the era, the original developer team is gone and probably retired and since nobody is actively maintaining it nobody has much knowledge about how it works.

So yeah anything that's still running on VB6 is very likely crap.

> If you have code relying on it and haven't migrated to something supported it means your code is not maintained

No, it does not. It means that Microsoft no longer provides support for the IDE. That does not prevent the developer from maintaining their own VB6 code. With some extra steps, the official IDE and compiler for VB6 can still be installed on Windows 10. Running programs built from VB6 is still supported.

> written 2 decades ago with the coding standards of the era, the original developer team is gone and probably retired

This applies regardless of the programming language to any codebase that has been around for long enough.

>No, it does not. It means that Microsoft no longer provides support for the IDE. That does not prevent the developer from maintaining their own VB6 code. With some extra steps, the official IDE and compiler for VB6 can still be installed on Windows 10. Running programs built from VB6 is still supported.

If you're comfortable with this then I don't think you're actively investing in your software.

>This applies regardless of the programming language to any codebase that has been around for long enough.

No, if you have a team actively maintaining the project you have the knowledge transfer in-house which is the second part of that sentence.

> If you're comfortable with this then I don't think you're actively investing in your software.

What exactly does 'actively investing' mean in this context and why is it needed? If the software is actively maintained so that it continues to meet business requirements, is that not enough?

> No, if you have a team actively maintaining the project you have the knowledge transfer in-house which is the second part of that sentence.

That is orthogonal to what programming language is being used. When the project is actively maintained, knowledge can be transferred regardless of the programming language.

>What exactly does 'actively investing' mean in this context and why is it needed? If the software is actively maintained so that it continues to meet business requirements, is that not enough?

If you're actually investing in maintaining something that's running on a deprecated platform that's decade over EOL and nobody wants to touch with a 10 foot pole - that sounds like a crap project by definition.

Anything that's sufficiently funded to be actively developed would have figured out a migration plan by now, the only scenarios where it wouldn't sound like terrible projects to work on.

>That is orthogonal to what programming language is being used. When the project is actively maintained, knowledge can be transferred regardless of the programming language.

No it's not when the language is deprecated by the owners for over 12 years at this point. It's like having software that only works on windows xp and maintaining it because you can still boot a VM to run it. Good luck working on that POS.

> If you're actually investing in maintaining something that's running on a deprecated platform that's decade over EOL and nobody wants to touch with a 10 foot pole - that sounds like a crap project by definition.

The platform it runs on is Windows 10, which is not deprecated. Microsoft provides an 'It Just Works' guarantee on Windows 10 for VB6 applications. It does not matter whether someone wants to maintain code. The company pays people to do it. Just like how there are many people who do not want to work on proprietary software but do it anyway because their employer pays them to do it.

> Anything that's sufficiently funded to be actively developed would have figured out a migration plan by now, the only scenarios where it wouldn't sound like terrible projects to work on.

Actively developed means that bugs are fixed and features are added as needed by the business. It does not mean jumping on the latest tech trends when there is no business justification. And I am pretty sure that the users are happy that they can use a fast, responsive application instead of a lumbering, bloated Electron app.

> No it's not when the language is deprecated by the owners for over 12 years at this point. It's like having software that only works on windows xp and maintaining it because you can still boot a VM to run it. Good luck working on that POS.

That is a strawman argument because the VB6 IDE and programs compiled with it run on Windows 10 natively, without a VM. And running VB6 programs on Windows 10 is officially supported.

>The platform it runs on is Windows 10, which is not deprecated. Microsoft provides an 'It Just Works' guarantee on Windows 10 for VB6 applications. It does not matter whether someone wants to maintain code. The company pays people to do it. Just like how there are many people who do not want to work on proprietary software but do it anyway because their employer pays them to do it.

My point is that every time I've seen scenarios like this with products stuck on unsupported platforms is that product is used but there's no money in actively maintaining it (or else it would have made migration plans in the last 12 years). This means you are likely getting shit money working on it, working on a legacy stack you won't use anywhere else, codebase is almost always shit, and the work you do is unrewarding. So crap projects by definition, and every testimonial I've seen so far confirms it.

I've worked on projects being stuck on tech close to EOL - they always had migration plans to upgrade to supported tech.

I think none of those assumptions apply in this case, or even in general. If there were no money in it, then the products would not have been actively maintained. VB6 was ranked 20th in TIOBE last year, so it is not just used in my company, either. In fact, if it were so unused, Microsoft would have dropped support for it already, but they currently support it until Windows 10 EOL, and possibly will on Windows 11 as well. I have not looked at the codebase behind the VB6 products in question, but there is no reason for me to assume that it looks any different to any other long-established codebase, or that the developers who maintain it are paid less. I would assume they are paid more because they are harder to replace, which in turn makes their work more rewarding.

I have worked on products with old tech stacks as well, e.g. a C++98 codebase for the core product of a multi-billion-dollar company, with no plans to migrate. New features were being frequently added, and the company's own standard library replacement that bridged the gap was itself actively developed.

That’s kind of life though.

We have a COM component written in VB6 running in IIS on windows containers on Amazon in EKS.

It works but it’s crap!

At my last job there was a team of 4 or so people who had originally written some vb6 code that they were still maintaining. This was as recent as 2020 and since there were no plans to stop I assume it's still ongoing with, at best, some plans to move off being made what with how slow things moved.

Even they agreed it was shit though.

Welp, what if I told you a Sixth Form (Y12, age 16-17) in a UK school has a computing course that just started and is reportedly using VB6 ... I'm really not sure what to say?

yeah it wasn't said when it was moved from though? The reference to Machine learning implies more modern, but not necessarily so.

At any rate even when it was maintained still looked down on, I guess a Dijkstra based side-effect.

There is RO Mercury?

The "the right tool for the right job" applies for ML topics too.

If the job involves "looking smart and innovative" for whatever reasons, people tend to err on the side of overly complex solutions.

On the other hand if the advice "let's just go with an SQL query built on a stack of egregiously oversimplifying assumptions" comes from someone, who doesn't know how SQL and linear regression / logistic regression with binning/bucketing / simple decision trees work, I would ask for a second opinion. Because a huge part of the retail banking, non-life insurance and marketing business is running on this simple stack. Obviously profitable.

If the same advice comes from someone, who knows when to use deep learning instead of XGBoost and why, I would go with his/her advice. And I would try to keep him happy and on my team.

Furthermore in the article, yes.

This isn't ironic, I've actually done that multiple times in a large company. No one noticed, everyone went home happy.

Well, the article does conclude with that exact tweet...

My workplace has got all kinds of attention for building a blockchain based data collection system that encompasses an entire sector of the economy. It's "almost done", so we are right now starting a simple set of REST services that write into a badly normalized transactional database just in case it stays "almost done" for too long.

That quote is seriously brilliant! Thanks for sharing.

Just don't build one solution to your problem with regular expressions: then you have two problems.

TFA ends with that quote.

I recall attending a technical talk given by a team of senior ML scientists from a prestigious SV firm (that I shall not name here). The talk was given to an audience of scientists at a leading university.

The problem was estimating an incoming train speed from an embedded microphone sensor near the train station. The ML scientists used the latest techniques in deep learning to process the acoustic time series. The talk session was two hours long. This project was their showcase.

I guess no one in the prestigious ML team knew about the Doppler shift and its closed form expression. Typically taught in a highschool physics class. A simple formula that you can calculate by hand: no need for a GPU cluster.

So did you check if the simple solution outperfomed it?

In the real world there's often more noise and variance and additionally, part of the benefit of using those techniques is that you can arrive at solutions that are about as good without being an expert in every single thing.

I'm sympathetic as this is a showcase and if their general method performed as well it does show it can learn the data well for other comparable problems without easy solutions. I know I often test my models on verifiable problems as a sanity check..

Bingo. The kind of problems OP talks about are fairly frequent, but this problem of train speed estimation is not it.

Also worth discussing is what happens when instead of one you put 30 sensors to improve your estimate of the speed. Good luck figuring out the closed form Doppler expression in that case (you technically _can_ use Kalman filtering but you are assuming each sensor is independent - they would not be, they would be correlated based on their spatial location and closeness to train).

With deep learning, all you need to extend your 1 microphone solution to 30 is a lil bit of pytorch code to add more neurons and some plumbing to pass in 30 audio streams but that's it.

Not to mention extensions to more complicated scenarios - people talking nearby, cars nearby etc. With deep learning you probably wont even need to modify any code, just throw training data (assuming your original model architecture is well designed).

But you might need a lot of training data, which in some cases you might not have.

Well the main sound you hear when a train arrives in a station is the sound of brakes. Its frequency and volume changes as the train slows down. You'll need to analyze the physics of that before extracting doppler shift from it.

Also, depending on the track used, there may be trains passing by without braking, so you will need at least a classifier to sort these two cases.

I'd argue that using ML to build such a classifier is almost always a time saver.

And if you have the ML pipeline there, why not try to train it to recognize the speed while we are at it? It will likely find out about doppler shift but also do things that would take ages to code manually:

- Use volume levels and volume level differences - Use the clicks at rails junctions to evaluate the speed - Recognize the intensity of the braking/engine running - Use cues like rails vibration at certain speed - Adjust for air pressure difference when it hears the rain

All of that for free. Nowadays, going ML first is becoming a pretty good idea actually.

I remember the first 15 years of my life getting woken up by trains and it certainly wasn't the breaks that I heard first.

They brake only when they stop at the station. If you were sleeping next to the tracks but not next to a station, you probably did not hear them much.

If the train had a loudspeaker on the front emitting a pure sinewave of known frequency, louder than anything else in the environment, you could probably just use a frequency counter and the Doppler formula.

Given just some microphones picking up whatever sound the train makes on its own, it's not obvious to me that there's a simple solution.

A doppler shift thingamajig might work in a lab, but not in the real world.

I guess, such a calculation could have been one of the inputs to the system.

I do get your point that an ML system for such a thing is an overkill. I guess there are more reliable and rugged methods to get the speed of the incoming train (sensors that need not be mounted on the train)

Doppler shift wouldn't help much in this case.

The clues required are in the how the thousands of waveforms are affected by the environment, how they change as the train passes different features, and how their volumes change over time, and other features we can't know in advance. Probably the clicks as the wheels pass joints between tracks are the most telling clues about speed.

The microphone doesn't give a sine wave.

> The clues required are in the [random bits of physics we can't know in advance]

If we can't know in advance, how can you expect a glorified Markov Chain to magically figure it out? If it could - and it can't, but if it could - how would you know it did it correctly?

Fortunately, we know enough about physics to be able to deal with it without a divination server.

I get it. The train operator wants a solution, but realizes figuring this out is too hard, so it's better to pay someone else to do it. That's normal. It used to be that this someone else would do the actual work necessary. But thinking is hard and electricity is cheap, so some figure it's better to just light up a GPU farm and wait until a solution forms in the primordial soup of repurposed vertex shaders. That too, perhaps, would be OK in principle - if the technology was there. But it's not there yet. We're still better off doing the actual thinking.

> The microphone doesn't give a sine wave.

No, it gives an infinite number of sine waves added up together. Which become a finite number of sine waves after passing through ADC, and then a finite sequence of sine waves after a Fourier transform.

I had an internship project which was a simpler cousin of this, where I needed to determine the approximate location of a WiFi-enabled device, based off of received signal strengths from a several access points. Normally this would be trivial, but this demo was meant to simulate an environment highly reflective to 2.4GHz RF. So while it took only a day or 2 to demonstrate relatively poor performance using simple triangulation (actually, trilateration is the better word here), I spent several weeks collecting data and putting it through a support vector machine. With a simple moving average filter on top of that SVM, around 98-99% accuracy was pretty easily achievable in classification (I believe my prediction classes were 2 x # of rooms, so quite coarse but good enough for the task).

The main advantage over a physics-based modeling approach - which with enough information, could surely have reached practically 100% accuracy - is that the SVM didn't rely on knowing anything about the location of the access points, or the geometry of the space. The signal strength training data was to be available for free as a biproduct of another device, so this solution had very low cost in the form of manual effort/precise measurement, both of which would have dwarfed a few weeks of intern time.

> If we can't know in advance, how can you expect a glorified Markov Chain to magically figure it out? If it could - and it can't, but if it could - how would you know it did it correctly?

We might not know anything about them in advance, but the patterns are there and could maybe be extracted from the some training data. If only you had a statistical model that was flexible enough to find them…

Validation is then as easy as running the model on some examples outside the training set.

> No, it gives an infinite number of sine waves added up together.

Yeah, and after Doppler shift it is still an infinite number of sine waves - no immediate information gained.

Of course, if there are characteristics in the original noise and its frequency distribution, you could try to find those in the doppler-shifted signal. How would you determine the characteristics? From a dataset of examples, I guess. So now the problem is: recognize a pattern from examples and try to find it in new instances. Sounds like the kind of problem ML has found success in. (If you're now thinking "we don't need ML, just some advanced statistics"… Well ML is often basically a statistical model with lots and lots of parameters.)

> Validation is then as easy as running the model on some examples outside the training set.

Only if you can trust the data gathered from that validation to be representative. You can do that easily when you understand the statistics your model is doing - which is the case with an "old-school" ML solution, but not so with DNNs.

This gets worse the more complex your problem is. I can expect a DNN to pick up the correct frequency patterns in audio time series quickly, as it stands out in the solution space - but with more variables, more criteria, we know it takes ludicrous amounts of data for the model to start returning good results, and it still often fixates on dubious variables.

And then you have to ask yourself - what are your error bars? With a classical approach to estimating train velocity from sound, your results will be reasonably bounded, and won't surprise you. With a DNN, all bets are off.

> How would you determine the characteristics? From a dataset of examples, I guess.

And physics. In this case, a human can apply their understanding of physics to determine what characteristics to expect, verify they exist in the dataset, and encode that knowledge in the solution. A DNN will have to figure this out on its own, and we have no good way to verify it did it correctly (and isn't just overfit on something that's strongly but incidentally correlated).

I agree there are plenty of problems where we don't have a good "first principles" solution - where we're just looking for correlations. DNNs automate this nicely. But such models belong to the category of untrusted ones - they might seem to work now, but because of their opaqueness, we can't treat past performance as a strong indicator of reliability.

> Well ML is often basically a statistical model with lots and lots of parameters.

Yes. But I think it matters if people know what those parameters do.

Ha! A friend sent me this comment when he recognized this project. Unless there happens to be another firm who did the exact same thing we did, I was a part of this project (see this blog post https://www.svds.com/introduction-to-trainspotting/).

You misunderstood the point of the presentation. The company was a consulting firm that specialized in data science and engineering. Our clients wanted to kick the tires and see what our technical chops were before hiring us but they didn't want to let us use their proprietary and confidential data for our own tech demos.

We didn't want to just use the same open source datasets everyone else did, so we got to thinking about novel datasets we could create that might have applications for industries we sold our services to. From this, the Trainspotting project was born.

Many of us commuted via the Caltrain, which was right next to our office, and we were frequently frustrated with the unreliability (this was in ~2016 or so when car and pedestrian strikes were happening seemingly every week), so we made an app that tried to provide more accurate scheduling.

We used the official API for station:train arrival times, but we found that it was unreliable, so we wanted some ground truth data on whether a train was passing. Since our office was right next to the Castro MTV station, I had the idea to use a microphone (attached to a raspberry pi) to just listen for when the train went by. In addition to ground-truth data for validating arrival times, this gave us a chance to show off some IoT applications. It actually worked pretty well, but it had false positives (e.g. the garbage truck would set it off). So we added a camera.

We pointed it at the tracks and started streaming data off of it. At first we used very simple techniques, processing the raw stream on-device with classic computer vision algos (e.g. Haar cascades) in openCV. We discovered that the VTA, which had a track parallel to the Caltrain and was "behind" the Caltrain in our camera's shot, could cause false positives. Gradually we used more and more complex techniques like deep learning, but the raspberry pi couldn't handle it (IIRC it could only process a single frame in like 6 seconds). So we used a two-stage validation whereby the simpler, faster detectors that could run on the raw stream in real time detected a positive and then we'd send a single frame to run deep learning.

TL,DR: The whole point was to be a tech demo, not to gauge the speed. The trains were either stopping or pulling out of the station, so speed would have been useless.

Really enjoyed this post and explanation, thank you! I work in ML and used to live on Alma St in Palo Alto so it really hit home for me :).

I also acutely enjoy the notion that a pithy critique of people who refused to simplify the problem they were solving is in itself grossly oversimplified!

Sorry, but this seems too strange to be true. Are you sure you didn't miss anything?

Particularly strange since moving train (i.e. vehicle) is about the most common way doppler effect is explained in textbooks- it's not like you need any big "eureka" moment to get to this solution either.

Analyzing the doppler shift to calculate speed only works if you know what the unshifted audio spectrum should be. Trains generate a ton of noise at a wide range of frequencies and that noise probably varies significantly based on a bunch of factors.

If you put the microphone directly against the track, I would bet the friction and movement of the wheels against the track generates vibration that is fairly consistent for a given speed. Maybe a sensor that better detects slight vibration would be better than a microphone for this use case.

Additionally, train engines run as generators to actually power the wheels, which means they're likely running at consistent RPMs or a consistent range of set RPMs. This could be listened for.

A sufficiently large RLCDNN would reinvent the Doppler effect from data, eight?

You could also let Tom the traindriver sit there and have him guesstimate the speed.

Or just have two switches on the train tracks

Radar exists too.

The need might be for a sensor local to the platform as a back up to give warning for a train that's traveling too fast? In which case a sensor that mimics the old Cowboy film favourite of putting one's ear to the track seems like a reasonable thing to try.

Ah, we're talking about practical solutions here? Should've warned me. A chain of laser reflective sensors might be even better, because there is less mechanical wear + you can use them to know where the train currently is and where it isn't.

But this is very likely a very well researched area and there are definitly train people who can point out a flaw in this idea (dirt?)

Or some other type of sensor and minimal gear added to each locomotive...

When all you have is a hammer...

Or maybe in this case: When you have a shiny new hammer, and not enough fitting nails.

But what about noise ? Is it really accurate in a real world environment ?

ok, a compromise. let's do ML on the power spectral density of the train audio. or just use lidar.

I was very keen on machine learning for some time- I started working with ML in the mid 90s. The work I did definitely could have been replaced with a far less mathematically principled approach, but I wanted to learn ML because it was sexy and I assumed that at some point in the future we'd have a technological singularity due to ML research.

I didn't really understand the technology (gradient descent) underlying the training, so I went to grad school and spent 7 years learning gradient descent and other optimization techniques. Didn't get any chances to work in ML after that because... well, ML had a terrible rep in all the structural biology fields and even the best models were at most 70% accurate. Not enough data, not enough training methods, not enough CPU time.

Eventually I landed at Google in Ads and learned about their ML system, Smartass. I had to go back and learn a whole different approach to ML (Smartass is a weird system) and then wait years for Google to discover GPU-based machine learning (they have Vincent Vanhouke to thank- he sat near Jeff Dean and stuffed 8 GPUs into a workstation to prove that he could do training faster than thousands of CPUs in prod) and deep neural networks.

Fast forward a few years, and I'm an expert in ML, and the only suggestion I have is that everybody should read and internalize: https://research.google/pubs/pub43146/ So little of success in ML comes from the sexy algorithms and so much just comes from ensuring a bunch of boring details get properly saved in the right place.

Most of the article is about the first of Google’s 43 rules about ML: “Don’t be afraid to launch a product without machine learning.”

and this is the first part of the description:

“ Machine learning is cool, but it requires data. Theoretically, you can take data from a different problem and then tweak the model for a new product, but this will likely underperform basic heuristics. If you think that machine learning will give you a 100% boost, then a heuristic will get you 50% of the way there.



That makes it sound like the problem is lack of data, which isn't true.

The problem is that the kind of ML that involves downloading a framework from github and tweaking features until the percent goes up is actually built on certain statistical models under the hood that people don't understand and that don't fit the process they're trying to model.

When the statistical model is correct you don't need loads of data. E.g., you don't need more than a thousand respondents to make valid inferences about millions of people in a sociological survey.

For real.

People are downloading ready-made models from repositories to try and solve minor problems.

Guess what, your problem might be a simple linear regression. Yes you can solve it with a DNN (one level, one neuron - but hey, don't keep that from putting it into your CV) but you don't need to.

At university, I generated Markov chains of the solution space from a single neuron that was being used as a binary classifier. You take n samples, average them out and look at the decision boundary. The decision boundary itself is linear but the margin of error is not.

It was really cool. Attempting to implent Hamiltonian MCMC on a single neuron really forced you to learn what a gradient is in regards to NN.

There point of that statement is that you need data to train a valid model for your usecase. And you may need a lot of it if you are trying to train a neural network.

Using a pre-trained model only works if the usecase it was trained for matches you're closely enough.

But the problem really is lack of data in many cases. Not necessarily lack of it in quantity, but lack of it in quantity in any usable/trustworthy form.

Huge amounts of data was necessary in the early models and is still when you want to win DL competition.

I fine tuned YOLOv5 with a few dozens hand-labelled images to make an object detector in a semi-controlled environment.

The idea that you need a million images to train a detector or a classifier is now totally wrong. Fine-tuning can be done on a very small dataset.

Yes, this is the third paragraph of the article.

Meta: I feel like a lot of people (including me) just come to HN for the comments, which are often (subjectively) better than the article itself.

Basically the heading becomes the random discussion topic that gets thrown in the room.

Maybe there is an experimental social platform in that:

(Re-)create a HN or reddit look-alike, but instead of user submitted links just pick random headings from news sites. Every ten minutes, post a new one without any context or link to be discussed and voted by the audience.

No idea where this would take us.

Isn’t that what «Ask HN:» is for? You can also just post a title without any link.

So you are basically asking for a subset of HN? To avoid echo chamber I think the links is a good thing.

After months learning about machine learning for time series forecasting, several chapters in a book on deep learning techniques for time series analysis and forecasting, the author kindly pointed out that there are no papers published up to that point that prove deep learning (neural networks) can perform better than classical statistics.

From the scikit-learn faqs:

> Will you add GPU support?

> No, or at least not in the near future. The main reason is that GPU support will introduce many software dependencies and introduce platform specific issues. scikit-learn is designed to be easy to install on a wide variety of platforms. Outside of neural networks, GPUs don’t play a large role in machine learning today, and much larger gains in speed can often be achieved by a careful choice of algorithms.

Of course, there are libraries that can support GPU acceleration for numpy calculations using matrix transformations now. Nonetheless, they are not often necessary.

> the author kindly pointed out that there are no papers published up to that point that prove deep learning (neural networks) can perform better than classical statistics.

Early in my career I moved to Silicon Valley to work for a large company. The project was a machine learning project. I was taking models defined in XML, grabbing data from a few different databases, and running it through a machine learning engine written in-house.

After a year and a half, it came out that our machine-learning-based system couldn't beat the current system that used normal statistics.

What rubbed me the wrong way was that the managers brought someone else in to run the data, manually, through the machine learning algorithm. More specifically, what bothered me was that we didn't attempt this kind of experiment early in the project. It felt like I was hired to work on a "solution in search of a problem."

Career lesson: Ask a lot of questions early in a project's life. If you're working on something that uses machine learning, ask what system it's replacing, and make sure that someone (or you) runs it manually before spending the time to automate.

So true, especially about RegEx. I love RegEx. You can do so many things with simple RegEx rules.

For example, have you ever tried to autodetect a published datetime of a news article published online? In many cases, it will be in metadata, or in the time/datetime tag.

However, there still many websites where published time is just written somewhere with no logic at all.

Writing a RegEx script by hand can resolve a problem. But every time I speak about it with our clients/prospects, they ask about ML that we use to parse news content.

Product: https://newscatcherapi.com/news-api

To offhand dismiss ML is also a cardinal sin. Control/treatment groups can show unambiguously when ML outperforms expert hand-crafted rules, pure random decisions or a simple model. The point is to measure, and not go 100% all in with one approach, but try many things and measure. I've done some process optimization with black-box methods, simple models, and SQL using domain expertise. In business you typically have budget and time constraints, so you go for the simplest and quickest solution first, show unambiguously that it works better, and then ask for more time and budget to build something more fancy. I ask myself "if this was my business, and my money, would I spend it doing this", if the answer is no, then you probably shouldn't.

The thing is, it's almost always clear when ML will outperform and when it wont. It isn't magic. ML systems are just compressed aggregations of their input datasets.

The question is then, (1) do we have datasets that are highly representative of the solutions to our problems? and (2) are our current systems sensitive to the relevant variations in these datasets?

If (1) is NO, then ML is impossible. If (2) is YES, then it's unlikely to provide a big ROI.

When ML replaces human decisions or very strict old software, there's also a more fundamental problem: are we enabling new mistakes that weren't possible before? How catastrophic?

For example, processing images according to some trained model instead of fixed rules and formulas introduces the risk of mismatched models (e.g. landscape photographs treated as line art from anime). Cases like self-driving cars not seeing obstacles are more obvious and more tragic.

>ML systems are just compressed aggregations of their input datasets

I like to think of them as forgiving sieves of patterns in data.

Overfitting a sieve will exclude a large number of almost positive cases, loose fitting will include a large number of mostly negative cases.

And there is always a danger of falling into a local minima and not being able to come out of it.

I would add (3) can we leverage existing models.

ML can help reduce technical debt at logic layer, but it increases the technical debt at the infrastructure layer. It's a challenge for any company to deploy, manage and monitor models in production. If you can get away with a simple rule, that's a bigger win for the product (I'm not talking about research here).

In the community, there is a trend that "complicated == better". imho, more is less in industrial ML. You need to deal with model management, worry about inference & latency when the model gets bigger. The author has another article where he argues that data scientists need to be full stack ninja. While I don't fully agree with that statement, I think it benefits the company in many many ways. Data scientists need to meet engineers in the middle, and all these challenges need to be considered from day 1. Another trend I see is that some data scientists are not driven by the question "Can we solve this problem for the company?", but rather "Can we solve this problem using ML/DL?". This will lead data scientists to use the shiny and trendy models, even if it is not suitable for the job. I would blame management here, in some environments, data scientists are evaluated based on "fancy" models they build, not solutions that they provide. Solutions can be simple (but not simpler) rules.

I can see both sides of the argument. On one side, using ML feels like huge overkill when a simple trick exists. Plus AI can freak out in some circumstances. On the other side, it may find other, less obvious cues giving something more robust.

Rich Sutton's "bitter lesson" says the weight will move in time in favor of ML. http://www.incompleteideas.net/IncIdeas/BitterLesson.html

From my limited experience, ML is good at massively multi-factor problems. If a human can understand the input, normal code will usually suffice.

This is why ML is pretty much the only option for autonomous driving, but not for calculating credit scores.

Credit scores aren't a great point of comparison because they have specific explainability requirements. If your goal is to predict defaults - for instance, if you work for a bank or a hedge fund dealing in bonds - then more sophisticated ML techniques might be appropriate. But credit scores are optimized for consistency, not accuracy.

I know that was probably an offhand example, but it's illustrative of the kinds of non-functional requirements that can make ML solutions more or less viable as soon as the technology has contact with human society.

Was going to say the same. In any decision where the outcome affects a human being, "because the algorithm said so" is usually not a satisfactory answer either to the human being affected or to any regulators who have an interest.

I haven't read the article, but I really like the construction of the phrase. I would also propose:

First rule of optimization: Don't optimize first. First rule of automation: Don't automate first.

“This is just an agile PoC. It’s not meant to be readable, performant or documented”.

The first rule of everything should be “it depends”.

I propose instead the first rule of rules: Don't assume that general advice will be applicable to all circumstances. :P

Until a few weeks ago, I worked for a team trying to build AI driven products. A surprisingly challenging thing has been finding problems that aren't better solved without ML (as an ML company, we are supposed to be using it so those concepts get eliminated).

Thanks for that! Some people I work with are constantly asking for ML, they invoke like its magic and will figure shit out by itself. Then when I push back asking how they would make the decisions themselves, their answers tend to be in the line of "it's ML, it should figure out by itself", and when I ask about the data to be used, "it sshould adapt itself and find the data". Getting to have a heuristic in the first place is so hard.

Reminds me of the book "Everything is obvious", where they experimented a few times and showed that in complex systems, advanced prediction systems made on many available and seamingly relevant variables are only marginally better (2 to 4% in the experiments) than the simplest heuristics you can use. They interpreted that as a limit of predictability, because systems with sufficient complexity behave with a seemingly irreducible random part.

To me that is just an iteration on first you makes it run then you make it right. And to make it run you start by the simplest approach. And building your own model is generally not the simplest however, it can be. There are some areas where you should start with ml. Most importantly Vison and some NLP, whenever a pretrained model for your task exists.

This is correct. But i guess the article argues about not already solved problems. NLP is in the most cases so powerful and easy to implement, that i would argue it can be viewed as a more complex version of a heuristic. My thought comes from the idea, that you need to act up on the data NLP-Algorithms bring to you.

Yea I get that but I have experience people working with all sorts of insanely complicated heuristic to get something like a NER system running when they could have much more easily used a Hugging face model. But I totally agree that the article holds true if you have to train your own model.

If you have not seen James Mickens (Harvard CS) USENIX Security keynote presentation from 2018, I highly recommend it. It's hilarious while clearly showing how reckless and dangerous ML is:


(for consulting in ML)

The Second Rule of Machine Learning - Start Machine Learning with simple shallow models.

50% of the problems are solved with good data choice of data + some generalized linear model.

30% remaining problems solved with shallow models or old school ML models. Anything from support-vector machines, decision trees, nearest neighbors, very shallow neural networks.

Remaining 20% require more work.

I wonder if this also applies to computer vision? There are certainly problem spaces where heuristics are well established, but many approaches around object detection/segmentation seem much easier/robust to implement with machine learning.

Was going to answer this. I very much agree with the article, but deep learning is absolutely a game changer for computer vision.

I myself tried several time to “not use ML” for some easy computer vision tasks where traditional CV methods are supposed to work. Well I always end up in situations where they don’t work well without fine parameter tuning, and tuning the parameter for a situation breaks the model in other situations, so you start adding layers of complexity to automatically tune the parameters, but the parameter tuning system also has its own parameters... While a simple neural net is trained easily and is much more robust, saving a lot of time and complexity.

Another proof of that is that CV products only started meaningfully entering the market after ML became applicable to CV (after 2015 for complex tasks, or earlier for simpler stuff like MNIST).

The aerospace industry has been using CV successfully for decades, well before ML appeared. So I would temper your last statement a bit.

That being said, you are right to say that ML changed the game entirely for CV in industry at large.

Same with industrial inspection.

To be fair. Doing that fine parameter tuning and complex layering of heuristics is to some extent creating a “ml model” from hand.

Exactly. This is why usually when you reach that point, a red light turns on in your brain saying “you are just reinventing ML at this point, stop”

The success of Deep Learning in Computer Vision is fascinating for me.

It revolutionized the field in a little over a decade and brought forth new frontiers.

I do believe that CV is an area ML will excel well into the next century.

Perhaps, we will find a way to chain together ML systems dynamically, overseen by a procedural system that makes real time decisions in understanding its input.

There are some stuff that is more robust but the clarification is hard with classical methods or even small models. Though we're getting better at small models. There's different biases in the models too. But I wouldn't expect classical methods to do well on ImageNet. Though ImageNet has a lot of issues...

I fully agree with the article. One thing not mentioned, however probably assumed to be given: domain knowledge. A domain expert using simple methods will probably beat any decent ML model because they are able to define strong features.

That can happen indeed. Compensating for lack of system or domain understanding with ML can result in mediocre results. I've seen this repeatedly with ML teams struggling to get their models adjusted to what was fundamentally not so great data that needed a simple cleanup. Failing to understand the data was dirty, which was easy to address, led to a wild goose chase extracting this and that feature in attempts to make the magic work better.

Once you have deep understanding of your domain and system, finding the places where ML truly adds value is a lot easier. Also, you'll have a basic understanding of how things are without it and you'll know whether it is working better or not and whether that's worth the trouble.

But the point of ML to begin with is likely often to appeal not by a better product but by appealing to investors or managers. If you create a better product but it doesn’t have “AI” in it then it failed in that aspect. What’s needed is a set of things that can be sold as AI or ML but isn’t.

Spend two weeks adding some hidden worthless token "feature" no one will ever need or use that relies on AI. Then you can say your product is powered by AI. Boom, done.

My logger uses AI to generate a catchy 'message-of-the-day' to the console on first run.

My project is powered by AI.

MG, Chinese carmaker took full frontpage ad on Indian papers to saw their new car has AI, mostly meaning voice recognition commands and ADAS.

> Solve the problem manually, or with heuristics. This way, it will force you to become intimately familiar with the problem and the data, which is the most important first step.

Back then, when I did social research at university, I found it helpful to just look at the raw data. This is immensely helpful for familiarizing yourself with the data and discerning patterns that high-level analysis wont reveal easily. (In this case, you may want to start with a subset for evident reasons.)

This is generally correct about ALL technologies. You should NEVER start with a solution and look for a problem except in a very general sense (e.g. looking for potential markets in the abstract). Taking a solution market without having the problem well defined and identified is absolutely Epic Fail.

Once you think you have a market, you should see how it can be done FIRST without your "fancy miracle technology" because NOBODY buys a technology because it's sexy or trendy: they buy because it added value in terms of more capability or lower costs.

And ALL problems have current solutions that almost certainly DO NOT use anything as complex as your technology solution so you have to trend very carefully and deliberately in a rational sense: what value are we REALLY adding? That starts with knowing your competition and the current solution to solving the problem first and then finding every reason why your technology won't work or will be problematic.

You ONLY have market potential once you've exhausted those faults or have objective arguments for your value proposition that have been validated by actual customers. The actual prove is made when they are willing to write a PO to you for the solution. Until then, everything you are doing is unproven.

well said, thx for your comment

What people call "ML" is actually several bundled phenomena. Unbundling them is profitable exercise that can help prevent alot of heartburn

* 1 -> the discovery of specific families of non-linear classification algorithms (with image and language patterns being examples succesful new domains). the domain where these approaches are productive might be significantly smaller than what all the hyperventilation and obfuscation suggests.

* 2 -> the ability to deploy algorithms "at scale". this cannot be overemphasized. Statistics used to be dark art practiced by scienty types in white lab coats locked in ivory towers. With open source libraries, linux, etc to a large degree ML means "statistics as understood and practiced by recently graduated computer scientists"

* 3 -> business models and regulatory environments that enabled the collection of massive amounts of personal data and the application of algorithms in "live" human contexts without much regard for consent, implications, risks etc. Compare that wild west with the hoops that medical, insurance or banking algorithms are supposed to pass

Conclusion, ML is here to stay in some shape or form, but ML hype has an expiration date

Googe's Rule #2:

> First, design and implement metrics.

> Before formalizing what your machine learning system will do, track as much as possible in your current system. Do this for the following reasons:

> * It is easier to gain permission from the system’s users earlier on.

> * If you think that something might be a concern in the future, it is better to get historical data now.


You always start by looking at the data, not by busting out advanced statistical methods. Those methods are obscure and could easily hide how ugly and unclean your dataset is. You really do need to look at types, missingness, the data structure and ensuring the row ID is what you want it to be, eliminating duplicates, joining on other datasets; it's a massive list of steps.

Even with a clean dataset, most clients will want basic arithmetic calculations: averages, counts, percentages, standard deviation, etc. Occasionally they'll want some basic logistic models, something slightly more causal. If they go straight to machine learning without these steps, do they actually understand their problem and what they want? Or are they reaching for the shiniest thing they've heard of?

I come from a core engineering background. In my experience, ML especially DNN these days is a way for people to avoid doing critical thinking. The improvement even if it works is extremely marginal making the ROI useless. Further unlike social media, a failure of ML model will result in a loss of limb or life.

Unfortunately most decision making C-suites are not engineers who fall for the marketing hype and burn through time and capital without tangible outcomes.

I went into ML when I realized that this piece of advice is now wrong, at least in computer vision.

It was a few years ago. I had to classify pictures of closed and opened hands. I thought surely I don't need ML for simple stuff like that: a hue filter, a blob detector, a perimeter/area ratio should give me a first prototype faster and given the little amount of data I had (about a hundred images of each), not worth the headache. I quickly had a simple detector with 80% success rate.

Then as I was learning a new ML framework, I tried it too, thinking that would surely be overengineering for a poor result. I took the VGG16 cat-or-dog sample, replaced the training set with my poorly scaled, non-normalized one, ran training for a few hours and, yes, outperformed the simple detector that took me much longer to write.

Now in computer vision, I think it makes sense to try ML first, and if you are doing common tasks like classification or localization of objects, setting up a prototype with pre-trained models has become ridiculously easy. Try that first, and then try to outperform that simple baseline. In most case, it will be hard and instead worth improving the ML way.

I think the author was focusing more on general applications (given his research & industrial background). In computer vision & NLP, the field is a bit advanced and it's harder to come up with rules. The promise of Auto-ML is bigger in these two fields.

Lately I've been thinking a lot about data cubes and how their use cases and methodologies for making them applicable are very similar to most machine learning algorithms. I don't mean how the output is generated or how things are programmed. What I mean is that they both tend to produce far more output than is practically useful. Additionally, it can be very easy to look at any small part of the output and draw incorrect conclusions.

To clarify, when I talk about ML I'm primarily referring to classifier algorithms and approaches (including nlp). In the large part the ML is being used to generate classifier rules which generalize patterns, and data cubes are often used to look for aggregations and data sequences which generalize patterns. The problem is that random patterns happen all the time, and may even persist for a long time despite a lack of real correlation. Semantic analysis of data cube output is really important in order to find meaningful patterns.

What I'm getting at is I often wonder why most ML projects try to treat it like it's magic. Human assisted learning has shown repeatedly to be the system which actually works in practical application. The classifier output needs to be pruned to remove rules that only held true in the sample data, or were merely coincidental, or simply have no practical value.

Approaches like this are not cheap to set up and may in the end still only produce the same results as the existing entirely non-ML based system. What is the likely scale of work compared to the benefit is the first question I ask myself before working on anything. If I don't have objective data to answer that you have to do some research to find out. Never try to build a massive or complicated system you don't have objective reasons to expect will be worth the effort. That's precisely what people have been doing with ML constantly. It's little wonder most developers have such low opinions of ML projects.

I have seen first hand at small and large companies how problems have been tackled with ML without trying a simple rule or heuristic first. And then, further down the line, the system has been compared to a few business rules put together, to find that the difference in performance did not explain the deployment of an ML system in the first place.

It's true that if your rules grow in complexity, this might make it harder to maintain, but the good thing about rules is that they tend to be fully explainable, and they can be encoded by domain experts. So the maintenance of such a system does not need to be done exclusively by an ML engineer anymore.

Here is where I insert my plug: I have developed a tool to create rules to solve NLP problems: https://github.com/dataqa/dataqa

In the business and corporate world this is so underrated.

In the past I attended several meetings with customers where I was actively discouraged asking questions which would help us deliver a good meaningful solution as long as the customer would be happy "investing in a ML solution". And they were...

I disagree. If you have the data, try throwing ML at it. It's probably less work than trying to "understand it" and building a heuristic. If you don't have the data, how are you going to validate your heuristic anyway?

That's not how it works. People build ML solutions not because they went through rigorous analysis and figured out their problem needs ML solution.

They just want to do ML and are looking for a problem that can be solved with it. Then they will likely ignore you when you say this problem has also neat traditional solution.

This is further exacerbated by corporate actions like competitions for best AI (or Blockchain, etc.) project. Which you typically can't participate in if you have traditional solution even if it is way better.

The fallacy of ML/AI companies.

Example: https://beta.openai.com/examples/default-translate

They even use flawed results in their marketing materials that they didn't validated with domain experts. ("Où est les toilettes ?" is not french).

I thought this was going to be about data preprocessing or domain transformation. The article does touch upon it. For instance, you can boost your image classifier by normalizing your images with simple statistics. Ironically, since neural networks are very good at finding basic (but non-trivial) feature correlations, the reverse is also true: for instance, you can boost your SVG classifier by adding to it the feature responses of a CNN pre-trained on Imagenet.

Yes - and after many years, I'm yet to get past this first rule, and actually use ML. One day I hope to have a use case that's worth testing it out on!

I think a good rough guide is that if you consider it ML, if you're going to 'do ML', then.. it might be appropriate, but you're jumping to the solution and trying to make it fit (pun intended) the problem.

If on the other hand you start from having some statistics to do on the data you have, then you might at some point find yourself doing the sexy subset of it that we call 'ML', and fine.

There are domains where the use of ML is not only valid but the best viable option (e.g. recommendation systems, computer vision, etc.)

A few thoughts on how to maximize your chances of winning in this case:


"You bought a BBQ grill, you must be interested in more BBQ grills". This is how Amazon ML engine seems to work for me

So, I am a total ML noob. The thing I haven't found a straight answer to is, what is a model. I mean when it is in production? Is it just some random blob that you pipe data into and get data out?

Depends on how it's put into production, but you can deploy a model as a RESTful API that has a defined interface and a defined output. What it does underneath is less important to you I guess. So for all intents and purposes, yes, a model in production is something you feed a predefined set of data points and it gives you a predefined format of output.

So, you know how a straight line is defined as mx + b, where you just have two parameters: slope and intercept ?

Your input value is X, you multiply it by your slope and add your intercept to get the output (the Y value on the line).

The 'training' of an ML algo is really just finding the line-of-best-fit so that you can make predictions. So your line-of-best-fit is encoded in these two parameters, allowing you to make predictions about what the output would be for arbitrary input.

The problems people are throwing at ML have many more parameters and dimensions, but the training is a matter of finding those parameters that come closest to predicting the outcome. The 'model' is this set of parameters that allows the function to make predictions.

(disclaimer: also an ML noob, correct me if I'm wrong)


Starting with Machine Learning gets you funded, though.

I always have great success doing anomaly detection with basic standard deviation in some SQL queries...

Seems like antirez (from Redis fame) doesn't agree with this:


Don’t bash the tools… just bash the fools!

ML really needs specification tests.

A second point on that is, start with the simplest and most trivial models first, then add complexity as needed.

Ns que es esto yo soyb español

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact