Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Igel – A CLI tool to run machine learning without writing code (github.com/nidhaloff)
405 points by nidhaloff on Oct 3, 2020 | hide | past | favorite | 102 comments


Keep pursuing this and ignore critics. What you're doing is important b/c ML is just out of reach of a big percentage of developers and technical lay people. It will take time to get your approach right, but it will make a difference.

As a suggestion - provide more real-world examples (eg. business, sports, etc) so that users can tinker with your samples as pathway toward learning.

Please don't give up on this. Great job.

Hi thanks a lot. I received positive interactions on github from the community, however, your comment is the first encouraging feedback I ve got here :D so, I appreciate it.

I will take your suggestion into consideration. You are right, there should be more real-world examples that will help users get started and see how this can be useful.

The thing is, I started the project two weeks ago, so it still relatively new. I ve been coding day n night because the idea got me excited. I published the first stable release this week. However, there are new features that will be implemented in the next releases.

If the project is only 2 weeks old, all the more reason to ignore any critics. Particularly here where people are likely to criticize a baby in the crib for not working on coding projects outside of naptime.

Well, I mean that baby doesn't have a functional colon yet so putting semicolons everywhere just makes perfect sense.

I feel like if there is one thing that works on a baby, it is the colon...

What you're doing is creating a declarative syntax for applying machine learning tasks directly to data. This makes it learnable by machines, effectively teaching them how to do their own machine learning experiments. I think this project is greater than the sum of its parts.

In case it isn't on your radar, there is also https://github.com/uber/ludwig which seems to have similar goals.

Someone posted this tool earlier in the comments too. I was surprised since I never heard of it and find it great!

However, I think it is only for building deep learning models and does not have any general ML support or am I missing something? If yes then that fact makes it very different from igel as a tool

Agreed here. I am not involved in the ML space but have briefly toyed with PyTorch/TF/Sklearn.

I see the value in having a CSV data dump and going "I wonder what happens if I run it through X." then a CLI command to find out.

Would be neat if there was an adapter for SQLite too IMO.

Combining it with bash and psql + csvkit + xsv will give you a powerful combination for data ingestion, wrangling and training all on the command line this would seem to have clear benefits for fast development and prototyping.

Hi, can you please explain more, or you can open an issue and explain the expected functionality there

Hi, can you explain how you imagine the functionality with SQLite and why SQLite specifically

Totally agree. I sense ML in general can benefit tremendously from buttery-smooth UX, something it has typically lagged behind on.

Keep it up Nidhal, you’re doing a tremendous service. Don’t let the snobs get to you.

Thank you. I appreciate it

I agree, I think this is great, and something I will try to use shortly.

Thanks. Stay tuned, a lot is coming soon

I find it difficult to believe anyone who can use the models listed on the repository effectively would have any difficulty using scikit themselves.

Abstracting scikit out into a configuration file only very slightly simplifies the actual code involved but I can see this being useful for some non technical users who don't care about the code and just know the ML terms.

It's not about that someone will have difficulty using sklearn. It's more about how clean the approach is if you have all your configs in a yaml file and you can change things very easily/quickly and rerun an experiment. I'm working with data & ML models everyday and it became overwhelming when my codebase is large and I want to change small things and re-run an experiment. Also It would be great to not lose much time writing that code in the first place (although it's easy to do), if you want a quick and dirty draft. The thing is, it is much cleaner if you have your preprocessing methods and model definition in one file. However, there are other features that will be integrated soon, like a simple gui built in python

This is a good point, something that I've been struggling with in my own personal projects is keeping track of parameters as I tweak and play with hyper-parameters and model structures.

A few parameters are fine, you can pull them out into constants, but you quickly end up with a lot of variables to keep track of.

This is so cool!

I know the answer is to just write what I'm describing myself but does anyone know of an existing way to find the best SciKitLearn algorithm for a particular problem. Like if I want to find the regression fit is there a way to just pass in the data and have it trained,tested on all of the regression algorithms in SKLearn? My current workflow is to just pick a handful of algorithms that sound like they should be good for the problem at hand and try each one of them manually. Igel seems like a step towards making this sort of thing possible if another tool doesn't exist already.

Hi, we should be careful with the feature you are talking about. The results from all machine learning algorithm can be very misleading and probably some models will overfit the data.

So, if you throw some data and fit all machine learning models on it and then compare the performance. You will probably receive misleading values since different models require different tuning approaches. It's not as easy as you said it, you can't just feed data (also depends on the data) to models and expect to get the best model at the output.

One approach I can think of here is to integrate cross validation and hyperparameter tuning with your suggestion. However, I can imagine that this can be computationally expensive. I will take it into consideration as an enhancement for the tool. Thanks for your feedback

Thank you for explaining this more indepth. I should have been more specific with my original comment, I did intend cross validation and hyper parameter tuning as inclusing to the automatic feature I was describing.

These operations certainly are computationally expensive, a recent hyperparameter tuning operation locked up my laptop for 3 days but this seems to be the case for any similar operation. The only approaches I've come across so far to overcome it are things like converting the data to smaller sizes (which seems outside the scope of this tool) and some way to batch the data so that it can be "paused" and resumed as needed. Thank you again for creating Igel.

Hey, I really appreciate your answer to this question. As I was reading the question, red flags started popping up in my mind about the risk of overfitting when using the ensemble approach, and I think your response was spot on for how an ML researcher would go about it! Most ML professionals I've talked to have been really against making a user friendly ML suite because of how easy it is to misuse these algorithms.

I think you’re looking for something like AutoML by H2O[0]. There are few similar offerings out there if you search around ‘automl’.

[0] https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html

thank you, this is the keyword I think I was missing in my searches and now I feel silly for not thinking of it.

Triage is built for this: training and evaluating giant grids of models & hyperparameters using cross-validation. Similar to igel, it abstracts ML to config files and a CLI.

It's designed for use in a public policy context, so it works best with: - binary classification problems (ex: the evaluation module is designed for binary classification metrics) - problems that have a temporal component (the cross validation system makes some assumptions about this)


Thank you for sharing this.

I think it can be a useful tool for automation of very standardized ML tasks. However:

It's a command line tool that is also intended for non-technical folks. I sense a contradiction.

That doesn't even speak to the requirement of understanding all these ML algorithms so I can specify them in the config file, or understanding YAML format, or data curation. At this point it would be easier to write the python code - especially scikit-learn which is a very well-documented library.

Hi, I want to clear up some points. First, it is not intended for non technical folks, this was never claimed! However, even if it was, we are currently working on a gui, where (non technical)users can run it by writing a simple cmd in the terminal.

Second, I'm a technical user, in fact this is my daily work and we build this tool for reasons that were mentioned in the docs/readme, so you can check it out.

Third, you mentioned understanding YAML Format. Really? I mean yaml is the most understandable format any person can understand. I can never imagine that a person cannot learn yaml in 30 min at most.

Finally, yes sklearn is great and well documented but did you checked how many libraries are out there that represent basically a wrapper to make it easier/abstracter to write sklearn code? you ll be surprised.

As discussed in the official repo & docs, it is a much cleaner approach to gather your preprocessing & model definition parameters/configs in one human readable file/place, where you can manipulate it easily. Re-run experiments, generate drafts, building proof of concepts as fast as possible, than to write code. At the end of the day, we all have different opinions, you can still write code of course. The tools are there to help.

The README says "The goal of the project is to provide machine learning for everyone, both technical and non technical users"; that definitely sounds as though it's intended for non-technical users.

Well, "both technical and non technical users" right?

Yes, so? Your saying that it's not intended for non technical users still contradicts what is said in README. Yes, the README implies that it's not _exclusively_ intended for non technical users. But it implies that the tool is intended for non technical users.

I am only going off on the README, as the other user pointed out, which addresses technical and non-technical people.

So yes, this tool can have great utility. It adds an abstraction layer and removes busywork for repetitive programming tasks. However, the utility will be for users acquainted with command line. Users who know what a config file is, or data types, lists, and key-value relationships assumed by the YAML spec. Users will also have to know the different algorithms so they can populate the config. All of these things require technical knowledge.

All of the above things are what us technical users take for granted, so a claim to cater to non-technical users must be evaluated from their perspective.

I am not belittling your work - this is a good project, but currently targeting an audience too broad.

Usually the hardest part of a learning pipeline is data gather and cleaning; once it is in a suitable format (such that it is easy to create a structured CSV file), the training part is probably the easiest part: just a few lines of Python code.

All parts of a learning pipeline are hard if you want to do it right. Gathering, weeding, and binning your data is meticulous and hard work, and while "a single run" is trivial, rerunning it over and over with new parameters or even a completely different model because the outcome made no sense whatsoever is not.

If updating a YAML file and hitting "run" makes that other "hardest part of learning" easier: hurray!

I agree. That's why some usually used pre-processing methods were implemented in the stable release.. and more is yet to come

And, arguably, data cleaning is the most overlooked part.

From a purely UX perspective, there’s a huge difference between “no lines of code” and “a few lines of code”.

This is the giant's shoulders I like to stand on!

Need to see more projects abstracting away the hard stuff (I'm looking at you, GUI libraries!)

The great thing about this is that it is directly usable in a gui . Someone will build a gui and make it even more accessible.

I love it and plan to use it in my data as it is.

We are already working on a gui ;) stay tuned

Thanks for your feedback. Stay tuned, we are working on an integrated gui tool written in python too.

I remember how I first got interested in ML and DL. I did not know the a lick of programming ML in Python or whatever language was out there. I simply began by using Matlab's Neural Network and Machine Learning toolboxes and playing around on them. That turned into real coding interest on Matlab, which carried on to Python, so on and so forth. In a sense, I rediscovered programming because of those toolboxes.

What you're doing is great stuff and I hope it encourages a lot of folks to play around just as I had, just to get started.

Awesome! How do you compare this to https://github.com/uber/ludwig, which also has a YAML-based cli for ML?

Wow! this is great! I didn't know that such a tool exists, thanks for posting it here. The python & AI community are moving really fast, it's crazy!

However, It looks like the ludwig tool is about deep learning and not ML, or am I wrong? It looks like there is no support for ML models or am I missing something

I didn't try it yet, I just read the get started section but looks really great for training deep neural networks.

> A machine learning tool that allows you to train/fit, test and use models without writing code

I recently had a discussion about the requirements that a text file format (like YAML) has to fulfill to be considered "code". :)

Hi, and what was the result/conclusion of the discussion? I'm interested in your finding, is it considered code or not :D

Well, we came to the conclusion that there is no hard border and therefore good answer to that question.

But we also agreed that it's not the most important factor whether it's code, a graphical user interface or a command line interface to make a tool usable for a lay person. What's more important is that the entry point is easy, and that the complexity and flexibility is abstracted away in layers that do not have to be fully understood from the beginning, so that the learning curve is not too steep.

Of course, my first post was not meant to be criticism of the project, just some pseudo philosophical thoughts that crossed my mind when reading that sentence. Sorry for being too off topic with that. :)

Nice work! This is def gonna be useful for a lot of people. Companies can benefit from using something like this too. It helps when all your people use the same tool to build and run models and just need to share yaml/json files. The alternative is each group has its own scripts and sharing is harder.

The simplicity of such tools has a tradeoff though. It takes away some of the flexibility. Also, actually writing code helps people learn about ML so there's benefit of doing the hard code when building ML models. But that's not the target audience of this project so that's OK.

Good luck!

I really love seeing projects like this.

My humble two cents is that in past projects I've often found the process to create the CSV that goes into a model is often much more time consuming and error prone than actually training a model. I really love the dataset operations that are provided.

Do you have any thoughts about what a gold standard library for data preprocessing would look like? If you have any plans to move further in that direction? Any projects that you find compelling in that space?

Hi, thanks for your feedback. hmm any machine learning project has to start with a dataset. Most of the time you will have to construct it (or take an existing one and update it) manually. Sure there are tool that generate a dummy dataset for you but that would be just for playing a world and certainly not for a real world use/production.

We are actually working on adding support for text, excel and json format in igel.

We already implemented some of the famous preprocessing methods in igel, which you can use by providing them in the yaml file.

Now about preprocessing libraries, I personally use numpy, pandas and some of sklearn functionality to preprocess data. Furthermore, I use matplotlib and seaborn for some visualisation & further analysis.

"non-technical" and "cli tool" sound like an oxymoron. But if you hide the yaml config behind a UI i guess it can pass for "non technical".

already working on a gui that users can launch using a command. You can check the issues list

Make it (double) clicking an icon on a desktop/app list, and you have a winner. The moment a terminal is needed, you've lost the non-technical crowd (and some of the technical crowd, even)

Great idea - I’ve been waiting for a good cli tool for machine learning, saves the hassle to have to learn python and also can use with other existing shell tools.

Great idea! We had a similar idea back in 2015 with SKLL[1]. We are still actively maintaining it and it’s definitely been helpful to many folks, including many outside our organization, over the years! Wishing you the best!

[1] https://github.com/EducationalTestingService/skll

Thank you for sharing!

I was thinking about starting something like this, and had reached out to datasette [1] / Simon Willison for advice on starting and maintaining a project.

[1]: https://simonwillison.net/2017/Nov/13/datasette/

I'm more familiar with a different "i-gel" (which does a similar thing for emergency airway management as this does for ML, allowing less trained users to still achieve "advanced" results)


Glad I wasn't the only one whose first thought went to protecting someone's airway when some else fails!

To be really automatic the only thing i should need to do is feed an csv, correct the suggested data types and then run all algorithms on the data with the information at the end which has been the most effective and which i should further optimize.

Sure, but shotgunning every statistical test and machine learning algorithm completely undermines your results, because the power and significance levels are not adjusted for many experiments. In the statistical setting this leads to spurious correlations, and in the machine learning setting it leads to overfitting. In either case the results have a high risk of not generalizing beyond the initial sample being analyzed.

I'm not saying you're endorsing this, but it's basically antithetical to sound experimental design. I don't think the author should pursue automatic anything when it comes to statistics, unless it's just a thin quality-of-life wrapper around other statistical primitives and libraries.

There are for example multiple decision tree learners or rule learners. Everyone has different semantics and works differently on the data. If i just can run every one and see which one performs the best is a completely normal approach.

And with k-fold cross validation its very hard to have overfitting.

So over-fitting can be solved by :

1) Using cross validation/validation set. 2) Regularisation . 3) Finding statistically significant features (e.g. chi square).

Why cant this be done automatically? I.e. what is the human advantage ?

It's not that you intrinsically need a human, it's that doing this without human oversight requires being very careful not to make tricky mistakes.

The nature of statistical significance (which underpins everything you've said), is that repeating many experiments reduces the confidence you should have in your results. Supposing each algorithm is an experiment and each experiment is independent, if you target a significance level of p = 0.05, you can expect to find 1 correlated feature out of every 20 you test just by chance.

Can you automatically correct for this? Sure. But this is just one possible footgun. Are you confident you're avoiding them all? In theory automation could do an even better job than a human of avoiding the myriad statistical mistakes you could make, but in practice that requires significant upfront effort and expertise during the development process.

At a certain point doing this automatically becomes analogous to rolling your own crypto. It's not quite an adversarial problem, but it's quite easy to screw up.

I agree that cross validating would work; that's what I was gesturing to when I was talking about making an assessment of the data and partitioning it. Either the provided sample should be partitioned for cross validation, or it should prompt the user for a second set.

Correct. My point is that both humans and machines face the same issues. At least with a machine you get a consistent errors (that does not cost you time), which you can decrease with time.

With humans you must make sure that the same human with the same skill set who knows stats at Master Level, will always be there for your specific data and actually have the time to do the experiments.

Also, I think that 95% of the users/consumers of machine learning are non consumers - I.e. they do not have ANY access to any machine learning tech, and thus need to revert to guessing.

So the ethical thing to do is actually give them some tool even if it might not be optimal.

this is a great feature! Thanks for the feedback

No, it's emphatically not a great feature, and it's not clear to me the commenter was recommending that so much as making a nit. Please don't automate the process of choosing and running algorithms on a single sample of data, it's unsound experimental design that undermines your results. If you insist on doing it anyway, at minimum you will need to automate an initial assessment of the sample data to determine if it has a suitable size and distribution to allow you to adjust the significance of results for the number of tests you're running, and partition the data into smaller subsamples.

Hi, thanks for your comment. I actually understood that he meant something like a hyperparameter search/tuning using cross validation (at least that what came in my mind).

Parameter tuning and algorithm selection! I just don’t want to manually start 5 different runs of algorithms i believe which could work good on the data and manually compare the results. And maybe i was too lazy to run the 6th algorithm which now performs much better.

But to be sure, every test should be done with k-fold cross validation. The decision whether to split the training set should not be chosen by the user. It‘s crucial that this is a must!

Cross validation would be good! I think if you build this in you could automatically run a few heuristics to see if the data can be partitioned, or maybe just prompt the user for another sample of the data with the same distribution.

This is going too far, looks like nobody understand machine learning at hacker rank

Well, it's probably almost the truth.

Off topic, but I recently found myself using the phrase "pretty much exactly". I realized it's nonsense, because the first part contradicts the meaning of "exact". I vowed to never use that phrase again.

I feel the same about "probably almost the truth" (not a criticism, just a thought) - unless truth is a range (100% true to 100% false) rather than a binary (either true or false).

Truth is a range though, for the vast majority of things.

What does “IGEL” stand for? I couldn’t find it in the documentation.

It's a german word and means Hedgehog.

It's funny we were discussing a name for the project and we wanted to make an abbreviation from some words that make sense, so we started throwing ideas spontaneously. At the end we wanted to make an abbr for these words: "Init, Generate, Evaluate Machine Learning".

IGEL made sense for us then since it's a german word too. Easy to say, type and remember ;)

Not sure if there is any deeper meaning hidden behind it, but it is the German word for Hedgehog (pronounce as: "Eagle")

Interesting, igel is Swedish for leech (Egel in German). The Swedish word for hedgehog is instead igelkott. In short, Egel = igel and Igel != igel... TIL

Indeed interesting :D

Interesting. That makes sense since the logo is a hedgehog. What’s the connection though I wonder.

Not sure if you're aware but https://igel.com (thin client manufacturer) is a thing.

OMG this is funny :D I wasn't aware about this, thnx. Looks like it's a thing and the company can even skyrocket in the near future! I only read the about us section there.

I also noticed they pronounce the name differently. Actually, igel is a german word and is pronounced "Eagle" but they pronounce it as I-jeel

thank you for the link!

i think long term this is the future of ML. it’s like a database. every engineer needs to know when to use one and how. not every engineer needs to be able to write a database

Coincidentally, poor understanding of your tools (especially databases) seems like a huge source of frustration and pain for anyone involved in software development.

you can have a poor understanding of your db and still build a billion dollar company

You _can_ win the lottery too, that statement asserts exactly nothing.

This is already how most ML in production works though. Noone writes their own NN, optimizer or even linear regression and for good reasons.

I feel like the places where a non-technical user would be building non-trivial models also have the money to pay for one of those commercial GUI drag boxes around tools.

This is really cool!

Keep going with this, I think you are onto something.

So a lot like weka then. https://www.cs.waikato.ac.nz/ml/weka/

Who else thought this was something that would turn AI loose on your bash commands, and automate everything in your CLI?

If you are disappointed, there's mcfly, which does things with your shell history and ML: https://github.com/cantino/mcfly :-)

"Automate everything" is a disingenuous claim. You're simply replacing a couple of lines of scikit-learn with a couple of lines of your CLI tool. There is pretty much no benefit to using this.

There's most definitely a benefit - turning "writing the code yourself" into "updating a config file" is huge: I can now write code that creates config files, which is stupidly easy, instead of having to write code that writes code, which is stupidly hard.

The title is a complete misnomer, but the project itself is perfectly useful. As a programmer, any programming I _don't_ have to do is time and money saved.

Well. The user is writing a description in a human readable format. Then, the tool will take that description and start running the pipeline. From data reading, preprocessing until creating and evaluating the model. If this isn't automation, please define automation for me.

Also there are new features that I'm working on. The stable release was done this week.

From what I can tell it's a declarative framework (while most other common ones are imperative). And as generally with the tradeoffs of a declarative approach, if the data/model is easy the baked in assumption require less input, while if it's more complex your config files will be equally verbose and/or you will run into a wall. I don't see much automation there, just abstraction.

Interesting opinion. Well I must disagree in some points. First, yes sure the tool uses declarative paradigm, which is the goal of the project. If you want to use ML without writing code, then you will certainly not want an imperative framework.

Second, I must disagree that most other common frameworks are imperative. I would say it's a mix of declarative & imperative but certainly not imperative.

Finally, it's interesting how you see this as a just abstraction tool. I find other ML frameworks are more about abstraction since you are focusing on building your model but all details are hidden from you using the framework. Sure, igel is also about abstraction but to say it's JUST abstraction? mmm I find it not quite right, instead it's more about automating the stuff that you would write yourself using other frameworks.

At the end of the day, we all have different opinions and feedback is important ;)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact