Keep pursuing this and ignore critics. What you're doing is important b/c ML is just out of reach of a big percentage of developers and technical lay people. It will take time to get your approach right, but it will make a difference.
As a suggestion - provide more real-world examples (eg. business, sports, etc) so that users can tinker with your samples as pathway toward learning.
Please don't give up on this. Great job.
I will take your suggestion into consideration. You are right, there should be more real-world examples that will help users get started and see how this can be useful.
The thing is, I started the project two weeks ago, so it still relatively new. I ve been coding day n night because the idea got me excited. I published the first stable release this week. However, there are new features that will be implemented in the next releases.
However, I think it is only for building deep learning models and does not have any general ML support or am I missing something? If yes then that fact makes it very different from igel as a tool
I see the value in having a CSV data dump and going "I wonder what happens if I run it through X." then a CLI command to find out.
Would be neat if there was an adapter for SQLite too IMO.
Keep it up Nidhal, you’re doing a tremendous service. Don’t let the snobs get to you.
Abstracting scikit out into a configuration file only very slightly simplifies the actual code involved but I can see this being useful for some non technical users who don't care about the code and just know the ML terms.
A few parameters are fine, you can pull them out into constants, but you quickly end up with a lot of variables to keep track of.
I know the answer is to just write what I'm describing myself but does anyone know of an existing way to find the best SciKitLearn algorithm for a particular problem. Like if I want to find the regression fit is there a way to just pass in the data and have it trained,tested on all of the regression algorithms in SKLearn? My current workflow is to just pick a handful of algorithms that sound like they should be good for the problem at hand and try each one of them manually. Igel seems like a step towards making this sort of thing possible if another tool doesn't exist already.
So, if you throw some data and fit all machine learning models on it and then compare the performance. You will probably receive misleading values since different models require different tuning approaches. It's not as easy as you said it, you can't just feed data (also depends on the data) to models and expect to get the best model at the output.
One approach I can think of here is to integrate cross validation and hyperparameter tuning with your suggestion. However, I can imagine that this can be computationally expensive. I will take it into consideration as an enhancement for the tool. Thanks for your feedback
These operations certainly are computationally expensive, a recent hyperparameter tuning operation locked up my laptop for 3 days but this seems to be the case for any similar operation. The only approaches I've come across so far to overcome it are things like converting the data to smaller sizes (which seems outside the scope of this tool) and some way to batch the data so that it can be "paused" and resumed as needed. Thank you again for creating Igel.
It's designed for use in a public policy context, so it works best with:
- binary classification problems (ex: the evaluation module is designed for binary classification metrics)
- problems that have a temporal component (the cross validation system makes some assumptions about this)
It's a command line tool that is also intended for non-technical folks. I sense a contradiction.
That doesn't even speak to the requirement of understanding all these ML algorithms so I can specify them in the config file, or understanding YAML format, or data curation. At this point it would be easier to write the python code - especially scikit-learn which is a very well-documented library.
Second, I'm a technical user, in fact this is my daily work and we build this tool for reasons that were mentioned in the docs/readme, so you can check it out.
Third, you mentioned understanding YAML Format. Really? I mean yaml is the most understandable format any person can understand. I can never imagine that a person cannot learn yaml in 30 min at most.
Finally, yes sklearn is great and well documented but did you checked how many libraries are out there that represent basically a wrapper to make it easier/abstracter to write sklearn code? you ll be surprised.
As discussed in the official repo & docs, it is a much cleaner approach to gather your preprocessing & model definition parameters/configs in one human readable file/place, where you can manipulate it easily. Re-run experiments, generate drafts, building proof of concepts as fast as possible, than to write code. At the end of the day, we all have different opinions, you can still write code of course. The tools are there to help.
So yes, this tool can have great utility. It adds an abstraction layer and removes busywork for repetitive programming tasks. However, the utility will be for users acquainted with command line. Users who know what a config file is, or data types, lists, and key-value relationships assumed by the YAML spec. Users will also have to know the different algorithms so they can populate the config. All of these things require technical knowledge.
All of the above things are what us technical users take for granted, so a claim to cater to non-technical users must be evaluated from their perspective.
I am not belittling your work - this is a good project, but currently targeting an audience too broad.
If updating a YAML file and hitting "run" makes that other "hardest part of learning" easier: hurray!
Need to see more projects abstracting away the hard stuff (I'm looking at you, GUI libraries!)
I love it and plan to use it in my data as it is.
What you're doing is great stuff and I hope it encourages a lot of folks to play around just as I had, just to get started.
However, It looks like the ludwig tool is about deep learning and not ML, or am I wrong? It looks like there is no support for ML models or am I missing something
I didn't try it yet, I just read the get started section but looks really great for training deep neural networks.
I recently had a discussion about the requirements that a text file format (like YAML) has to fulfill to be considered "code". :)
But we also agreed that it's not the most important factor whether it's code, a graphical user interface or a command line interface to make a tool usable for a lay person.
What's more important is that the entry point is easy, and that the complexity and flexibility is abstracted away in layers that do not have to be fully understood from the beginning, so that the learning curve is not too steep.
Of course, my first post was not meant to be criticism of the project, just some pseudo philosophical thoughts that crossed my mind when reading that sentence. Sorry for being too off topic with that. :)
The simplicity of such tools has a tradeoff though. It takes away some of the flexibility. Also, actually writing code helps people learn about ML so there's benefit of doing the hard code when building ML models. But that's not the target audience of this project so that's OK.
My humble two cents is that in past projects I've often found the process to create the CSV that goes into a model is often much more time consuming and error prone than actually training a model. I really love the dataset operations that are provided.
Do you have any thoughts about what a gold standard library for data preprocessing would look like? If you have any plans to move further in that direction? Any projects that you find compelling in that space?
We are actually working on adding support for text, excel and json format in igel.
We already implemented some of the famous preprocessing methods in igel, which you can use by providing them in the yaml file.
Now about preprocessing libraries, I personally use numpy, pandas and some of sklearn functionality to preprocess data. Furthermore, I use matplotlib and seaborn for some visualisation & further analysis.
I was thinking about starting something like this, and had reached out to datasette  / Simon Willison for advice on starting and maintaining a project.
I'm not saying you're endorsing this, but it's basically antithetical to sound experimental design. I don't think the author should pursue automatic anything when it comes to statistics, unless it's just a thin quality-of-life wrapper around other statistical primitives and libraries.
And with k-fold cross validation its very hard to have overfitting.
1) Using cross validation/validation set.
2) Regularisation .
3) Finding statistically significant features (e.g. chi square).
Why cant this be done automatically? I.e. what is the human advantage ?
The nature of statistical significance (which underpins everything you've said), is that repeating many experiments reduces the confidence you should have in your results. Supposing each algorithm is an experiment and each experiment is independent, if you target a significance level of p = 0.05, you can expect to find 1 correlated feature out of every 20 you test just by chance.
Can you automatically correct for this? Sure. But this is just one possible footgun. Are you confident you're avoiding them all? In theory automation could do an even better job than a human of avoiding the myriad statistical mistakes you could make, but in practice that requires significant upfront effort and expertise during the development process.
At a certain point doing this automatically becomes analogous to rolling your own crypto. It's not quite an adversarial problem, but it's quite easy to screw up.
I agree that cross validating would work; that's what I was gesturing to when I was talking about making an assessment of the data and partitioning it. Either the provided sample should be partitioned for cross validation, or it should prompt the user for a second set.
With humans you must make sure that the same human with the same skill set who knows stats at Master Level, will always be there for your specific data and actually have the time to do the experiments.
Also, I think that 95% of the users/consumers of machine learning are non consumers - I.e. they do not have ANY access to any machine learning tech, and thus need to revert to guessing.
So the ethical thing to do is actually give them some tool even if it might not be optimal.
But to be sure, every test should be done with k-fold cross validation. The decision whether to split the training set should not be chosen by the user. It‘s crucial that this is a must!
I feel the same about "probably almost the truth" (not a criticism, just a thought) - unless truth is a range (100% true to 100% false) rather than a binary (either true or false).
It's funny we were discussing a name for the project and we wanted to make an abbreviation from some words that make sense, so we started throwing ideas spontaneously. At the end we wanted to make an abbr for these words: "Init, Generate, Evaluate Machine Learning".
IGEL made sense for us then since it's a german word too. Easy to say, type and remember ;)
I also noticed they pronounce the name differently. Actually, igel is a german word and is pronounced "Eagle" but they pronounce it as I-jeel
The title is a complete misnomer, but the project itself is perfectly useful. As a programmer, any programming I _don't_ have to do is time and money saved.
Also there are new features that I'm working on. The stable release was done this week.
Second, I must disagree that most other common frameworks are imperative. I would say it's a mix of declarative & imperative but certainly not imperative.
Finally, it's interesting how you see this as a just abstraction tool. I find other ML frameworks are more about abstraction since you are focusing on building your model but all details are hidden from you using the framework. Sure, igel is also about abstraction but to say it's JUST abstraction? mmm I find it not quite right, instead it's more about automating the stuff that you would write yourself using other frameworks.
At the end of the day, we all have different opinions and feedback is important ;)