Open up the ~50 different individual datasets linked in separate tabs, and then quickly flip through all of them trying to get a sense of what each one is.
That experience will demonstrate one of the main challenges we're aiming to solve by making Kaggle Datasets your default place to publish data online (https://www.kaggle.com/datasets)
Some friends and I created Quilt to bring versioning and packaging to data: https://quiltdata.com/. The interface is the familiar Python lifecycle of install and import.
We also encourage sharing analytics code and visualizations that users create on the data back to the community. For example, see all these visualizations and insights in StackOverflow's developer survey data linked from https://www.kaggle.com/stackoverflow/stack-overflow-2018-dev...
There's some supervised ML use of those, and a lot more open-ended exploration, visualization, cleaning, clustering, language modeling, etc.
Unsupervised techniques work really well for language modelling.
There is also weakly supervised and distant-supervision, where the labels are "noisy" or not exactly what you want.
You're right in that strong supervision, where you basically trust your class label, works really well, because it's probably the easiest case.
Combining unsupervised (e.g. pre-trained language models) with a very small set of strongly labeled data, or a larger set of weakly labeled data, seems to work pretty well too.
(Because if you know a priori what is it that you want to measure - it's supervised)
One of the features was a subjective rating of how much I liked some of the women, and scikit-learn then suggested to me other women in the clusters that had my best ratings. It turns out that I like vegetarians, redheads, and left-wingers. Which happens to be true, even though I eat meat and do not identify as left-wing. But those traits correlate with _other_ traits that are more difficult to measure objectively, such as caring about children, liking to hike, and preferring an evening of sex to an evening of television.
It is a very expansive collection of datasets, some well-prepped for ML and most not (which is part of the fun of it, anyways).
opendatanetwork.com: this is effectively a Google for public Socrata data portals, and for me, the best way to discover datasets across different municipalities. For example, when I was interested in trying to replicate the NYT's "Do ‘Fast and Furious’ Movies Cause a Rise in Speeding?"  article, it was pretty easy to find a bunch of other traffic/motor vehicle violation datasets with opendatanetwork's search.
Enigma public (https://public.enigma.com): a huge collection of scraped public datasets, including flattened versions of data that originally comes in annoying-to-parse, such as U.S. lobbying disclosures 
The collection is good though, it's sad that it looks like it is stealing from the sources.