
Ask HN: What dataset do you think is useful to you but not readily available - abhikandoi2000
We are a bootstrapped team trying to build tools for data extraction. We are currently focusing on tools for data that is semi-structured and thus can be extracted using non deep learning based software. So if you think there is some data that you need (and you are willing to pay for) but it is not readily available, we might be able to help you. We are looking for different types of datasets that are actually useful to people, so that we can work towards a tool that can be generally used for some sort of data extraction. If you think you have such a dataset in mind, do let us know. Also, if you could share a website where we could find the semi-structured version of this dataset that you need, it&#x27;d be really helpful.
======
ggm
Customer counts for ISPs worldwide by ASN. There are approximations for some
economies, filing for federal regulations and stock exchange notices. There
are yearly numbers for broadband which are pretty hazy. I went to a meeting
where china declared 160m extra online users had been found that year.

A huge amount of internet modelling and sampling would improve at scale if we
knew this. I've discussed this with researchers in the field. Akamai and
Google and Facebook have private information which is their secret sauce.

~~~
sampkola
Can you share a link to one of these filings or notices you are refering to?
Or it would be helpful if you share what regulations are typically enforced on
ISPs (in your region) and if they publish customer count details to public
domain.

------
steerpike
A clean dataset YouTube music videos associated to musicbrainz tags.

~~~
sampkola
Hi! I am part of the team trying to build these tools. Have you checked out
[https://last.fm](https://last.fm). I think its possible to associate
musicbrainz data with last.fm data. Can you share more details on how you plan
to use this data?

~~~
steerpike
Sure. Couple of points:

* Lastfm has an API but it's not great and an API is different to a dataset. I want to have the data available in my own DB, not have to make requests for whatever I might need.

* You can't get youtube links from the lastfm API, you have to crawl the tracks pages.

* The tags on lastfm are a folksonomy not a fixed taxonomy like musicbrainz

I have used [music-map]([https://www.music-map.com/](https://www.music-
map.com/)) and last.fm together to make a [playlist
generator]([http://playlist.hallofbrightcarvings.com.au](http://playlist.hallofbrightcarvings.com.au))
but I'd like to be able to do the same kind of thing without having to crawl
third party sites. I'd also like to know that there was a music database of
listenable music that was kept up to date.

