1). Don't use hard coded values for types
>GET /apiv2/entries?cat=6 -- List entries that are datasets
>GET /apiv2/entries?cat=5 -- List entries that are papers
These could be written as:
2). You may not need path elements like entries, entry, collection, and collection name. For example further simplification would leave
4). Use HTTP verbs in a standard, semantic way. For example this
POST /apiv2/collection -- create an collection
And while these suggestions may seem like nitpicking, they are not.
While not everything around REST APIs is entirely standard, there's a great deal of agreement about proper resource naming and use of HTTP verbs. If you follow these widely-used standards, not only will developers find it much easier to interact with your API, but also there is a lot of tooling and client libraries out there built specifically to work with APIs that use HTTP verbs and URIs in the standardized ways.
I also found the creators here  and here ...maybe you can help shoot them an e-mail so that they can acknowledge a lot of the praise on this thread and the flood of helpful suggestions. Additional contributors can also be found here  as well as related publications/presentations  including a massively interesting Reddit discussion .
- Joseph Paul Cohen
- Henry Z Lo
- Academic Torrents: A Community-Maintained Distributed Repository (http://dl.acm.org.sci-hub.cc/citation.cfm?id=2616528&dl=ACM&...)
- Academic Torrents - Simple Pitch (https://docs.google.com/presentation/d/1JC2d1g9U6HaenGSn_Xvk...)
- Academic Torrents: Scalable Data Distribution (http://arxiv.org/pdf/1603.04395v1.pdf)
- I have every publicly available Reddit comment for research. ~ 1.7 billion comments @ 250 GB compressed. Any interest in this? (https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_eve...)
What would happen if I wanted to get a list of datasets and papers? (maybe in this case it's nonsensical, but it's a problem with some other APIs I've used and I've never figured a good way to work around it)
GET /apiv2/entries?cat=5&cat=6 vs 2 separate requests and client logic to combine results?
2) I don't like dup query params, it's not well defined: http://stackoverflow.com/questions/1746507/authoritative-pos...
For returning multiple types at once there are a few common strategies.
For related types some APIs offer an "include" or "embed" approach like this: http://www.vinaysahni.com/best-practices-for-a-pragmatic-res...
Another approach is to support a query syntax for items, where items is a container record for multiple possible types.
If a certain multi-type scenario super common you may want build the concept into the API itself as basic functionality.
Again, your final choice should be as simple as possible but no simpler, taking into account ease of use, performance, etc.
I do agree that you should have a fallback, but for reference, I've been consuming semantically-defined REST APIs for 4-5 years now, and I've never run into this problem. I agree with you that it can happen because of firewalls, etc, but it must be incredibly rare. Maybe in some corporate environments, for the usual BS reasons.
Also, if you use SSL/TLS, you won't run into this problem unless you're being MITM'd, like in some corporate environments (because no one in between server and client can tell what HTTP method is used). Use of SSL/TLS is probably why I haven't run into this in recent years.
A tip is to use a tool like axel (https://github.com/eribertomota/axel) to do direct downloads. This will usually speed it up.
Uh oh. Thanks to the repo owner for updating the README, but that's not a good situation.
I'd also like to suggest aria2c for this purpose: https://aria2.github.io/
For example, until fairly recently, if I mentioned a "torrent" to my non-technical mom, she would assume I meant ThePirateBay or something like that. Nowadays, she knows it as just another means to download files.
So very legitimate uses date quite some time back (as you would expect).
AFAIK it pioneered the concept of "web seeds", using HTTP GETs with a Range: header to download specific chunks from a CDN that were not healthy/available in the swarm.
This could take off it only a big player like Ubuntu pushed it. I don't see why we depend on a set of centralized servers for a bunch of files that a huge number of people download on a very regular basis.
And yet, the idea has been stagnant for years.
Edit: By which I mean, it works, but not enough people use it.
See this alternative that uses similar idea but not real Bittorrent, they worked around the 1st problem: http://www.camrdale.org/apt-p2p/
I still don't see why Ubuntu/Debian/et al don't take this (or something like it) up in a more official manner. I can see why it's not a default of course, but it could be made a question during installation for example.
An additional benefit would be that you'd be able to source packages from machines on your local network, with fallback to the internet, and it would all be pretty much automatic and configuration-free.
For the local network part at least, it's really not that complicated to implement, all you have to do is to listen for announces on the network and ask those peers before asking remotely; there is a standard example for archlinux in pacserve (http://xyne.archlinux.ca/projects/pacserve/) with my own very crude reimplemantation (https://github.com/rakoo/paclan)
I'm aware that you can swap out the PPAs as needed, but I would really like something distributed and decentralized.
1. Consumers can get higher speed in ISP network.
2. ISPs get lower external bandwith usage
3. Lower resource usage for distributor.
Relevant HN discussion - https://news.ycombinator.com/item?id=12380797
So you can seed those torrents directly in the browser with something like instant.io.
- When WebTorrent runs on the standard bittorrent network from node, that doesn't change anything: it's still not available from the browser.
- When WebTorrent runs on the WebRTC network through instant.ion or anything else, it will only work if somebody else is also seeding the same torrent in their browser. Which they can only have in the browser if they first got it somewhere else. Oh and I'm willing to bet that none of the nodes who currently have the content (ie on the bittorrent network) also share it on the WebTorrent network.
I don't expect classic bittorrent peers to ever implement the mess that is WebRTC just to accommodate browsers, unfortunately.
If WebTorrent were to do that itself, it would have to become a "plugin" rather than just an extension.
So start asking Mozilla/Google to implement the bittorrent protocol in the browser (or even better, implement IPFS directly, as that's a more wholesome technology specifically made for the browser).
browser vendors shouldn't have to implement it. they should expose posix-like APIs (bsd sockets, file IO) or process management+ipc via plain pipes (talk to native bittorrent client) so it could be provided through an extension.
The problem with browsers is that they create a backwards-incompatible API stack. This is understandable for web content. Not so for extensions.
> In node.js, the webtorrent package only connects to normal TCP/UDP peers, not WebRTC peers.
That's why there's this webtorrent-hybrid client which runs a hidden electron process to communicate with WebRTC peers and normal TCP/UDP peers. According to the readme, there's (understandably; it's running chromium in the background) a lot of overhead with this method so they're working toward a non-electron version of WebRTC in Node.
Another random datapoint: When EdX/Harvard released a dataset showing how students performed/dropped out, I uploaded a copy to my S3 to mirror and linked to it from HN. I got a polite email the next day asking for it to be taken down. Academics are (rightfully, IMO) protective of their data and its distribution (particularly its attribution).
One thing I would love to see on here is stuff from ICPRS, such as its mirror of the FBI's National Incident-Based Reporting System . As far as I can tell, it's free for anyone to download after you fill out a form. But it also should be free to distribute in the public domain, but for all I know, ICPSR has an agreement with the FBI to only distribute that data with an academic license.
(The FBI website has the data in aggregate form, but not the gigabytes that ICPSR does)
That project maintains a number of archival datasets, including arXiv:
Seems like an opportunity to combine efforts.
I feel like there is a danger, however, that using torrents would facilitate the thousands of nonstandard (often redundant) formats bioinformaticians seem to create.
BTSync and SyncThing are also tools to do this, and I'm sure there are FUSE things to work with BT and block chains ("bittorrent fuse" google results look promising).
The P2P nature of the network then help the descentralization of sources, populating several clones of the dataset.
"Dat is a decentralized data tool for distributing data small and large."
It looks like one of those logo-design-competition-sites, but for big data. Anyone compete in one of these?
How much data do you have? How much storage do you project is needed? I'm wondering how practical it would have been to use centralized storage, which has its own advantages.
- this is a mechanism for sharing files and directories (e.g., zipped csv files), whereas noms defines its own structured data model that is much more granular
- noms has versioning built-in, you can track the history of a particular dataset
- this is firmly based on bittorrent. you could maybe run noms on top of bittorrent, but it's more intended to be run like git, where you talk directly to a server that you want to collaborate with
Also, academic datasets aren't free of copyright concerns. Consider the famous Imagenet dataset for image classification. It's made of a million images pulled from Google Images. Did they get each photographs' creators' permission for such unlimited redistribution? Of course not. But there's no way the 'implied license' of posting a photo online extends that far... Like so much of the Internet, it's only possible in the absence of enforcement of copyright law.
* which is particularly frustrating because academic publishers make such enormous profits and hosting large datasets is exactly the sort of thing they should be doing if they were remotely interested in supporting science rather than making more money
Thinking about http://opendap.org/node/305 for context.
[What is OPeNDAP: http://opendap.org/about]
Here is one of the plots from 2014: https://i.imgur.com/Ecr44AZ.png
That may have been an autocorrect "correction" of the word "subdomain".