Organizing this data would also be a hell of an effort because the maps use different projections, are from a huge variety of times, and are often inconsistent (overlapping areas with different mineral deposit analyses).
I suppose I can dream, though.
More that that, even if the map can be used more freely in the country (by multiple parties) it may be illegal for the map to cross the border in any form. This also means electronic signals over the internet if they cross the border.
I guess the reason some of them are getting out in the open is because they are no longer considered "critical".
Disclaimer: My work intersected the oil and gas domain at some point.
Also, this is why some believe the war in Afghanistan was even waged - was to get the "boots on ground*" == on-top of the $1T in mineral deposits.
While we kvetch constantly about stupid political distractions, there are, in fact, many government-corporate entities that are in the long-game.
Look at even the completely obvious:
and even as openly published ten years ago:
The point is that we are already in the long war for resource control. And sadly, I would say that the US is not in the upper-hand at this time, aside from building up the military...
> I don't have time to write the essay on this, but we definitely DONT wage war to get "boots on the ground". That's absurd and if true why go to that effort when there is the rest of the world to but your boots on without war. That's one of the biggest problems. The Afghan Geo Survey was at PDAC pleading people to come work in their country but, I for one won't be going AND NO ONE is going to put money into that huge risk. Why would they with the rest of the world available to make a real map you have to have boots on the ground, (like Iv done for 40 years) not a computer with .tab and .shp files. That is a derivative product that comes after collection of real data that costs money and is proprietary intellectual capital. Nobody sees my map without a non disclosure agreement or confidentiality agreement. And that happens across the world for any private expenditure. Exxon, Total, Shell, etc possibly have the the most vast earth science databases created by mankind. They represent $trillions of investment and are not about to make that publicly available.
Publicly available earth science information is created by, curated by, and made available through governments, =tax $. Thus, it is meager, everywhere.
When I said "get boots on the ground" I was also saying that the point of having a military presence is a long-term play ensuring that we secure some avenue to access of minerals at-some-point in the future - and this aligns with the corporate goals as well.
Maybe I worded it poorly, but for the most part - I feel I am saying the same thing as your friend, though I am more cynical of the motives of our government/military and more suspicious that they are much more cunning than they would want the general populous to believe.
To search the catalog from GA go to:
There is an emphasis on open formats and using open source software.
GA data is also extensively used by the Australian National Map at:
Also GA is available from data.gov.au
Discoverability for openly released scientific datasets is a huge problem in general. While some enterprising folks have worked on adding parsers for scientific data formats such NetCDF and HDF5 to Apache Tika (which can then be indexed by Solr/Elasticsearch/whatever) , the vast majority of scientific file formats don't have parsers available. Even worse, in the climate of publish-or-perish, most scientists are unaware of or less likely to prioritize the incorporation of metadata extraction / indexing tools, even though these would make their data more readily searchable based on relevant metadata (such as equipment settings, etc).
I have some personal experience in this area- when I was working as a research assistant, I basically did helpdesk support for an open access dataset, answering questions from researchers at other institutions. I'd estimate that of the questions I received in my inbox, close to 70% could have been resolved with a good implementation of faceted search. A related issue I encountered is that rather than relevant metadata existing alongside a dataset, sometimes I'd have to dive into an article's methods section to find it, often in a weird place that wasn't obvious at first glance due to the obtuse writing style that is encouraged for scientific publications.
The bigger problem, however, is that the culture of science in academia right now puts way too much emphasis on flashiness over sustainability and admittedly non-sexy tasks like properly versioning and packaging scientific software, documenting analyses, and producing well-characterized datasets.
The platform understands a lot of things already and is able to organize/chart/map/visualize automatically, and extensible where not.
> Faults of the world that effect mankind are mostly known and well mapped at the surface and below the surface through assorted geophysical means. We can't see there otherwise. Google Earth has made a tremendous contribution to mankind and there are many efforts to make Google geo-earth at least in small chunks. Universities are the best homes for this type of effort.
Quality of data is a huge factor. The location of the San Andreas Fault is very well understood. Yet 20 years ago, there were NO geologic maps of even rudimentary quality for the San Francisco Bay area. Maps yes. Useful, printed or digital No. I doubt that has changed. Reason? Politics, funding, right people to do the job, etc BART and bridge routes are well studied for engineering purposes and should be public information, but try find a geology map derived from that suitable for public use.
When I work in foreign countries and even in the US, there are regional maps, say 1:1,000,000. But that offers very little value. Now I am writing the book on the geology where I am working. I am the first one there to figure out what I am looking at and the potential value for any mineral commodity. You can't do what I do from a computer or satellite. You HAVE to be boots on the ground.
The way forward for advancing geologic understanding is quality mappers on the ground knowing what they are looking at, translating that to useful information that conveys both facts and interpretations and then that to digital.
We are a looooong loooong way from having geology Google Earth, but if we did, it would be great for mankind. Raise taxes?
As for data being "locked away" in formats, well, no one's really come up with the One True Geospatial format, so you've got TAB and SHP. I prefer TAB, because I hate column name limitations. Use QGIS  or GDAL  to get the data into a format you want, or for really heavy lifting, apply for a home use licence for FME Desktop  (which uses GDAL among other things).
TL;DR: Mineral maps are worth cash money, and that's why you can't get at them.
1: https://qgis.org/en/site/ - can a benevolent billionaire please throw a couple of million dollars at this thing?
2: https://www.gdal.org/ - GDAL makes the world go round (or ellipsoid, depending on projection)
I imagine the relevant metric would be "importance / current number of seeders".
I wonder if you can set up a distributed-hash-table sort of thing that lets you reliably query for less-popular torrents. Like a magnet-link system that supports top-k queries.
What steps are in place to ensure this over reasonable timescales (20-50 years)?
We also run the project ShortScience.org! Check it out!
This is the key thing for me, so there are no guarantees of the data being available? Or is all the data backed by an owner hosted box (and backups)?
I picture a system where a person adds themselves as a node to a certain group of files- like BitTorrent- but instead of downloading everything, they choose how much space and bandwidth they're offering and the system grabs the "best" pieces. Best in this case is about spreading the data out to improve resiliency, like any dates storage service would. It's not a human choice but an optimization problem.
If 250 people each offered 100MB of space, a 5GB file could be maintained with quadruple redundancy. Nodes come and go, and the system would minimize data transfers while aiming to maximize redundancy. Try to spread the files far and wide geographically, to put popular pieces of the data onto nodes offering more bandwidth.
Hmm. This sounds like a fun project to try...
If you split a 5GB file into (20) 250MB chunks and then mirror each of them four times, there is a nontrivial probability that one of the chunks loses all four mirrors. Especially when you're using notoriously unreliable volunteer hosting.
If you split the same 5GB file into (20) 250MB data chunks and (60) 250MB erasure chunks, you consume the same amount of total storage but have to lose 61 chunks (>75% of all the hosts) before you lose any data.
Edit: https://git-annex.branchable.com/design/iabackup/ has more details.
This is not that different from how current millenia-scale data storage has worked for... well, millenia. The primary challenge is in finding a universal key (if somebody comes along a million years later, will they share enough context to decode the archive?)
It doesn't really scale, but then neither does the Clock.
Realize it wouldn’t scale, would cost money etc, but could be interesting
In the past, I have used certain wide open read only genomics databases (not going to name it so it doesn't get hammered by HN).
Other posters are right about services such as BigQuery but I think there's a place for an open source project here that interfaces SQL to databases through a layer that adds caching, throttling and more services on top of that. That's how you make it scale.
The Dremio project (open source by the backers of Apache Arrow) has a SQL REST API that converts a standard SQL dialect/datatypes to the underlying systems. I think that's a good start and Dremio has a ton of other awesome functionality like Apache Arrow caching.
Simple model is expose an expression language (even could be not SQL, like jsoniq, or other expression languages), mapper from that to SQL, web service API on top with a pluggable connector model.
I say that I'm going to start an open source project around this all the time but haven't gotten the inertia to do it. Argh!
Thanks to the recent big-data nosql craze this should be easy to find off the shelf. As example i know MongoDB has all this built in, not sure it is designed with untrusted nodes in mind but otherwise fits all requirements.
There’s one feature of Faunadb that stood out, the ability to create a seemingly unlimited number of nested databases, where the db assigned to you appeared to be the root/full database, with full query/write privileges.
Imagine a public dataset that you could query but also update with your own findings and metadata, and choose to have it shared upstream or keep private. The engine could diff your work to avoid duplicating a copy of the data. You could sync it locally for speed if needed
Historically, as empires collapse, they burn their libraries, but the usually don’t get to the copies that are kept on the outskirts.
On one hand, they seem to converge towards a consensus with most seeded and downloaded files and popularity as a trust factor. On the other, they also promote the dissemination of ideas the knowledge of which poses a threat to the status quo that is, the state towards which a society was coerced to.
On one hand, Torrents are about rejecting the Publisher and Big Media status but on the other they are about arriving to a democratic status about which films/books/... are the best or most useful.
And don't even get me started about the constant ethical dilemmas associated with sharing and who should control or own the data.
To link all that threads into a broader topic, we could associate the torrent subculture to the Dionysian archetype which Nietzsche wrote about.
Actually, one of the biggest uses of torrents is to disseminate pop culture materials that fall right in the middle of US culture. Probably dwarfing "radical" stuff by many orders of magnitude.
Only when it was unpopular did it do the former.
Popularity was brought about by its being templated, packaged, and sold, thus bringing it into the status quo's portfolio.
In one of my private trackers there is a person with a seedbox that downloads every single torrent as soon as it is uploaded, and they have been doing so for quite a few years now.
This ensures that while some things will indeed, be seeded more, nothing quite vanishes.
Then again the form media of that specific tracker is fairly small, so it is not prohibitively expensive to archive everything. One raw ISO Blueray movie file elsewhere could be thousands upon thousands of torrents in that specific tracker.
Maybe something of utility would be creating a distributed torrent system that is a bit more closely tied to the tracker. Where membership would require you to integrate to the swarm by automatically downloading a percentage of the entire corpus, ensuring the health of the tracker.
So a new peer would be bearing part of the load of having everything be accessible.
I think this would require decently heavy curation, but I could see how it could be useful for something like the OP specifically, where having scientific papers lost for good would be a shame.
Many had the same hopes for the Internet and social media, for example. But when these things became valuable - influential - powerful interests acted to control and manipulate them, to obtain money, political power and social outcomes. It's hard to claim that the results are that people are choosing information that is "the best or most useful".
I think politics and social outcomes, such as status quos, are unavoidable results of human interaction. Eliminating rules eliminates the protection against arbitrary power and returns us to the world of despots. The politics is unavoidable; the question is, how do we want to manage it?
EDIT: Some major edits; sorry if you read an earlier version.
Modern democracy, with its focus on leaders and representation, still gives us extreme power and economic imbalances and is arguably a barely disguised oligarchy.
I think we agree that getting rid of all the rules is a bad thing, though.
So as soon as someone distributes some data via a torrent, everybody starts asking if it is legal to use that data. When the data is offered via a download link on some website, most people assume that they got the data through a legal channel.
There are still legitimate reasons to use torrents apart from piracy. I particularly like the physibles section of TPB https://thepiratebay.org/browse/605 which are not inherently illegal (well apart from the 3D printable guns which are a bit iffy and in a legal gray area).
I've considered adding a torrent download, because it includes built-in verification of the download. A common problem is users reporting that their download over HTTP is corrupt, but I'm not sure if they'd be able or want to use Bittorrent.
(Also, for many users the download is probably fine, but they can't open it in Excel. Bittorrent won't help that. )
I'd expect your users to be willing to set up a torrent client. It's not even difficult.
The few times I've published files over bittorrent I've had to reassure people that torrenting itself isn't any more illegal than other download methods.
It's also not clear ahead of time how difficult it's going to be.
If torrenting is the only way, some people won't bother.
Bandwidth is no problem for us, we have a faster connection than all the users. Users in Africa and Latin America would probably benefit most, but I'd need to research whether they'd be prepared to use Bittorrent before implementing this.
I really wish we could get basic security concepts added to the default curriculum for grade schoolers. You shouldn't need a PhD in computer security to know this stuff. These site creators have PhDs in other fields, but obviously no concept of security. This stuff should be basic literacy for everyone.
Arguably, one compromised PKI x.509 CA jeopardizes all SSL/TLS channel sec if there's no certificate pinning and an alternate channel for distributing signed cert fingerprints (cryptographically signed hashes).
We could teach blockchain and cryptocurrency principles: private/secret key, public key, hash verification; there there's money on the table.
GPG presumes secure key distribution (`gpg --verify .asc`).
TUF is designed to survive certain role key compromises.
> We would like to avoid the blind mirroring of all data.
Found at http://academictorrents.com/about.php#mirroring
I mean Academia has destroyed the scientific method, turning it into:
Who needs a PhD and what does your Professor want to prove true?
Ive started to ONLY trust industry.
As long as you're not downloading copyrighted data there should be no issue with using the BT protocol on a company or academic network, providing their is no outright ban on the protocol in your network usage policy. The BT protocol itself actually lends itself quite well to large datasets such as what is hosted here due to its inbuilt error checking (so no more spending hours downloading a huge dataset only to find your connection did something silly for a second and corrupted the whole file) and can provide much faster download speeds on popular files due to the number of peers available, instead of a normal hosting arrangement which would likely provide slower speeds on popular files due to network congestion and file access speeds.
P2P will consume any amount of upload bandwidth available. It's horrendous to have inside your network, as a university or research center.
Torrents are the most effective, reliable and convenient way to distribute large files, its adoption shouldn't be blocked by bad configuration and policies.
It benefits everyone because the automatic mirror selection in Linux distributions picks the lowest latency mirror automatically. That means all OS running in the academic network will pick the public academic mirror, since it's the closest.
There's no Janet policy (UK academic network) against P2P, although there are institutional policies in place (typically for student residential networks).
To be honest it would be difficult to have a policy that wouldn't impinge on some of the more unusual protocols used in research.
We work with academic institutions to ensure they allow this service. Please report universities which block the service using the feedback button shown in the lower right of the webpage.
We also encourage HTTP seeds to be specified (aka url-lists) by the uploader to offer a backup URL which can be contacted automatically if BT is blocked. We also offer a python API designed for clusters and university computers written in pure python which supports HTTP seeds: https://github.com/AcademicTorrents/python-r-api
(Zenodo is limited to 50GB though)
1. It appears to be sponsored by seedbox hosting companies -plus- a google ad. This is misleading (no, it is -not- directly sponsored nor endorsed by Salesforce, which is the Google Ad I see).
2. Many higher education institutions will block BitTorrent on their firewalls to prevent/reduce copyright infringement.
3. How legitimate is the data? Is there any vetting of the content to ensure that it doesn't violate copyright or that the data was legally obtained eg, the site scrapes? A DMCA takedown is too late if we've already accidentally seeded infringing information and could harm our reputation.
4. The site claims to be "used by" a group of very big names (Stanford, MIT, UT Austin etc). Did they ask/give permission to be cited? Do they endorse the use of this service?
5. HTTPS. Please?
It's a great idea but it needs a bit more polish before I could even suggest this to my management.
All these sheltered academics were heralding it as the best thing ever, without even the basic user experience expected this decade.
Yes, I know about the chrome extensions.
It would be cool if sci-hub came with a catalogue, but it's very usable without it. And it's not just sheltered academics (or even just academics) that are using it.
For example, if someone uploads a bundle of 50 N64 ROMs, and someone tries to DMCA that link so that this service no longer provides an index to it, is there censorship resistance?
This is almost exactly what I need: A distributed index of massive datasets. I've been building a web browser that can source from such content. The idea is that you write a website which refers to some resources by SHA256, and anyone else running the browser will transmit the resource to you if they have it.
That would let you build an emulator which can play any ROM in history, without having to explicitly download the ROMs. It's equivalent to clicking on a link to "Super mario brothers" and then seeing it play immediately in your browser. No explicit downloading.
From a long term standpoint, the vision is that you can build whatever games you want, using whatever assets you want, and nobody can tell you that you can't.
So it was funny to see this service pop up, because it's nearly an identical use case: "I have some data (ROMs), I want to make it available to everyone else (people running the emulator), and it's decentralized so no one can say no (bittorrent)." But that raises the question of scope, or whether such use cases would be welcome.
Maybe it's possible to bundle tor with the executable though, and have the network connections automatically route through tor.