Hacker News new | comments | show | ask | jobs | submit login

I have about the same amount of data in a Postgres database as part of the TLS Observatory project [1].

    observatory=> select count(distinct(sha256_fingerprint)) from certificates;

    observatory=> select count(distinct(target)) from scans;
The scanner evaluates both certificate and ciphersuites and stores the results in DB, so we can run complex analysis [2,3]. There is also have a public client [4].

I don't have a good way to provide direct access to the database yet, but if you're a researcher, ping me directly and we can figure something out.

[1] https://github.com/mozilla/tls-observatory

[2] https://twitter.com/jvehent/status/684127067005390848

[3] https://twitter.com/jvehent/status/686938805413232640

[4] https://twitter.com/jvehent/status/687429007680376833

I have something similar, running on a Linux desktop, to analyze SSL certs. The front end is in Go, which picks certs of interest (I wanted all certs with more than one domain) and loads them into a MariaDB database. Here's the code.[1] It's amazing how much work you can do on a modern computer when you actually use it for computing.

[1] https://github.com/John-Nagle/certscan

I see you've reused the zscan DB schema. It's a good choice, we used it to inspire our schema too. We went a bit beyond because we also store ciphersuites (from cipherscan) and chains of trusts, so the DB schema had to be more relational than what zscan uses.

That's awesome. If you are not already doing so, you can download my set from the torrent and include it in your database.


For exporting, pg_dump -F c greatly compresses the data so cost-wise you might be able to put on S3 and publish as a torrent.

Exporting is one possibility, but eventually I'd like to provide a read-only sql access to the database we host. We have a few ideas on how to do this [1], but it's not implemented yet.

[1] https://github.com/mozilla/tls-observatory/issues/92

Perhaps something like a modified PostgREST could work?


The problem isn't so much exposing the data as a rest api, as it is allowing for complex queries that may contain various table joins, subqueries or recursive conditions. I only skimmed through the documentation of postgrest, but it doesn't make mention of joining tables, which is a deal breaker for our use case.

Idea from someone just starting to learn about databases (very green :P):

- People request access and get an API key associated with a given load threshold, or don't use an API key and default to some low threshold

- Anything that SQL EXPLAIN says is over the threshold returns an error

- Successful requests' load costs and execution time (and possibly CPU, if that can be determined) count toward a usage rate limit

- An SQL parser implements the subset of SQL you deem safe and acceptable and forms a last-resort firewall

Obviously this is a complex solution; I'm curious what people's opinions are on whether this would overall be simpler or more difficult in the long run.

> I have about the same amount of data in a Postgres database ...

I'm curious, how fast can one load data into Postgres? Is it possible to import data directly from CSV files?

> I'm curious, how fast can one load data into Postgres?

Hard to answer considering the number of variables impacting. pg_bulkload[0] quotes 18MB/s for parallel loading on DBT-2 (221s to load 4GB), and 12MB/s for the built-in COPY (with post-indexing, that is first import all the data then enable and build the indexes)

> Is it possible to import data directly from CSV files?

Yes, the COPY command[1] can probably be configured to support whatever your *SV format is. There's also pg_bulkload (which should be faster but works offline).

[0] http://ossc-db.github.io/pg_bulkload/index.html

[1] http://www.postgresql.org/docs/current/interactive/sql-copy....

18MB/s sounds rather low. It obviously rather depends on the source of data, format of data (e.g. lots of floating point columns is slower than large fields of text), and whether parallelism is used. But you can relatively easily get around 300MB/s into an unindexed table, provided you have a rather decent storage system.

>Is it possible to import data directly from CSV files?

Yup! http://www.postgresql.org/docs/current/static/sql-copy.html

Our dataset is not loaded from an external source, it is generated by scanners.

But to answer your question: yes, postgres can load data from csv files: http://stackoverflow.com/questions/2987433/how-to-import-csv...

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact