Hacker News new | past | comments | ask | show | jobs | submit login
My favourite API is a zipfile on the European Central Bank's website (csvbase.com)
1024 points by qsantos 10 months ago | hide | past | favorite | 265 comments



Ah, I remember this specific file from my time at the ECB 15-ish years ago.

IIRC it was by far the most downloaded file on the ECB website. Tons of people, including many financial institutions, downloaded it daily, and used it to update their own systems.

IIRC #2 in the minutes immediately after the daily scheduled time for publishing this file, there was a massive traffic spike.

It was a conscious decision to make it a simple CSV file (once unzipped): it made it possible to serve the file reliably, fast, and with little resources needed.

The small team responsible at the time for the ECB’S public website was inordinately proud of the technical decisions made to serve this data in a single static file. And rightly so.


This is something that powers a lot of data interchange but lots of people don't realize because they never work with older systems like that. It's not shiny. No frameworks.

I used to work for a large, old company whose products you have all bought and buy (and which shall remain nameless) about 15 years ago. I worked on data interchange between the systems used for record keeping of said products and various other downstream or parallel systems (same purpose, separate system because left over from merger/acquisition).

So basically bulk data import and export. The product was 15 years old at that point. Data interchange was via various fixed width or delimited (CSV but the C might be some other character or character sequence) files transferred to and from various SFTP servers.

It's been a while but there were probably like 20 or 30 different such data sources or exports going in and out. Worked like a charm. I bet they're still used without much change. The frontend was being rewritten at the time (old one was in Smalltalk).


I'm using this type of simple approach to build a SaaS right now. We need to spin up many VPS and provision them, and the fastest way to do that is with rsync and ssh.

But we didn't stop there: this SaaS for our open source browser product is entirely built like this: behind the scenes it's a collection of bash scripts that implement and execute the business operation of the SaaS.

So basically, it's a command-line interface to the SaaS. Think of it this way, say I didn't have a website, with login, and "click a button to open a browser", but instead people would write me letters, send me cheques, or call me on the phone. Then I can serve their requests manually, at the command line.

The reason I made it like this was:

- clear separation between thin web front-end and actual business logic

- nice command-line interface (options, usage, help, clear error messages) to business logic for maintenance and support to jump on and fix things

- inheritance of operating system permissions and user process isolation

- highly testable implementation

Maybe this is dumb, but I really like it. To me it's an architecture and approach that makes sense.

I'm sure this is not new, and I think a lot of good quality operations must be built via this way. I highly align with the author's stance of the composition of a few simple command line tools to get the job done.

Perhaps we can call this "unix driven development", or "unix-philosophy backend engineering"

browser product: https://github.com/dosyago/BrowserBoxPro (saas coming soonish)


I feel like you went way overboard with this one.. I think the first paragraph and an example of what you're doing that's similar to OP would've been plenty


I enjoyed their comment, thanks...

You're being a jerk, someone shared something they were proud of that's relevant to the discussion. What's the problem?


A lot of developers talk smack about CSV files (or tab delimited) but they really are the workhorse of our industry. They're simple, textual, compress really well, and as long as everyone on both sides has an agreement about the format, they're almost a standard.

sure, a lot of the criticisms of CSV are true, but given the above constraints it's really hard to beat them.


"everything is a file" still going strong


And they can be open in Excel for the best and the worst.


I know this file too, I was one of them. Of all the data sources we used, it was the best to work with.


Wow. I’m just thinking of the hoops I’d have to go through to host changing data as a static CSV/ZIP at my employer.

Architects would complain ZIP isn’t a format conform to their specs for this purpose. Compliance would complain there needs to be a check that no private informatie leaked. Risk that I should prevent bad actors from downloading the file. Web people that I need an approved change to add stuff to the site.


Yeah, but modern, approved practices don't really address any of that - they just hide it under a pile of extra complexity that's too big for most devs and managers' attention span, effectively disappearing it.


Do you know why they decided to host a zip file instead of just hosting the CSV and relying on HTTP compression?


HTTP compression is optional, so they either have to compress on the fly (wasting cpu) or provide multiple versions (complicating setup and deployment) or make some HTTP clients not work.

simgle zip file is really the easiest solution for cases when the file must absolutely be compressed


Many webservers allow you to serve a compressed file (stored on disk) and _decompress_ when a client specifically can't support the compressed encoding. Since most clients should support compression, this means you only use the CPU for the rarer case where the uncompressed data is required.

Eg, http://nginx.org/en/docs/http/ngx_http_gunzip_module.html


Or you can just zip it explicitly and remove complexity and future issues it they have to move to another web server platform.


But what if an analyst needs to access this data and run their regressions on a potato? Surely that use case is worth adding a few libraries to handle.


I would think running regression should be more demanding on the potato then decompressing ZIP file.


hmm good point, can you draft up an architecture plan using multiple microservices and redundancy via a kubernetes cluster and have it on my desk by Monday please.


What machine could anyone be running an analysis on where unzip is a limiting factor?


you heard them, a literal potato from the ground.

even a big old russet only has, what, like 32 bites?


I mean there are some potatoes that take more then four bytes to eat, but they are rare.

For some, if you slice them, you might even end up with only 16 bits.


Whoosh


Nginx or Apache would both cache these versions transparently. I think they just wanted to distribute as a zip


The business logic here is >15 years old. Http compression was only in early stages then, and you can guarantee that many client side scripts and libraries would jot have supported it. Zip was well known. Compress and place.

If I'm not missing the point here, this was, and still is, about offering the simplest, most reliable solution over a long period. This is a near perfect example of how to do exactly that. No changing formats, no moving requirements, no big swings in frameworks, apis, or even standards. And most importantly, no breaking your customers business workflows.


No edge case dependencies on the WWW server's configuration, and no sudden "why did we just saturate our external connection?"

No emergency change requests from the outage team that has to be impacted by other areas and fit into the infrastructure's teams maintenance windows and their capacity to address that.

No rebalancing of workloads because Jane had to implement (or schedule the task and monitor it) that change, Joe had to check and verify that the external availability tests passed, and Annick had to sign off on the change complete, and now everyone isn't available for another OT window for the week.

Or something.


> "The business logic here is >15 years old. Http compression was only in early stages then"

At least one of us is confused about history here; are you really saying that circa 2008, HTTP compression would have been considered immature or unstable?


Maybe in part because it encourages you to work with a local copy, rather than just hitting the hosted .csv repeatedly?

Just guessing. I have no idea :-)


It it exactly what most api optimization does. For example, batch queries merge several api calls into one to reduce api call counts. And it this case, it is perfectly optimized. You have literally one zip file, and there is no more.


I could see 2 reasons:

1 - save on cpu usage, compress once, serve many

2 - with zip you can have some rudimentary data integrity checks (unzip -t)


Another option would be save it as a gzip file and serve it raw with implied gzip compression.

Nginx does this: https://docs.nginx.com/nginx/admin-guide/web-server/compress...

Last entry: the directive is `gzip_static on;`


By distributing the data in a compressed file, you retain the benefits of the smaller file size at all stages (transfer from source, local storage, local transmission) until the data needs to be used in uncompressed format, all without any additional processing.


The CSV was available at https://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.csv but isn't up to date.


It's a bit scary to imagine the consequences if this file was somehow corrupted (wrong column headers for instance)!


Just wait until you inherit a service that, sans documentation, pulls in a web resource file that suddenly is no longer available :(


scale8 recently archived their GitHub repo without explanation and right after that the url for a GDPR opt-in list they were using changed. The product is now dead because the code doesn't handle the situation where that file might not be available.

(Scale8 is a web analytics and tag management software like google analytics)


I forgot to mention this error check even breaks the admin dashboard meaning you can't even see historical data anymore.

Just in case anybody else is in the same boat you can change the url in the code to the new one yourself.


> IIRC #2 in the minutes immediately after the daily scheduled time for publishing this file, there was a massive traffic spike.

Perfect usecase for bittorrent protocol imho. Funny thing is you can serve files with bittorrent protocol from Amazon S3 without any additional complexity or charges I think.


Do you remember why they decided to zip it instead of just relying on the compression in http?


Does HTTP compression perform well if you're streaming a file? Wouldn't it need to have access to the entire file to have the same result as zipping it?


Yes it performs well. You don't need access to the entire file.


Simple file downloads and CSV files are fantastic. I wish more folks would publish data in simple formats like this. Every time I have to fill a "shopping cart" for a US government data download I die a little.

There's a bunch of wrapper tools to make this particular pipeline easier. Also something like Datasette is great if you want a web view and some fancier features.


One nice thing about CSV files being zipped and served via the web is they can be streamed directly into the database incredibly fast without having to persist them anywhere (aside from the db).

You can load the zip file as a stream, read the CSV line by line, transform it, and then load it to the db using COPY FROM stdin (assuming Postgres).


Definitely, it is much easier to stream CSV than say JSON or XML (even if JSONL/Sax parsers exist etc).


That doesn't sound like an amazingly safe idea


It isn't. But that's easily mitigated with temp tables, ephemeral database and COPY etc.

Upstream can easily f-up and (accidentally) delete production data if you do this on a live db. Which is why PostgreSQL and nearly all other DBS have a miriad of tools to solve this by not doing it directly on a production database


Maybe I'm missing something but I don't see how it's possible for a COPY statement alone to remove existing data.


If in the regular scenario you load 10000 rows of new data and delete the old then it’s fine.

What if someone screws up the zip and instead of 10000 today, it’s only 10?


I had this last week, but instead it was a 3rd party api and their service started returning null instead of true for the has_more property beyond the second page of results.

In either the solution is probably to check rough counts and error if not reasonable.


I think generally don't replace the prod db until the new one passes tests.


What specific risks do you foresee with this approach?


Seem totally fine to me. As long as you can rollback if the download is truncated or the crc checksum doesn’t match.


> or the crc checksum doesn’t match.

which wouldn't exist if the api is simply just a single CSV file?

at least with a zip, the CRC exists (an incomplete zip file is detectable, an incomplete, but syntactically correct CSV file is not)


DROP DATABASE blah;


That’s not how COPY FROM works in postgres. You give it a csv and a table matching the structure and it hammers the data into the table faster than anything else can.


I you feel risky, try a Foreign Data Wrapper ;)


it blows my mind that you can use sqlite with csv as input and then query it, it sounds so logical and useful yet I never came by it.

we have lots of reporting in CSV, can't wait to start using it to run queries quickly


You should checkout powershell; it supports converting CSV into in-memory structured data and then you can run regular powershell queries on that data:

    $> csvData = @"
       Name,Department,Salary
       John Doe,IT,60000
       Jane Smith,Finance,75000
       Alice Johnson,HR,65000
       Bob Anderson,IT,71000
       "@;
    
    $> csvData
         | ConvertFrom-Csv
         | Select Name, Salary
         | Sort Salary -Descending
    
       Name          Salary
       ----          ------
       Jane Smith    75000
       Bob Anderson  71000
       Alice Johnson 65000
       John Doe      60000
You can also then convert the results back into CSV by piping into ConvertTo-Csv

    $> csvData
         | ConvertFrom-Csv
         | Select Name, Salary
         | Sort Salary -Descending
         | ConvertTo-Csv
         
       "Name","Salary"
       "Jane Smith","75000"
       "Bob Anderson","71000"
       "Alice Johnson","65000"
       "John Doe","60000"


nushell does too:

    /tmp/> "Name,Department,Salary
    ::: John Doe,IT,60000
    ::: Jane Smith,Finance,75000
    ::: Alice Johnson,HR,65000
    ::: Bob Anderson,IT,71000" |
    :::     from csv | 
    :::     select Name Salary | 
    :::     sort-by -r Salary
    ╭───┬───────────────┬────────╮
    │ # │     Name      │ Salary │
    ├───┼───────────────┼────────┤
    │ 0 │ Jane Smith    │  75000 │
    │ 1 │ Bob Anderson  │  71000 │
    │ 2 │ Alice Johnson │  65000 │
    │ 3 │ John Doe      │  60000 │
    ╰───┴───────────────┴────────╯


for something a bit more robust, check out DuckDB. It's a library you can embed, use it to run SQL on local files as well as connect to databases, do joins, analytics, etc.


Agreed. The article mentioned duckdb and I'm her to thumbs-up the use of DuckDB wholeheartedly. If you like the world of public CSV files as data sources that you can query or cross-query, duckdb is the tool for you. Just follow the demo on the duckdb website and you'll be wow'd for sure.


I use both, and I have found it helpful to have nushell around when munging csv and parquet files, even when working with duckdb - I find it quicker to ask for the first few rows of a thing or do simple stuff with it, then deeper analysis with duckdb.

They're a powerful pairing


Crazy how similar the commands are.


qsv (https://github.com/jqnatividad/qsv) also has a sqlp command which lets you run Polars SQL queries on CSV(s).

Here I'll:

  - Send the csv data from stdin (using echo and referred to in the command by -)
  - Refer to the data in the query by stdin. You may also use the _t_N syntax (first table is _t_1, then _t_2, etc.), or the file name itself before the .csv extension if we were using files.
  - Pipe the output to the table command for formatting.
  - Also, the shape of the result is printed to stderr (the (4, 2) below).

  $ echo 'Name,Department,Salary
    John Doe,IT,60000
    Jane Smith,Finance,75000
    Alice Johnson,HR,65000
    Bob Anderson,IT,71000' |
    qsv sqlp - 'SELECT Name, Salary FROM stdin ORDER BY Salary DESC' |
    qsv table
    
    (4, 2)
    Name           Salary
    Jane Smith     75000
    Bob Anderson   71000
    Alice Johnson  65000
    John Doe       60000


You do realize that this happened also on the article? I.e. In memory and autoinference


You might find a lot of interesting tools in the Datasette ecosystem. Data dashboarding for SQLite with all sorts of import and export and visualization plugins.

https://datasette.io/


What’s really interesting about it is that Awk is now, finally, getting support for CSV. But I bet a large amount of Awk+CSV use cases can be met with SQLite and SQL+CSV.


AWK's new CSV and UTF-8 support is great, but when querying data I think in terms of SQL.


The clickhouse-local tool is also really great for querying a bunch of systems, and it has connectors and converters for a whole bunch of other systems and formats.


Have you used duckdb? It's great for that.


Look at duckdb. Queries against csv, parquet, Jason, locally or via http. It’s like SQLite, but faster and better.


Or the vnlog tools can do this. There are many ways to do data processing on the commandline now.


There's a chance this HN post is going to more than halve my awk-usage....


perl with the right plugins makes data janitor situations simplified.


CSV would be great if there were one CSV, and if you absolutely guarantee that nobody has "improved" the CSV by editing it with a spreadsheet program (including opening it with a spreadsheet program) or their own deft little fingers in a text editor.

For example:

    "Look, this contains \"quotes\"!",012345
Or:

    "Look, this contains ""quotes""!",012345
Or, for some degenerate examples:

    "Look, this contains "quotes"!",012345
Or:

    Look, this contains "quotes"!,012345
Or the spoor of a spreadsheet:

    "Look, this contains ""quotes""!",12345
Theoretically, JSON isn't immune to being hand-hacked into a semi-coherent mess. In practice, people don't seem to do that to JSON files, at least not that I've seen. Ditto number problems, in that in JSON, serial numbers and such tend to be strings instead of integers a "helpful" application can lop a few zeroes off of.


> CSV would be great if there were one CSV, and if you absolutely guarantee that nobody has "improved" the CSV by editing it with a spreadsheet program (including opening it with a spreadsheet program)

Practically no formats actually pass those rules. Even plain text is bound to be "improved" by text editors frequently (uniformation of line endings, removal of data not in a known encoding, UTF BOM, UTF normalization, etc.)

Just don't do that.


Just don't do what, edit your data? The problem is all JSON implementations agree what a properly encoded JSON file look like, and it's not true for CSV.

Arbitrary CSV is not, in general, round-trippable.


> all JSON implementations agree what a properly encoded JSON file look like

You clearly haven't had to deal with tools that cannot read and write a JSON file without messing up and truncating large numbers.


Json is JavaScript Object Notation. Javascript itself doesn't large numbers like 2^65 or accurate numbers like 2^55+75. The json spec doesn't limit numbers to 2^51 or 2^64, but also doesn't guarantee that they will be supported.


I mean in practice you open up the CSV file, figure out the escaping rules, change one line of your code, and get on with your life.


You're lucky.

In practice, you open the file a disappointed customer sends you because your software doesn't parse it. You resist yelling at whomever decided to ever do CSV support in your software because you know what it's like. But the customer is important, so then either write a one-off converter for them to a more proper CSV format, or you add - not change - several lines of code in mainline to be able to also parse customer's format. God forbid maybe also write. Repeat that a couple of times. See the mess increasing. Decide to use a so-called battle-tested CSV parser after all, only to find out it doesn't pass all tests. Weep. Especially because at the sametime you also support a binary format which is 100 times faster to read and write and has no issues whatsoever.

I'm not blind for the benefits of CSV, but oh boy it would be nice if it would be, for starters, actually comma-separated.


Stick to files conforming to RFC-4180 then


JSONL should replace CSV. It is standardized and the escapes mostly well specified. It is effectively CSV with "[" and "]" surrounding lines.

Regular JSON would work fine for static file, and make Schema and links (JSON-LD) possible. But then the file could be any structure. JSONL works better for systems that assume line-based records, and are more likely to have consistent, simple records.


I've never stopped to question the absurdity of the gov data shopping cart thing. WHY!?!? Is there some justification for that?


I can almost assure you it was an explicit requirement in an RFP that was copied from some master template. So not a good justification - but a justification in the eyes of the bureaucracy.

The book Recoding America has a lot of anecdotes to this effect; most of these situations reduce to a Congressional mandate that got misinterpreted along the way. My favorite was for an update to Social Security. The department in charge of the implementation swore that Congress was forcing them to build a facebook for doctors (literally where doctors could friend other doctors and message them). Congress had no such intention; it was actually 3rd party lobbying that wanted the requirement so they could build their own solution outside of government. Really crazy stuff.


> Congress had no such intention; it was actually 3rd party lobbying

Right, but 3rd party lobbying can't force anyone to do anything, whereas Congress can (and did) give this mandate the force of law. The fact that lobbyists got Congress to do something that they had "no such intention" to do is its own problem, but let's not lose sight of who is responsible for laws.


That's the interesting part of this story; Congress didn't think this requirement existed, neither did the lobbyists. But the language that congress adopted (with the consultation of this lobbying group) made the agencies _think_ it was what congress wanted. So the agency was left holding the bag for something no one actually wanted in the first place. Like a big game of telephone.

I agree with your broader point however. Congress needs to do a better job of owning outcomes.


I for one was tickled the first time I paid an NYC parking ticket online and had to add it to my cart, as if they might upsell me on some other more serious violation. Act now before it's too late!


When you're paying for things, and can pay for several things at the same time, it makes sense. I helped my mother pay her property taxes this year, and for two properties we had to make four payments. Without the shopping cart (or a functional equivalent) I would have had to enter my payment information four times instead of once.


And if you don't like the service you have free returns for 14 days no questions asked


I assume it's a vestige from the old days when you ordered actual tapes or printouts from government agencies. The notion of an order that had to be prepared is baked into how they think about the service and product.

All sorts of strange things happen with accessing US government data. But most agencies have a lot of excellent data available for free and motivated data scientists who want to make it available to you.


Some records are only released for a fee, so I always assumed implementing a standard interface to handle commercial transactions and making select items free was easier than maintaining separate commercial/noncommercial sites.


Which in itself is fairly antidemocratic.


Probably because it’s the out-of-the-box functionality on ServiceNow or whatever tool they’re using.


My guess would be to gather information about who is using the data - and present it to the stakeholders.

Sometimes they send questionnaires to data consumers.


Probably made by the same people who made https://pacer.uscourts.gov/pacer-pricing-how-fees-work ...

(As a side note, I can understand why in years past it would cost multiple cents per page to physically photocopy a federal document - but it is absolutely absurd that already-digitized documents, documents which are fundamentally part of the precedent that decides whether our behavior does or doesn't cause civil or criminal liability, are behind a paywall for a digital download!)


Pacer physically hurts to use. They should have to pay us for having to use it.


I’d go further and say that government websites and services in general are weird and often feel outdated, especially the more local you get.

Some of them look like they’re straight from the early 2000s.

Another thing is that privacy is dead and almost everything is deemed public information. Your address, mugshot, voter rolls, you name it, it’s all deemed public information.

But once you actually want to access information that is useful to society as whole, more often than not it’s behind a paywall. Despite this, they still call it “public” information because theoretically anyone can pay for it and get access.

It’s one of the first things I noticed when I moved to the US.

Another thing that I’ve noticed is that, if possible, there always needs to be a middle man inserted, a private corporation that can make a profit out of it.

You want to identify yourself to the government? Well you’re gonna need an ID.me account or complete a quiz provided to you by LexisNexis or one of the credit reporting agencies.

Why? How is it that the government of all entities, isn’t capable of verifying me themselves?

Zooming out even further you’ll start to recognize even more ancient processes you interact with in daily life.

The whole banking situation and the backbone that’s running it is a great example. The concept of pending transactions, checks and expensive and slow transfers is baffling to me.

It’s so weird, like ooh, aah this country is the pinnacle of technological innovation, yet in daily life there’s so much that hinges on ancient and suboptimal processes that you don’t see in, say, Western Europe.

My best guess is that’s this is because of a mix of lack of funding and politicians that wanting to placate profit seeking corporations.

Ironically, and I have no hard evidence for this because I’m too lazy to look into it, I suspect that on the long term it costs more to outsource it than it does to do it themselves.

/rant


I feel like the USA was probably ahead of the curve on things like the banking system in the 90s but removing all of that cruft and replacing with a v2 is not an easy thing to do. Especially when the population are generally skeptical of government and outsourcing to a private corporation is seen as the cheaper option. Short term it's potentially cheaper/quicker to outsource but long term maintenance contracts and lobbying to keep things as they are is much more expensive.

In the USA you find these whole industries that exist due to the inadequacies of the old systems. E.g. venmo doesnt need to exist in Western Europe (and probably rest of world?) because person-to-person bank transactions are free and easy.


Bureaucracy = $$$


Who is making money with shopping cart-style downloads?


The web development companies that are subcontracted by the government agencies to repurpose their silly shopping cart software.

I will decline to share my personal anecdote's about these companies because I am like 10+ years out of date, but I can tell you that most of these companies seemed to have certain very specific non-technical things in common.


> Simple file downloads and CSV files are fantastic. I wish more folks would publish data in simple formats like this.

The document format doesn't seem to have much to do with the problem. I mean, if the CSV is replaced by a zipped JSON doc then the benefits are the same.

> Every time I have to fill a "shopping cart" for a US government data download I die a little.

Now that seems to be the real problem: too many hurdles in the way of a simple download of a statically-served file.


> if the CSV is replaced by a zipped JSON doc then the benefits are the same.

Being able to use things like jq and gron might make simple use cases extremely straightforward. I'm not aware of anything similarly nimble for CSV.


csvtool is probably what you're looking for, though I think the use case for JSON vs CSV is different as one is hierarchy-oriented and the other is flat.


For CSV? DuckDB, Datasette, awk, pawk, and tons of others.


> For CSV? DuckDB, Datasette, awk, pawk, and tons of others.

Those tools are either line- or table-oriented, and thus don't handle CSVs as a structured format. For instance, is there any easy way to use AWK to handle a doc where one line has N columns but some lines have N+1, with a random column entered in an undetermined point?


The CSV Spec (1) specifies that “Each line should contain the same number of fields throughout the file”, so what you describe would be a noncompliant file. Nevertheless, I would expect both awk and pawk to be able to work with it just fine.

Both pawk and the recent version of awk can parse the CSV format correctly (not just splitting on commas, but also handling the quoting rules for example).

1: https://www.rfc-editor.org/rfc/rfc4180


You dont gron with CSV, normal grep will work wonders.


gron gives you both the complete key and value in a single line (which you then grep). Directly grep-ing a specific column in a CSV file isn't very straightforward.


Can you explain what the shopping cart is?

I mean... do they make you select one or more files, then navigate to another page to download your selected files?


Sounds like Germany’s company register[1].

You can search for companies, select the documents you’d like to see (like shareholder lists), then you go through a checkout process and pay 0 EUR (used to be like a few euros years ago), and then you can finally download your file. Still a super tedious process, but at least for free nowadays.

[1] https://www.unternehmensregister.de/ureg/


Quite literally yes. There are often multiple click throughs. Every Department, agency, sub-agency, all the way down to federal dog catcher has decided the most important thing is to invent a new way of getting data from them.


Solutions vary as the budgets and talent assigned the projects do.

Federal, State, local and hyper local solutions cannot be the same unless the financier is also the same.


Pretty much. USGS’s EarthExplorer requires a login just to download free files https://earthexplorer.usgs.gov/

There are other ways to access the data on here, but they’re fragmented. It’s nicely organized here so it’s a bummer they make it hard to programmatically retrieve files once you find what you’re looking for.


Agreed! People are surprised but CSV files (while I would not use them for everything) work great in low-tech environment, for instance, or when files must be produced by many different actors with different technical levels.

They also work nicely as an interop between different stacks (Cobol <-> Ruby).

One concrete example is the French standard to describe Electrical Vehicles Charge Points, which is made of 2 parts (static = position & overall description, dynamic = current state, occupancy etc). Both "files" are just CSV files:

https://github.com/etalab/schema-irve

Both sub-formats are specified via TableSchema (https://frictionlessdata.io/schemas/table-schema.json).

Files are produced directly by electrical charge points operators, which can have widely different sizes & technicality, so CSV works nicely in that case.


I worked for a company that used TSV for huge datasets. It was extremely efficient, far more so than any database.


TSV is criminally underrated compared to CSV.


Lol. Tell me you never had to parse CSV files without telling me.

CSV files can be a nightmare to work with depending where they come from and various liberties that were taken when generating the file or reading the file.

Use a goddam battle tested library people and don't reinvent the wheel. /oldman rant over


Yes, you eventually realize the hard way that "CSV" is actually a blanket of various similar formats, with different rules and conventions. The way one program outputs CSV's may be completely different from another.


Actually, I worked on OpenAddresses, a project to parse thousands of CSV files containing address data from many different county, state, and national systems around the world. It really wasn't that hard, even Python's basic csv parser was sufficient to the task (and there are plenty of better options).

CSV is remarkably robust in practice.


Hey, CSV is hard, guys.

I've found template injection in a CSV upload before because they didn't anticipate a doublequote being syntactically relevant or something.

It was my job to find these things and I still felt betrayed by a file format I didn't realize wasn't just comma separated values only.


Someone can find all these issues on 200 lines of code :) see sibling comment


This isn't my code, but gives you an idea of the level of complexity involved. But don't reimplement what you don't need to.

https://github.com/mafintosh/csv-parser/blob/master/index.js


To be clear my comment was meant as a joke.

Looking at the parser I see a few problems with it just by skimming the code. I'm not saying it wouldn't work or that it's not good enough for certain purposes.


Oh yeah? Such as? What purposes do you think it wouldn't be good for? The author will probably be interested in your feedback. Apparently it's getting over a million downloads a week on npm.


I have not used it, so this is mostly speculation but i would be curios around character set handling, mixed line ending handling, large file handling, invalid and almost valid file handling.

You can pick on some of the corner case issues here: https://github.com/mafintosh/csv-parser/issues Also look at ones that were solved. https://github.com/mafintosh/csv-parser/pulls

Some interesting ones: https://github.com/mafintosh/csv-parser/pull/121 https://github.com/mafintosh/csv-parser/pull/151 https://github.com/mafintosh/csv-parser/issues/218

The author of the library probably has learned, the hard way many many lessons (and probably also decided to prioritize some of the requested issues / feature requests along the way).

The above is not meant as a ding on the project itself and I am sure it is used successfully by many people. The point here is that your claim that you can easily write a csv parser in 200 lines of code does not hold water. It's anything but easy and you should use a battle tested library and not reinvent the wheel.


If you had read my original comment, you would see I didn't claim it's easy to do, only that it can be done in around 200 lines. That's clearly the case.

Character set handling isn't really an issue for JavaScript as strings are always utf-16. When a file is read into a string the runtime handles the needed conversion.

As for handling large files, I've used this with 50mb CSVs, which would need a 32bit integer to index. Is that large enough? It's not like windows notepad which can only read 64kb files.


Windows notepad can read multiple megabyte files. It can read files that are hundreds of megabytes. It's not pleasant, loading is incredibly slow, and resizing the window when reflow is enabled makes it take that much longer, but it's definitely possible.


My point was that it's not trivial and it's hard to get it right. The way I read your comment was that it's not hard and can easily be done in 200 lines. It's possible I misread it.

I think the original point I was making still stands.


It's not as bad as all that. There's some gochas sure but you can cover them all with about 200 lines of code.

However, I would recommend using a tested library to do the parsing, sqlite for example, rather than rolling your own. Unless you have to of course.


> Every time I have to fill a "shopping cart" for a US government data download I die a little.

I've worked in gov/civic tech for 5+ years and, as you're probably aware, there is now a highly lucrative business in centralizing and consequently selling easy access to this data to lobbyists/fortune 500s/nonprofits.

USDS/18f are sort of addressing this, but haven't really made a dent in the market afaik since they're focusing on modernizing specific agencies rather than centralizing everything.


I don't know if this is the cart you're talking about but the USGS does this for LiDAR data and yeah I'm with you I die a little every time I use that site. I love that the data is available but why in the world do we have to cart everything? Just give me a one click download


The main thing for me is that the data isn't broken down in a way that ends you having to make a thousand requests and collate them. If the client is expected to be able to pull the whole set of data, provide the whole set. There's other ways to rate limit.


clinicaltrials.gov let's you save a search term then download the results as a zip. But there's an xml file for each search result for the trial.

One of the first things I played around with was using Python to get that file, unzip it, then iterate through the xml files grabbing the info I wanted and putting it into ElasticSearch to make it searchable then putting an angular front end on it.

I used to have it published somewhere but I think I let it all die. :(


Pretty much all data eventually ends up in 2 dimensions (rows & columns) so all these complicated data models are just mostly complicating things.


Someone posted me a pdf in the last year! Online even! (But posted!)


I once hade the unfortunate experience of building an API for a government org where the data changed once a year or when amendments were made which happens very infrequently.

The whole data set could have been zipped into a <1MB file but instead a “solution architect” go their hands on the requirements. We ended up with a slow API because they wouldn’t let us cache results in case the data had changed just as it was requested. And an overly complex webhook system for notifying subscribers of changes to the data.

A zip file probably was too simple, but not far off what was actually required.


I think for <1MB of data, with changes once (or twice) a year, the correct API is a static webserver with working ETag/If-Modified-Since support.

If you want to get really fancy, offer an additional webhook which triggers when the file changes - so clients know when to redownload it and don't have to poll once a day.

...or make a script that sends a predefined e-mail to a mailing list when there is a change.


> working ETag/If-Modified-Since support

I completely agree and csvbase already implements this (so does curl btw), try:

    curl --etag-compare stock-exchanges-etag.txt --etag-save stock-exchanges-etag.txt https://csvbase.com/meripaterson/stock-exchanges.parquet -O


> ETag/If-Modified-Since

See above. Also you can just publish the version in DNS with a long enough TTL


A zip file on a web server that supports etags, that's polled every time access is required. When nothing has changed since last time, you get an empty HTTP 304 response and if it has changed then you simply download the <1MB Zip file again with the updated etag. What am I missing?


You forgot to get yourself paid.


Probably nothing

My concern was "what if file is updated while it's mid-download" but Linux would probably keep the old version of the file until the download finishes (== until file is still open by webserver process). Probably. It's better to test


Is it updated in place (open/write), or replaced (rename)?

If it's updated in place, did the web server read the whole thing into a buffer or is it doing read/send in a loop?


If data changes only once a year or rarely that would imply usage of the api is a rare event for a user of the data so speed isn't a huge concern. Caching would introduce more complexities and the risk of needing to manually revalidate the cache. The solution architect was probably right.


Why do rare writes imply rare usage? It's possible the file is read often and by different systems even if changes are infrequent.

If the API was used rarely, that would be even more of an argument for a simple implementation and not a complex system involving webhooks.


> Caching would introduce more complexities

Apache/nginx do it just fine...


Can't cache so you need to read it whenever you use the data, not just when it changes.


  cat /api/version.txt
  2023.01.01

  ls /api
  version.txt data.zip


Or maybe encode the version into the filename? It would overwrite if nothing changed, and the previous versions would remain available.

    2023.01.01-data.zip


That requires preprocessing on the client and there are some ppl who has.. weird assumptions about how the dates should be written.

The version file can be quired at least the two ways:

the ETag/If-Modified-Since way (metadata only)

content itself

The best part with the last one - you don't need semver shenanigans. Just compare it with the latest dloaded copy, if version != dloaded => do_the_thing


No, it's a terrible API if retrieving the result '2000-10-26' requires downloading a 565 KB file.

I don't want to seem overly negative -- zipped CSV files are fantastic when you want to import lots of data that you then re-serve to users. I would vastly prefer it over e.g. protobufs that are currently used for mass transit system's live train times, but have terrible support in many languages.

But it's incredibly wasteful to treat it like an API to retrieve a single value. I hope nobody would ever write something like that into an app...

(So the article is cool, it's just the headline that's too much of a "hot take" for me.)


It's historical data. There is zero reason to request it more than once a day, and the users of such data will be interested in wildly different filters or aggregates over that data.

If this were being used to get the current rates, then yes, it would be a terrible design choice. But there are other services for that. This one fits its typical use case.


> I hope nobody would ever write something like that into an app...

I have got some bad news for you...

Not directly API related, but I remember supporting some land management application, and a new version of their software came out. Before that point it was working fine on our slow satellite offices that may have been on something like ISDN at the time. New version didn't work at all. The company said to run it on an RDP server.

I thought their answer was bullshit and investigated what they were doing. One particular call, for no particular reason was doing a 'SELECT * FROM sometable' for no particular reason. There were many other calls that were using proper SQL select clauses in the same execution.

I brought this up with the vendor and at first they were confused as hell how we could even figure this out. Eventually they pushed out a new version with a fixed call that was usable over slow speed lines, but for hells sake, how could they not figure that out in their own testing and instead pushed customers to expensive solutions?


> how could they not figure that out in their own testing

This one is easy. Testing with little data on fast network (likely localhost).


Also, if there are any ORMs involved it could be that it’s not immediately obvious from their code itself that this would happen.

I’ve seen code that was using an ORM, where they needed to find some data that matched certain criteria. With plain SQL it would have resulted in just a few rows of data. Put instead with their use of the ORM they ended up selecting all rows in the table from the database, and then looping over the resulting rows in code.

The result of that was that the code was really slow to run, for something that would’ve been super fast if it wasn’t for the way they used the ORM in that code.


Seen this so many times by novice developers, whose work style is often "Trial and error, and as soon as it works, stop and change nothing." And it's so easy to slip in there because it works perfectly locally and often even in non-production envs, unless you've seeded them with production-like amounts of data, and with the ORM it looks fine. Like in Rails:

  Product.where(category_id: xx).filter {|p| p.name.include?("computer") }
vs

  Product.where(category_id: xx).where("name like '%?%'", "computer")
PS: I know there's an Arel way of doing the above without putting any SQL into the app -- use your imagination and pretend that's what I did, because I still don't have that memorized :)


I just see recently an example of that, a REST call returning a few KB of data that would fetch a few million rows from the database and use 10+ GB of memory (unfortunately some people think that you should always use join fetches with JPA...).


Kinda feels like 20 years ago called and wants its argument back, like have you seen any javascript ever?

565 KB + the logic to get the big one is miniscule today by any reasonable factor.


True, though ironic that also OP is implicitly making the argument for GraphQL-like interfaces ("just specify the data that you want returned" rather than "get all the data, then process it"), which are themselves the New Hotness in some areas.


At risk of sounding pedantic, REST also allows for "just specify the data that you want returned" style APIs.

Something like:

    curl https://www.ecb.europa.eu/stats/eurofxref/api?fields=Date&order=USD&limit=1


A very fair point well-made!


I think a lot of people in this thread are glossing over a difference in definitions. Some people see "API" as "how to get data, even if that means all data with no filtering." Personally, I regard downloading an entire table as downloading a data model without any logic operating on the model, and an API as logic that returns only part of the model filtered in some way that interests me.


Why do you assume that 565KB even matters?

I've been building loads of financial software, both back and front ends. In frontend world it, sadly is quite common to send that amount of "data" over wires before even getting to actual data

And in backends it's really a design decision. There's nothing faster than a Cron job parsing echange rates nightly and writing them to a purpose designed todays-rates.json served as static file to your mobile, web or microservices apps.

Nothing implies your mobile app has to consume this zip-csv_over_http


They should ship parquet, supports predicate pushdown and is dramatically more compact, while you can't get row level data, it's great for analytical queries.


Totally agree! So much that I wrote a whole article about it a while back!

"Why parquet files are my preferred API for bulk open data" https://www.robinlinacre.com/parquet_api/


Parquet is a level better than csv but difficult to get customers to adopt and transmit in that format.


csvbase does ship parquet!

Just add ".parquet" - eg https://csvbase.com/meripaterson/stock-exchanges.parquet


Nice!


IMO this is the best simple option right now.

For sorted data you only need the relevant row groups which can be tunable to sensible sizes for your data and access pattern.


I'm an old fart, seeing the curl in every example is like nails on chalkboard.

565 KB, that's about 3 minutes download on a 28.8kbps modem I started getting online with...


When you got that modem were you concerned with how long it would have taken to type/write the information sent over it or were you just happy it didn't matter anymore?

I often wonder what I'll think of technology in say another 20 years but I can never tell if it's all just some general shift in perspective or, as you look farther back, if certain people were always just about a certain perspective (e.g. doing the most given the constraints) and technology changes enough that different perspectives (e.g. getting the most done easily for the same unit of human time) become the most common users for the newer stuff and that these people will also have a different perspective than the ones another 20 years down the line from them and so on.

For example, maybe you think it's crazy to ask for the exact piece of data you need, I think it's crazy to do all the work to not just grab the whole half MB and just extract what you need quickly on the client side as often as you want, and someone equidistant from me will think it's nuts to not just feed all the data into their AI tool and ask it "what is the data for ${thing}" instead of caring about how the data gets delivered at all. Or maybe that's just something I hope for because I don't want to end up a guy who says everything is the same just done slower on faster computers since that seems... depressing in comparison :).


How much data do I need to download before I can do a protobuf request ?


Mass transit files are often a single protobuf file that is like a megabyte or something.

I'm talking about zipped CSV files as easier to use than protobuf files. Neither is a request in this comparison.


Are you counting the client size? TLS handshake? Because otherwise the answer is none.


But what if I want the average over time? The query depends on every value, should everything be computed on the server?


Here's an API built on top of this data that allows for more fine tuned queries.

https://exchangeratesapi.io/

https://github.com/exchangeratesapi/exchangeratesapi


How much bandwidth does the average API documentation page use?


Does your client query the API documentation every time it's querying an API?


Well of course it does; how else would it know how to query the API? /s


> (So the article is cool, it's just the headline that's too much of a "hot take" for me.)

It's one of those headlines that we'd never allow if it were an editorialization by the submitter (even if the submitter were the author, in this case), but since it's the article's own subtitle, it's ok. A bit baity but more in a whimsical than a hardened-shameless way.

(I'm sure you probably noticed this but I thought the gloss might be interesting)


A very simple optimization for those complaining about having to fetch a large file every time you need a little datapoint: if they promised the file was append-only, and used HTTP gzip/brotli/whatever compression (as opposed to shipping a zip file), you could use range requests to only get the new data after your last refresh. Throw in an extra checksum header for peace of mind, and you have a pretty efficient, yet extremely simple incremental API.

(Yes, this assumes you keep the state, and you have to pay the price of the first download + state-keeping. Yes, it's also inefficient if you just need to get the EUR/JPY rate from 2007-08-22 a single time.)


Also, on the topic of range requests, when a server allows the range requests for zip files, the zip files are huge and one needs just a few files from them, one can actually download just the "central directory" and the compressed data of the needed files without downloading the whole zip file:

https://github.com/gtsystem/python-remotezip


Absolutely! I have a plan for a client lib that uses ETags (+ other tricks) to do just that.

Very WIP but check out my current "research quality" code here: https://pypi.org/project/csvbase-client/


Or, just serve a bunch of diff files. Just having a single daily patch can drastically reduce the bandwidth required to keep the file up to date on your side.

That's if downloading a few hundred kB more per day matters to you. It probably doesn't.


    >  curl -s https://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.zip | gunzip | sqlite3 ':memory:' '.import /dev/stdin stdin'   "select Date from stdin order by USD asc limit 1;"
    Error: in prepare, no such column: Date (1)

There is a typo in the example (that is not in the screenshot): you need to add a -csv argument to sqlite.


This is odd, I did originally have that argument but I removed it because it didn't seem to matter - it works without ("On My Machine"(tm))

Erk - readding it and busting the cache. After I put my kids to bed I will figure out what is wrong

EDIT: the reason it works for me is because I've got this config in ~/.sqliterc:

    .separator ','
Apparently at some point in the past I realised that I mostly insert csv files into it and set that default.


Generally curious - any particular reasons you chose sqlite and gnuplot for this task rather than, say, Python?


I do also use Python (pandas) partway down the page :) I'm just trying to show different tools to give people ideas about how much is possible with stuff that is already in /usr/bin/

If you were to ask my own tool preferences which I use for $DAYJOB: pandas for data cleaning and small data because I think that dataframes are genuinely a good API. I use SQL for larger datasets and I am not keen on matplotlib but still use it for graphs I have to show to other people.


Shouting out that clickhouse-local can do all of those in one step, including downloading, parsing, SQL query and charting.


Going on a tangent:

> Why was the Euro so weak back in 2000? It was launched, without coins or notes, in January 1999. The Euro was, initially, a sort of in-game currency for the European Union. It existed only inside banks - so there were no notes or coins for it. That all came later. So did belief - early on it didn't look like the little Euro was going to make it: so the rate against the Dollar was 0.8252. That means that in October 2000, a Dollar would buy you 1.21 Euros (to reverse exchange rates, do 1/rate). Nowadays the Euro is much stronger: a Dollar would buy you less than 1 Euro.

Even if the Euro initially only existed electronically, it still had fixed exchange rates with the old currencies of the EU zone members. Most importantly with the established and well-trusted Deutsche Mark of Germany.

So any explanation of 'why was the Euro initially weak' also has to explain why the DEM was weak at the time. The explanation given in that paragraph doesn't sound like it's passing that test.


For little problems where you can just download the entire database every time and do read-only work, never underestimate the value of keeping it simple! I really like SQLite for this because it has the portability of a .Json or .csv file but it’s more ready to interact with as a database.


Just use clickhouse-local, and you can interact with any old CSV file as if it was a database.


Key point:

> Some things we didn't have to do in this case: negotiate access (for example by paying money or talking to a salesman); deposit our email address/company name/job title into someone's database of qualified leads, observe any quota; authenticate (often a substantial side-quest of its own), read any API docs at all or deal with any issues more serious than basic formatting and shape.


I am skeptical about the "observe any quota" part. Bandwidth is not free.


There probably is a quota of some sort, but since there's no reason to download the file very often, it's probably not a quota you need to worry about.


  curl -s https://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.zip \
  | gunzip \
  | sqlite3 ':memory:' '.import /dev/stdin stdin' \
   "select Date from stdin order by USD asc limit 1;"
SQLite can read and write zip files.

https://sqlite.org/zipfile.html

Is it possible to use sqlite3 instead of gunzip for decompression.


Very good point!

If you don't mind saving the file to disk, you can do:

  sqlite3 -newline '' ':memory:' "SELECT data FROM zipfile('eurofxref-hist.zip')" \
  | sqlite3 -csv ':memory:' '.import /dev/stdin stdin' \
    "select ...;"
Doing this without a temporary file is tricky; for example `readfile('/dev/stdin')` doesn't work because SQLite tries to use seek(). Here is a super ugly method (using xxd to convert the zipfile to hex, to include it as a string literal in the SQL query):

  curl -s https://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.zip \
  | { printf "SELECT data FROM zipfile(x'"; xxd -p | tr -d '\n'; printf "')"; } \
  | sqlite3 -newline '' \
  | sqlite3 -csv ':memory:' '.import /dev/stdin stdin' \
    "select ...;"


The example is failing at decompression for me, and I suspect it's because of this:

> That data comes as a zipfile, which gunzip will decompress.

Doesn't gunzip expect gzip files, as opposed to ZIP (i.e. .zip) files?

On Linux I get further (apparently Linux gunzip is more tolerant of the format error than the macOS default one), but there, I then run into:

> Error: no such column: Date


Here's the gzipped tarball of the docs you can read to find out

https://www.gnu.org/software/gzip/manual/gzip.txt.gz


This is slightly abusive of gunzip, but it does work, but only because the zipfile in question only contains a single file and you are piping it in via shell. If you had tried to download the file and then run gunzip on it the program would have refused.

As for the second error, I think you might be trying to import an empty file or maybe an error message? Probably related to the first problem.


> but it does work

Only on some gzip implementations, e.g. the one shipping with many Linux distributions. It doesn't work on macOS, for example.


the second error is because you need to add a `-csv` argument to sqlite


GNU gzip has limited support for ZIP:

> Files created by zip can be uncompressed by gzip only if they have a single member compressed with the “deflation” method. This feature is only intended to help conversion of tar.zip files to the tar.gz format.

-- https://www.gnu.org/software/gzip/manual/html_node/Overview....


The example works on my MacBook if I replace gunzip with funzip. Also, I like to see the value. So:

    curl -s https://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.zip | funzip | sqlite3 -csv ':memory:' '.import /dev/stdin stdin' "select Date,USD from stdin order by USD asc limit 1;"


Thx. I have extended a bit as a script:

    # !/bin/bash
    ARG1=${1:-'select Date from stdin order by USD asc limit 1;'}
    curl -o - -Ls https://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.zip \
  | funzip | sqlite3 -csv ':memory:' '.import /dev/stdin stdin' "${ARG1}"

    # Usage:
    $ ./run.sh 'select * from stdin limit 1;'


At least in Ubuntu it doesn't do anything at all.

# gunzip -d master.zip gunzip: master.zip: unknown suffix -- ignored


It's a ZIP file, so you'll want to use unzip, not gunzip.

Ubuntu's version of g(un)zip actually does tolerate ZIP inputs though; for that usage, just omit the '-d'.


Per the docs (I linked them in another comment), the correct command is `gunzip <foo.zip` or `gunzip -S .zip foo.zip`.


There are several reasons a compressed archive isn't a good data format outside trusted backups.

1. inconsistent binary formats, float precision edge conditions, and unknown Endianness. Thus, assuming the Marshalling of some document is reliable is risky/fragile, so pick some standard your partners also support... try XML/XSLT, BSON, JSON, SOAP, AMQP+why, or even EDIFACT.

2. "dump and load" is usually inefficient, with an exception when the entire dataset is going to change every time (NOAA weather maps etc.)

3. Anyone wise to the 42TiB bzip 18kiB file joke is acutely aware of what compressed files can do to server scripts.

4. Tuning firewall traffic-shaping for a web-server is different from a server designed to handle large files. Too tolerant rules causes persistent DDoS exposure issues, and too strict causes connections to become unreliable when busy/slow.

5. Anyone that has to deal with CSV files knows how many proprietary interpretations of a simple document format emerge.

Best of luck, and remember to have fun =)


About your Point 5, let me tell you about the ESV file format: Eggplant Separated Values! Who needs escaping if you have the eggplant emoji!


My favorite was console control-characters accidentally embedded in plain text by cross-platform document code-page encoding errors.

One learns over the years that every program stage/interface must sanitize input, sanity check formats, and isolate data on a per account log.

Did the user intend "P=0/O" as a joke... we'll never know. =)


ZIP Bomb is 42.zip compressed archive whose size is 42 kilobytes but expands to 4.5 petabytes of uncompressed data. https://blog.fileformat.com/compression/zip-bombs-exploding-...


Yeah having a local snapshot of an entire dataset is obviously nice when it's feasible. But there's no need to conflate that with "application programming interface" just to market csvbase.


There's an issue with openly & freely available financial data. It tends to be limited (i.e. you need many sources to answer the questions you want and not just the questions a particular source is able to answer). And if you're bringing in different sources of data, each source requires custom fetching and parsing, along with being just annoyingly slightly different from other sources. Even if the value is given on a per-day basis (which is a huge assumption, lots of things like interest rates are published much slower), you have things like "is the value at the start of day, end of day, market close, median? What time zone? Which exchange? What about days when the exchange was closed? What about odd exchange hours or halts?", "what's the date and time format, what's the naming convention?"


This why Bloomberg and friends make money. They're a cartel but they at least do both making all this data work together and also transcribe stuff that is written in English into structured data etc (hence their interest in AI)


Target rates aren't published very frequently but interest rates themselves are typically daily or better.


Interesting.

As someone with a lot of data experience, and in particular including financial data (trading team in a hedge fund) I definitely prefer the column format where each currency is its own column.

That way it's very easy to filter for only the currencies I care about, and common data processing software (e.g. Pandas for Python) natively support columns, so you can get e.g. USDGBP rate simply by dividing two columns.

The biggest drawback of `eurofxref-hist` IMO is that it uses EUR as the numeraire (i.e. EUR is always 1), whereas most of the finance world uses USD. (Yeah I know it's in the name, it's just that I'd rather use a file with USD-denominated FX rates, if one was available.)


Are you really asking for the european central bank to not use its own currency?


Not sure what you mean by "most of the finance world", but in forex trading the EUR/USD rate is denominated exactly as in this file - 1.066 for today and not the inverse 0.094.


right, I mean quant trading - i.e. where people actually use data :D you want things relatively standardised, which means using a common unit of account - for most funds AFAIK that's USD


This is my experience as well. Everything is expressed as 'number to multiply by to get USD' so USDJPY would always be used in-house inverted. British firm as well. Nobody likes the GBP.


But why would the ECB publish its Euro fixings in some other currency?


Lots of exchanges and banks still use FTP and and friends to publish data to both the public and counterparties


I would say almost all do, especially the larger ones. Looking at you wells fargo with your tabbed and nested record csv's.


Indeed, I think ACH (Automated Clearing House) specifically still relies on FTP.


> Switching from wide to long is a simple operation, commonly called a "melt". Unfortunately, it's not available in SQL.

That's not completely true. In most SQL databases you can query information about the table, and this is true for Sqlite too. You cannot use this information to have a query with a "dynamic" number of columns, but you can always generate the query from the metadata and then executed the generated queries. It is not exactly fun to do as a one-liner, but it works:

> curl 'https://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.csv' | sqlite3 -csv ':memory:' '.import /dev/stdin stdin' '.mode list' '.output query.sql' "select 'select Date, ''' || name || ''', ' || name || ' from stdin ;' from pragma_table_info('stdin') where cid > 0 and cid < 42;" '.output stdout' '.mode csv' '.read query.sql'

(yes, you're also left with some temporary file on the computer).


> Although pulling out a simple max is easy, the data shape is not ideal. It's in "wide" format - a Date column, and then an extra column for every currency. Here's the csv header for that file:

> Date,USD,JPY,BGN,CYP,CZK,DKK,EEK,GBP,HUF,LTL,LVL,MTL,[and on, and on]

> When doing filters and aggregations, life is easier if the data is in "long" format, like this:

> Date,Currency,Rate

> Switching from wide to long is a simple operation, commonly called a "melt". Unfortunately, it's not available in SQL.

> No matter, you can melt with pandas:

> curl -s https://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.zip | \ gunzip | \ > python3 -c 'import sys, pandas as pd > pd.read_csv(sys.stdin).melt("Date").to_csv(sys.stdout, index=False)'

You sure this can't be done in SQL? I myself don't know enough SQL, but I believe that you can decompose a single row into multiple rows in SQL by joining multiple queries using UNION ALL:

https://stackoverflow.com/questions/46217564/converting-sing...

https://stackoverflow.com/questions/52279384/how-do-i-melt-a...

Now the question is, how to make this less verbose (so that the query doesn't need to mention each of the 42 columns)


That sounds like PIVOT and UNPIVOT. But of course being SQL you have to list all the columns explicitly, which can get kind of annoying.


Csv is surprisingly common and I prefer it despite parsing, quoting, encoding and escaping woes.

Amazon S3 let's you query csv files already loaded in buckets which is interesting but I haven't used yet.

One company I worked at a long time ago used free dropbox accounts as a ftp like drop that they would consume. Was hilarious and it worked well and was easy to stay under the free limits.


The site apparently serves .csv directly as well (replace .zip with .csv). I used this to convert the expression to powershell (using only built-in powershell cmdlets and commands):

    $> (invoke-webrequest "https://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.csv").Content
         | ConvertFrom-Csv
         | sort "usd"
         | select "date" -first 1

       Date
       ----
       2000-10-26
Doing it with a zip file would be a little more verbose since there is no built in "gunzip" type command which operates on streams, but you can write one which does basically that out of built in .Net functions:

    function ConvertFrom-Zip {
        param(
            [Parameter(Position=0, Mandatory=$true, ValueFromPipeline=$true)]
            [byte[]]$Data
        )
        
        process {
            $memoryStream = [System.IO.MemoryStream]::new($Data)
            $zipArchive = [System.IO.Compression.ZipArchive]::new($memoryStream)
        
            $outputStreams = @()
            
            foreach ($entry in $zipArchive.Entries) {
                $reader = [System.IO.StreamReader]::new($entry.Open())
                $outputStreams += $reader.ReadToEnd()
                $reader.Close()
            }

            $zipArchive.Dispose()
            $memoryStream.Dispose()

            return $outputStreams
        }
    }
and call it like:

    # unary "," operator required to have powershell
    # pipe the byte[] as a single argument rather
    # than piping each byte individually to 
    # ConvertFrom-Zip

    $> ,(invoke-webrequest "https://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.zip").Content
         | ConvertFrom-Zip
         | ConvertFrom-Csv
         | sort "usd"
         | select "date" -first 1

       Date
       ----
       2000-10-26
I love powershell


Here it is in nushell:

    /tmp> # be kind to the server and only download the file if it's updated
    /tmp> curl -s -o /tmp/euro.zip -z /tmp/euro.zip https://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.zip
    /tmp> unzip -p /tmp/euro.zip | from csv | select Date USD | sort-by USD | first
    ╭──────┬────────────╮
    │ Date │ 2000-10-26 │
    │ USD  │ 0.83       │
    ╰──────┴────────────╯

(I removed the pipe to gunzip because 1. gunzip doesn't work like that on mac and 2. it's not something you should expect to work anyway, zip files often won't work like that, they're a container and their unzip can't normally be streamed)


> be kind to the server and only download the file if it's updated

I wonder if their webserver supports the If-modified-since http header


You can check with `curl -v`.

tl;dr: the server doesn't support that header, but since the response does include a Last-Modified header, curl helpfully aborts the transfer if the Last-Modified date is the same as the mtime of the previously downloaded file.


This is not a secure use of /tmp.


I'll throw out the the GTFS transit standard involves publishing zipped CSVs representing your DB tables regularly. There are nice things about it and downsides IMO. This is how google map's transit directions function -- they rely on agencies to publish their schedules regularly in zipped CSVs I think.

One downside is that dev experience is pretty bad in my opinion. It took me years of fiddling in my free time to realize that if you happen to try to use the CSVs near a schedule change, you don't know what data you're missing, or that you're missing data, until you go to use it. My local agency doesn't publish historical CSVs so if you just missed that old CSV and need it, you're relying on the community to have a copy somewhere. Similarly, if a schedule change just happened but you failed to download the CSV, you don't know until you go to match up IDs in the APIs.


The EDGAR API bulk data is similar in nature, albeit in JSON instead.[1]

[1]: https://github.com/andrewmcwattersandco/programming-language...


If this is his favorite API, he should check out patent office gazettes


I used to help run the Best Buy product catalog API. We did nightly CSVs instead of making people poll/paginate the entire catalog. It was super widely used.


What's the use case here? For a quick look up you're not going to write that monster command. And for a more ongoing repeated use, you're not going to download and uncompress the file every time you want to query it...

I get the satisfaction as a curiosity, but other than that, to me that wouldn't be enough to make the my "favorite" or even "good".


I viewed it as an example of how simple it could be (essentially a one-liner), but not meant for actual repeated use as your points suggest.


I worked at the Chicago Mercantile Exchange 15 or so years ago and we had dozens of similar files that most major financial institutions downloaded daily. If they weren’t published on schedule we would start getting calls from executives at Bloomberg, Goldman, etc. I wrote a small system to manage and monitor their publication.


Too many different tools with weird syntax... once you are used to Nushell, you keep using it :)

http get https://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.zip | gunzip | from csv | sort-by USD | first | get Date


I believe that curl does not have client side caching so every time you run the command it downloads the csv. While downloading the csv and then analyzing it would no longer makes it a magic one-liner you can send to people. It would save bandwidth and reduce traffic on the API.

Unless there is caching going on here? Perhaps a CDN cache on the server side?


> I believe that curl does not have client side caching

See:

https://news.ycombinator.com/item?id=37528558

by llimllib

Specifically:

   curl  -o /tmp/euro.zip -z /tmp/euro.zip
Option o is output, option z says

"Request a file that has been modified later than the given time and date, or one that has been modified before that time."

And:

https://news.ycombinator.com/item?id=37529690

by sltkr:

"the server doesn't support that header, but since the response does include a Last-Modified header, curl helpfully aborts the transfer if the Last-Modified date is the same as the mtime of the previously downloaded file."


You should add files to the website in sync whenever they get published. Most statistics instututions publish their data in csv format in a webpage. You could bookmark them once and create a database of all the data so the app is used more


If your data is somewhat sparse, somewhat hierarchical or have evolving/amorphous schema, gzipped jsonl is better. Jsonl is supposed by clickhouse and jq and a bit more consistent around escaping/delimiting


Just thinking out loud: If we make the zip file a parquet file, wouldn't we be able to do partial downloads (like getting the first 1000 lines) instead of getting the whole file every single time?


Great read. I didn't know it's so easy to plot something in the terminal. Also really shows the power of 2D data structures. A good long format and you can do anything.


If you’re going to go all bash, use cut (made for it) or sed to remove the extra comma.

And you could use awk to do the melt if you don’t have python and pandas.


In my case it is not Bash, but POSIX shell:

    #!/bin/sh
    
    set -eu
    
    u=https://www.ecb.europa.eu/stats/eurofxref/eurofxref.zip
    ftp -Vo - $u | zcat | sed 's/, $//' | awk -F ', ' '
    NR == 1 {
        for(i = 1; i <= NF; ++i) {
            col[$i] = i
            c       = ($i != "GBP") ? $i : "EUR"
            printf "%s%s", c, (i < NF) ? "\t" : "\n"
        }
    }
    NR >= 2 {
        printf "%s\t", $1
        for(i = 2; i <= NF; ++i) {
            c = (i != col["GBP"]) ? $i/$(col["GBP"]) : $i
            printf "%s%s", c, (i < NF) ? "\t" : "\n"
        }
    }
    '
Fetches the latest rates and converts the base currency to GBP and format to TSV. Then I run a cron(8) job 15~30 minutes after the release each work day and dump the output so that I can retrieve it from httpd(8) over HTTPS to be injected into my window manager status bar. A pretty elegant solution that spares the ECB from my desktops hammering their server. All with tools from the OpenBSD base install.


I find awk to be more reliable even for simple string manipulation. Something about sed's regexes throw me for a loop!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: