Ah, I remember this specific file from my time at the ECB 15-ish years ago.
IIRC it was by far the most downloaded file on the ECB website. Tons of people, including many financial institutions, downloaded it daily, and used it to update their own systems.
IIRC #2 in the minutes immediately after the daily scheduled time for publishing this file, there was a massive traffic spike.
It was a conscious decision to make it a simple CSV file (once unzipped): it made it possible to serve the file reliably, fast, and with little resources needed.
The small team responsible at the time for the ECB’S public website was inordinately proud of the technical decisions made to serve this data in a single static file. And rightly so.
This is something that powers a lot of data interchange but lots of people don't realize because they never work with older systems like that. It's not shiny. No frameworks.
I used to work for a large, old company whose products you have all bought and buy (and which shall remain nameless) about 15 years ago. I worked on data interchange between the systems used for record keeping of said products and various other downstream or parallel systems (same purpose, separate system because left over from merger/acquisition).
So basically bulk data import and export. The product was 15 years old at that point. Data interchange was via various fixed width or delimited (CSV but the C might be some other character or character sequence) files transferred to and from various SFTP servers.
It's been a while but there were probably like 20 or 30 different such data sources or exports going in and out. Worked like a charm. I bet they're still used without much change. The frontend was being rewritten at the time (old one was in Smalltalk).
I'm using this type of simple approach to build a SaaS right now. We need to spin up many VPS and provision them, and the fastest way to do that is with rsync and ssh.
But we didn't stop there: this SaaS for our open source browser product is entirely built like this: behind the scenes it's a collection of bash scripts that implement and execute the business operation of the SaaS.
So basically, it's a command-line interface to the SaaS. Think of it this way, say I didn't have a website, with login, and "click a button to open a browser", but instead people would write me letters, send me cheques, or call me on the phone. Then I can serve their requests manually, at the command line.
The reason I made it like this was:
- clear separation between thin web front-end and actual business logic
- nice command-line interface (options, usage, help, clear error messages) to business logic for maintenance and support to jump on and fix things
- inheritance of operating system permissions and user process isolation
- highly testable implementation
Maybe this is dumb, but I really like it. To me it's an architecture and approach that makes sense.
I'm sure this is not new, and I think a lot of good quality operations must be built via this way. I highly align with the author's stance of the composition of a few simple command line tools to get the job done.
Perhaps we can call this "unix driven development", or "unix-philosophy backend engineering"
I feel like you went way overboard with this one.. I think the first paragraph and an example of what you're doing that's similar to OP would've been plenty
A lot of developers talk smack about CSV files (or tab delimited) but they really are the workhorse of our industry. They're simple, textual, compress really well, and as long as everyone on both sides has an agreement about the format, they're almost a standard.
sure, a lot of the criticisms of CSV are true, but given the above constraints it's really hard to beat them.
Wow. I’m just thinking of the hoops I’d have to go through to host changing data as a static CSV/ZIP at my employer.
Architects would complain ZIP isn’t a format conform to their specs for this purpose. Compliance would complain there needs to be a check that no private informatie leaked. Risk that I should prevent bad actors from downloading the file. Web people that I need an approved change to add stuff to the site.
Yeah, but modern, approved practices don't really address any of that - they just hide it under a pile of extra complexity that's too big for most devs and managers' attention span, effectively disappearing it.
HTTP compression is optional, so they either have to compress on the fly (wasting cpu) or provide multiple versions (complicating setup and deployment) or make some HTTP clients not work.
simgle zip file is really the easiest solution for cases when the file must absolutely be compressed
Many webservers allow you to serve a compressed file (stored on disk) and _decompress_ when a client specifically can't support the compressed encoding. Since most clients should support compression, this means you only use the CPU for the rarer case where the uncompressed data is required.
But what if an analyst needs to access this data and run their regressions on a potato? Surely that use case is worth adding a few libraries to handle.
hmm good point, can you draft up an architecture plan using multiple microservices and redundancy via a kubernetes cluster and have it on my desk by Monday please.
The business logic here is >15 years old. Http compression was only in early stages then, and you can guarantee that many client side scripts and libraries would jot have supported it. Zip was well known. Compress and place.
If I'm not missing the point here, this was, and still is, about offering the simplest, most reliable solution over a long period. This is a near perfect example of how to do exactly that. No changing formats, no moving requirements, no big swings in frameworks, apis, or even standards. And most importantly, no breaking your customers business workflows.
No edge case dependencies on the WWW server's configuration, and no sudden "why did we just saturate our external connection?"
No emergency change requests from the outage team that has to be impacted by other areas and fit into the infrastructure's teams maintenance windows and their capacity to address that.
No rebalancing of workloads because Jane had to implement (or schedule the task and monitor it) that change, Joe had to check and verify that the external availability tests passed, and Annick had to sign off on the change complete, and now everyone isn't available for another OT window for the week.
> "The business logic here is >15 years old. Http compression was only in early stages then"
At least one of us is confused about history here; are you really saying that circa 2008, HTTP compression would have been considered immature or unstable?
It it exactly what most api optimization does. For example, batch queries merge several api calls into one to reduce api call counts. And it this case, it is perfectly optimized. You have literally one zip file, and there is no more.
By distributing the data in a compressed file, you retain the benefits of the smaller file size at all stages (transfer from source, local storage, local transmission) until the data needs to be used in uncompressed format, all without any additional processing.
scale8 recently archived their GitHub repo without explanation and right after that the url for a GDPR opt-in list they were using changed. The product is now dead because the code doesn't handle the situation where that file might not be available.
(Scale8 is a web analytics and tag management software like google analytics)
> IIRC #2 in the minutes immediately after the daily scheduled time for publishing this file, there was a massive traffic spike.
Perfect usecase for bittorrent protocol imho. Funny thing is you can serve files with bittorrent protocol from Amazon S3 without any additional complexity or charges I think.
Does HTTP compression perform well if you're streaming a file? Wouldn't it need to have access to the entire file to have the same result as zipping it?
Simple file downloads and CSV files are fantastic. I wish more folks would publish data in simple formats like this. Every time I have to fill a "shopping cart" for a US government data download I die a little.
There's a bunch of wrapper tools to make this particular pipeline easier. Also something like Datasette is great if you want a web view and some fancier features.
One nice thing about CSV files being zipped and served via the web is they can be streamed directly into the database incredibly fast without having to persist them anywhere (aside from the db).
You can load the zip file as a stream, read the CSV line by line, transform it, and then load it to the db using COPY FROM stdin (assuming Postgres).
It isn't. But that's easily mitigated with temp tables, ephemeral database and COPY etc.
Upstream can easily f-up and (accidentally) delete production data if you do this on a live db. Which is why PostgreSQL and nearly all other DBS have a miriad of tools to solve this by not doing it directly on a production database
I had this last week, but instead it was a 3rd party api and their service started returning null instead of true for the has_more property beyond the second page of results.
In either the solution is probably to check rough counts and error if not reasonable.
That’s not how COPY FROM works in postgres. You give it a csv and a table matching the structure and it hammers the data into the table faster than anything else can.
You should checkout powershell; it supports converting CSV into in-memory structured data and then you can run regular powershell queries on that data:
$> csvData = @"
Name,Department,Salary
John Doe,IT,60000
Jane Smith,Finance,75000
Alice Johnson,HR,65000
Bob Anderson,IT,71000
"@;
$> csvData
| ConvertFrom-Csv
| Select Name, Salary
| Sort Salary -Descending
Name Salary
---- ------
Jane Smith 75000
Bob Anderson 71000
Alice Johnson 65000
John Doe 60000
You can also then convert the results back into CSV by piping into ConvertTo-Csv
/tmp/> "Name,Department,Salary
::: John Doe,IT,60000
::: Jane Smith,Finance,75000
::: Alice Johnson,HR,65000
::: Bob Anderson,IT,71000" |
::: from csv |
::: select Name Salary |
::: sort-by -r Salary
╭───┬───────────────┬────────╮
│ # │ Name │ Salary │
├───┼───────────────┼────────┤
│ 0 │ Jane Smith │ 75000 │
│ 1 │ Bob Anderson │ 71000 │
│ 2 │ Alice Johnson │ 65000 │
│ 3 │ John Doe │ 60000 │
╰───┴───────────────┴────────╯
for something a bit more robust, check out DuckDB. It's a library you can embed, use it to run SQL on local files as well as connect to databases, do joins, analytics, etc.
Agreed. The article mentioned duckdb and I'm her to thumbs-up the use of DuckDB wholeheartedly. If you like the world of public CSV files as data sources that you can query or cross-query, duckdb is the tool for you. Just follow the demo on the duckdb website and you'll be wow'd for sure.
I use both, and I have found it helpful to have nushell around when munging csv and parquet files, even when working with duckdb - I find it quicker to ask for the first few rows of a thing or do simple stuff with it, then deeper analysis with duckdb.
- Send the csv data from stdin (using echo and referred to in the command by -)
- Refer to the data in the query by stdin. You may also use the _t_N syntax (first table is _t_1, then _t_2, etc.), or the file name itself before the .csv extension if we were using files.
- Pipe the output to the table command for formatting.
- Also, the shape of the result is printed to stderr (the (4, 2) below).
$ echo 'Name,Department,Salary
John Doe,IT,60000
Jane Smith,Finance,75000
Alice Johnson,HR,65000
Bob Anderson,IT,71000' |
qsv sqlp - 'SELECT Name, Salary FROM stdin ORDER BY Salary DESC' |
qsv table
(4, 2)
Name Salary
Jane Smith 75000
Bob Anderson 71000
Alice Johnson 65000
John Doe 60000
You might find a lot of interesting tools in the Datasette ecosystem. Data dashboarding for SQLite with all sorts of import and export and visualization plugins.
What’s really interesting about it is that Awk is now, finally, getting support for CSV. But I bet a large amount of Awk+CSV use cases can be met with SQLite and SQL+CSV.
The clickhouse-local tool is also really great for querying a bunch of systems, and it has connectors and converters for a whole bunch of other systems and formats.
CSV would be great if there were one CSV, and if you absolutely guarantee that nobody has "improved" the CSV by editing it with a spreadsheet program (including opening it with a spreadsheet program) or their own deft little fingers in a text editor.
For example:
"Look, this contains \"quotes\"!",012345
Or:
"Look, this contains ""quotes""!",012345
Or, for some degenerate examples:
"Look, this contains "quotes"!",012345
Or:
Look, this contains "quotes"!,012345
Or the spoor of a spreadsheet:
"Look, this contains ""quotes""!",12345
Theoretically, JSON isn't immune to being hand-hacked into a semi-coherent mess. In practice, people don't seem to do that to JSON files, at least not that I've seen. Ditto number problems, in that in JSON, serial numbers and such tend to be strings instead of integers a "helpful" application can lop a few zeroes off of.
> CSV would be great if there were one CSV, and if you absolutely guarantee that nobody has "improved" the CSV by editing it with a spreadsheet program (including opening it with a spreadsheet program)
Practically no formats actually pass those rules. Even plain text is bound to be "improved" by text editors frequently (uniformation of line endings, removal of data not in a known encoding, UTF BOM, UTF normalization, etc.)
Just don't do what, edit your data? The problem is all JSON implementations agree what a properly encoded JSON file look like, and it's not true for CSV.
Arbitrary CSV is not, in general, round-trippable.
Json is JavaScript Object Notation. Javascript itself doesn't large numbers like 2^65 or accurate numbers like 2^55+75. The json spec doesn't limit numbers to 2^51 or 2^64, but also doesn't guarantee that they will be supported.
In practice, you open the file a disappointed customer sends you because your software doesn't parse it. You resist yelling at whomever decided to ever do CSV support in your software because you know what it's like. But the customer is important, so then either write a one-off converter for them to a more proper CSV format, or you add - not change - several lines of code in mainline to be able to also parse customer's format. God forbid maybe also write. Repeat that a couple of times. See the mess increasing. Decide to use a so-called battle-tested CSV parser after all, only to find out it doesn't pass all tests. Weep. Especially because at the sametime you also support a binary format which is 100 times faster to read and write and has no issues whatsoever.
I'm not blind for the benefits of CSV, but oh boy it would be nice if it would be, for starters, actually comma-separated.
JSONL should replace CSV. It is standardized and the escapes mostly well specified. It is effectively CSV with "[" and "]" surrounding lines.
Regular JSON would work fine for static file, and make Schema and links (JSON-LD) possible. But then the file could be any structure. JSONL works better for systems that assume line-based records, and are more likely to have consistent, simple records.
I can almost assure you it was an explicit requirement in an RFP that was copied from some master template. So not a good justification - but a justification in the eyes of the bureaucracy.
The book Recoding America has a lot of anecdotes to this effect; most of these situations reduce to a Congressional mandate that got misinterpreted along the way. My favorite was for an update to Social Security. The department in charge of the implementation swore that Congress was forcing them to build a facebook for doctors (literally where doctors could friend other doctors and message them). Congress had no such intention; it was actually 3rd party lobbying that wanted the requirement so they could build their own solution outside of government. Really crazy stuff.
> Congress had no such intention; it was actually 3rd party lobbying
Right, but 3rd party lobbying can't force anyone to do anything, whereas Congress can (and did) give this mandate the force of law. The fact that lobbyists got Congress to do something that they had "no such intention" to do is its own problem, but let's not lose sight of who is responsible for laws.
That's the interesting part of this story; Congress didn't think this requirement existed, neither did the lobbyists. But the language that congress adopted (with the consultation of this lobbying group) made the agencies _think_ it was what congress wanted. So the agency was left holding the bag for something no one actually wanted in the first place. Like a big game of telephone.
I agree with your broader point however. Congress needs to do a better job of owning outcomes.
I for one was tickled the first time I paid an NYC parking ticket online and had to add it to my cart, as if they might upsell me on some other more serious violation. Act now before it's too late!
When you're paying for things, and can pay for several things at the same time, it makes sense. I helped my mother pay her property taxes this year, and for two properties we had to make four payments. Without the shopping cart (or a functional equivalent) I would have had to enter my payment information four times instead of once.
I assume it's a vestige from the old days when you ordered actual tapes or printouts from government agencies. The notion of an order that had to be prepared is baked into how they think about the service and product.
All sorts of strange things happen with accessing US government data. But most agencies have a lot of excellent data available for free and motivated data scientists who want to make it available to you.
Some records are only released for a fee, so I always assumed implementing a standard interface to handle commercial transactions and making select items free was easier than maintaining separate commercial/noncommercial sites.
(As a side note, I can understand why in years past it would cost multiple cents per page to physically photocopy a federal document - but it is absolutely absurd that already-digitized documents, documents which are fundamentally part of the precedent that decides whether our behavior does or doesn't cause civil or criminal liability, are behind a paywall for a digital download!)
I’d go further and say that government websites and services in general are weird and often feel outdated, especially the more local you get.
Some of them look like they’re straight from the early 2000s.
Another thing is that privacy is dead and almost everything is deemed public information.
Your address, mugshot, voter rolls, you name it, it’s all deemed public information.
But once you actually want to access information that is useful to society as whole, more often than not it’s behind a paywall.
Despite this, they still call it “public” information because theoretically anyone can pay for it and get access.
It’s one of the first things I noticed when I moved to the US.
Another thing that I’ve noticed is that, if possible, there always needs to be a middle man inserted, a private corporation that can make a profit out of it.
You want to identify yourself to the government?
Well you’re gonna need an ID.me account or complete a quiz provided to you by LexisNexis or one of the credit reporting agencies.
Why? How is it that the government of all entities, isn’t capable of verifying me themselves?
Zooming out even further you’ll start to recognize even more ancient processes you interact with in daily life.
The whole banking situation and the backbone that’s running it is a great example.
The concept of pending transactions, checks and expensive and slow transfers is baffling to me.
It’s so weird, like ooh, aah this country is the pinnacle of technological innovation, yet in daily life there’s so much that hinges on ancient and suboptimal processes that you don’t see in, say, Western Europe.
My best guess is that’s this is because of a mix of lack of funding and politicians that wanting to placate profit seeking corporations.
Ironically, and I have no hard evidence for this because I’m too lazy to look into it, I suspect that on the long term it costs more to outsource it than it does to do it themselves.
I feel like the USA was probably ahead of the curve on things like the banking system in the 90s but removing all of that cruft and replacing with a v2 is not an easy thing to do. Especially when the population are generally skeptical of government and outsourcing to a private corporation is seen as the cheaper option. Short term it's potentially cheaper/quicker to outsource but long term maintenance contracts and lobbying to keep things as they are is much more expensive.
In the USA you find these whole industries that exist due to the inadequacies of the old systems. E.g. venmo doesnt need to exist in Western Europe (and probably rest of world?) because person-to-person bank transactions are free and easy.
The web development companies that are subcontracted by the government agencies to repurpose their silly shopping cart software.
I will decline to share my personal anecdote's about these companies because I am like 10+ years out of date, but I can tell you that most of these companies seemed to have certain very specific non-technical things in common.
> Simple file downloads and CSV files are fantastic. I wish more folks would publish data in simple formats like this.
The document format doesn't seem to have much to do with the problem. I mean, if the CSV is replaced by a zipped JSON doc then the benefits are the same.
> Every time I have to fill a "shopping cart" for a US government data download I die a little.
Now that seems to be the real problem: too many hurdles in the way of a simple download of a statically-served file.
csvtool is probably what you're looking for, though I think the use case for JSON vs CSV is different as one is hierarchy-oriented and the other is flat.
> For CSV? DuckDB, Datasette, awk, pawk, and tons of others.
Those tools are either line- or table-oriented, and thus don't handle CSVs as a structured format. For instance, is there any easy way to use AWK to handle a doc where one line has N columns but some lines have N+1, with a random column entered in an undetermined point?
The CSV Spec (1) specifies that “Each line should contain the same number of fields throughout the file”, so what you describe would be a noncompliant file. Nevertheless, I would expect both awk and pawk to be able to work with it just fine.
Both pawk and the recent version of awk can parse the CSV format correctly (not just splitting on commas, but also handling the quoting rules for example).
gron gives you both the complete key and value in a single line (which you then grep). Directly grep-ing a specific column in a CSV file isn't very straightforward.
You can search for companies, select the documents you’d like to see (like shareholder lists), then you go through a checkout process and pay 0 EUR (used to be like a few euros years ago), and then you can finally download your file. Still a super tedious process, but at least for free nowadays.
Quite literally yes. There are often multiple click throughs. Every Department, agency, sub-agency, all the way down to federal dog catcher has decided the most important thing is to invent a new way of getting data from them.
There are other ways to access the data on here, but they’re fragmented. It’s nicely organized here so it’s a bummer they make it hard to programmatically retrieve files once you find what you’re looking for.
Agreed! People are surprised but CSV files (while I would not use them for everything) work great in low-tech environment, for instance, or when files must be produced by many different actors with different technical levels.
They also work nicely as an interop between different stacks (Cobol <-> Ruby).
One concrete example is the French standard to describe Electrical Vehicles Charge Points, which is made of 2 parts (static = position & overall description, dynamic = current state, occupancy etc). Both "files" are just CSV files:
Files are produced directly by electrical charge points operators, which can have widely different sizes & technicality, so CSV works nicely in that case.
Lol. Tell me you never had to parse CSV files without telling me.
CSV files can be a nightmare to work with depending where they come from and various liberties that were taken when generating the file or reading the file.
Use a goddam battle tested library people and don't reinvent the wheel. /oldman rant over
Yes, you eventually realize the hard way that "CSV" is actually a blanket of various similar formats, with different rules and conventions. The way one program outputs CSV's may be completely different from another.
Actually, I worked on OpenAddresses, a project to parse thousands of CSV files containing address data from many different county, state, and national systems around the world. It really wasn't that hard, even Python's basic csv parser was sufficient to the task (and there are plenty of better options).
Looking at the parser I see a few problems with it just by skimming the code. I'm not saying it wouldn't work or that it's not good enough for certain purposes.
Oh yeah? Such as? What purposes do you think it wouldn't be good for? The author will probably be interested in your feedback. Apparently it's getting over a million downloads a week on npm.
I have not used it, so this is mostly speculation but i would be curios around character set handling, mixed line ending handling, large file handling, invalid and almost valid file handling.
The author of the library probably has learned, the hard way many many lessons (and probably also decided to prioritize some of the requested issues / feature requests along the way).
The above is not meant as a ding on the project itself and I am sure it is used successfully by many people. The point here is that your claim that you can easily write a csv parser in 200 lines of code does not hold water. It's anything but easy and you should use a battle tested library and not reinvent the wheel.
If you had read my original comment, you would see I didn't claim it's easy to do, only that it can be done in around 200 lines. That's clearly the case.
Character set handling isn't really an issue for JavaScript as strings are always utf-16.
When a file is read into a string the runtime handles the needed conversion.
As for handling large files, I've used this with 50mb CSVs, which would need a 32bit integer to index. Is that large enough? It's not like windows notepad which can only read 64kb files.
Windows notepad can read multiple megabyte files. It can read files that are hundreds of megabytes. It's not pleasant, loading is incredibly slow, and resizing the window when reflow is enabled makes it take that much longer, but it's definitely possible.
My point was that it's not trivial and it's hard to get it right. The way I read your comment was that it's not hard and can easily be done in 200 lines. It's possible I misread it.
I think the original point I was making still stands.
> Every time I have to fill a "shopping cart" for a US government data download I die a little.
I've worked in gov/civic tech for 5+ years and, as you're probably aware, there is now a highly lucrative business in centralizing and consequently selling easy access to this data to lobbyists/fortune 500s/nonprofits.
USDS/18f are sort of addressing this, but haven't really made a dent in the market afaik since they're focusing on modernizing specific agencies rather than centralizing everything.
I don't know if this is the cart you're talking about but the USGS does this for LiDAR data and yeah I'm with you I die a little every time I use that site. I love that the data is available but why in the world do we have to cart everything? Just give me a one click download
The main thing for me is that the data isn't broken down in a way that ends you having to make a thousand requests and collate them. If the client is expected to be able to pull the whole set of data, provide the whole set. There's other ways to rate limit.
clinicaltrials.gov let's you save a search term then download the results as a zip. But there's an xml file for each search result for the trial.
One of the first things I played around with was using Python to get that file, unzip it, then iterate through the xml files grabbing the info I wanted and putting it into ElasticSearch to make it searchable then putting an angular front end on it.
I used to have it published somewhere but I think I let it all die. :(
I once hade the unfortunate experience of building an API for a government org where the data changed once a year or when amendments were made which happens very infrequently.
The whole data set could have been zipped into a <1MB file but instead a “solution architect” go their hands on the requirements. We ended up with a slow API because they wouldn’t let us cache results in case the data had changed just as it was requested. And an overly complex webhook system for notifying subscribers of changes to the data.
A zip file probably was too simple, but not far off what was actually required.
I think for <1MB of data, with changes once (or twice) a year, the correct API is a static webserver with working ETag/If-Modified-Since support.
If you want to get really fancy, offer an additional webhook which triggers when the file changes - so clients know when to redownload it and don't have to poll once a day.
...or make a script that sends a predefined e-mail to a mailing list when there is a change.
A zip file on a web server that supports etags, that's polled every time access is required. When nothing has changed since last time, you get an empty HTTP 304 response and if it has changed then you simply download the <1MB Zip file again with the updated etag. What am I missing?
My concern was "what if file is updated while it's mid-download" but Linux would probably keep the old version of the file until the download finishes (== until file is still open by webserver process). Probably. It's better to test
If data changes only once a year or rarely that would imply usage of the api is a rare event for a user of the data so speed isn't a huge concern. Caching would introduce more complexities and the risk of needing to manually revalidate the cache. The solution architect was probably right.
That requires preprocessing on the client and there are some ppl who has.. weird assumptions about how the dates should be written.
The version file can be quired at least the two ways:
the ETag/If-Modified-Since way (metadata only)
content itself
The best part with the last one - you don't need semver shenanigans. Just compare it with the latest dloaded copy, if version != dloaded => do_the_thing
No, it's a terrible API if retrieving the result '2000-10-26' requires downloading a 565 KB file.
I don't want to seem overly negative -- zipped CSV files are fantastic when you want to import lots of data that you then re-serve to users. I would vastly prefer it over e.g. protobufs that are currently used for mass transit system's live train times, but have terrible support in many languages.
But it's incredibly wasteful to treat it like an API to retrieve a single value. I hope nobody would ever write something like that into an app...
(So the article is cool, it's just the headline that's too much of a "hot take" for me.)
It's historical data. There is zero reason to request it more than once a day, and the users of such data will be interested in wildly different filters or aggregates over that data.
If this were being used to get the current rates, then yes, it would be a terrible design choice. But there are other services for that. This one fits its typical use case.
> I hope nobody would ever write something like that into an app...
I have got some bad news for you...
Not directly API related, but I remember supporting some land management application, and a new version of their software came out. Before that point it was working fine on our slow satellite offices that may have been on something like ISDN at the time. New version didn't work at all. The company said to run it on an RDP server.
I thought their answer was bullshit and investigated what they were doing. One particular call, for no particular reason was doing a 'SELECT * FROM sometable' for no particular reason. There were many other calls that were using proper SQL select clauses in the same execution.
I brought this up with the vendor and at first they were confused as hell how we could even figure this out. Eventually they pushed out a new version with a fixed call that was usable over slow speed lines, but for hells sake, how could they not figure that out in their own testing and instead pushed customers to expensive solutions?
Also, if there are any ORMs involved it could be that it’s not immediately obvious from their code itself that this would happen.
I’ve seen code that was using an ORM, where they needed to find some data that matched certain criteria. With plain SQL it would have resulted in just a few rows of data. Put instead with their use of the ORM they ended up selecting all rows in the table from the database, and then looping over the resulting rows in code.
The result of that was that the code was really slow to run, for something that would’ve been super fast if it wasn’t for the way they used the ORM in that code.
Seen this so many times by novice developers, whose work style is often "Trial and error, and as soon as it works, stop and change nothing." And it's so easy to slip in there because it works perfectly locally and often even in non-production envs, unless you've seeded them with production-like amounts of data, and with the ORM it looks fine. Like in Rails:
Product.where(category_id: xx).where("name like '%?%'", "computer")
PS: I know there's an Arel way of doing the above without putting any SQL into the app -- use your imagination and pretend that's what I did, because I still don't have that memorized :)
I just see recently an example of that, a REST call returning a few KB of data that would fetch a few million rows from the database and use 10+ GB of memory (unfortunately some people think that you should always use join fetches with JPA...).
True, though ironic that also OP is implicitly making the argument for GraphQL-like interfaces ("just specify the data that you want returned" rather than "get all the data, then process it"), which are themselves the New Hotness in some areas.
I think a lot of people in this thread are glossing over a difference in definitions. Some people see "API" as "how to get data, even if that means all data with no filtering." Personally, I regard downloading an entire table as downloading a data model without any logic operating on the model, and an API as logic that returns only part of the model filtered in some way that interests me.
I've been building loads of financial software, both back and front ends. In frontend world it, sadly is quite common to send that amount of "data" over wires before even getting to actual data
And in backends it's really a design decision. There's nothing faster than a Cron job parsing echange rates nightly and writing them to a purpose designed todays-rates.json served as static file to your mobile, web or microservices apps.
Nothing implies your mobile app has to consume this zip-csv_over_http
They should ship parquet, supports predicate pushdown and is dramatically more compact, while you can't get row level data, it's great for analytical queries.
When you got that modem were you concerned with how long it would have taken to type/write the information sent over it or were you just happy it didn't matter anymore?
I often wonder what I'll think of technology in say another 20 years but I can never tell if it's all just some general shift in perspective or, as you look farther back, if certain people were always just about a certain perspective (e.g. doing the most given the constraints) and technology changes enough that different perspectives (e.g. getting the most done easily for the same unit of human time) become the most common users for the newer stuff and that these people will also have a different perspective than the ones another 20 years down the line from them and so on.
For example, maybe you think it's crazy to ask for the exact piece of data you need, I think it's crazy to do all the work to not just grab the whole half MB and just extract what you need quickly on the client side as often as you want, and someone equidistant from me will think it's nuts to not just feed all the data into their AI tool and ask it "what is the data for ${thing}" instead of caring about how the data gets delivered at all. Or maybe that's just something I hope for because I don't want to end up a guy who says everything is the same just done slower on faster computers since that seems... depressing in comparison :).
> (So the article is cool, it's just the headline that's too much of a "hot take" for me.)
It's one of those headlines that we'd never allow if it were an editorialization by the submitter (even if the submitter were the author, in this case), but since it's the article's own subtitle, it's ok. A bit baity but more in a whimsical than a hardened-shameless way.
(I'm sure you probably noticed this but I thought the gloss might be interesting)
A very simple optimization for those complaining about having to fetch a large file every time you need a little datapoint: if they promised the file was append-only, and used HTTP gzip/brotli/whatever compression (as opposed to shipping a zip file), you could use range requests to only get the new data after your last refresh. Throw in an extra checksum header for peace of mind, and you have a pretty efficient, yet extremely simple incremental API.
(Yes, this assumes you keep the state, and you have to pay the price of the first download + state-keeping. Yes, it's also inefficient if you just need to get the EUR/JPY rate from 2007-08-22 a single time.)
Also, on the topic of range requests, when a server allows the range requests for zip files, the zip files are huge and one needs just a few files from them, one can actually download just the "central directory" and the compressed data of the needed files without downloading the whole zip file:
Or, just serve a bunch of diff files. Just having a single daily patch can drastically reduce the bandwidth required to keep the file up to date on your side.
That's if downloading a few hundred kB more per day matters to you. It probably doesn't.
> curl -s https://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.zip | gunzip | sqlite3 ':memory:' '.import /dev/stdin stdin' "select Date from stdin order by USD asc limit 1;"
Error: in prepare, no such column: Date (1)
There is a typo in the example (that is not in the screenshot): you need to add a -csv argument to sqlite.
I do also use Python (pandas) partway down the page :) I'm just trying to show different tools to give people ideas about how much is possible with stuff that is already in /usr/bin/
If you were to ask my own tool preferences which I use for $DAYJOB: pandas for data cleaning and small data because I think that dataframes are genuinely a good API. I use SQL for larger datasets and I am not keen on matplotlib but still use it for graphs I have to show to other people.
> Why was the Euro so weak back in 2000? It was launched, without coins or notes, in January 1999. The Euro was, initially, a sort of in-game currency for the European Union. It existed only inside banks - so there were no notes or coins for it. That all came later. So did belief - early on it didn't look like the little Euro was going to make it: so the rate against the Dollar was 0.8252. That means that in October 2000, a Dollar would buy you 1.21 Euros (to reverse exchange rates, do 1/rate). Nowadays the Euro is much stronger: a Dollar would buy you less than 1 Euro.
Even if the Euro initially only existed electronically, it still had fixed exchange rates with the old currencies of the EU zone members. Most importantly with the established and well-trusted Deutsche Mark of Germany.
So any explanation of 'why was the Euro initially weak' also has to explain why the DEM was weak at the time. The explanation given in that paragraph doesn't sound like it's passing that test.
For little problems where you can just download the entire database every time and do read-only work, never underestimate the value of keeping it simple! I really like SQLite for this because it has the portability of a .Json or .csv file but it’s more ready to interact with as a database.
> Some things we didn't have to do in this case: negotiate access (for example by paying money or talking to a salesman); deposit our email address/company name/job title into someone's database of qualified leads, observe any quota; authenticate (often a substantial side-quest of its own), read any API docs at all or deal with any issues more serious than basic formatting and shape.
There probably is a quota of some sort, but since there's no reason to download the file very often, it's probably not a quota you need to worry about.
If you don't mind saving the file to disk, you can do:
sqlite3 -newline '' ':memory:' "SELECT data FROM zipfile('eurofxref-hist.zip')" \
| sqlite3 -csv ':memory:' '.import /dev/stdin stdin' \
"select ...;"
Doing this without a temporary file is tricky; for example `readfile('/dev/stdin')` doesn't work because SQLite tries to use seek(). Here is a super ugly method (using xxd to convert the zipfile to hex, to include it as a string literal in the SQL query):
This is slightly abusive of gunzip, but it does work, but only because the zipfile in question only contains a single file and you are piping it in via shell. If you had tried to download the file and then run gunzip on it the program would have refused.
As for the second error, I think you might be trying to import an empty file or maybe an error message? Probably related to the first problem.
> Files created by zip can be uncompressed by gzip only if they have a single member compressed with the “deflation” method. This feature is only intended to help conversion of tar.zip files to the tar.gz format.
There are several reasons a compressed archive isn't a good data format outside trusted backups.
1. inconsistent binary formats, float precision edge conditions, and unknown Endianness. Thus, assuming the Marshalling of some document is reliable is risky/fragile, so pick some standard your partners also support... try XML/XSLT, BSON, JSON, SOAP, AMQP+why, or even EDIFACT.
2. "dump and load" is usually inefficient, with an exception when the entire dataset is going to change every time (NOAA weather maps etc.)
3. Anyone wise to the 42TiB bzip 18kiB file joke is acutely aware of what compressed files can do to server scripts.
4. Tuning firewall traffic-shaping for a web-server is different from a server designed to handle large files. Too tolerant rules causes persistent DDoS exposure issues, and too strict causes connections to become unreliable when busy/slow.
5. Anyone that has to deal with CSV files knows how many proprietary interpretations of a simple document format emerge.
Yeah having a local snapshot of an entire dataset is obviously nice when it's feasible. But there's no need to conflate that with "application programming interface" just to market csvbase.
There's an issue with openly & freely available financial data. It tends to be limited (i.e. you need many sources to answer the questions you want and not just the questions a particular source is able to answer). And if you're bringing in different sources of data, each source requires custom fetching and parsing, along with being just annoyingly slightly different from other sources. Even if the value is given on a per-day basis (which is a huge assumption, lots of things like interest rates are published much slower), you have things like "is the value at the start of day, end of day, market close, median? What time zone? Which exchange? What about days when the exchange was closed? What about odd exchange hours or halts?", "what's the date and time format, what's the naming convention?"
This why Bloomberg and friends make money. They're a cartel but they at least do both making all this data work together and also transcribe stuff that is written in English into structured data etc (hence their interest in AI)
As someone with a lot of data experience, and in particular including financial data (trading team in a hedge fund) I definitely prefer the column format where each currency is its own column.
That way it's very easy to filter for only the currencies I care about, and common data processing software (e.g. Pandas for Python) natively support columns, so you can get e.g. USDGBP rate simply by dividing two columns.
The biggest drawback of `eurofxref-hist` IMO is that it uses EUR as the numeraire (i.e. EUR is always 1), whereas most of the finance world uses USD. (Yeah I know it's in the name, it's just that I'd rather use a file with USD-denominated FX rates, if one was available.)
Not sure what you mean by "most of the finance world", but in forex trading the EUR/USD rate is denominated exactly as in this file - 1.066 for today and not the inverse 0.094.
right, I mean quant trading - i.e. where people actually use data :D you want things relatively standardised, which means using a common unit of account - for most funds AFAIK that's USD
This is my experience as well. Everything is expressed as 'number to multiply by to get USD' so USDJPY would always be used in-house inverted. British firm as well. Nobody likes the GBP.
> Switching from wide to long is a simple operation, commonly called a "melt". Unfortunately, it's not available in SQL.
That's not completely true. In most SQL databases you can query information about the table, and this is true for Sqlite too. You cannot use this information to have a query with a "dynamic" number of columns, but you can always generate the query from the metadata and then executed the generated queries. It is not exactly fun to do as a one-liner, but it works:
> curl 'https://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.csv' | sqlite3 -csv ':memory:' '.import /dev/stdin stdin' '.mode list' '.output query.sql' "select 'select Date, ''' || name || ''', ' || name || ' from stdin ;' from pragma_table_info('stdin') where cid > 0 and cid < 42;" '.output stdout' '.mode csv' '.read query.sql'
(yes, you're also left with some temporary file on the computer).
> Although pulling out a simple max is easy, the data shape is not ideal. It's in "wide" format - a Date column, and then an extra column for every currency. Here's the csv header for that file:
> Date,USD,JPY,BGN,CYP,CZK,DKK,EEK,GBP,HUF,LTL,LVL,MTL,[and on, and on]
> When doing filters and aggregations, life is easier if the data is in "long" format, like this:
> Date,Currency,Rate
> Switching from wide to long is a simple operation, commonly called a "melt". Unfortunately, it's not available in SQL.
You sure this can't be done in SQL? I myself don't know enough SQL, but I believe that you can decompose a single row into multiple rows in SQL by joining multiple queries using UNION ALL:
Csv is surprisingly common and I prefer it despite parsing, quoting, encoding and escaping woes.
Amazon S3 let's you query csv files already loaded in buckets which is interesting but I haven't used yet.
One company I worked at a long time ago used free dropbox accounts as a ftp like drop that they would consume. Was hilarious and it worked well and was easy to stay under the free limits.
The site apparently serves .csv directly as well (replace .zip with .csv). I used this to convert the expression to powershell (using only built-in powershell cmdlets and commands):
Doing it with a zip file would be a little more verbose since there is no built in "gunzip" type command which operates on streams, but you can write one which does basically that out of built in .Net functions:
# unary "," operator required to have powershell
# pipe the byte[] as a single argument rather
# than piping each byte individually to
# ConvertFrom-Zip
$> ,(invoke-webrequest "https://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.zip").Content
| ConvertFrom-Zip
| ConvertFrom-Csv
| sort "usd"
| select "date" -first 1
Date
----
2000-10-26
/tmp> # be kind to the server and only download the file if it's updated
/tmp> curl -s -o /tmp/euro.zip -z /tmp/euro.zip https://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.zip
/tmp> unzip -p /tmp/euro.zip | from csv | select Date USD | sort-by USD | first
╭──────┬────────────╮
│ Date │ 2000-10-26 │
│ USD │ 0.83 │
╰──────┴────────────╯
(I removed the pipe to gunzip because 1. gunzip doesn't work like that on mac and 2. it's not something you should expect to work anyway, zip files often won't work like that, they're a container and their unzip can't normally be streamed)
tl;dr: the server doesn't support that header, but since the response does include a Last-Modified header, curl helpfully aborts the transfer if the Last-Modified date is the same as the mtime of the previously downloaded file.
I'll throw out the the GTFS transit standard involves publishing zipped CSVs representing your DB tables regularly. There are nice things about it and downsides IMO. This is how google map's transit directions function -- they rely on agencies to publish their schedules regularly in zipped CSVs I think.
One downside is that dev experience is pretty bad in my opinion. It took me years of fiddling in my free time to realize that if you happen to try to use the CSVs near a schedule change, you don't know what data you're missing, or that you're missing data, until you go to use it. My local agency doesn't publish historical CSVs so if you just missed that old CSV and need it, you're relying on the community to have a copy somewhere. Similarly, if a schedule change just happened but you failed to download the CSV, you don't know until you go to match up IDs in the APIs.
I used to help run the Best Buy product catalog API. We did nightly CSVs instead of making people poll/paginate the entire catalog. It was super widely used.
What's the use case here? For a quick look up you're not going to write that monster command. And for a more ongoing repeated use, you're not going to download and uncompress the file every time you want to query it...
I get the satisfaction as a curiosity, but other than that, to me that wouldn't be enough to make the my "favorite" or even "good".
I worked at the Chicago Mercantile Exchange 15 or so years ago and we had dozens of similar files that most major financial institutions downloaded daily. If they weren’t published on schedule we would start getting calls from executives at Bloomberg, Goldman, etc. I wrote a small system to manage and monitor their publication.
I believe that curl does not have client side caching so every time you run the command it downloads the csv. While downloading the csv and then analyzing it would no longer makes it a magic one-liner you can send to people. It would save bandwidth and reduce traffic on the API.
Unless there is caching going on here? Perhaps a CDN cache on the server side?
"the server doesn't support that header, but since the response does include a Last-Modified header, curl helpfully aborts the transfer if the Last-Modified date is the same as the mtime of the previously downloaded file."
You should add files to the website in sync whenever they get published. Most statistics instututions publish their data in csv format in a webpage. You could bookmark them once and create a database of all the data so the app is used more
If your data is somewhat sparse, somewhat hierarchical or have evolving/amorphous schema, gzipped jsonl is better. Jsonl is supposed by clickhouse and jq and a bit more consistent around escaping/delimiting
Just thinking out loud: If we make the zip file a parquet file, wouldn't we be able to do partial downloads (like getting the first 1000 lines) instead of getting the whole file every single time?
Great read. I didn't know it's so easy to plot something in the terminal. Also really shows the power of 2D data structures. A good long format and you can do anything.
#!/bin/sh
set -eu
u=https://www.ecb.europa.eu/stats/eurofxref/eurofxref.zip
ftp -Vo - $u | zcat | sed 's/, $//' | awk -F ', ' '
NR == 1 {
for(i = 1; i <= NF; ++i) {
col[$i] = i
c = ($i != "GBP") ? $i : "EUR"
printf "%s%s", c, (i < NF) ? "\t" : "\n"
}
}
NR >= 2 {
printf "%s\t", $1
for(i = 2; i <= NF; ++i) {
c = (i != col["GBP"]) ? $i/$(col["GBP"]) : $i
printf "%s%s", c, (i < NF) ? "\t" : "\n"
}
}
'
Fetches the latest rates and converts the base currency to GBP and format to TSV. Then I run a cron(8) job 15~30 minutes after the release each work day and dump the output so that I can retrieve it from httpd(8) over HTTPS to be injected into my window manager status bar. A pretty elegant solution that spares the ECB from my desktops hammering their server. All with tools from the OpenBSD base install.
IIRC it was by far the most downloaded file on the ECB website. Tons of people, including many financial institutions, downloaded it daily, and used it to update their own systems.
IIRC #2 in the minutes immediately after the daily scheduled time for publishing this file, there was a massive traffic spike.
It was a conscious decision to make it a simple CSV file (once unzipped): it made it possible to serve the file reliably, fast, and with little resources needed.
The small team responsible at the time for the ECB’S public website was inordinately proud of the technical decisions made to serve this data in a single static file. And rightly so.