I worked for an education technology company that made curriculum for K-8. There are long sales cycles in this space and different departments of ed have different rules. Think "vote every 4 years because our books are out of date or just old". The technology wave came fast and most of this curriculum from incumbent providers was formatted to fit in a book with maybe some of the most cutting edge people having a large InDesign file as the output.
The edtech company I worked for was "web first" meaning students consumed the content from a laptop or tablet instead of reading a book. It made sense because the science curriculum for example came with 40+ various simulations that helped explain the material. A large metropolitan city was voting on new curriculum and we were in the running for being selected but their one gripe was that they needed N many books in a classroom. Say for a class of 30 they wanted to have 5 books on backup just in case and for the teachers that always like a hardcopy and don't want to read from a device.
The application was all Angular 1.x based that read content from a CMS and we could update it in realtime whenever edits needed to be made. So we set off to find a solution to make some books. The design team started from scratch going page by page seeing how long it would take to make a whole book in InDesign but the concept of multiple editing doesn't really exist well in this software. Meanwhile, my team was brainstorming a code pipeline solution to auto-generate the book directly from the code that was already written for the web app.
We made a route in the Angular app for the whole entire "book" that was a stupid simple for loop to fetch each chapter and each lesson in that chapter that was rendered out on a stupidly long page. That part was more less straightforward but then came the hard part of trying to style that content for print. We came across Prince XML which fun fact was created by one of the inventors of CSS. We snagged a license and added some print target custom CSS that did things like "add blank page for padding because we want new chapter to start on the left side of the open book". But then came the devops portion that really messed with my head.
We needed a headless browser to render out all of this and then we needed the source with all the images, etc to be downloaded into a folder and then passed to Prince XML for rendering. Luckily we had a ECS pipeline so I tried to get it working in a container. I came up with a hack to wait for the end of the rendering for loop for the chapters/lessons to print something to console and then that was the "hook" for saving the page content to the folder. But then came the mother of all "scratching my head" moments when Chromedriver started randomly failing for no reason. It worked when we did a lesson. It worked when we did a chapter. But it started throwing up a non-descript error when I did the whole book. Selenium uses Chromedriver and Chromedriver is direct from Google and Chromium repo. This meant diving into that C++ code in order to trace it down when I finally found the stack trace. Well yeehaw I found an overflow error in the transport protocol that happens from Chrome devtools as it talks to the "tab/window" it's reading from. I didn't have the time to get to the bottom of the true bug so I just cranked the buffer up to like 2 GB and recompiled Chromium with the help of my favorite coworker and BOOM it worked.
But scaling this thing up was now a nightmare because we had a Java Dropwizard application reading a SQS queue that then kicked off the Selenium headless browser (with the patched Chromedriver code) which downloaded the page but now the server needed a whopping 2 GB per book which made the Dropwizard application a nightmare to memory manage and I had to do some suuuuper basic multiplication for the memory so that I could parallelize the pipeline.
I was the sole engineer for this entire rendering application and the rest of the team assisted on the CSS and styling and content edits for each and every "book". At the end of the day, I calculated that I saved roughly 82,000 hours of work because that was the current pace of how fast they could make a single chapter in a book multiplied by all the chapters and lessons for all the different states because Florida is fucked and didn't want to include certain lines about evolution, etc and so a single book for a single grade but for N many states that all have different "editions".
82,000 hours of work is 3,416.6667 days of monotonous, grueling, manual, repetitive design labor. Shit was nasty but it was so fucking awesome.
Shoutout to John Chen <zhanliang@google.com> for upstreaming the proper fix.
Cool startup! I did some work for a client in the trucking space a while back, and I came away with a new appreciation for what a fascinating vertical it is.
If you don't mind me asking, how'd you end up in the space? It feels like the vertical has so much esoteric/specific knowledge. We ran into a few companies that were founded by truckers (or by people who had truckers in the family), and it seemed like already having all this knowledge gave them a big leg up.
If you're using 16 PCI 4.0 lanes you max out at 32GB/s, although commercial drives tends to have much lower throughput than that maximum (~7.5GB/s for a good NVMe drive). Cat6a ethernet tops out at 10 gigabits per second, but plenty of earlier versions have lower caps e.g. 1 gigabit. My guess is you'll most likely be limited by either disk or network hardware before needing CPU parallelism, if all you're doing is copying bytes from one to the other.
Oh, sorry — by "copying bytes from one to the other," I meant copying bytes from the disk to the network interface controller on the same physical computer. It's true that beyond that it'll depend on the network topology connecting you to where you want the data to be, and how fast the machines in between and on the other end are!
I don't know enough about custom fiber to know whether that will help stretch past being network-bottlenecked — most NICs max out at 10 gigabits/second, but I've heard of faster ones. Eventually you might be able to make yourself disk-limited... Either way, backing up one file is probably easier than backing up a zillion files scattered around the filesystem.
Object stores typically do optimizations where they store small files using a different strategy than large ones, maybe directly in the metadata database.
Facebook had a published paper on their storage system for pictures, haystack, which iirc is something like a slab allocation.
S3 is similar, in the sense that it has completely different usage than a file system (no hierarchy, direct access, no need for efficient listing etc) so I'm pretty sure they use something similar.
Even if it is bypassing the file system, S3 is itself essentially a file system. It has all the usual features of paths, permissions, and so on. I assume it can't completely escape the same issues.
S3 is a key-value store where object keys might contain slashes, but the implied directories don’t really exist. This is a problem for Spark and Hadoop jobs that expect to rename a large temp dir to signal that a stage’s output has been committed, because HDFS can do that atomically but S3 requires renaming objects one by one. IAM security policies also apply to keys or prefixes (renaming an object might change someone’s access level) and changes are cached for tens of minutes.
Some people have been crazy enough to store tables of padded data in the keys of a lot of zero-length objects (which they do charge for) and use ListObjects for paginated prefix queries. It doesn’t much matter whether keys have slashes or commas or what.
The edtech company I worked for was "web first" meaning students consumed the content from a laptop or tablet instead of reading a book. It made sense because the science curriculum for example came with 40+ various simulations that helped explain the material. A large metropolitan city was voting on new curriculum and we were in the running for being selected but their one gripe was that they needed N many books in a classroom. Say for a class of 30 they wanted to have 5 books on backup just in case and for the teachers that always like a hardcopy and don't want to read from a device.
The application was all Angular 1.x based that read content from a CMS and we could update it in realtime whenever edits needed to be made. So we set off to find a solution to make some books. The design team started from scratch going page by page seeing how long it would take to make a whole book in InDesign but the concept of multiple editing doesn't really exist well in this software. Meanwhile, my team was brainstorming a code pipeline solution to auto-generate the book directly from the code that was already written for the web app.
We made a route in the Angular app for the whole entire "book" that was a stupid simple for loop to fetch each chapter and each lesson in that chapter that was rendered out on a stupidly long page. That part was more less straightforward but then came the hard part of trying to style that content for print. We came across Prince XML which fun fact was created by one of the inventors of CSS. We snagged a license and added some print target custom CSS that did things like "add blank page for padding because we want new chapter to start on the left side of the open book". But then came the devops portion that really messed with my head.
We needed a headless browser to render out all of this and then we needed the source with all the images, etc to be downloaded into a folder and then passed to Prince XML for rendering. Luckily we had a ECS pipeline so I tried to get it working in a container. I came up with a hack to wait for the end of the rendering for loop for the chapters/lessons to print something to console and then that was the "hook" for saving the page content to the folder. But then came the mother of all "scratching my head" moments when Chromedriver started randomly failing for no reason. It worked when we did a lesson. It worked when we did a chapter. But it started throwing up a non-descript error when I did the whole book. Selenium uses Chromedriver and Chromedriver is direct from Google and Chromium repo. This meant diving into that C++ code in order to trace it down when I finally found the stack trace. Well yeehaw I found an overflow error in the transport protocol that happens from Chrome devtools as it talks to the "tab/window" it's reading from. I didn't have the time to get to the bottom of the true bug so I just cranked the buffer up to like 2 GB and recompiled Chromium with the help of my favorite coworker and BOOM it worked.
But scaling this thing up was now a nightmare because we had a Java Dropwizard application reading a SQS queue that then kicked off the Selenium headless browser (with the patched Chromedriver code) which downloaded the page but now the server needed a whopping 2 GB per book which made the Dropwizard application a nightmare to memory manage and I had to do some suuuuper basic multiplication for the memory so that I could parallelize the pipeline.
I was the sole engineer for this entire rendering application and the rest of the team assisted on the CSS and styling and content edits for each and every "book". At the end of the day, I calculated that I saved roughly 82,000 hours of work because that was the current pace of how fast they could make a single chapter in a book multiplied by all the chapters and lessons for all the different states because Florida is fucked and didn't want to include certain lines about evolution, etc and so a single book for a single grade but for N many states that all have different "editions".
82,000 hours of work is 3,416.6667 days of monotonous, grueling, manual, repetitive design labor. Shit was nasty but it was so fucking awesome.
Shoutout to John Chen <zhanliang@google.com> for upstreaming the proper fix.