This brings me back to when I worked on Excel. The formatting structs (and most cell related ones) are so tightly packed and the byte code VM for the formulas is pretty neat.
I wish I could have met duanec, that guy wrote probably 10% of the code for Excel. Sadly he had retired 2 years before I started working on Excel.
Fun fact: when they were about to ship Excel 1.0, they didn’t know what to name it then so they had an internal poll. The name that garnered the most votes was “Mr Spreadsheet”. Needless to say, marketing vetoed it.
I also worked on Excel and in the calc code. XOpers, LBLs… it’s all coming back. We never figured out what the Hungarian prefixes kml and ptg stood for. There was a rumor kml might be the initials of duanec’s girlfriend
TM1 is, for all its virtues, slow and memory hungry with unfortunate implementation requirements that make its optimization via rule file complex to impossible. My experience is with IBM TM1.
I work as a consultant and spend five out of eight hours a day in Excel, and I want the features in this app so bad.
Sadly, the only way I’ll be able to use these features is if Microsoft bought them out and integrated them into Excel. It’s ubiquitous at almost every workplace you go to, but the only innovation of the past 20 years seems to be PQ (which is awesome), and xlookup.
We're mainly working with high growth startups but people seem to be quite open to try out new tools. There will always be spreadsheets but once your company reaches a certain size they move their financial budgeting to Anaplan/Adaptive/TM1/Causal :)
I’ve been in this type of role and the problem is your clients send you XLSX and expect it in return, as well as other 3rd parties exchanging data as part of the engagement/transaction. Changing one company is hard but changing an industry is entirely different.
I seriously don't recommend this as a solution for production, but Access is quite capable to suck data out of Excel and squeezing it back into a database. It is horrible but I am still used to businesses doing electronic data interchange through a text file on FTP servers so my standards are low.
Excel added lambdas recently which is nice for defining custom functions without having to expect anything special on the client side (well, other than a newer copy of Excel at the moment).
Conceptually, I agree that it's cool, but oh man are there going to be some indecipherable formulas showing up in spreadsheets once the power users start using it.
They should really add the concept of array. Now you have to type "B2:B11" to refer to a range of cells, but why not "B2", and let the ":B11" part be implicit from the structure of the sheet?
The obvious thing you are missing is that you'd have to define it in some other way, e.g. by selecting the cell and clicking "Define array" or something along these lines.
You either make a table and refer to a column, or define the array in the cell and then refer to it with a hash (i.e. B2# refers to the array of cells stored/referenced in B2).
* Setup an excel sheet with "Name" and "Age" in A1 and B1, then put some sample data in.
* Highlight the data and go insert -> table and hit 'ok'
* Where it says Table Name: put in "Users"
If we want to refer to the array of names in that table, you would now say =Users[Name] in a formula.
As another example, if we wanted to find the average age, we could simply say =Average(Users[Age]) passing in the ages as an array.
If we made another two columns - MathsScore and EnglishScore, we can either automatically create a column adding these up ([@MathsScore]+[@EnglishScore]), or alternatively these ranges can then be used in spill formulas.
I hadn't seen this before but when I went to try I couldn't get it to work in Google sheets, am I using the wrong syntax? Typing "=SUM(A1)" just gave me the sum of A1 not A1 through the rest of the data in the column while similarly "=SUM(A1:)" resulted in an error. "=SUM(A1:A)" gives me the whole column as expected, not stopping when the data breaks after A11.
This is the correct behaviour, =SUM(A1) should give you A1 in both Excel and Google Sheets. You are passing in a single cell into SUM, so only asking it to add that cell up (not the cells below).
You must refer to a range of cells for it to work - in Excel this would be for instance =SUM(A:A).
In Google Sheets: generally you write a formula and hit Ctrl+Shift+Enter which will turn it into an array formula.
In Excel there are many more options in how to handle this, one of which is to define a table in Excel (Insert -> Table) and then refer to the table and column (=TableName[Column]).
Excel is ubiquitous in almost every workplaces, so I can use the same workflows and templates, and don’t have to raise a request to install some new app, or permission to load client data in a cloud service operated somewhere.
LibreOffice is so, so far behind the curve already, and I’d be laughed out of the building.
Most Apple-centric enterprise workplaces use Office. With the exception of some features that are specific to O365-specific licenses, it has feature parity with the Windows version.
LivreOffice does not do PowerQuery, and while it ‘opens Excel files well enough’, it probably only has 70% feature parity at most.
They themselves claim that it has ‘tens of millions of users’…
I’m not saying it’s a bad - or new - product. It’s just yet another customer requesting to install something. Very, very few (exactly zero) workplaces I have visited in Oceania and South East Asia are using it.
this is definitely not true - there's a huge depth of complexity in these multi-dimensional databases. It's relatively easy to make a proof of concept - it's "just" an array of arrays after all, but building something production quality is a multi-year endeavour
Interesting. A few q's as an Excel performance nerd who's fooled around in Pandas:
* Why the design choice to tie the calc engine directly to a DB vs computing dependencies and returning values post-facto?
* How does vectorization scale when using "volatiles" like offset-style formulas that may dynamically change the calc graph?
Every time I tried to recreate stuff in Excel in Pandas/Numpy, 1:1 multiplication wasn't the issue, the weird dependencies that prevented using their vectorized capabilities were.
Curious to think why they opted for Go and not C++ for the core engine, where there are lots of batched mathematical calculations.
If you want to eventually leverage SIMD and GPGPU computation C++ is almost mandatory, languages like Rust/Nim/Zig are getting there but still behind in many areas.
Great question. The biggest reason is that Go was what the team was most familiar with when the engine moved from Javascript. The reason today is that the engine is 40K LOC of Go, so a full-blown rewrite would be a big deal. As the post goes into depth with, we have a lot of rope left before we need to get to SIMD/GPGPU. Once we hit those computational bottlenecks in Go, we're very open to changing our mind!
Now that you have decent data structures, why don't you use a library for arithmetic calculations on vectors? I would use MKL for this, at least if you're running on Intel machines, the vdMul function for your particular example. Even on AMD machines the MKL implementation will probably be faster than hand-written code.
Yes, I didn't mean to imply we wouldn't go far to squeeze Go further! At some point the downside of the complexity of dropping down from Go to intrinsics, bypassing the GC, etc. might overwhelm the benefits of staying with Go. We've spent a while writing simulations like the ones in the post to understand Go's GC's behaviour under lots of pressure, and it won't be a bottleneck for a long time. I also think that doing all the pre-requisite work to make SIMD remotely possibly will yield _far_ more performance than SIMD itself.
GC pressure is the cause of many of your competitors' problems. I'd be planning a move to an alternative sooner rather than later otherwise you'll run into the exact same problems that they have - after a certain point in the company's life it becomes practically impossible to fix
> If you want to eventually leverage SIMD and GPGPU computation C++ is almost mandatory, languages like Rust/Nim/Zig are getting there but still behind in many areas.
Is this mostly about GPGPU? I think you can just import and use SIMD intrinsics in Rust. Zig lets you do that and additionally features native vector types whose operations generate the appropriate SIMD instructions.
There's nothing like Highway yet, but for many use cases it seems like a lot more trouble to use Highway than to write several implementations that use intrinsics directly.
Yeah mainly GPGPU. C++ still has an advantage in that there are quite some convenient SIMD libraries available (like Agner Fog’s vectorclass, and Google’s Highway that you have mentioned), but you can still write raw intrinsics in other languages (actually Zig has vector types baked in so you can use this instead).
Though on second thought: for this particular company’s usecase, since they only need this for the server backend it would be best to stick to a certain hardware configuration and optimize the hell out of it. For example they can just stick to Xeon servers and use raw AVX512 intrinsics which can be particularly fast for number crunching. And since GPUs are incredibly expensive to host (and also has latency costs associated with it), it might be best to just stick with CPU SIMD.
The JVM is quite good at emitting SIMD instructions if you write straightforward loops with primitives. The forthcomming Vector API will open up more possibilities.
I am less excited by auto vectorization than I once was after being repeatedly disappointed by LLVM's output, especially for string parsers. I hope it works well though.
Can you elaborate some of the use cases for spreadsheets having billions of rows? Databases of phone calls and credit card transactions reach this size, but I would never imagine someone wanting to dump any significant amount of transaction data into a spreadsheet without quite a bit of SQL/R/Py modification ahead of time.
I run a VA business with my brother so we deal with a lot of spreadsheets. We have had a few real estate firms as our customers. They put EVERYTHING in Google Sheets and Excel.
Everything from regulatory document , CRM data and checklists everything. Having billions of row might sound problematic but when a google sheet gets capped or slow they will either create a new sheet on the same file or a new file. A RE firm over its 10 year operations has hundreds and hundreds of spreadsheets. They even have spreadsheets to locate spreadsheets.....
Having everything in one spot is a slightly better thing to do. You can't pry spreadsheets from small businesses no matter what justification you have. From a practical standpoint upping the limits is kind of a good thing.
We usually clean everything up manually and create a proper database that they can query easily. Or they can just tell us what they need through email.
Fascinating. We’re seeing the same thing. Seeing a few companies opt for Coda as a way to convert some sheets to processes. Here’s what we’ve done with it: https://supsync.com/squared-away/
E-commerce companies have 1000s of products, modelled over 100s of weeks, broken down by 10s of locations, ...
SaaS companies running cohort models over multiple products, geographies, ...
You pretty quickly end up with very large numbers.
What is your strategy for storage? It seems that because you're offering this as SaaS, your customers both set a high bar for durability of their data and your COGS would be very high if you just kept a sheet this big in memory all the time.
Do you have any plans for models which exceed the size that can be reasonably processed on one machine?
Right now we keep our paid customer models' in memory. Models in free accounts are just evaluated every time (these models are usually much smaller).
For now we want to avoid the complexity of distributed systems and just run this on very powerful machines with lots of RAM.
For reference, one of our enterprise competitors has a calculation engine that can do 40B cells on a single host (using Java!).
Really interesting write up of the performance improvements. I love reading about such huge improvements by paying attention to algorithms and data structures.
You have inspired me to try replacing one of the hash tables in an algorithm I'm working on with an array to see what happens. It won't be exactly the same algorithm afterwards, but possibly the benefits will outweigh the disadvantages.
> However, we can create serious bottlenecks when it comes to moving memory back and forth between the CPU and GPU. We’d have to learn what kinds of models would take advantage of this, we have little experience with GPUs as a team—but we may be able to utilize lots of existing ND-array implementations used for training neural nets.
Indeed, moving memory back and forth from CPUs to GPUs has an overhead. There are ways to mitigate this though! I vaguely remember that one of the patterns in reducing this movement was to keep the data in the GPU as much as possible. I haven't kept up with the latest tech in GPUs off late. When I first played around with CUDA, ArrayFire (arrayfire.com) (no affiliation) was a promising library, and might be a good fit for your GPU prototypes?
Wait, where is the actual spreadsheet? Everything is hidden behind demo booking.
It'd be nice to actually see billions of cells in action. Otherwise it's just marketing for their product disguised as a technical blog post. For all we know it doesn't actually work that well in practice.
Man, this is exactly what I did at a previous job. We had a massive Query engine as a part of our BI product. I've recoded the entire sorting algorithm to use variadic templates to get a sorter for any set of column types in one go, rather than dereferencing a pointer for each column. We've observed massive improvements in speed (more like 50%, rather than 100x mentioned here. However, that codebase has been constantly optimized over the last 25 years).
OMG, does anyone else recall a series of comedy videos about a sysadmin dealing with employees? My favorite was a call from someone who "ran out of rows" in Excel. The sysadmin is losing his mind trying to explain why that wasn't possible.
We'll probably offer an on-premise solution soon but most companies are quite open to using SaaS these days (banks and financial institutions might be an exception).
I’m curious how incremental updates fit into this? I guess there are relatively structured possible updates like adding a row but one can imagine unstructured updates like updating some parameter that many formulas depend on or (not sure how much causal allows this) much more varied structures/dependencies such that one update may require calculating some small but hard to predict subset of cells. Maybe there’s no need if you can recalculate everything fast.
That's what I wonder about too. I imagine that a slow calculation would be fine if done once when the data is first loaded into the system. Once the results are in memory, any new data should only affect smallish subsets of cells.
This is the approach our product Grist (https://www.getgrist.com) takes for its calculation engine.
Keeping track of dependencies slows it down for the initial load, but beyond that, it only needs to calculate what's needed, so most updates are fast. Even though the engine literally evaluates user-defined Python code for every cell.
Cap the amount of cells and essentially partition. Using 2 machines could conceivably double performance after scaling up stops working. So a 1bn row sum could conceivably ask for 5 200m sums then sum those.
Maintain a float32 and float64 version. Compute the float32 first and return to user, then update the result.
You could conceivably run some kind of ml model or even regular logic to predict the API call your formula engine will need to make, eg. If the customer status typing a column of sum range, you could optimistically fetch that part as soon as you see that element.
Optimistically caching is obviously high variance and resource draining.
The 10000iq version is coordination free horizontal scaling with scyladb algorithm
That's exactly what we're doing :) The issue is that most of the time people look at aggregates and to compute those you have to compute all the leaf cells.
Can I ask — why do you need all the leaf cells? If we want to know total gross profit for the year — why can't we compute that as "total revenue - total COGS", rather than aggregating every net profit split by product / region / month?
Are you suggesting that this is a two step aggregation , first by groups, second, ultimately, aggregate those groups, in which case the first step was unnecessary?
In any case, if you sum a list of numbers you have to look at each value. There’s no way around it. You can parallelize, like MapReduce, but the number of items is unchanged. You can preaggregate, but at some point you need partial sums.
this looks really cool! I've been reading quite a bit about spreadsheet/calculation engines as I've been building my own as a hobby project recently, and the whole space is pretty interesting.
have you seen any C++ or C spreadsheet engines that look reasonable. I need to embedded a spreadsheet engine in a C++ system and have looked at using either sc-im and also this obscure project called teapot. I need a spreadsheet style runtime formulas and cells, not a calculation engine for compile time.
I've built my own cheapo one in js to support my financial modeling, but it's not quite general purpose so I get away with a lot. Would be curious how to take it to the next level without reinventing the wheel.
How do you manage memory on the backend? If you have single spreadsheets taking up GBs it can add up pretty fast. What are your thoughts on scaling this as the business grows?
We spent a fair amount of time trying to get this executing fast on SQL. However, with spreadsheets you recursively compute a lot of values. You can do this in SQL with recursive CTEs, but it’s slow, they’re not optimized for this. They also of course won’t do smart caching and dependency management for the cache. Fundamentally, it’s possible, but then we’d need to start hacking on a DB engine to make it work as a sheet. We concluded it’s better to continue on our existing engine.
I wish I could have met duanec, that guy wrote probably 10% of the code for Excel. Sadly he had retired 2 years before I started working on Excel.
Fun fact: when they were about to ship Excel 1.0, they didn’t know what to name it then so they had an internal poll. The name that garnered the most votes was “Mr Spreadsheet”. Needless to say, marketing vetoed it.