Hacker News new | past | comments | ask | show | jobs | submit login

In discussions like this, I often wonder how there can be enough data to require a "big data" pipeline with things like Spark and Presto.

Stitch Fix seems to be one of those online services that send you sets of clothes that they think fit your style. That seems like a really narrow, low-data kind of industry. How much data can they possibly have? And why so big on the backend? In January 2018 they had 100 engineers. Presumably they're even larger now. Just for a service that sends out clothing.

Maybe I'm lacking in imagination or insight into what takes to run a company like this. On the other hand, a single PostgreSQL instance can run complex ad-hoc queries, with CTEs and everything, on a single node involving millions, even billions, of rows.

Their service is high-touch. Stylists speak with clients constantly, so they record this information.

They also have mobile apps, they run product experiments, they source and sell clothing and manage inventory, they build and iterate on algorithmic approaches to recommend and design clothing (many of which help stylists and never reach the screen of an external client).

You can skim through their Algorithms blog for some more detail. I find them impressive in how they scale the impact of relatively few stylists to about 3M users.


Spark is not a competitor to PostgreSQL.

It is a distributed compute engine that has a lot of capabilities, is rock solid, allows you to blend SQL with Python/R/Scala and can support ML use cases as your needs grow. You can easily store all of your data in PostgreSQL and run Spark on top.

At what point does it become “big data”? 10s of billions, 100s of billions? I would love to get a single PostgreSQL server to index all of my data in a timely manner. Heck, I’d be happy with a sharded scheme, if it worked. But, I don’t have time to learn the low-level details of a RDBMS and my system IO throughput to make that work. Spark works, it’s easy, and it’s relatively cheap. So, why make it more complicated?

A transactional schema usually isn't ideal for running analytics on. So you create a data warehouse that transforms your data into a schema that's more amenable to reporting. They probably have a ton of marketing data also they integrate into the data warehouse to run analysis on.

I used to work for SF and managed one of the biggest table in the DW. I won't say what's used for but to give you a clue how big that is; we added a few billion more rows each day to that table. I don't think a single PostgreSQL instance had any chance to survice that.

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact