Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Ingest data from your customers (Prequel YC W21)
40 points by ctc24 on March 15, 2023 | hide | past | favorite | 13 comments
Hey HN! Charles here from Prequel (https://prequel.co). We just launched the ability for companies to import data from their customer’s data warehouse or database, and we wanted to share a little bit more about it with the community.

If you just want to see how it works, here’s a demo of the product that Conor recorded: https://www.loom.com/share/4724fb62583e41a9ba1a636fc8ea92f1.

Quick background on us: we help companies integrate with their customer’s data warehouse or database. We’ve been busy helping companies export data to their customers – we’re currently syncing over 40bn rows per month on behalf of companies. But folks kept on asking us if we could help them import data from their customers too. They wanted the ability to offer a 1st-party reverse ETL to their customers, similar to the 1st-party ETL capability we already helped them offer. So we built that product, and here we are.

Why would people want to import data? There are actually plenty of use-cases here. Imagine a usage-based billing company that needs to get a daily pull from its customers of all the billing events that happened, so that they can generate relevant invoices. Or a fraud detection company who needs to get the latest transaction data from its customers so it can appropriately mark fraudulent ones.

There’s no great way to import customer data currently. Typically, people solve this one of two ways today. One is they import data via CSV. This works well enough, but it requires ongoing work on the part of the customer: they need to put a CSV together, and upload it to the right place on a daily/weekly/monthly basis. This is painful and time-consuming, especially for data that needs to be continuously imported. Another one is companies make the customer write custom code to feed data to their API. This requires the customer to do a bunch of solutions engineering work just to get started using the product – which is a suboptimal onboarding experience.

So instead, we let the customer connect their database or data warehouse and we pull data directly from there, on an ongoing basis. They select which tables to import (and potentially map some columns to required fields), and that’s it. The setup only takes 5 minutes, and requires no ongoing work. We feel like that’s the kind of experience every company should provide when onboarding a new customer.

Importing all this data continuously is non-trivial, but thankfully we can actually reuse 95% of the infrastructure we built for data exports. It turns out our core transfer logic remains pretty much exactly the same, and all we had to do was ship new CRUD endpoints in our API layer to let users configure their source/destination. As a brief reminder about our stack, we run a GoLang backend and Typescript/React frontend on k8s.

In terms of technical design, the most challenging decisions we have to make are around making database’s type-systems play nicely with each other (kind of an evergreen problem really). For imports, we allow the data recipient to specify whether they want to receive this data as JSON blob, or as a nicely typed table. If they choose the latter, they specify exactly which columns they’re expecting, as well as what type guarantees those should uphold. We’re also working on the ability to feed that data directly into an API endpoint, and adding post-ingestion validation logic.

We’ve mentioned this before but it bears worth repeating. We know that security and privacy are paramount here. We're SOC 2 Type II certified, and we go through annual white-box pentests to make sure that all our code is up to snuff. We never store any of the data anywhere on our servers. Finally, we offer on-prem deployments, so data never even has to touch our servers if our customers don't want it to.

We’re really stoked to be sharing this with the community. We’ll be hanging out here for most of the day, but you can also reach us at hn (at) prequel.co if you have any questions!




So is this like Fivetran, except between clients as opposed to vendor-client?

If so, any idea why most data integrations tools have not done this (or have they)? What is so tricky that they could not extend their tools to cover a customer's Postgres database?


Not sure if I'm understanding the analogy. The way I usually describe it is that it's like Census / Hightouch, but it's offered by the vendor as a first-party feature.

Let's take Salesforce as an example. Let's say they want to pull in data from their customer's database -- maybe so that sales reps can keep track of how much volume the customer did in the last month -- instead of requiring the customer to instrument their code with Salesforce API calls. Salesforce could use this tool to connect directly to all of their customer's databases / data warehouses, regardless of whether they're Postgres, Snowflake, Clickhouse, etc.

As far as why it's non-trivial: you have to support a lot of different databases / data warehouses, which all have slightly different query languages, type systems, and optimizations. Then you've got to move the data reliably, dealing with things like eventual consistency etc. We feel like that's the reason this hasn't been built yet.


Is it a full refresh from the source each time, or is it incremental, and if incremental, what assumptions do you make or not make about keys, dupes, etc?


It depends -- mostly on whether the vendor (the company receiving the data) is comfortable requiring the source to map some fields.

For low volume cases, we can operate with zero mapping of fields. In those cases, we run every transfer as a full refresh.

If the volumes are higher, then we'll typically ask the source to expose a primary key and last_updated_at timestamp field. In those cases, we run incremental transfers. We use the last_updated_at to figure out what data to transfer, and the primary key to merge it into the destination table without creating dupes.


Thanks! Can you detect deleted rows as part of that?

Do you have support for the target table maintaining history via record effective and termination dates with a current record indicator, or do you just support maintaining current state at the target?

Can the target be a cloud filestore or old school SFTP site?


We can detect deleted rows for incremental transfers (and propagate those) if they're soft-deleted in the source, whether through a deleted_at column or a is_deleted column.

For now, we only support maintaining current state in the target.

Yup! We support all common cloud file storage as destinations (S3, R2, GCS, Azure Blob Storage) as well as vanilla SFTP servers.


So is the value-add that the customers of Company-A (who is your customer) entrust you with credentials to their databases versus entrusting Company-A with them directly?


That can be part of the value-add, though for on-prem deployments, we never touch the credentials ourselves.

Not to sound like a consultant, but there's three value-adds I'd call out:

1. Handling the dialect, types, and connection modalities of many different databases. This takes a lot of time to build and there's a lot of nuance that's non-trivial to work through.

2. Replicating data and guaranteeing data integrity + reliability. There's again a lot of nuance here, especially once you start considering that data is eventually consistent in most sources, that you want to transfer it as efficiently as possible, etc.

3. Providing a clean UX that end-customers can use out of the box, such that the end-customer experience is clean and intuitive. We spend a lot of time thinking about how it makes sense for people to connect their data, so that our customers don't have to.

edit: fmt


If one side of your business is ingesting data, is the other side excreting it?


Pretty much! We also offer data exports.


The setup only takes 5 minutes,

Nothing ever takes 5 minutes. Remember these are engineers you're talking to here.


Ha, fair enough! We did our best to make the setup flow as yak-shaving proof as possible, but no such thing as a guarantee.


Hey HN -- Conor here, aka the guy from the demo. Happy to answer any questions you have!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: