
Launch HN: Castodia (YC W20) – Populate Google Sheets from databases - jimmyechan
Hey HN, we’re Jimmy and Ayazhan, founders of Castodia (<a href="https:&#x2F;&#x2F;www.castodia.com" rel="nofollow">https:&#x2F;&#x2F;www.castodia.com</a>). We help pull data from databases directly into Google Sheets. It’s basically a live database connector for Google Sheets that allows for data refreshes.<p>Before we started working on this, we were working on an idea around data pipelines. It didn’t take off so we had to pivot mid-batch. We had less than 1 month to do user interviews, build our product, get it approved, and launch it. To start with, we knew we wanted to stay in the data space. We spent the first week talking to over 30 people at different companies. In those conversations, we noticed that sales, marketing, and operations teams constantly need to ask developers to help them export data from the database and often have to submit follow-on data requests because they forgot to add a dimension needed for analysis (they often do this by submitting Jira tickets). This is an inefficient yet surprisingly common thing at companies today. Perhaps even more surprising is that data scientists at big companies like Facebook, who have invested heavily in data infrastructure and analytics tools, often use spreadsheets as a step in their data analysis workflow. While they can pull data for analysis themselves, they too export data into CSV and then open it in a spreadsheet. This means that their data doesn’t update automatically and the process has to be repeated each time.<p>At first, we were considering building our own BI tool to solve this problem. However, during our conversations we noticed that people feel new-tool fatigue, especially when their companies rotate through different tools that are used for the same purpose. For instance, we’ve heard of companies going from Tableau to Looker and back to Tableau. Or from Kibana to Sisense to Looker. Sometimes it feels like companies are paying a lot of money for tools that people aren’t really using just because it’s the thing everybody “needs to have”. Each one comes with its own structure, data modeling and steep learning curve. It&#x27;s overwhelming. At some point people just realize that whatever tool they learn will likely change once a new VP gets hired and wants to do things in a new way. In anticipation of that, they default to the one tool they know how to use and likely won’t be replaced soon. Spreadsheets also happen to be an elegant solution for simple calculations, quick pivots, and high level data exploration. They are elegant in a way that no $50k a year enterprise visualization tool can be - not because they can’t do those things, but because people don’t know how to do it quickly with them. It’s also hard to change workflows for people who live in spreadsheets. Spreadsheets are a pretty good for ad-hoc analysis and summaries that are used for presentations and reports.<p>So instead of another BI tool, we built a Google Sheets add-on that connects to a database and lets them search through tables, filter&#x2F;order the data, and then load it into Sheets directly. In the future, we’ll let them schedule data refreshes for any saved query, so all their calculations and pivot tables update with the latest data from the database. We’ve built our first integration with Postgres and testing a MySQL, MongoDB, and public dataset connector. We plan to add integrations to more data sources, including data warehouses such as Snowflake and BigQuery.<p>We implemented it to only get read-access to the database and we don’t store connection credentials or replicate data in our database. We are focused on data analysis at the moment, but a few customers have requested features to also write into database (write access to the database). Not sure yet how to do this safely with spreadsheets or whether we should at all. We’ve heard suggestions such as database write roll-backs or some intermediate data queue that can be approved by table owners.<p>Other challenges include: (1) building data connectors because many data sources spits out data differently, which means we have to parse it differently to make it accessible from a uniform interface. (2) We make connections to databases and perform all queries using AWS lambda functions (thank you serverless!) Some queries are taking minutes to execute, but AWS drops connection after 29 seconds. (3) It is hard and, as HN often points out, risky to build a product that depends on Google. We&#x27;re also running into Google Sheets&#x27; row limitations - still figuring out a way to work around this for large databases. We’ve been bouncing ideas around building a back-end to perform heavy computations while only displaying a sampled subset of data in the spreadsheet. We think it&#x27;s worth it to plunge ahead despite all these issues, though, because having a spreadsheet interface to their data is very much what our users want. We plan to charge users a monthly fee for use of our product and we&#x27;ll have a HN discount on our basic plan.<p>We’d love to hear about your data analysis workflows, the tools you use to do this, any problems you’ve had getting data for analysis, and of course, your thoughts and experiences on this use of spreadsheets! Questions and specific integration requests are also welcome, and if you would like to be beta users, feel free to email us at founders@castodia.com.
======
davedx
So I work at an ecommerce company where sales are taking off and more and more
people from the business are asking for dumps of various kinds from our
production database. Sometimes people ask for dumps with PII in them
(typically customer contact info). I’m trying to educate people about the
risks of dumping our database into loads of excel files with no access
controls that get emailed around.

What kind of data governance does your service support? Can you automatically
filter/redact PII? This would be a great feature.

~~~
jimmyechan
We've had a couple of people ask for that feature. We don't have a solution
yet. We could try to control access to specific tables as a first step, but
what it would take solve it better would be some access restriction down to
the column data.

~~~
david_shaw
It might make sense to support (or replicate) Google's DLP functionality. I
think that feature is just a blacklist of certain regular expressions.

Cool product, by the way!

~~~
jimmyechan
I just took a look - it's really cool. We'll check out Google's DLP in more
detail! Thanks!

------
xrendan
We looked at building a two-way sync tool in Google sheets to test out an MVP
instead of building some custom tool and it was a shitshow from a data
synchronization standpoint figuring out how to keep the state of each in sync
is very hard. We abandoned the idea pretty quickly.

For problem (2), I'm not sure if you're directly returning data from lambda or
if you're using lambda to use the sheets API to update the sheet after the
data is returned.

If it's the first one, there's no way around it since lambda functions called
through API Gateway have the 30 second timeout that you're running into. If
it's the second you should be able to use two lambdas one as one that ingests
the requests and adds it to a queue, and the second one (which can have up to
a 15 minute timeout) that consumes the queue.

~~~
neeleshs
Two-way sync without maintaining state is indeed a complex problem. It is
loosely a multi-master replication problem.

Shameless plug : We built a multi-directional sync engine and a company called
Syncari around it to solve exactly this problem. And added data pipelines
around it. Mostly meant for sales operations/marketing operations folks, self-
service style.

------
jacamera
I'm unfamiliar with the Google Apps Add-On API and how Add-Ons are
installed/hosted/executed. What machine is connecting to the database? The
user's, your server or Google's server? I'm asking because if the database is
required to be publicly accessible over the internet that seems like a deal
breaker.

~~~
jimmyechan
Our servers. We don't require to make databases open to the world, but we do
ask that our IP is whitelisted. We use lambda functions to run the queries.

~~~
jacamera
If you're connecting over the internet without a VPN or SSH tunnel then you're
requiring databases be open to the world, regardless of any firewalls that sit
in front of it. Having worked on the infrastructure side of things I can't
imagine that any serious companies would be amenable to such an arrangement.

------
reilly3000
I think read-only is what most businesses would want, that seems to be how
most BI works.

Other challenges:

1\. Building uniform data connectors suck. GLHF.

2\. Long queries: can you return a promise, then callback to an Apps Script
endpoint when the data is ready?

3\. Row Limits: I don't see aggregation in your screenshots, that would
certainly help. Spreadsheet users know about row limts, just give them good UX
and control. Sheets gets sluggish pretty quickly on large data sets, so you
may want to have a lower default limit so the UX stays snappy and a warning
for large queries.

~~~
ayazhan
Thanks!

1\. Yes, We learned that pretty quickly. Especially when working on postgres
and mongoDB. And this is the value we’re trying to deliver. As a user, you
don’t have to know technical differences between databases and query them all
the same way.

2\. Interesting idea, we’ll give it a try, thanks!

3\. Yes, we don’t have an aggregate feature just yet, working on it. For large
datasets, we're thinking about loading a sampled version of the dataset
instead for a quick analysis (like histogram analysis and so on). Lowering
default limit is interesting, but we are worried it might compromise the
experience (if you can only load 10,000 out of 1,000,000 records, you might
not be able to do all the analysis you want). Still exploring solutions

------
jujodi
The pricing is confusing and possibly too expensive. For this price you could
just buy Office 365 and use the built in Data Connection to any SQL server,
and companies like MongoDB and ElasticSearch already offer ODBC/DSNs for
excel.

How are you handling MySQL differently than Postgres such that it would
warrant a higher price for the consumer?

It's also not super clear what you get with your 100 queries, you can hook up
DataGrip or another tool to any of these data sources and get infinite queries
for free.

------
mbil
Singer taps and targets can do this:

    
    
        pip install tap-postgresql target-google-sheets
        tap-postgresql | target-google-sheets
    

[https://www.singer.io/](https://www.singer.io/)

disclaimer: I used to work at Stitch Data

~~~
ayazhan
This is cool! Our vision is to simplify this experience with intuitive visual
tools and advanced features, such as versioning, data governance, saved
queries, and more.

------
knackfuss
Interesting product, but $30 per user per month just seem like too much. I
don't have that pain though, is it justifiable? Is it worth for a data science
team with 5+ people pay over 150 dollars a month for this? I am asking out of
curiosity.

~~~
jimmyechan
Thanks. We think it justifies the time saving of doing manual exports and
keeping Sheets connected to the database - this way data can be easily
refreshed by re-running the query. We looked around similar tools and found
pricing to be fairly competitive

~~~
reilly3000
I think its competitive. Many businesses run off of spreadsheets, and armies
of folks enter data into them manually or with great integration pain. Self-
serve business data in sheet form is pretty much the ideal UX for something on
the order of millions of business knowledge workers. $30/mo is a pittance for
that superpower.

Its certainly more affordable than Tableau and other traditional BI tools.

The userbase is probably not well represented on HN...

------
toomuchtodo
You may want to explore a partnership or some other relationship with no code
tool providers like Tray.io, Zapier, and the like. In my experience, users are
looking for this sort of functionality from those products, but they're unable
to support this use case, as they're not designed as bulk ETL tools (but event
driven transformation tools).

~~~
jimmyechan
I like this idea. I'll play around with those tools and see if it's a viable
options - we just want to make sure that UI/UX remains easy for people to
access their data in the spreadsheets.

------
georgek
Great work guys! This could be very useful for us. Some feedback for our use
case:

1\. We’re in healthcare, so data privacy / security and HIPAA compliance are a
must. One way to limit this problem might be by architecting it so that our
data doesn’t touch your servers.

2\. Support for ActiveRecord-like abstraction layers would be great. Our
underlying schema is fairly gnarly due to multiple rounds of regulation-driven
changes to our data model so directly querying the Postgres DB isn’t really
feasible for most of the team that does basic analysis.

Pricing is fine/cheap for what it could make possible for business
intelligence/analytics.

~~~
jimmyechan
Thank you! Those are good insights and feedback for the healthcare use case.
We'll look into it!

------
corry
Cool idea guys, congrats on the launch especially given it was mid-batch.

I'm not in the market for something like this but it resonates with me since a
few times now we've run into the need to jury-rig some scripts to pull data
from a MySQL DB to populate a Google Sheet, which then does some simple
calc/re-packaging for display on a dashboard or other simple BI use-case. But
to do it you end up maintaining basically a simple "app", even if it's just a
cron job hitting a python script hourly or something.

I also like the idea of connecting the world's most powerful / most adopted BI
tool (spreadsheets) and the more proper data stores one finds in DBs.

~~~
jimmyechan
Thanks! Would love to understand more about your use case. What do you use to
display dashboards? Is through charts inside Sheets or do you have a separate
tool reading data from the spreadsheet and displaying it? We are working on a
MySQL connection and scheduled refreshes. Might be able to solve your problem
soon

------
LaserToy
There is a thing, called Presto. It has GSheet connector and many other
connectors. Why don’t use it?

~~~
ayazhan
Great advice! We'll give it a try, thank you!

------
corcov20
Can you go into the decision to not deny collection of:

"Other Data While using our Service, we may also collect the following
information: sex, age, date of birth, place of birth, passport details,
citizenship, registration at place of residence and actual address, telephone
number (work, mobile), details of documents on education, qualification,
professional training, employment agreements, non-disclosure agreements,
information on bonuses and compensation, information on marital status, family
members, social security (or other taxpayer identification) number, office
location and other data. " [https://www.castodia.com/privacy-
policy](https://www.castodia.com/privacy-policy)

~~~
jimmyechan
Hey thanks! You actually went through our privacy policy. We don't really need
this. We are working with templates - we'll take it down.

------
amolo
About the issue on AWS Lambda, have you taken a look at other options such as
Cloud Run or Fargate. I think they would better serve your use case.

Edit.

To be specific, the limit on runtime of the container. It is longer on the
services I mentioned above.

~~~
ayazhan
Thank you for recommending these! Our problems is not related to Lambda
functions though, they can run for minutes. It is the API Gateway that shuts
the connection after 30 seconds. We might just need to build our own web
server.

~~~
dyu
Also worth looking into using ALB instead of API Gateway, and have ALB call
Lambda (I think this is possible).

------
alexbanks
I might be interested in using this, but I'm seeing the Snowflake connection
is in the "Enterprise" package which requires an annual contract and is coming
soon. Any idea on timing? Is there a beta?

~~~
jimmyechan
Would you want to be our early beta user for the Snowflake integration? We'd
appreciate your insight and learn from your use-case. Timeline-wise we are
prioritizing MySQL, MongoDB, and improving our back-end. I'd say another month
out before we start messing around with Snowflake, but would love to give you
early access when we start with it.

~~~
alexbanks
Lemme check with my higher ups, but I think so? We would use this
functionality a lot.

~~~
jimmyechan
Sure. Just reach out to us!

------
akanthi
It would be interesting if you could extend this to have a 2 way duplex
connector from something like Airtable to Postgres. Airtable could represent
the last n records at any given point of time. I'd pay for that product. Not
so much into Google Sheets.

------
saadatq
Y’all should check out the Google Sheets <> BigQuery connector for design and
usability ideas

~~~
ayazhan
Will do, thanks for advice!

------
enisberk
Congrats on the launch, an important problem to solve. On the other hand, I
found your previous idea about pipelines really interesting as well. Do you
mind sharing why it did not work out?

------
kevc
This looks great, looking forward to trying it out. I was a heavy Essbase user
at a previous analyst role so working with data as a spreadsheet plugin feels
very natural.

~~~
jimmyechan
Thanks! Let us know if you have any feedback or suggestions for improving the
product!

------
eliblock
I'm curious, what was the original data pipeline idea?

------
bluntcoder
How is your product different from
[https://seekwell.io/](https://seekwell.io/)?

~~~
jimmyechan
Right now we're focused on providing access directly from Sheets so you don't
need to switch back and forth between multiple apps or pages. We also allow UI
based filtering and sorting for those who prefer not using SQL.

------
thibautg
How is it different than an Excel sheet with external data from a ODBC
connector?

------
swampthinker
How does this compare to SuperMetrics?

~~~
jimmyechan
We focus a lot on our visual query editor. Making it easy for people to search
tables, select columns they want, filter/order and then load into Sheets
through the UI. We also allow for direct direct SQL queries for people who
already have pre-written queries they want to use.

