
Launch HN: Synth (YC S20) – Realistic, synthetic test data for your app - openquery
Hey!<p>Christos, Damien and Nodar here and we&#x27;re the co-founders of Synth (<a href="https:&#x2F;&#x2F;getsynth.com" rel="nofollow">https:&#x2F;&#x2F;getsynth.com</a>) - Synth is an API which allows you to quickly and easily provision test databases with realistic data with which to test your application.<p>We started our company about a year ago, after working at a quantitative hedge fund in London where we built models to trade US equities. Strangely, instead of spending time developing models or building the trading system, a large portion of our time was spent on just sourcing and on-boarding datasets to train and feed our models. The process of testing datasets and on-boarding them was archaic; one data provider served us XML files over FTP which we then had to spend weeks transforming for our models to ingest. A different provider asked us to spin up our own database and then sent us a binary which was used to load the data. We had to whitelist their API ip-address and setup a cronjob to make sure the dataset was never out of date. The binary provided an interactive input so it couldn&#x27;t be scripted, or rather it could be but you need something to mock the interactive params. All this took a junior developer on the team a good 3-4 days to figure out and setup. Furthermore after our trial expired we decided we didn&#x27;t actually need this dataset so those 3-4 days were essentially wasted. Our frustration around the status-quo in data distribution is what drove us to start our company.<p>We spent the first 6 months building a privacy-aware query engine (think Presto but with built in privacy primitives), but software developers we talked to would frequently divert the topic to the lack of high quality, sanitised testing data during the software development lifecycle. It was strange - most of us developers and data scientists constantly use some sort of testing data for different reasons. Maybe you want a local development environment which is representative of production but clean from customer data. Or a staging environment which contains a much smaller, representative database so that tests run faster. You could want the dataset to be much bigger to test how your application scales. Maybe you want to share your database with 3rd party contractors who you don&#x27;t necessarily trust. Whichever way you put it, it&#x27;s strange that for a problem most of us face every day, we have no idiomatic solution. We write bespoke scripts and pipelines which often break. They are time consuming to write and maintain and every time your schema changes you need to update them manually. Or we get lazy and copy&#x2F;paste production.<p>We finally listened to all this feedback, dropped the previous product, and built Synth instead. Synth is a platform for provisioning databases with completely synthetic data.<p>The way Synth works can be broken into 3 main steps. You first download our CLI tool (a bunch of python wrapped up in a container) and point it at your database to create a model (we host the models on the Synth platform). This model encodes your schema, and foreign key relationships as well as a semantic representation of your types. We currently use simple regular expressions to classify the semantic types (for example an address or license plate). The whole model is represented as a JSON object - if the classifier gets something wrong you can easily change the semantic type. Once the model has been created, the next step is to train the model. Under the hood we use a combination of copulas and deep-learning models to model the distributions and correlations in your dataset (the intuition here is that it&#x27;s much more useful for developers to have realistic data than just sample from a random number generator). The final step is to use the trained model to generate synthetic data. You can either sample directly from the model or we can spin up a database for you and fill it with as much data as you need. The generation step samples from the trained model to create realistic data, as well as utilising bespoke generators for sensitive fields (credit card numbers, names, addresses etc.)<p>You can run the entire lifecycle in a single command - you point the CLI tool at your database (currently Postgres, MySQL and MsSQL) and in ~1 minute you get an i.p. address and credentials to your new database with completely synthetic data.<p>We&#x27;re long time fans of HN and are eagerly looking forward to feedback from the community (especially criticism). We&#x27;ve made a free version available for this week so you can try it with no strings attached. We hope some of you will find Synth useful. If you have any questions we&#x27;ll be around throughout the day. Also feel free to get in touch via the site.<p>Thanks!
~ Christos, Damien &amp; Nodar
======
nartz
Hey guys - here's just some critical feedback from a fellow dev - here's my n
of 1 perspective - of course this could be a very different perspective for
e.g. large enterprise companies struggling with this.

Feedback:

It seems overly complicated. You lost me when you said i have to train models?
Are you assuming that software developers want to train machine learning
models to do something as simple as creating some test data? In reality - I
reach for tools that make things easier for me, which includes not having to
read a ton of documentation, download new external tools, and things that
'just work'.

It is 100% easier for me to export a little production data to test on (and
maybe sanitize), or to write a small script to generate a few users and those
things I need to test. Plus - then I know exactly what I'm going to get. A lot
of times, after I've done this once, it will work for a good while as well -
if I do change the schema, I can add some additional data for that column, and
go from there, or otherwise.

For those companies who have 'messy' fixture data - is the _tool_ the issue?
My take is that the difficulty with maintaining the data could contribute to
this issue, but is also more an issue of simply bad housekeeping - e.g.
rushing and not tending the garden. While your system might handle this, your
system also seems to require a different skillset (e.g. specific
training/knowledge) than the standard QA developer might have.

If I did use it, i'd prefer it to be much easier to use - if I could include a
ruby gem, and incorporte it into the testing progress, e.g. an 'after' hook
after migrating the db, that would be ideal. Then, I dont really need to know
much. However, I would still be concerned about whether this is
deterministically creating data or if its random?

Good luck!

~~~
treis
>things I need to test

I think this is the biggest problem. I don't need a lot of random data in my
database. I need a lot of specific scenarios set up. And a way to get those
scenarios back after I test something.

I've definitely been in a lot of situations where test data is a problem. A
particularly egregious one that comes to mind is the poor developer that had
to develop the fraud functionality. Marking an account as fraud nuked it in
the back end. Lots of angry testers/developers when their favorite test
account got marked as fraud.

~~~
openquery
> I need a lot of specific scenarios set up

Yes we've seen this quite a lot in the wild. The truth is this is not very
well defined - how do you get your data to tell a story depends on the story
you are trying to tell.

We are trying to come up with a more rigorous framework for abstract
representations of 'scenarios' . It's on our roadmap so keep an eye out for
this :)

------
brosky117
Congrats on shipping Christos, Damien, and Nodar! I really like this idea. I
have this problem at my company.

Two questions:

First, we’re using Postgres and some of our tables use JSON. Would Synth be
able to generate realistic JSON? Sometimes this is configuration (which would
need to be straight copied) and other times it would be data (which would need
to keep the same keys but have generated values). Is this use case supported?

Second, I’m concerned about giving Synth access to my data as much of it is
sensitive. I understand that you need access to production data to offer the
service. What can you tell me about your data security to help me feel more
comfortable? (i.e. What kind of data would you have stored on your end? How
does the CLI work? etc)

Congrats again and good luck!

~~~
openquery
Thanks and great questions!

> First, we’re using Postgres and some of our tables use JSON...

We've seen this before when we were talking to a company we were considering
to pilot pre-launch - it's on our roadmap. Currently the JSON text would be
treated as a string, i.e. it is classified as a categorical type or text.

What we would want is for the classifier to traverse the JSON object instead
of treating it like text. This feature is going to be implemented when we
extend to NoSQL databases.

> Second, I’m concerned about giving Synth access to my data as much of it is
> sensitive.

Absolutely. This has been one of the guiding principles in building Synth.
We've built it so that our servers _never_ have to see any sensitive
information. (Hence why you can use Synth via a CLI tool instead of an API)

Also:

1) The CLI is soon to be OSS giving full visibility into exactly what's
happening when you use it. (Really it's OSS now since you can just take a look
at the source code running in the container, we just haven't had the time to
make our repo public)

2) The models are designed to be transparent. You can inspect them by running
`synth model inspect <model-id>`. This gives you visibility into exactly what
the model looks like. (Looking at the data which has been sampled is still a
WIP)

3) If something goes wrong and sensitive information is uploaded to the Synth
platform, you can easily purge all traces of it using `synth model rm <model-
id>`

~~~
sbecker
> We've built it so that our servers never have to see any sensitive
> information.

If true, this is a key selling point and should probably be somewhere near the
top of the homepage. I didn't get that point from reading any of the copy.

~~~
openquery
Thanks for the feedback. I'll make sure this is clear.

Why is this important for you?

~~~
imInGoodCompany
(not OP, but) from a European perspective, it means one less GDPR headache. At
the company I work for I know having PII going through a 3rd party server for
this kind of purpose would be a no-go.

------
cowb0yl0gic
This is almost identical to a project idea I've had banging around
for...um...6 years now. :) Glad to see someone is running with it, and also
that you have data privacy as a 1st-class citizen. One idea for the data
model: domain-specific descriptors (ex., not just a date, but a human
birthdate with specific parameters (think healthcare applications: pediatrics
vs general inpatient); this could be derived from sample/production data, but
when designing a new application, one might need to have finer control over
things like distribution (normal vs. skewed), min/max, etc.). If someone is
designing a new report for an existing application, but wants synthetic data
to use for dev/testing and UAT, the report "target data profile" may diverge
from historical production data in very specific ways (ex., introducing new
types/classes of products).

~~~
openquery
Thanks for your comment :)

These are all very good points. We are in the process of figuring out a
natural way to express user-specified semantic types. We have some ideas but
more on this coming soon!

------
lukeqsee
As billed, this is good stuff.

I have a client who has millions of rows of data in production—and we have to
run our test suite against production because they have no curated staging
data set. This would allow us to save multiple minutes every dev pipeline and
local test run (which are typically too slow to even run locally).

Looking forward to see you growing!

~~~
lukeqsee
This same client is a bit of a penny-pincher.

Are there any plans to open this up so we could host the infrastructure and
then pull a SQL import dump or something along those lines after running the
CLI part? This would reduce your ongoing costs to reduce our monthly fee?
($130 would be a very tough sell, even though I think the business value is
there.)

~~~
openquery
Hey!

So we are soon introducing the Firehose API. Basically this allows you to
point at an arbitrary database and fill it up with as much data as you need
from the model.

The Firehose should work for use-case and be much more cost effective.

A more hacky solution for right now, you can spin up a database and run a
`select * ...`.

~~~
lukeqsee
That's perfect! I'll keep an eye out for that.

~~~
openquery
If you can't wait, you can always run `synth model sample <model-id> \--ouput
<some-directory> \--sample-size <number-of-rows>` which will generate
synthetic data directly into your directory as CSV files. You can then ETL
that into your database.

Hope this helps :)

------
joshAg
For testing i care a lot about repeatability.

Specifically, i'm interested in testing a web dashboard/app. So if I use synth
to populate my db, how would I know whether the backend's endpoints are giving
me good data? Is there a way to guarantee a specific set of test data each
time (so i can precompute what the values should be), or will i need to start
a test run by querying the data base a bunch to see what's in it to figure out
what i should expect the test results to be?

Also, is there a way to prepare data for import into an existing db? Right now
for some of our testing we have a single staging instance and we deconflict
multiple tests by including a randomized 8 character string in all the
relevant IDs for precomputed data we insert as part of the testing
initialization. For this testing it's not as important that the data is
repeatable, but the testers have a few different scenarios they want to test,
so I'd need a way to make a low-data, medium-data, and high-data test set
where the backing data fit within some ranges.

~~~
openquery
Hey!

> Is there a way to guarantee a specific set of test data each time

Absolutely. You can seed the model so that the data you get each time is
completely reproducible

> For this testing it's not as important that the data is repeatable, but the
> testers have a few different scenarios they want to test, so I'd need a way
> to make a low-data, medium-data, and high-data test set where the backing
> data fit within some ranges.

This is a great use-case for Synth. With the upcoming Firehose API you can
point it at an existing database and specify how much synthetic data you want
to generate and pump into your db.

For now you can either create a database and write the ETL, or do `synth model
sample <model-id> \--ouput <some-directory> \--sample-size <number-of-rows>`
to sample directly from the model into a directory of CSV files and use that
to load your database

Feel free to get in touch if you would like to learn more :)

------
Tarrosion
This looks really cool. One question I have is about how much the synthetic
data can protect privacy. For example, my company has geospatial event data
from our customers. We're very protective of customer identities, and wouldn't
want to expose which cities our customers are in. If a model trained on our
database notices that the "longitude" column marginal distribution has a spike
around (just as an example) -71 degrees (longitude of Boston, where we're
located), then presumably the synthetic data would also include a bunch of
longitudes near -71 degrees? But there aren't that many cities at longitude
-71 degrees, so even the marginal distribution of the synthetic longitudes
would reveal something private about our data.

Second question is whether y'all support geospatial data? Both in the sense of
"the topology of latitudes and longitudes is not a plane" and "can the model
be trained on databases which encode geometries as a single column?"

~~~
openquery
That's a great question. I had to divert to my co-founder Damien who is spear-
heading the research side of the company.

The gist of it is that if the the original data has a spike around -71, you
will indeed see a spike in the synthetic copy as well. What it boils down to
under the hood is a decision on the value of a continuous degree of freedom
between two pieces of information:

\- the information that you have a significant number of users located in
Boston, and

\- the information that any given particular user is located in Boston.

At a high level, we are taking the view that for your synthetic data to be
realistic, it would need to spike around Boston if and only if most of your
users are in Boston. This also means that you are not leaking information
about any given individual user and that the behavior of the crowd is OK. Put
more simply, if you have a single user located in Boston and all the others
in, say, San Francisco, then your synthetic data should not end up having
users in Boston at all.

Currently we do not have any bespoke support for lat/lon data, beyond it being
like any other float of course. It is planned for the next release though! So
check back in a couple weeks and it'll be there

------
silverlake
I implemented a similar system a while ago, including differential privacy.
The data at my firm was so messy the models failed miserably. You really need
an analysis phase that can tell a customer whether their data will work or
not. I.e. weird distributions, crazy foreign keys, difficult data types.

~~~
openquery
Yes - you're absolutely right in that data is a messy business.

Even in the early days we've seen crazy data types and constraints that makes
our job of completely automating the process hard. However, every instance of
this makes the product better and this transfers to the next customer.

> You really need an analysis phase that can tell a customer whether their
> data will work or not

This is part of the roadmap, it's a non-trivial piece of engineering. In the
meantime you can try it for free and see if it works for you :)

------
iforiq
One use case I've seen for this is compliance. For SOC2 and other compliance
standards, I _think_ you aren't allowed to use production data for dev/staging
environments. An automated way to generate a database with synthetic data
would make life much better in such cases.

~~~
openquery
Absolutely! We spent a bunch of time in the data privacy space before pivoting
to Synth. Synth has utility as a dev tool but really does address exactly this
issue.

This also ties into GDPR and CCPA compliance - we think that as regulations
tighten (which seems almost inevitable) this sort of tooling will empower
developers to go quicker and focus on their applications instead of
compliance.

------
sqs
Anyone know how this compares to
[https://www.tonic.ai/](https://www.tonic.ai/)? Tonic lets you generate data
for safe local dev/testing, and they're also open source and have some big
customers.

------
tekkertje
Congrats, looks great and quite useful for the cases you've mentioned.

Only thing is that I initially thought the pricing was a bit high, but that
was because I thought there was no trial option. On the second visit found it
at the bottom of the page. Maybe an idea for an A/B test in the future to put
the trial option right below the pricing?

------
graerg
> Under the hood we use a combination of copulas and deep-learning models to
> model the distributions and correlations in your dataset (the intuition here
> is that it's much more useful for developers to have realistic data than
> just sample from a random number generator)

This is neat, but do users have the option of just doing vanilla RNG if they
want?

~~~
openquery
Hey - good question.

Not right now, but it shouldn't be hard to implement. Is there something some
specific use-case this would address?

~~~
graerg
> it shouldn't be hard to implement

Yeah it seems like it's just a flat/un-informed probability distribution and
I'd guess your models are general enough to accommodate that.

A couple use cases come to mind:

1\. If I have no data but want to test out various/arbitrary schemas with just
a bunch of dummy data. Of course, I could generate it myself (either with ad
hoc scripts or building a more general CLI that does this for me), but if
Synth just makes it a one-liner in the command line, that's appealing.

2\. If it's too burdensome to convince others in my org that you've "built it
so that our servers never have to see any sensitive information". Even if I
trust you, I then have to make arguments for others to also trust you, when
really if all I need is some random data for an empty schema, then that's a
whole can of worms I don't need to open.

------
carlps
I'm curious how the model handles text data. Does it use the actual input text
from the source db to generate new synthetic data? If I have a column of a
bunch of sensitive text that I need sanitized, how will that appear in the
output? What is the risk of leaking something sensitive?

~~~
openquery
Thanks for the question!

For now text data will be marked as `categorical` or `text`. When you have
sensitive data you want to use `text` which will provide a lorem-ipsum type
generator.

If the model has classified that column with the semantic type `text`, no
information from the column should be leaked :)

------
sammyd56
Very interesting concept. A couple of initial observations:

* Creating models from a file borks on anything non-UTF8 (i.e. most legacy system outputs)

* `synth model inspect` output does not match the docs - how do I see the JSON?

~~~
openquery
> Creating models from a file borks on anything non-UTF8 (i.e. most legacy
> system outputs)

Yes - this is a WIP. Thanks for pointing it out

> `synth model inspect` output does not match the docs - how do I see the
> JSON?

Ah yes this is a typo in the docs. We'll fix it up. What you're looking for
is: `synth --format json model inspect <model-id> | jq`

Thanks for the feedback!

------
withinboredom
Does this work with unstructured data (such as cosmosdb?)

~~~
openquery
Not yet - but it's on our roadmap. Feel free to get in touch if you would like
this to be accelerated and we can find out more about your use case :)

------
AznHisoka
When you were working at the hedge fund, what type of datasets were you
typically testing? Can you give me some broad examples?

~~~
openquery
Unfortunately I can't go into specifics here.

What I can say is that these were alternative[0] datasets.

[0]
[https://en.wikipedia.org/wiki/Alternative_data_(finance)](https://en.wikipedia.org/wiki/Alternative_data_\(finance\))

------
cmdkeen
Looks very interesting and would be a huge win if we were able to use it - any
chance Oracle support is on your roadmap?

~~~
openquery
We haven't looked into the logistics of supporting Oracle yet. The fact that
Oracle is closed source makes everything a little bit harder. This was our
experience when adding MsSQL Server support.

Feel free to get in touch if you would like to discuss more about your use
case :)

------
hans_castorp
Can this be installed on premise? Especially in the light of GDPR it might not
be possible to do something like this with data stored "on the outside" (even
if it's only a "model").

I know for sure, our customers wouldn't allow this.

~~~
openquery
Hey - great question!

We've been careful to design Synth such that the model doesn't contain any
sensitive information. That being said I completely understand where you're
coming from.

We do offer the enterprise version for on-prem deployments. Basically, if you
have a Kubernetes cluster you can run Synth on-prem :)

------
trulala
How does it compare to Delphix?

~~~
openquery
It's hard to say. Delphix is quite opaque on what they do exactly and getting
through requires booking a demo.

From what we have seen, Delphix is very much focused exclusively on large
enterprise, and by extension does not look like a tool which is focused on the
developer experience (could be wrong here).

We are much more focused at addressing the engineers in businesses - at the
end of the day it's developers who will be using this tooling.

------
vosper
Offtopic: Was 2020 YC's Year of Developer Tooling or something? Seems like
there have been lots of launches for YC-backed dev-tool startups in the past
few weeks.

~~~
dang
There have been, but YC has always funded lots of those. I suspect it's a
random cluster in the startup stream. More are coming, too. This is Launch HN
season because Demo Day is next week.

------
svsaraf
For those of you who feel this solution is a bit too complex for your
workflow, there are a couple of lightweight alternatives, including Sudopoint
([https://www.sudopoint.com](https://www.sudopoint.com)) which lets you
specify what you need and download a CSV, in and out in a few seconds.

To the Synth team, awesome product! Great to see that more tools are getting
built to help testing / QA workflows. I think this a huge area for the future.
Welcome to the competition. :)

[Disclaimer] I'm the (solo, bootstrapped) founder of Sudopoint

------
sleepygardener
Sorry to say this, but the name "synth" is terribly misleading and generic.
Word "synth" is used widely for electronic musical instrument "synthesizer".

~~~
openquery
I wouldn't say it's misleading but I see where you're coming from. I play the
piano so this was what inspired the name.

It turns out that picking a name for a startup/product which is representative
of what you do is hard!

