
Launch HN: Syndetic (YC W20) – Software for explaining datasets - stevepike
Hi HN,<p>We&#x27;re Allison and Steve of Syndetic (<a href="https:&#x2F;&#x2F;www.getsyndetic.com" rel="nofollow">https:&#x2F;&#x2F;www.getsyndetic.com</a>). Syndetic is a web app that data providers use to explain their datasets to their customers. Think ReadMe but for datasets instead of APIs.<p>Every exchange of data ultimately comes down to a person at one company explaining their data to a person at another. Data buyers need to understand what&#x27;s in the dataset (what are the fields and what do they mean) as well as how valuable it can be to them (how complete is it? how relevant?). Data providers solve this problem today with a &quot;data dictionary&quot; which is a meta spreadsheet explaining a dataset. This gets shared alongside some sample data over email. These artifacts are constantly getting stale as the underlying data changes.<p>Syndetic replaces this with software connected directly to the data that&#x27;s being exchanged. We scan the data and automatically summarize it through statistics (e.g., cardinality), coverage rates, frequency counts, and sample sets. We do this continuously to monitor data quality over time. If a field gets removed from the file or goes from 1% null to 20% null we automatically alert the provider so they can take a look. For an example of what we produce but on an open dataset check out the results of the NYC 2015 Tree census at <a href="https:&#x2F;&#x2F;www.getsyndetic.com&#x2F;publish&#x2F;datasets&#x2F;f1691c5d-56a9-47d4-8df7-a373a2894c8b" rel="nofollow">https:&#x2F;&#x2F;www.getsyndetic.com&#x2F;publish&#x2F;datasets&#x2F;f1691c5d-56a9-4...</a>.<p>We met at SevenFifty, a tech startup connecting the three tiers of the beverage alcohol trade in the United States. SevenFifty integrates with the backend systems of 1,000+ beverage wholesalers to produce a complete dataset of what a restaurant can buy wholesale, at what price, in any zipcode in America. While the core business is a marketplace between buyers and sellers of alcohol, we built a side product providing data feeds back to beverage wholesalers about their own data. Syndetic grew out of the problems we experienced doing that. Allison kept a spreadsheet in dropbox of our data schema, which was very difficult to maintain, especially across a distributed team of data engineers and account managers. We pulled sample sets ad hoc, and ran stats over the samples to make sure the quality was good. We spent hours on the phone with our customers putting it all together to convey the meaning and the value of our data. We wondered why there was no software out there specifically built for data-as-a-service.<p>We also have backgrounds in quantitative finance (D. E. Shaw, Tower Research, BlackRock), large purchasers of external data, where we&#x27;ve seen the other side of this problem. Data purchasers spend a lot of time up-front evaluating the quality of a dataset, but they often don’t monitor how the quality changes over time. They also have a hard time assessing the intersection of external datasets with data they already have. We&#x27;re focusing on data providers first but expect to expand to purchasers down the road.<p>Our tech stack is one monolithic repo split into the frontend web app and backend data scanning. The frontend is a rails app and the data scanning is written in rust (we forked the amazing library xsv). One quirk is that we want to run the scanning in the same region as our customers&#x27; data to keep bandwidth costs and transfer time down, so we&#x27;re actually running across both GCP and AWS.<p>If you&#x27;re interested in this field you might enjoy reading the paper &quot;Datasheets for datasets&quot; (<a href="https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;1803.09010.pdf" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;1803.09010.pdf</a>) which proposes a standardized method for documenting datasets modeled after the spec sheets that come with electronics. The authors propose that “for dataset creators, the primary objective is to encourage careful reflection on the process of creating, distributing, and maintaining a dataset, including any underlying assumptions, potential risks or harms, and implications of use.” We agree with them that as more and more data is sold, the chance of misunderstanding what’s in the data increases. We think we can help here by building qualitative questions into Syndetic alongside automation.<p>We have lots of ideas of where we could go with this, like fancier type detection (e.g. is this a phone number), validations, visualizations, anomaly detection, stability scores, configurable sampling, and benchmarking. We&#x27;d love feedback and to hear about your challenges working with datasets!
======
dodata
Neat! Congrats on the launch - the demo is very helpful to understand the
product. Having consumed long, painful PDF data dictionaries in the past, this
is a big breath of fresh air. Excited to see where Syndetic goes!

For me, the most painful part of working with 3rd party data was actually
figuring out the "match rate" to internal data. For example, you might be a
consumer-facing company who hopes to add more context to your internal data by
pulling in 3rd party information for existing clients. To match your internal
data to a 3rd party dataset, you usually match on some hashed email (or
similar identifier) to see what percentage of your consumer records will be
available in the 3rd party dataset. Have you thought about something like that
with your tool? Maybe you can upload a sample of hashed emails and see how
different match rates pan out.

~~~
stevepike
Yes! This has come up across multiple industries and is probably the feature
on our roadmap I'm most excited about. The implementation is tricky but
customers definitely care about the intersection of a provider's data with
their own. Some more sophisticated providers have internal tools for
generating things like sample sets customized to a prospect.

We're going to be adding a feature where we can flag fields as identifying
keys and index them. We'll start with a simple intersection count ("upload 100
stock tickers, see how many records match"). Then we'll add an interactive
feature to let a prospective customer generate all of the stats in the
dictionary scoped down to the subset of data they care about. It's important
to be able to answer questions like "for the 100 tickers I care about, how
many NULLs are there for this other column?".

Maybe someday we'll even get into the more general record linkage problem when
there's no reliable matching key.

~~~
carterehsmith
That sounds very useful.

I am also super impressed that you managed to present your product without
mentioning "big data" or "machine learning" or AI - given that anyone that
does anything these days crams those big words in.

Thats is good, good luck.

~~~
stevepike
Thanks, I'm with you on the big data buzzwords and trying to avoid
overengineering things (one of my favorite HN posts ever is
[https://news.ycombinator.com/item?id=8908462](https://news.ycombinator.com/item?id=8908462)).

Right now the data scanning is just a fork of
[https://github.com/BurntSushi/xsv/](https://github.com/BurntSushi/xsv/)
running in a container with plenty of ram, and we've handled files in the
~20GB range with no problem. I think we could actually scale up to ~100GB
files with xsv, which seems to cover 99%+ of data providers we're running in
to. Providers might be processing massive amounts of data but the eventual
deliverable they share with their customers is rarely too big for one machine.

That said, we will probably move away from our super simple stack towards
running a Spark cluster in the medium term. Not for "big data" (actually I
expect Spark to have higher latency and possibly to be slower for a moderate
sized dataset than the rust solution) but because we want to be able to run
multiple parallel scans over the datasets for upcoming future features. Some
of that will involve a DAG of dependencies (e.g., do type detection first to
figure out fields that are categorical, then generate visualizations where the
plots are grouped by whatever the values are of the categorical field). There
are also a bunch of nice libraries in the spark world for more comprehensive
stats.

------
loganfrederick
This (or something like it) makes a lot of sense to me. I've been at multiple
organizations where there have been efforts to create these "data
dictionaries" explaining the meaning of the data, especially when the schemas
or APIs are not well designed.

But then manually writing documentation is obviously tedious and can typically
only be written by the data team that knows the underlying data well, which is
not always the best use of their time.

I'll definitely be following Syndetic and hope they can help crack this
problem.

~~~
aswihart
(This is Allison) - thank you! It's interesting for us to see how the problem
is handled at different types of organizations, because as you point out the
data team knows the underlying data best but is not often customer-facing.

------
gvv
Great concept, although worried about uploading datasets with sensitive data
to your cloud.

I usually use this for Pandas Profiling to accomplish EDA tasks
[https://github.com/pandas-profiling/pandas-
profiling](https://github.com/pandas-profiling/pandas-profiling)

------
knes
Super great idea. I've been talking to data scientist/eng in B2B SaaS space on
how we should bring best practices like that to the Sales/Marketing/Business
ops world too.

What would you say are the differences between syndetic and qri.io (not
affiliated in anyway)

~~~
stevepike
Thanks for the kind words! I hadn't seen qri.io before, but from a read of
their website I think it's broadly similar to Dolt
([https://www.liquidata.co/](https://www.liquidata.co/)) which is git for
datasets. Kaggle has a similar data hub at
[https://www.kaggle.com/datasets](https://www.kaggle.com/datasets) that's not
open source but is in the same space.

Our approach is to scan the datasets wherever they currently live in
production rather than being a new way to store the data. The industry seems
to have settled on FTP and S3 for now, and we think it's important that we
connect to the same exact thing a customer would access. That lets us keep the
dictionary up to date automatically without the data providers needing to
change their storage infrastructure.

------
SamuelAdams
I wonder if you could combine your service with freely available datasets like
Google Dataset Search [1] to demo what a large amount of various datasets
would look like under your service.

[1]:
[https://datasetsearch.research.google.com/search?query=puppi...](https://datasetsearch.research.google.com/search?query=puppies&docid=OhxQw53Vy%2FmwMc68AAAAAA%3D%3D&filters=WyJbXCJ1cGRhdGVkX2RhdGVcIixbXCIzeVwiXV0iXQ%3D%3D&property=dXBkYXRlZF9kYXRl)

~~~
aswihart
We would love to do that at some point. There is tons of open data out there,
but not a lot of it has useful descriptions at the field level (only the
dataset level), so it would take some time to put a collection together that
is robust. Also the demo on our splash page is a demo of the artifact we
create (i.e. the published dictionary) only. The other side of the web app is
a management layer to bundle datasets into collections, annotate fields,
configure sample sets, and share the artifacts. We'll work on fleshing out our
demo to show what the system looks like when there are hundreds or thousands
of datasets.

------
bitprincess
I have been doing this for my company, when there is a sector you have data on
- there is also external factors like how much of the market does your db
cover for prices etc. and NLP on different items and metrics for different
types of data in your offerings. What do you do with this when the actual
field names are very general (i.e. item metric region unit)

~~~
aswihart
We actually built in the concept of "display names" for exactly this purpose.
You want to keep the actual field name in the schema so that ingestion works
properly, but you also want to describe the field as helpfully as possible. In
another comment, Steve mentioned that we are trying to tackle the first
problem you mention (how much overlap is there between my dataset and other
external factors) with the concept of creating dataset intersection or
relevance scores.

------
mason55
Any plans on a hosted version? Either regular on prem or as something that can
be privately hosted on e.g. AWS like DataBricks or Snowflake?

I love the idea but we could never expose most of our data to a public SaaS.
There are all kinds of restrictions we have on things like data privacy and
data needing to stay in specific regions.

~~~
stevepike
Yeah, on-premise or at least private cloud has come up a few times. Beyond the
data privacy and licensing requirements it'd also just be plain faster in some
cases. We haven't offered it yet just because we're a small company and are
rapidly adding features. Our backend is mostly running in k8s so I don't think
it'll be a _huge_ technical rewrite to get it running in private cloud.
Frankly I just don't have experience supporting software running outside my
control and want to make sure we take the time to do it correctly.

~~~
mason55
Cool. We're an enterprise B2B SaaS company that does a ton of data interchange
with our customers. Each project burns a bunch of engineering hours and
calendar time because customers send us data that doesn't match the spec or we
just don't have anyone outside the engineering org who can confirm that the
data looks how it's supposed to.

Something which simplified the process of analyzing sample data and provided a
view of it to a non-technical user would be very valuable. But as I mentioned
in my previous comment we could never expose any of our data outside of our
private infrastructure, so we can't use this until there are other options for
hosting.

------
aripickar
This looks really cool! I've worked with large data sets before and one of the
most annoying things was when they were split up into multiple files. Do you
currently support statistics across multiple datasets?

Also, how did you come up with the pricing? 500 and call us seems like a lot
per month

~~~
stevepike
Thanks! We can definitely combine multiple files into one dataset so long as
they share the same fields. We've got one customer that keeps their data in an
S3 bucket as one JSON file per record, so for them we're scanning ~480K files
to construct the stats. If you've got multiple different datasets we've got a
concept of "collections" for organization.

We came up with the pricing strategy based on conversations with early
customers. We want to be able to say yes to integrations with whatever system
they're using to store their data so we need flexibility at the early stage.

------
tzm
This reminds me of readme generator for Frictionless 'Data Package'
[https://frictionlessdata.io/](https://frictionlessdata.io/)

~~~
aswihart
Interesting, I just came across Frictionless recently as part of the FISD
group that's trying to implement a set of standards across the financial
industry for documenting datasets.

------
pplonski86
What data types do you plan to support? Have you considered supporting
datasets with images+labels used in computer vision? How would you like to
handle them?

Are you going to support data labeling tasks?

~~~
stevepike
We've talked with some image data providers who are creating datasets for use
in machine learning. We're not running any of our own models on image data
right now, so I think the place we can be most useful is in summarizing
metadata about the images in cases where the dataset isn't just an image file.
For example, if the dataset is images of intersections plus bounding box
coordinates of street signs we can tell a prospective consumer what % of
images have a street sign in them. If you have a little more metadata (e.g.,
what time of day the photo is) the stats get much more useful out of the box.

I don't understand the data labeling question. How would you imagine us
getting involved there?

------
wefarrell
Makes a ton of sense to me, I really wish the datasets that I work with had
this or some kind of equivalent.

Would be really great if you could generate a fake dataset of similar size for
testing purposes. It would take some thinking but it would be really useful
for building a consumer before getting the full dataset.

~~~
aswihart
I missed this comment earlier - we can pilot with sample/fake data to give the
user a sense of how the system works and what workflows within it work best
for their company. The versioning/diffs wouldn't make much sense, but that's
probably ok.

------
coolsank
very cool product! I've worked on much smaller datasets with pandas and their
inbuilt profiling report method can slow things down to a crawl!. Hoping to
see more from you guys :)

~~~
stevepike
A very reductionist version of our company that Allison hates when I use is
"csvstat but on the internet" :-). I think the problem of auto-summarizing
datasets has hit kind of a local maximum in what pandas dataframe summaries
(csvstat is a similar python tool) can do on one machine. We will be able to
add much fancier things like sophisticated type classification (e.g., is this
field a stock ticker) without burning your CPU.

~~~
coolsank
hah! but this is a very interesting area. You're right on the auto-summarizing
issue becoming a problem these days with the usage of larger datasets. Data
versioning also is starting to become a larger problem and I saw that you guys
already have addressed it in your enterprise product. Hoping to see some sort
of API-like version for comparison of data troves from different timelines in
the future.

------
shostack
What industries do you see this being most useful to?

~~~
aswihart
We think data-as-a-service is a new and growing category, and this includes
startups we would consider "pure" data companies (e.g. a company that sells
data on airfares across the web) and companies that have been around for
decades selling csvs that they deliver over FTP (e.g. a giant company like ADP
that sells payroll processing data). Based on our backgrounds we have a fair
amount of experience in the alternative data space, which is basically any
data that might have some signal to a hedge fund that isn't market data. I'm
finding that the providers in that space are interested in expanding their
customer base to corporates (e.g. Walmart, McDonalds) whereas the providers
currently selling to corporates are interested in expanding into alternative
data.

------
adampgreen
Awesome! Congrats on the launch! Excited to check it out.

------
gkoberger
Hey! I’m Greg of ReadMe... think Syndetic but for APIs rather than datasets!

Congrats on the launch :) I’ll find you at Alumni Demo Day, or feel free to
reach out if I can help with anything! And welcome to the war on PDFs :)

~~~
aswihart
This made our day : ) Thank you!

------
nocitrek
Does this support more complex data structures - for example parquet files?

~~~
stevepike
The manual data upload is restricted to well-formed CSV files w/ headers on
the first row right now. For the "contact us" higher tier we'll handle any
file format that we can extract columns from, so parquet would be fine.

