
Show HN: Singer – Simple, Composable Open Source ETL - jakestein
https://www.singer.io/
======
jakestein
Hi, I'm the CEO of Stitch, the company behind Singer. Here's our blog post
with more information: [https://blog.stitchdata.com/introducing-singer-simple-
compos...](https://blog.stitchdata.com/introducing-singer-simple-composable-
open-source-etl-a4a6da7eac19)

Singer is an open-source standard for writing scripts that move data between
databases, web APIs, files, queues, and just about anything else you can think
of. Lots of companies build ETL scripts to move their data, and there's a huge
amount of rework that happens from company to company. We believe that
developers should spend less time moving their data and more time using it.

We're open sourcing 12 of our integrations (with more to come) so that they
can be used in other applications, and we're excited to see what the community
builds. Let me know if I can answer any questions.

~~~
_ar7
What's the motivation for the schemas? Are they just verifying that the APIs
haven't changed and are still returning what you'd expect them to?

I read through [https://github.com/singer-io/getting-
started/blob/master/SPE...](https://github.com/singer-io/getting-
started/blob/master/SPEC.md#schema), but I'm trying to better understand why
they're necessary.

~~~
cm
There are a couple reasons why we included schemas in the spec:

\- JSON doesn't have a robust set of data types, and specifically lacks a
datetime/timestamp type. With a schema, Taps can, for example, denote fields
in the JSON that contain datetimes represented as strings, and then targets
can convert those to proper datetimes and handle them accordingly.

\- Dealing with un-structured or flexibly-structured data is _hard_. Requiring
a schema forces a Tap author to think about the structure of the data up
front. By validating each data point against a schema, the Tap author should
be able to more quickly identify nuances in the data set - like missing
fields, nullable fields, mixed-type fields, etc - and either decide to clean
them out of the data (if appropriate), or provide the right schema to inform
downstream applications about them. Identifying and handling these problems
requires an understanding of the source data set, so it is best done as close
to the data source as possible.

------
jswny
Looks very cool and it definitely addresses a problem that not only many
organizations have but also many individuals have. I know personally I have
struggled with moving my data out of various services many times before.

The question I would have to you is this: if the API (or however a tap
extracts information) for a given service changes, who is responsible for
updating the tap, and is there any way to verify that a tap will extract the
data correctly and that said tap isn't assuming an old version of the API?

(Sorry if I'm misunderstanding the way that Singer or any of its associated
services work)

~~~
jakestein
That's a great question. For any Singer integrations that we include in the
Stitch product, we (Stitch, Inc.) will ensure they stay up to date with the
most current version of the API.

That may not be the case with 100% of the community built/ maintained
integrations, but our goal with the Singer Slack community and list serve is
to connect people using these taps to so that they can validate which are
working well and which need additional work.

------
nwellinghoff
This has been attempted before by many companies and many standards. No
standard is robust enough and no company strong enough to get everyone to use
it. I really really hope is gains adoption though!

~~~
jakestein
Thanks nwellinghoff! I'm sure that there will be cases where Singer is not the
best tool for the job, but hopefully it will still make a lot of engineers'
jobs easier.

I'd love to know which previous standards you're referring to, could you
elaborate?

------
venkasub
How is it different from logstash for syncing between 2 data sources? Or is
Singer primarily meant for data extraction from SaaS sources?

------
fudged71
Are there similar products on the market? How is ETL typically done today?

~~~
woqe
We are using Apache NiFi[1] to handle a lot of our ETL use cases.

We have HTTP endpoints set up to receive data from our ERP's accounting system
to send data to Concur and to update customers' Lawson punchout ordering
systems with shipment information. The 'E' is an HTTP post with an XML
payload. The 'T' consists of using the payload to query other databases to
build the 'L' payload, and the 'L' is an HTTP post to the consumer's
endpoints.

Further, we have NiFi handling HL7 messages inside hospitals. The 'T' is the
real winner, here. NiFi has built in transformer for HL7 messages which makes
them a breeze with which to work.

EDIT: I wanted to add that we've also used the EventHub processor to connect
NiFi to Azure's services as well, and it has been rock solid for us.

NiFi's data provenance, flowfile/attribute system, back pressure settings,
auto-queueing, retrying capabilities all make it very reliable and robust.

1\. [https://nifi.apache.org/](https://nifi.apache.org/)

------
akalitenya
do you plan to opensource all of Stitch integrations?

~~~
jakestein
The short answer is yes.

The longer answer is that it may take us a while to get to 100% open source,
but that's the direction we're moving. All of our new integration development
will be open source and be part of the Singer project. Our original
integrations were written in a different framework and couldn't be run
independently of Stitch, and it's a nontrivial amount of work to convert them
to the Singer format.

We included several of our existing integrations as part of this launch, and
we'll definitely be adding more of them as well as new integrations.

