
Apache NiFi - awjr
https://nifi.apache.org/
======
wpietri
Dear Apache NiFi people: almost every technology featured on HN could be
described as "an easy to use, powerful, and reliable system to process and
distribute data." Please consider using a tagline that will tell people
something about your project. E.g., intended audience, chosen problem space,
desired benefit.

You'll note that a lot of the discussion here is, "What is it? Is it like X?
Is it good for X?" Those are great questions to answer on the home page.

~~~
norswap
> almost every technology featured on HN could be described as "an easy to
> use, powerful, and reliable system to process and distribute data."

Really? I feel like it's more: "usable, powerful and reliable: choose one ...
if you're lucky".

~~~
ne0n
The point is that it doesn't say anything about what it does. A programming
language could be described as a "easy to use, powerful, and reliable system
to process and distribute data." It processes and distributes data? That just
sounds like a computer.

------
drivers99
I was thinking, "wow, reading in data, processing it, and outputting data.
Reminds me of Interface Engines in healthcare." Then I saw "HL7" on the
diagram.

That's what I did for a long time. HL7 contains information such as
"Admissions, Discharges, and Transfers (ADT)" which is sent from the Hospital
Information System (HIS) to other department's systems (radiology, pharmacy,
medical records, and possibly dozens more) and vice versa (LAB results back to
HIS, for example).

HL7 interface engines unpack the HL7 data into separate segments, fields, and
subfields, identify the type/subtype of message, route to various destinations
based on the type and any other fields, map the data and reformat as necessary
for the destination, encode back into HL7, and send it. It also needs to queue
the messages as needed, re-transmit or set aside the message, or just stop
sending depending on what it gets back in the form of response codes. So you
can see in the diagram some of those steps. The also handle input/output
either in the form of TCP/IP ports, reading/writing files (for batch
processing), or pretty much any other method you can send and receive data
(serial ports, in some old cases).

I'm kind of curious how (and how easy it is) to define which fields in the
input for a given event goes to the output (and do certain transformations on
those fields if needed).

~~~
ubertaco
Yep, this reminded me of a not-crappy version of Mirth Connect.

~~~
eclipxe
Hey what's wrong with Mirth? I've long since moved on but I was one of the
original engineers building Mirth. I wrote a lot of the HL7 parsing logic and
the tcp transports.

~~~
wpietri
User interview protip: if you actually want to know what's wrong with your
product, never tell people that it's your product.

At my last startup, we did user tests every week, and my cofounder was very
careful never to let on which test items were made by us. We didn't have our
name on the door or on the buzzer. Nothing with our logo was visible between
the entryway and the user interview room. And when he started showing
products, he'd always start with somebody else's.

It was great. We got some incredibly honest feedback (sometimes brutally so)
on what we were building. It helped us kill a lot of bad directions early.
Most people just don't want to tell you that your baby is ugly, but they'll
happily dish if they think it's somebody else's.

~~~
mlonkibjuyhv
Any tips for when the user definitely know it's your product? I can't think of
anything besides starting with butchering some aspect of it. Then again that
might trigger some empathetic response, and end with even less actionable
feedback.

~~~
wpietri
I think people can't un-know something, so I'd try very hard to get test
subjects in a way where they don't think the people they are talking to are
the ones who make the product.

For example, you could set up a fake market research org, and have them say
they are "conducting a study on [your market] and are looking for users of
products like [competitor 1], [competitor 2], and [your product]". Then for
the user testing sessions you could rent a conference room from somebody like
Regus, or even rent a user testing lab. Then during the tests, make sure to
start with general questions and test (or show screens from) the other
products before you get to testing yours.

I'm not sure if those specifics will work in your environment, but I hope you
see what I mean.

------
nanocyber
Just in case you don't know the interesting history of NiFi:

[http://www.zdnet.com/article/nsa-partners-with-apache-to-
rel...](http://www.zdnet.com/article/nsa-partners-with-apache-to-release-open-
source-data-traffic-program/)

Thanks, NSA!

~~~
kitd
Thank you. I was looking all over the project's site to see who was behind it.
Not a peep, but I guess they're good at that!

~~~
rectang
When people participate in Apache projects, it is emphasized that they do so
as _individuals_. Affiliations are downplayed. If a contributor takes a new
job at a different company, their participation within the project is
completely unaffected.

Furthermore, as a 501(c)(3), the ASF is limited in what it can do with do with
donations and very, very rarely accepts targeted donations aimed at a specific
project -- it's just not worth the hassle or risk. So while outside entities
might contribute by sponsoring people to work on the project, no project at
the ASF has someone "behind" it in the sense of direct funding.

This is all part of maintaining project independence.

Hope this helps to explain why it is not always easy to discover "who is
behind" an Apache project. :)

More info: [https://www.apache.org/foundation/how-it-
works.html#hats](https://www.apache.org/foundation/how-it-works.html#hats)

~~~
kitd
Sure I understand that. And you're right, it is important for project
independence.

But it is usual to be able to deduce at least something about contributors
from email address, github accounts, other contact info. But in this case
there was _absolutely nothing_. It was almost as if they were experts in
covering their tracks! I just thought it was quite funny.

------
mystique
We were wowed by NiFi when we looked at it originally. Once we put it in Local
env to build test flows, we found that the most complex tasks for data flows
were fairly simple to setup. And the simplest tasks ended up requiring complex
workarounds because the system was trying to be extra smart about what it was
doing. In the end, we decided not to use it in Production due to the 80/20
split of simple/complex tasks we had.

Hopefully it's better than what it was in Jan/Feb timeframe.

~~~
mpayne
Interesting. Any details on the things that you found particularly easy or
particularly difficult to do with it?

~~~
mystique
Simple use cases that were more complicated

\- We wanted to collect files from various locations and push that in hdfs.
Nifi seemed like a good way to build a self service setup. Once we setup
source and sink, if we read + processed + removed file from destination, Nifi
copied it again. We did not have control of always removing source files as
soon as it was copied on destination. Components we used were GetFile, PutFile
and some options in Conflict Resolution settings for these components

\- Inspect file name, run a script to generate new subdirectories for
partitions and place the file in appropriate partitions. Attaching a script
was easy. Changing the destination path on the fly was not.

Some Complex cases - there are other ways to do this but setting this up in
Nifi was a breeze

\- Set up file collection from ftp, sftp, file copy from 20+ locations. This
was painless, few minutes per source

\- Add REST interaction within data flow easily

\- Read CSV files and convert to Avro/Sequence files

\- Read files and route part of data to different processors

We also ran into some strange bugs where Nifi got stuck in some type of loop
and kept copying data over and over again.

We were able to do all this testing in 2 weeks. Give it a shot, it might work
out for your use case.

------
impostervt
Been using this for a few months at work. I was originally a skeptic, as my
experience with "drag-and-drop" coding hasn't been positive, but I've come
around after using it for awhile.

The guys developing it are incredibly responsive. I submitted a bug one
morning via the mail list at around 9 am, and by 11am they had a patch slated
for their next release.

------
SwellJoe
It's rare that I step into a conversation in software fields where I am
completely at a loss for what is being discussed. This is one of those
conversations. I have no idea what the description of the project means, and
after reading the conversation here at HN, including people quoting references
and such that explain what it does, I _still_ don't really know what it does.

I'm not complaining about not understanding, I just had one of those moments
of "wow, the world of software is really big and there are vast, heavily
funded, corners of it that I've never even heard of".

~~~
luckydata
The short version: any complex data processing done at scale will have a lot
of steps and fail a lot. To avoid having to constantly monitor everything, you
have systems called "workflow managers" that help you make sense of those
multi steps monsters and manage their dependencies, their failure etc...

If you never had to process that kind of data (we're talking about Google,
Facebook and Twitter kind of data) then you probably have no exposure to this
kind of system.

------
samuell
There's a really informative discussion about NiFi on the Flow-based
Programming mailing list [1].

One thing that is discussed is how NiFi, in contrast to "proper FBP", has only
one inport, to which all incoming connections connect, so incoming information
packets are merged, and subsequently need to be "routed".

[1] [https://groups.google.com/forum/#!searchin/flow-based-
progra...](https://groups.google.com/forum/#!searchin/flow-based-
programming/nifi/flow-based-programming/_lQhsJR_Ihg/krSGxKgbNqcJ)

------
chrisarnesen
Graphical dataflow programming is super powerful. It's the bread and butter of
Ab Initio Software, which powers the data infrastructure of many of the
world's largest corporations. I'm glad to see an open-source project entering
that market too.

~~~
nycthbris
What is your opinion on LabVIEW then? I find their graphical interface to
programming to be incredibly limited.

~~~
jcadam
I remember using LabVIEW around ~2000 to read/display data from a high
altitude balloon coming in over packet radio. In that case it seemed at least
somewhat useful. It provided a quick and dirty way to grab data from various
sources and throw it into some pretty graphs.

Years later, I had to maintain a software application built by a bunch of
idiots who thought it would be a great idea to use LabVIEW as their GUI layer
(because apparently that was easier than learning/using a proper GUI toolkit).
This monstrosity communicated with a back-end running on Solaris via LabVIEW's
freaking horrendous TCP control. The whole thing made absolutely _no sense_.
Though LabVIEW _did_ provide a rather nifty visualization of spaghetti code.

I suspect this was a resume-padding exercise for the original authors. Rumor
had it that one of the engineers responsible for the decision to use LabVIEW
had been hired away by National Instruments. And we all cursed him.

So anyway, after that experience I'm pretty well prejudiced against graphical
'programming'. I took one look at this NiFi thing and said 'Ha! Nope.'

------
nl
DataFlow programming[1] (which NiFi) is created to do is _NOT_ the same as
your typical extraction/transform/load (ETL) tool with a nice user interface!

Wikipedia says _Dataflow programming languages share some features of
functional languages, and were generally developed in order to bring some
functional concepts to a language more suitable for numeric processing._

The thing closest to Dataflow that is most commonly used is the concept of DAG
operations in Spark, but Dataflow usually makes time windowing a first level
concept. Spark Streaming is moving towards this type thing.

It is true that there is overlap with ETL tools, but that undersells what
Dataflow is.

[1]
[https://en.wikipedia.org/wiki/Dataflow_programming](https://en.wikipedia.org/wiki/Dataflow_programming)

~~~
abrookewood
Can you expand on how they are different? The first thing I thought of after
reading about this was that it was primarily an ETL engine.

~~~
nl
So ETL is mostly about getting data _into_ some kind of processing system.
That means there are lots of functions for dealing with things like csv data
and polling directories and transforming json data into flat structures and...
etc.

Dataflow is a _programming model_ for performing actions on data. A system
that implements dataflow programming will probably have functions to load
external data into the structure needed for the system, but it isn't primarily
about moving data to another system.

For example, Google Dataflow[1] has functions for reading files etc, but there
aren't really the huge number of things for cleaning and processing data that
a real ETL system has. Instead, you load the data into the system, and then
process it for a specific task.

[1] [https://cloud.google.com/dataflow/what-is-google-cloud-
dataf...](https://cloud.google.com/dataflow/what-is-google-cloud-
dataflow#Sdks)

------
amai
"Apache Nifi is a new incubator project and was originally developed at the
NSA. In short, it is a data flow management system similar to Apache Camel and
Flume. It's mostly intended for getting data from a source to a sync. It can
do light weight processing such as enrichment and conversion, but not heavy
duty ETL. One unique feature of Nifi is its built-in UI, which makes the
management and the monitoring of the data flow convenient. The whole data flow
pipeline can be drawn on a panel. The UI shows statistics such as in/out byte
rates, failures and latency in each of the edges in the flow. One can pause
and resume the flow in real time in the UI. Nifi's Architecture is also a bit
different from Camel and Flume. There is a master node and many slave nodes.
The slaves are running the actual data flow and the master is for monitoring
the slaves. Each slave has a web server, a flow controller (thread pool)
layer, and a storage layer. All events are persisted to a local content
repository. It also stores the lineage information in a separate governance
repository, which allows it to trace at the event level. Currently, the fault
tolerance story in Nifi is a bit weak. The master is a single point of
failure. There is also no redundancy across the slaves. So, if a slave dies,
the flow stops until the slave is brought up again." from
[http://www.confluent.io/blog/apachecon-2015/](http://www.confluent.io/blog/apachecon-2015/)

------
noja
Is this like Yahoo Pipes?

~~~
splitbrain
That was my first reaction, too. It seems to at least have borrowed the design
patterns from it.

~~~
stonemetal
I doubt it borrowed from Y! Pipes in particular. There is a category of server
software called the "integration engine" that has been around for a while. The
purpose of it is to integrate multiple third party vendor systems, so you
aren't locked in to one vendor for everything. You see it a lot in industries
that use software but aren't about software like health care or finance.

------
Sami_Lehtinen
That seems to be quite much what my friend built in one company as customer
proprietary system. Basically it looks exactly the same. Those boxes are just
code modules / microservices with custom code. It's also important that
everything can be configured, modified and routed in realtime by adding new
boxes etc. I really loved that design. Ok, technically same results can be
reached using multiple differenet architectures, but that suits microservices
concept very well. Yet, it could lead to high latency depending from multiple
different aspects and how modules are technically connected.

------
awjr
I was pointed at this as something that can be used very effectively for IoT.

Still investigating.

~~~
awjr
[https://www.youtube.com/watch?v=sQCgtCoZyFQ](https://www.youtube.com/watch?v=sQCgtCoZyFQ)

------
checksim
So is this like Pentaho but more general purpose?
[http://www.pentaho.com/product/data-
integration](http://www.pentaho.com/product/data-integration)

~~~
cbsmith
...and a bit more focused on "flows".

------
mrbgty
In general I'm curious how people handle managing diffs in the workflow over
time. I've found when working with Microsoft SSIS, that I end up preferring
something in code where changes are obvious.

~~~
dragonwriter
I'm not sure what the preferred approach is, but since it seems to have a REST
API with full query and update ability on the flow graph, it seems like you
should be able to do a workflow where you make an update on a dev server
(updates are immediately live), run a process that captures that via the REST
API, serializes it, and pushes the serialization to version control, and have
another process that takes a serialized version and pushes it to other servers
(prod, etc.) via the REST API.

 _Why_ visual programming tools don't (at least in documentation) address
version control is an interesting question, given how important that is in
programming generally, and the fact that those tools are often focused on
enterprise audiences.

~~~
marktangotango
I worked on a product once that provided users an interactive graph. The users
could apply layout algorithms AND modify the grqph by moving nodes and edges
around. From time to time, nodes and edges would be added and removed by the
system.

So there's a distinction there between the graph, and layout of the graph ie
x,y coordinates of nodes, and routing of edges. What would version control
track? This is a non trivial problem imo. Edit; specifically, storing a graph
in some canonical form so that trivial changes would not create massive 'false
positives' in the diff. For example changing an edge indicatin an entire sub
graph was added or removed. The problem is similar to xml tree diffing.

This is why I think versioning in these tools is pretty much non existent, imo
of course.

------
ibejoeb
I'm interested. This looks like the proper evolution of a cohesive
Camel+ActiveMQ+Felix system. Classloader isolation is so important, and it's
where most of the big applications servers failed.

------
nigel182
It looks like DTS from MS SQL Server. Is it a competing product?

~~~
polskibus
For the last couple of releases it's been called Integration Services, but
yeah NiFi looks very similar to it.

------
frugalmail
Big Data ELT (Extract Load Transform) similar to Informatica (bleh), Talend,
Kettle, etc... but designed for big/fast data warehousing, data mining
systems.

I haven't tried this, but this looks f'n awesome for what we do. We've been
using LinkedIn's Azkaban or Confluent as two seperate paths waiting for
cloudera's dataflow.

------
amai
FYI: Hortonworks buys Onyara, the company behind Apache Nifi:
[http://hortonworks.com/press-releases/hortonworks-to-
acquire...](http://hortonworks.com/press-releases/hortonworks-to-acquire-
onyara-to-turn-internet-of-anything-data-into-actionable-insights/)

------
anc84
Is there a list of examples available?

~~~
kevinbowman
This was linked from the Youtube video linked above, which looks useful:
[https://www.youtube.com/watch?v=LGXRAVUzL4U&list=PLHre9pIBAg...](https://www.youtube.com/watch?v=LGXRAVUzL4U&list=PLHre9pIBAgc4e-tiq9OIXkWJX8bVXuqlG)

------
BeefySwain
Could someone explain what one might use this for? I have read both the
website and a bit of the documentation, but cannot think of a use case.
Obviously there is one, but I am very ignorant as to what it might be.

~~~
abrookewood
At my last job, we worked with a lot of financial transaction input files
(e.g. bank transactions, share transactions etc) which we called data feeds.
They came from a lot of different sources (SFTP, FTP, WebDAV, HTTPS, SSH) and
as a result we ended up with a bunch of different scripts, many of which we
would occasionally forget about. This would appear to be perfect for that
situation: gather the data; extract the files; check the data; put it
somewhere; and optionally modify it.

------
protomyth
Is this Apache's equivalent of Microsoft's BizTalk?

~~~
ljani
Apache ServiceMix is the more direct equivalent:
[http://servicemix.apache.org/](http://servicemix.apache.org/)

------
dajonker
Could be very useful if you can script jobs in code, and the gui just writes
to code as well, so you can use version control and diffs to see what actually
changed.

------
sheraz
How does this compare to Mulesoft's CloudHub or ws02 ESB?

------
darkus
How does this compare to Spring Integration?

------
fargo
how does this compare with airflow or luigi?

~~~
samuell
At least luigi, and I think also airflow, are batch workflow systems. This is
more of a streaming system as far as I know, and also allows for more control
over data routing, as in airflow and luigi mainly define dependencies between
whole tasks, while in NiFi you can route each output of every task separately.

------
pweissbrod
i wonder how this contrasts with spring XD

~~~
mrcsparker
It is a lot like Spring XD. They both use a data flow model:
[https://en.m.wikipedia.org/wiki/Dataflow_programming](https://en.m.wikipedia.org/wiki/Dataflow_programming)

I have been playing with both, and I prefer Spring XD's DSL. It looks a lot
like Unix pipes, which I could easily grok (and it has the full suite of
Spring monitoring and logging tools built in). That being said, they are both
excellent projects.

