Hacker News new | past | comments | ask | show | jobs | submit login
Apache NiFi (apache.org)
279 points by boredgamer2 60 days ago | hide | past | favorite | 135 comments

I've used it a fair bit, though not for a couple of years. Few points, some of which may be out of date:

* I've seen customers fall into the trap of thinking they don't need expensive developers because you can drag and drop, just people who can use a mouse can crack on with NiFi.

* It persisted its config to an XML file, including the positions of boxes on the UI. Trying to keep this config in source control with multiple devs working on it was impossible.

* Some people take the view that you should use 'native' NiFi processors and not custom code. This results in huge graphs of processors with 1000s of boxes with lines between you have to follow. Made both better and worse by being able to descend and ascend levels in the graph. The complexity that way quickly becomes insane.

* You're essentially programming with it. I've no doubt you could use it to write, say, an XMPP server if so inclined. Which means you can do a great many things of huge complexity. Programming tools have developed models for inheritance and composition, abstraction, static analysis, etc. which NiFi just didn't have. The amount of repeated logic I've seen it's configuration accumulate is beyond anything I've seen from any novice programmer.

I ended up feeling like it could be an OK choice in a very small number of places, but I never got to work on one of those. The NSA linking together multiple systems with a light touch is possibly one such use case. For most everyone else, I couldn't recommend it.

> I've seen customers fall into the trap of thinking they don't need expensive developers because you can drag and drop, just people who can use a mouse can crack on with NiFi.

That is sort of the key problem I see with NiFi (and equivalents). The heavy emphasis on graphical UI and visual paradigm sort of implies that its oriented towards non-developers, but problem is that it doesn't make non-developers suddenly expert system architects or developers even if they manage to click through the UI. And many developers probably prefer just defining stuff in code instead of having fancy UIs. So it sort of falls between these to categories.

Of course there is huge spectrum of skill in people, and there are probably plenty of "semi-technical" persons to whom this is perfect match, especially if supported by some more techy people.

Doesn't Microsoft Access have a similar interface? It's been a long time since I've used it.

Based on a quick look at NiFi, SQL Server Integration Services (SSIS) seems similar. The configuration management tooling for SSIS is pretty good and it's amenable to version control better than, it sounds like, NiFi is. SSIS still suffers from the potential "gotchas" the top-level poster mentions.

I haven't used SSIS since about SQL2008 - back then it terrible to use with version control - not only was it a huge blob of XML, it had more xml escaped and shoved into attributes of the main document! Whats more it seemed to re-allocate the GUIDs of the elements every time you opened a diagram, so there were always changes...

Sounds like it's improved a bit since then.

To me, honestly, this seems more similar to Azure Data Factory.

it is genuinely odd to see this prejuidice against graphical tools persist in 2020 .. text file dot-conf and graphical tools are different but.. is it controversial that a graphical tool can emit performant code ?

I have no experience with NiFi itself.. just that graphical tools are not inherently "non-technical"

I don't think it's about whether a graphical tool is performant. Many programming languages aren't performant either.

It's more that graphical tools don't generally serve their stated purpose of making it possible for non-programmers to get things done because it appears that the main skill behind programming is actually problem decomposition and modelling rather than syntax.

Additionally, graphical tools tend to have a bunch of problems inherent to them, such as being unable to write comments and harder to store in git and work on collaboratively and only having one single editor, which is usually much buggier than a text editor and compiler.

So: graphical "languages" don't make most things much easier, and make other things much harder.

> it appears that the main skill behind programming is actually problem decomposition and modelling rather than syntax

Totally. This is where declarative and intention-oriented systems shine. Take something like SQL, where, in the majority of cases, the end-user needs to know next to nothing about algorithmic complexity and can still achieve excellent performance and correct results.

It'd be neat to see a system that was actually designed toward problem decomposition. What's the state of the art around this kind of stuff?

We have very obviously worked on different types of end-user SQL. One that springs to mind had 30 or 40 tables, evolved over the years as “special cases” popped up, and by time I was asked to look at it, it was maybe too far gone. I tried. Maybe if the schema had been tended like a garden, the queries wouldn’t have been so crazy, but...

Mind, though, that this system had made a ton of money for 15 years, so in a way it was a huge win for them! They just eventually painted themselves into a corner and performance started to get worse faster than Moore’s law could save them. Last I heard, the great untangling is still in progress, and I stepped away from that project a few years ago.

They likely have still come out on top when you take the sum of revenue they made from the system, but they incurred a huge unanticipated cost and got backed into a pretty bad corner. Sales people were still selling features that didn’t exist, and the team was desperately trying to build new things while also fixing the ever degrading performance.

I think GUI tools come a cropped when they try to do too much, but can work very well in limited domains.

We have recently released a GUI tool to do data transformation/ETL ( https://www.easydatatransform.com ). It is aimed more at power Excel users, than programmers or data scientists. We have tried to overcome some of the issues mentioned here:

* It is only intended for the limited domain of data transformation, which limits the complexity.

* No need to remember any syntax/commands for the vast majority of use cases.

* The transformation template is stored in a single XML file, so it can be versioned.

* You can use regexp, Javascript and command line arguments to get extra power if you need it.

* It is written in C++ and is fast.

Obviously it can't do everything you can do with a general purpose programming language, but it hopefully do enough for many use cases and is much easier to learn.

I am working with graphical tools and node based structures for years. Think about Blenders shader nodes or modular synthesizers. These systems are good to play around, try things out, quickly change how things are connected.

Above a certain complexity they become very hard to organize. If the complexity of your problem is below this level and the user has any self control, all is good. If the complexity of your problem is higher you'd need a programmer who knoes how to handle this anyways.

I don't think it's a prejudice. I've seen pentaho introduced as a way to enable business analysts to do work otherwise expensive developers would have to do. It simply didn't work, business analysts were not able to use it, devs had to take over and when they did they did not want to use pentaho. It's an easy trap for upper management to fall into thinking the selling point of these tools is the ability to allow the non tech staff to do the programming.

+1 to this. I've seen similar stories play out with Microsoft's Logic Apps, Power Apps, and to a lesser extent Dynamics.

> is it controversial that a graphical tool can emit performant code ?

No. But it's controversial to claim that a graphical tool can do maintainable code.

That seems to be the thrust of the comment you’re replying to: the graph is essentially a programming language, with all the same complexity but multiplied because the same level of complexity is being managed without the same level of sophistication of tools like version control and even basic code re-use constructs typical in text-based languages.

How do you do what has become basic best practices in a gui?

Everything under source control, defined as code.

* This gives you the ability to trace who did what

* Good code-commit hygiene with prs

* The ability to revert changes

* Ability to have an exact replication and promote across environments

Pragmatically test changes, unit test your flows

* Tie all the above up with a deployment pipeline

NiFi is being used in the critical parts of business data pipelines and all the devops rules for reliable development and deployment go out the window.

If NiFi is used by one person for a minor role the GUI is fine. But if it’s the core data pipeline which is a critical service used by multiple teams it’s not the smartest way to do things.

There's a general level of elitism in the programming world. GUI tools are seen by some programmers as a "worse" tool, but without any real evidence, and it's likely due to the fact that GUI tools enable non-programmers to reach some level of output that used to be the sole domain of programmers.

It's a similar train of thought that leads to a lot of game developers to look down on games made with GUI tools like gamemaker or such.

Can we claim that it's elitism when it's born from experience?

Programmers have seen these sorts of UIs fail to deliver countless times as described in the root of this thread. Where is the counter example?

Unreal, Unity, Substance, Octane Render, and Scratch.

GUI tools have a very long history of being about to revolutionize the industry but ending up as dead ends. If you want evidence, just look to the past.

I'm honestly all for tools that enable non-programmers to get things done. E.g., Excel, Google Forms, Zapier, and a lot of Salesforce tooling all deliver to some extent on that promise. But there's only so much they can do.

You're right that there's also an elitism problem. But if we're going to get past that, we have to start by understanding what GUI tools are good for. And being honest about their limits.

Not all game developers, AAA game engines have lots of visual languages specially for what the game designers and artists end up using.

Its indeed drag and drop but as soon as you want to do something that's slightly a little more complex than the regular processors in place, you need to use regex filters, branching, AVRO converters, and non-tech users will be lost very quickly.

I see it very useful to automate certain operations (watch S3 storage, take action as soon as object comes in and store in into a DB), as for such use cases it's pretty much drag and drop.

Another NiFi article was in Hacker News the other week with a fair few positive comments.

I left a comment with my thoughts and we pretty much agree with the exact same issues. Good to see I’m not the only one as seeing some organisations use it more and more to the extent insisting any data ingested in to systems need to go via nifi. Granted some of these are extremely large and disfunctional companies.

I have not found that argument persuasive to the managers who believe that coding is inherently wasteful if what you’re trying to do is technically possible in a workflow builder GUI.

That's true but, personally, I just avoid working for managers like that.

I can avoid this at the line manager level, but Directors love these things.

No, we don’t.

NiFi is also very useful for visually chaining together scripts (e.g. foo | bar | baz). In NiFi you can use ExecuteProcess and ExecuteStreamCommand to basically do what pipes (|) does. Further, the intermediate data between each process is backed by a WAL. You can also mix and match your pipeline, for example you can mix-in some of your custom scripts, and periodically use built-in processors that are provided to do additional stuff (e.g. PutKafka).

Compare coding to managing and suddenly they're on board. In practical terms: a checklist for work can achieve the same thing as a manager verbalising and visualising that checklist. Every argument the manager has against that is the same as 'gui workflow' vs code.

The idea that some sufficiently smart compiler/language/tool will replace expensive developers is a siren song that’s as old as COBOL. Most of these tools have historically failed to deliver much (if any) value. In the few cases where they genuinely made programming easier, the bar for what computers were expected to accomplish was raised to the point where specialized labor was once again required.

Curious if anyone from the NiFi team cares to comment / has thoughts on how to work around these issues (e.g. source control)

They already have a solution for that https://nifi.apache.org/registry.html

We used NiFi...one of the worst experiences.

It installs like an appliance and feels like you are grappling with a legacy tool weighed down by a classic view on architecture and maintenance.

We had built a data pipeline and it was for very high-scale data. The theory of it was very much like a TIBCO type approach around data-pipelines.

Sadly the reality was also like a TIBCO type approach around data-pipelines.

One persons experience and opinion and I am super jaded by it due to some vendor cramming it down one of our directors throats who subsequently crammed it down ours when we warned how it would turn out. It ended up being a very leaky and obtuse abstraction that didn't belong in our data-pipeline when you planned how it was maintained longer-term.

I ultimately left that company. It had to do with as much of their leadership and tooling dictation as anything else, NiFi was one of many pains. I am sure there are places that are using NiFi who will never outgrow the tool so take it with a grain of salt.

Said company ultimately struggled for the very reasons those of us who left were predicting (the tooling pipeline was a mess and was thrashing on trying to get it right, constantly breaking by forcing this solution, along with others, into the flow. Lots of finger-pointing).

Sucks to have that: "I told you so..." moment when you never wanted that outcome for them....I just couldn't be a part of their spiral anymore.

Nifi is a very powerful tool, but also a very specific one, and a self described 'necessary evil'. It does one heck of a job at getting data from A to B though.

That was exactly how folks talked about TIBCO. I didn't find it very powerful. I felt it very cumbersome. When you need to think about Day 2 is where everything completely collapsed.

We were able to pass data around in incredible lightweight ways leveraging Spring sometimes even just leveraging RestRepositories and transforming the object to our data representation by hand, it was never more than 100 lines of code for the entire thing. You could spit one out in an hour...the time was really in composing them and ensuring the architecture reflected the world and was still sensible/manageable.

We ultimately faced issues with running microservices and the licensing cost of that. Our enterprise was sadly too big, they didn't realize they needed to price their internal infrastructure competitive to legacy vendors.

You could get a WAS box for 50k and cram so much on that server until it was bursting...price didn't change. On the other hand each microservice brought a cost which added up.

The economics didn't make sense and it was a new political battle to fight with someone who had zero understanding of marketing what they've built. It just wasn't worth it. Lambdas would have been an option or something more ephemeral/serverless...but the options just weren't there for us at the time.

Enter NiFi and this "new data pipeline" and the circus began.

> It installs like an appliance and feels like you are grappling with a legacy tool weighed down by a classic view on architecture and maintenance.

This is actually a fair and well-articulated point of view. NiFi is currently an "appliance" like you said. Worse, it's a Pet and not a Cattle.

I believe there is active work in the community to address some of that pain. For example, there was a recent addition to NiFi called "stateless NiFi" which enables NiFi to better run in Kubernetes and other "cloud" architectures.

It's not there yet, it's still what would be described as a "fat" application. But I believe that eventually NiFi will evolve to more like a command-and-control tool for the cloud and less like something you have to install directly to your hardware. We hopefully see the day where "NiFi-as-a-Service" exists, which would really be an improvement over the current model.

Would be a huge step.

It feels like the answer will end up being something totally different. The reality is that enterprises do well with appliances.

Selling them cattle is hard because the maintenance piece expects a certain level of hygiene, proactivity, discipline.

An appliance sits there and when the thing breaks, you call in someone to fix the box. That relationship between a customer and vendor surprisingly makes for a good selling environment/symbiosis.

It's the Cathedral and the Bazaar in another spectrum...

Can you elaborate by what you mean on a TIBCO like approach? I haven't used their tools, but would like to know more about the issues you ran into. What were examples of the leaky abstraction>?

I'd like to second this request. I have encountered event buses and ETL in a number of places over my career - I don't understand what the heck TIBCO does beyond something simple like RabbitMQ/ZeroMQ. How is this different from Pub-Sub (and its variants). Any pointers to books or blog posts would be really appreciated.

TIBCO is very much providing queuing/caching to shuttle data from one point to another.

The goal is even more-so to be the interconnect for all systems across a varied enterprise at a higher level. It's all pub-sub underneath the hood. Think cheap butts in seats doing the same work for a "negligible hit on performance".

In the same way you can plug random devices into outlets around the house all served by some powerplant you don't know (or even need to care about), TIBCO attempts to provide that same interface.

Data does need some restructuring, whether these are aggregations, transformations, etc. So they provide steps in the process where you can perform these operations through a drag and drop UI.

There is an input defined and an output defined in XML that you don't have to code, but is managed and can be seen. The engine beneath provides the lower layers of routing, bytecode, implementation letting you just drag blocks around on a screen "connecting things".

The goal is very pure: I have many people in my organization that know how data flows, not all of them are developers. How can I enable them to connect my organization without everyone needing to be a developer.

In theory and in practice are always the interesting observations. What I had seen happen (as was mentioned somewhere else) is that very strong developers became weak by relying on this tool (or merely left for adequately challenging work). When the world moved on to something else, so much had changed it was almost a career change to get back into development.

They went from understanding Java 3/4, JEE to Java 11, Spring, DI Frameworks....I saw a lot give up or move over to product management roles. This only made the tension between on-premise infrastructure teams and public cloud teams more divisive and toxic. I don't think it's anything uncommon in other areas, just feel like we've reached a full revolution in this particular space (and not the first revolution either).

Thanks for a clear explanation without dismissing the product as garbage. (it's in that space where techies hate it, but it must provide value since it's so expensive!)

Why do non-technical people need to understand the data flow? It seems like documentation (data dictionaries) would be preferred. Or, are they useful for very non-technical people, while TIBCO data flow understanding is useful for people who are data savvy but not tech savvy?

It is a butts in seats equation.

If you can have less expensive operators driving and mapping the world and place all the smarts in the pipes, you can drive down opex and divert cash to capex for competitive advantage.

Linux and much of the streaming software world is smart people, dumb pipes.

If you invert that you have more automation, predictability, control at lower cost. The risk is a lot of eggs in one basket and when the market turns, if the company you are buying from mismanages tech, if they can't keep pace with change...you go along for their ride. Every company big and small falls into this technical debt. I have maby opinions on why as I am sure many do.

There is a lot baked into that comment but the constant tug-of-war every CIO is trying to wrap their head around....how do we do more with less and gain an advantage.

TIBCO is garbage. They had a halo for a long time from Rendezvous/EMS. But their money maker was this integration suite called BusinessWorks. It was this horrifyingly complex application that forced you into these ruts so that it could compile Java code. I kid you not, the developer environment for complex code was notepad.exe.

They spent a bunch of money on M&A and eventually had to go private and buy out the founder.

IMO, there is no “TIBCO-like approach” to application integration c. 2020 any more than there is say an Oracle or AWS or Google approach to databases. It’s multi-paradigm, multi-usecase and polyglot. TIBCO as a vendor supports approaches and patterns ranging from event-driven functions to data streams and stateful orchestration to stateless mediation to choreography. The “runtimes” are built on anything from Golang, Python & Node to Java, Scala & .NET.

What you‘re referring to sounds like the legacy version of BusinessWorks 5.x that was launched back in 2001. The current generation of BusinessWorks 6.x provides Eclipse-based tooling just like closest alternatives like Talend, Mule, Fuse, etc. and deploys to 18+ PaaSes (k8s, swarm, GKE, AKS, etc.) or its own managed iPaaS aka TIBCO Cloud Integration. It’s aimed at Enterprise Integration specialists at a Global 2000 or F500.

If you‘re an app developer at a large bank/telco/retailer/airline building integration logic or stream processing or event-driven data pipelines, you‘re likely to use Project Flogo (flogo.io) It’s 3-clause BSD FLOSS and has commercial support and optionally commercial extensions available. Oh and you’re likely going to use Flogo apps with Apache Pulsar or Apache Kafka messaging. Both Pulsar and Kafka are available as commercially supported software from TIBCO (Rendezvous or EMS are our traditional proprietary messaging products). Flogo apps can deploy to TIBCO Cloud, dozen+ flavors of k8s, AWS Lambda, Google Cloud Run or as a binary on an edge device.

(Disclaimer: Product at TIBCO. Used to work on BW 6.0 back when the only PaaS was good ol’ Heroku)

Curious, did you have a preferred alternative?

I get the feeling you described, Nifi has a.. heavy and highly structured feel to it, but lighter alternatives are not as integrated, say... Airflow, Streamsets, AWS Glue, Kafka (different beast) etc.

That said, Nifi is incredibly powerful and complete considering it's open source and free.

> Sadly the reality was also like a TIBCO type approach around data-pipelines.

Thats exactly how it looks like, thanks for confirming. Will avoid.

NiFi's biggest strength is that it is a 2-way system - it is not Storm, it is not Flink, it is not Kafka, it is not SQS+Lambda.

I like to think of it like Scribe from FB, but with an extremely dynamic configuration protocol.

The places where it really shines is where you can't get away with those 3 and the problem is actually something that needs a system which can back-pressure and modify flows all the way to the source - it is a spiderweb data collection tool.

So someone trying to Complex Event Processing workflows or time-range join operations with it, will probably succeed at the small scale, but start pulling their hair out at the 5-10GB/s rate.

So its real utility is in that it deploys outside your DC, not inside it.

This is the Site-to-Site functionality and MiniFI is the smallest chunk of it, which can be shrunk into a simple C++ something you can deploy it in every physical location (say warehouse or grocery store).

The actually useful part of that is the SDLC cycle for NiFi, which lets you push updates to a flow. So you might want to start with a low granularity parsing of your payment logs on the remote side as you start, but you can switch your attention over it to & remove sampling on the fly if you want.

If you're an airline flying over the arctic, you might have an airline rated MiniFI box on board which is sending low traffic until a central controller pushes a "give me more info on fuel rates".

Or a cold chain warehouse which is monitoring temperature on average, until you notice spikes and ask for granular data to compare to power fluctuations.

It is a data extraction & collection tool, not a processing and reporting tool (though it can do that, it is still a tool for bringing data after extraction/sampling, not enrichment).

Incredible piece of software. I've used it in production at my last two jobs. You can build almost anything in NiFi once you get into the mindset of how it works.

A good way to get started with NiFi is to use it as a highly available quartz-cron scheduler. For example, running "some process" every 5 seconds.

Disclaimer: I'm an Apache NiFi committer.

An article you might find interesting about it's ability to scale.


Disclaimer v2: I used to work at Cloudera

This page is classic Apache project in that I read it and have no idea what it does. Can you high level explain what this thing is really for?

Agreed. So here's an attempt to describe NiFi at a high level.

Fundamentally NiFi is a "dataflow engine", a system that can be used to automate data transfer from different and varying types of sources and sinks. It has a fairly usable UI that enables a "dataflow manager" (end user) to perform transformation, routing and delivery of data using a "drag-n-drop" configuration approach.

Getting data into or out of your application/system, or performing simple schema transformations, is a common (maybe tedious) task that most developers face. NiFi helps connect the dots, so to speak, and decouples the receipt/delivery of data away from your application. NiFi comes with a set of "batteries included" connectors for almost every transport protocol you would generally need. And it's modular so you can create your own processing components as well.

NiFi is fundamentally modeled after what's called "Flow-Based Programming"[1], which is a style of programming that facilitates composition of black-box processing units. It can run at an enterprise or IoT level, depending on where that decomposition best fits into your architecture.

[1] https://en.wikipedia.org/wiki/Flow-based_programming

(disclaimer: I'm affiliated with the NiFi project)

Is this a layer on Apache Camel [1] or something completely different?

[1]: https://camel.apache.org/

It was built from scratch at the NSA and open sourced a few years ago.

I'm curious how you would compare it to Apache Camel - when would you use Camel and when would you use this?

We use camel with DSLs to make programmatic workflows that connect data flows together. However Camel itself doesn't typically carry the data. Sometimes it SFTPs files around etc., but mostly, it is just a messaging layer.

Is that the main difference here?

Is this similar to NodeRed for IoT? I was trying to bring up something similar for IoT with CEP rules and source and sink connections on the Edge.

So is this comparable to something like the old Microsoft Biztalk?

I fine that explaining NiFi tends to be difficult, especially if you're an advanced user. This is because NiFi is extremely general purpose. I've used it for many different types of projects. That being said, here's a few use cases that are relatively easy to get started.

1. Move data from A to B

2. Move data from A to B, but perform some intermediate processing in-between (ETL)

3. Grab data from A, write it to [INSERT SYSTEM HERE] (Mongo, Kafka, many others)

4. Periodically run an arbitrary script or program (like cron but you can schedule in seconds)

Those are just some use cases that people tend to start off doing. It's important to note...the above doesn't sound very impressive until you think about the following features:

1. Build these pipeline completely from a UI. You can leverage custom code, but the number of processors that come out of the box is astounding. If you can think of it, it most likely already has something built-in.

2. By default safety nets. Data is backed by a high performance WAL that allows for recovering in the case of a failure.

3. Ability to pause and resume specific areas of a pipeline at any time. For example, you may have a processor that's receiving data and perhaps the database you write the data to is offline (or has moved). Data is automatically buffered on disk so when the DB comes back online, you can backfill the data that you've been consuming. Further, none of the data was dropped in this process.

4. Tunable and verifiable back pressuring out of the box (_very_ hard to get right when writing custom code)

5. Easy to prioritize certain events over others (e.g. higher priority, latency sensitive data)

Lastly, when you start getting comfortable using NiFi, you will realize how general purpose the engine is. Data is represented as a "FlowFile". A "FlowFile" is essentially just a Array<Byte>. This means you can operate on virtually anything. Further, you can operate on large data since its backed by a WAL and NiFi facilitates the ability to stream data from disk when you want to process it. As long as you don't read the entire thing into memory (of course).

Easiest way to get started is to just download the binaries and run it. Then go to the http://localhost:8080/nifi and just mess around. There's plenty of tutorials online (Blogs, YouTube, etc). Once you get comfortable using NiFi, people will be blown away with how fast you can get something up and running. Things that would normally take days or weeks to get into production, you can routinely do it in hours or minutes.

Hope this helps!

edit: You can download from https://nifi.apache.org/download.html. Just unzip/untar the package and run "./bin/nifi.sh run"

Isn't most of this just Kubernetes running PrestoDB/PrestoSQL?

Not sure what you mean. PrestoDB is an analytics data store. The only correlation between NiFi and Presto would be if you wrote data to Presto using NiFi.

Thanks for your honesty, I thought I was the only one.

Quick question, what does the role of Data Engineer at Epic Games entail and what technologies are you working with?

NiFi at first glance sometimes just looks like a glorified GUI for building out a data-delivery application. But NiFi doesn't just compile an application to be deployed on your network. Instead, the "power" of NiFi is that it allows an operations staff to perform the regular day-in-day-out task of monitoring, regulating and if needed modifying the delivery of data to an enterprise.

NiFi gives insight to your enterprise data streams in a way that allows "active" dataflow management. If a system is down, NiFi allows dataflow operations to make changes and deal with problems directly, right at tier 1 support.

It's often the case that an enterprise software developer has an ongoing role of ensuring the healthy state of the applications from their team. They don't just develop, they are frequently on call and must ensure that data is flowing properly. NiFi helps decouple those roles, so that the operations of dataflow can be actively managed by a dedicated support team that is more tightly integrated with the "mission" of their dataflow.

NiFi additionally offers some features that most programmers skip to help with the resiliency of the application. For example:

- the concept of "back pressure" is baked into NiFi. This helps ensure that downstreams systems don't get overrun by data, allowing NiFi to send upstream signals to slow or buffer the stream.

- data provenance, the ability to see where every piece of data in the system originated and was delivered (the pedigree of the data). Includes the ability to "replay" data as needed.

- dynamic routing, allowing a dataflow operator to actively manage a stream, splicing it, or stopping delivery to one source and delivering to another. Sources and Sinks can be temporarily stopped and queued data placed into another route. Representational forms can be changed (csv -> xml -> json, avro), and even schemas can be changed based on stream.

Anyone can write a shell script that uses curl to connect with a data source, piping to grep/sed/awk and sending to a database. NiFi is more about visualizing that dataflow, seeing it in real-time, and making adjustments to it as needed. It also helps answer the "what happens when things go wrong" question, the ability to back-off if under contention, or replay in case of failure.

(disclaimer: affiliated with NiFi)

NiFi is vey good at reliably moving data at very high volumes, low latency, with a large number of mature integrations, in a way that allows for fine grained tuning, and i've seen first hand that it is very scalable. It's internal architecture is very principled: https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.ht...

Out of the box it is incredibly powerful and easy to use; in particular it's data provenance, monitoring, queueing, and back pressure capabilities are hard to match; custom solution would take extensive dev to even come close to the features.

It is not code, and that means it is resistant to code based tooling. For years it's critical weakness was related to migrating flows between environments, but this has been mostly resolved. If you are in a place with dev teams and separate ops teams, and lots of process required to make prod changes, then this was problematic.

However, the GUI flow programming is insanely powerful and is ideal when you need to do rapid prototyping, or quickly adapt existing pipelines; this same power and flexibility means that you can shoot yourself in the foot. As others have said, this is not a tool for non technical people; you need to understand systems, resource management, and the principles of scaling high volume distributed workloads.

This flow based visual approach makes understanding what is happening easier for someone coming later. I've seen a solution that required a dozen containers of redis, two multiple programming languages, zookeeper, a custom gui, and and mediocre operational visibility, be migrated to a simple nifi flow that was 10 connected squares in a row. The complexity of the custom solution, even though it was very stable and had nice code quality, meant that that solution became a legacy debt quickly after it was deployed. Now that same data flow is much easier to understand, and has great operational monitoring.

Some suggestions: - limit NiFi's scope to data routing and movement, and avoid data transformations or ETL in the flow. This ensures you can scale to your network limits, and aren't cpu/memory bound by the transformation of content. - constrain the scope of each instance of nifi, and not deploy 100s of flows onto a single cluster. - you can do alot with a single node, only go to a cluster for HA and when you know you need the scale.

Phew! Happy to have read the comments here. They say a lot. I will go with Apache Airflow for all my workflow needs from now on. I wasn't entirely sure if this was the best bet, but after seeing all of this I am now.

I know a massive installation [0] which is about to be open sourced, where Apache NIFI is used in the middle of the stack as a key component. No dismissal of the capabilities this package offers intended.

[0] https://sikkerhetsfestivalen.no/bidrag2019/138

slides [slide #32]: https://static1.squarespace.com/static/5c2f61585b409bfa28a47...

For the love of god, don't use NiFi to trigger an Airflow DAG.

Can you expand? We just set this workflow up and it seems to be working fine.

NiFi is meant for stream processing and Airflow for batch processing, if your NiFi triggers an Airflow DAG that means that your entire process is batch processing and you shouldn't use NiFi in the first place. If you still want to do stream processing then use Airflow sensors to "trigger" it.

If you're considering Apache NiFi, you should also look at Apache Airflow and Uber Cadence to decide what model would work best for you.

They do have totally different use cases, so it should be fairly quick to decide which one is for you.

Can you explain what the different use cases are? In particular, what is Airflow good at that Nifi is not?

I have used Nifi a little bit and Airflow not at all. Reading the home pages of the two products, it's hard for me to know when it would be more appropriate to use Airflow than Nifi.

They both schedule jobs and move data according to control flow topologies that you build in a GUI, right?

From watching the presentation on Youtube (https://youtu.be/sQCgtCoZyFQ), it seems NiFi is geared towards acquisition of data, gluing/batching/massaging the flow between systems and providing the necessary interfaces to downstream systems;

- Would love to see the ability to develop custom NiFi processors in Go/Rust/Elixir etc.

- XML is a big pain in the rear.

- Being container-aware is big win. Stateless is even better.

I see a good opportunity there for users like me to explorer NiFi's capability in the future.

> They both schedule jobs and move data according to control flow topologies that you build in a GUI, right?

Airflow on the other hand is designed to run scheduled jobs (whether it be batched or otherwise). The 'job' can really be anything - build / data processing pipelines, system configuration management pipelines and so on. In Airflow parlance, one can create connected DAG's as pipelines that massage the data in a way you intend it to.

They both share some commonalities but I do gravitate towards their use cases being subtly different and an important one highlighted above.

Can someone explain what this is? I can't find anything on the website that explains it

Sure, I've worked with it. It let's you visually build data pipelines. It's extremely useful for getting work done quickly--you just drag and drop prebuilt connectors to things like elasticsearch, s3, or twitter and you have a data pipeline, including automatic backoff and the ability to inspect the data at each step. It's visual so it's easy to tell what's going on. Biggest downside is it's not automatically distributed. You can set it up to be distributed, but you have to do the plumbing yourself on the nifi graph by dropping nodes for routing tasks between nifi servers. Overall, perfect tool for quickly building a pipeline that can be easily shown to the business and in which you can visually see all data and errors at each step.

Why would one consider distributing it in the first place? IMHO that would require distributed transactions, which would make it hugely complex.

Been that way for at least 5 years https://news.ycombinator.com/item?id=10190846

There's a few articles in the old comments that explains it's use case a little

Disclaimer: I haven't worked with NiFi specifically, but with another similar product

As far as I know, it at least partially fulfills the role of "enterprise application integration" pattern; the idea is sort of proto-servicemesh where you have bunch of weird enterprise applications all around and you need to get them talking to each other, so you plop in a fancy EAI middleware thingy in the middle which then will talk to each of the applications individually, converting and transforming data from one format and protocol to another. But its not just a dumb hub, in addition to doing arbitrarily complex transformations, it can also make "routing" decisions based on the data itself. And this pattern being product of its time, instead of defining the network in code in some nice DSL everything is just "configured" and lots of things is achieved through clicketyclicking through GUI. I suspect this is at least partially due "configuring a turn-key product" sounding less scary to managers than "developing a system on framework".

If you look at the docs page of NiFi[1] and scroll quickly through the long list in the left sidebar with all the "processors", you already can get an idea what sort of things it does.

While this sort of solutions can be very useful in some situations, I wouldn't necessarily start designing a greenfield architecture around NiFi. But if you end up running it, then you might find that piece by piece it will accumulate all sorts of bits and bops of things because it's "convenient" to throw it in there, while the logic flows become more and more eldritch in nature.

[1] http://nifi.apache.org/docs.html

I quite like the description they give in the docs[1], a brief summary:

> Over the years dataflow has been one of those necessary evils in an architecture. [...] NiFi is built to help tackle these modern dataflow challenges.

[1]: http://nifi.apache.org/docs.html

Think TIBCO data pipelines.

You create sources of data and can have non-technical (or maybe non-developer) roles wire together these data pipelines with transformations, aggregations etc.


TIBCO is an "Enterprise Message Bus." Think of RabbitMQ or 0MQ but supported by a giant enterprise grade vendor.

My last experience with TIBCO was them telling us the product didn't support anything but running on bare metal because "virtualization does not guarantee the order of writes to disk" (their words not mine)

True.. if your last interaction with TIBCO was c. 2005. Back then consistency issues were a real challenge with VMs, “ESB” was a real software category with transactional messaging as a pattern (I make it a practice to not judge customer preference on how they use our tools)

Today you will find even our traditional brokered messaging products like EMS lifted and shifted on AWS/Azure infra.

OTOH, if you are building new apps - you‘re likely to deploy TIBCO’s integration or stream apps / pipelines using Flogo/Pulsar/Kafka on k8s-lambda-gcr or on a fully managed service on cloud.tibco.com.

(Source / Disclaimer: I do product stuff over at TIBCO Software. Happy to chat if there are further qns: rkozhikk<remove-this-at>tibco<dot>com)

We are using NiFi as our dataflow engine for real time data ingest. We are using a current version, 1.11.4, and have several instances running including a development instance. The interface provides our team the ability to do quick iterative development and testing. An example of one of our use cases is we have 2 dataflows that ingest data from 2 different vehicle location/status systems and pump them into SQL Server. At the same time another dataflow merges the data from SQL Server and sends the data to Azure Event Hub. These dataflows were easy to setup, test and extend. This replaced a process that was written in Go.

Nifi is a good (not great) tool, mostly because of all of the functionality you get out of the box. It comes with almost any kind of connector you would need for moving data. There's a pretty steep learning curve, but once you push through that, creating a new data flow from scratch is quick and easy. It sucks that other people in this thread have had bad experiences with Nifi, and I can't say that I haven't. However, it has generally been a positive addition to my team's stack.

Had some second hand opinions on running NiFi in prod and all of them were rather negative, some saying it was a mistake. That was around one year ago. I wonder if things have changed since then.

I have never heard of this before, and I'm sad that profit-driven, marketing speak has taken over even non-profit product pages.

> An easy to use, powerful, and reliable system.

This is the title. That's the most important sentence, and it's absolutely meaningless.

It's bad enough that everything has to "sell" - just describe to me what your product does and I'll decide if I need or not. Don't try to convince me.

If you have to sell, do it by differentiating yourself from your competitors. No one is calling themselves "Difficult to use, weak, and unreliable", so saying the opposite is not differentiation.

When did we accept that marketing-speak was default communication. Can't we have some landing pages that are essays? Or even a few paragraphs instead of trying-to-be-catchy bullet point phrases in large font?

> This is the title. That's the most important sentence, and it's absolutely meaningless.

Well, yeah, it's meaningless if you cut off the second half...

> ...to process and distribute data.

That's what it does. The adjectives before it aren't the meat of the sentence.

I have experience from multiple projects with NiFi and it was the main reason for me and others quitting the company. Somehow management were convinced by some salesmen that this would be the golden bullet, however, all of their deliveries were delayed. We experienced issues debugging flows with performance problems, and even basic version control was problematic due to ids being replaced every time.

NiFi is a fantastic tool for a certain set of organizational constraints.

* It doesn't need much in the way of dependencies to run. If you can get Java onto a machine, you can probably get NiFi to run on that machine. - That is HUGE if you are operating in an environment where getting any new dependencies installed on a machine is an operational nightmare.

* It doesn't require a lot of overhead. Specifically, no database.

* You can write components for it that don't require a whole lot of tweaking for small changes to the incoming data. So, if I have a machine processing a JSON file that looks like XXYX and another machine processing a nearly identical JSON file that looks like XYXX, the tweaks can be made pretty easily.

So, if you're looking for a lightweight, low overhead, easily configurable tool that may be running in an environment where you've got to run lots of little instances that are mostly similar but not quite, NiFi is great.

If you are running a centralized data pipeline where you have a dedicated team of data engineers to keep the data flowing, there are better options out there.

No more XML. Check out NiFi 1.11.4, it does everything you need for easy ingest. If you are reading some files putting them into Kafka or S3 or a database or MongoDB or Hbase or Hive or Impala or Oracle or Kudu or ..., it's genius.


As far as I see, your link has nothing to do with 1.11.4 release. May be you wanted to link to a specific page?

Having used NiFi in production, my biggest issue with it is handling source control and multiple environments. As the "IDE" is effectively also the runtime, the lines between "local", "stage", and "prod" are easy to blur.

They have a built-in source control product called "NiFi Registry", which can even be backed by git. The workflow for promoting flows between environments feels clunky though, especially as so much environment-specific configuration is required once your number of components gets high enough.

Moving our Java, Ruby or Go code between environments or handling versioning and releases was a piece of cake, in comparison.

Do I understand what this is: general purposes SSIS-type data integration, pipeline, and workflow tool?

If so, how does it compare to SSIS, dbt, and other projects (please name!)?

Otherwise, what is an analogous toolset?

I have been working on a new product which competes with NiFi, providing streaming data transformations.

Think, if order value > 100 and the customer has ordered 3 times in the last hour and the product will be in stock tomorrow.

Kafka streams, Flink and Dataflow are super powerful and I think there is room for a GUI tool.

Would be great to hear experiences of NiFi in this domain or discuss the space with any experienced users. Will add contact details in my profile.

Have you heard of BPEL? :D

From wikipedia:

> the open BPMS movement led by JBoss and Intalio Inc., IBM and Microsoft decided to combine these languages into a new language, BPEL4WS. In April 2003, BEA Systems, IBM, Microsoft, SAP, and Siebel Systems submitted BPEL4WS 1.1 to OASIS for standardization via the Web Services BPEL Technical Committee

if that doesn't scare you, then oh boy...

Just look at the list of organizations mentioned there, I don't know if you could make it more enterprisy.

For those who were around during the mid-2000s, is this basically another revival of SOA (Service Oriented Architecture)?

I watched one of the explanation videos and it brought back memories.

My dislike of the phase back then, which I hope they've addressed now, is that while everything looked find and dandy while designing things on a UI, when something broke it was a whole heaps of generated XML no one could read.

So Apache has at least a handful of software packages that do about the same thing, but with different interfaces and connectivity?

This has sorta been my experience with a lot of Apache projects recently. The differences between them are becoming quite nuanced.

I'm trying to piece together the main reason why someone would pick this over Camel, or vice-versa. I know they're different - but not night and day.

So, the comments here have mostly ranged between neutral to negative regarding their experience.

I have a problem where I want to stream data to an ML layer and then stream that to a web app (e.g. Laravel or Django).

Reading the docs here, this seems like this would solve this problem, but was wondering if people had alternatives given that people seem to think poorly of this application.

Nifi is fantastically good at one thing, which is dataflow. Where you've got data coming in at point A, but you need it at point B, and for some reason can't convince either A or B to connect directly.

It's not a message bus, nor is it a data processing framework, nor a scheduler, nor an ETL tool. If you try to use it for one of those you're in for a bad time.

What you're describing sounds like you might need a message bus (think ZeroMQ, Kafka, etc). Assuming you're writing the software yourself and want to connect it together.

could you elaborate on the difference between what op is asking for and what nifi/airflow does? to me the use case of moving data through a couple of different services could be solved by a message bus OR nifi.

Nifi is designed to handle the problems that crop up when two systems can't talk directly to each other. It puts a buffer between them to allow one process to keep sending data when the other isn't quite fast enough, it can do some basic transformation when the data isn't quite in the right format, etc.

However if you just need something to send messages then you're better off using a tool that does just that, you don't need the overhead of a system that can connect arbitrary applications that talk in incompatible protocols you just need a single protocol that allows your applications to send each other data.

In their docs Nifi calls itself a dataflow tool and calls dataflow a necessary evil. It's the band-aid you need when you've got a mismatch between the way data is generated and the way it is consumed. It would be insane to deliberately create such a mismatch just to use Nifi.

ok that makes sense. but in that case wouldn't be easier to write your own adapters for each data source?

Well no, Nifi has a lot of adapters that you can use out of the box, making them yourself isn't easier.

> nor an ETL tool

Can you disambiguate that further? I think you mean that an ETL tool would handle the external interface work whereas Nifi is dealing with data that is somewhat inside your control. Is that an OK description?

Often in ETL the transformation part becomes a goal of itself, in Nifi there are tools to transform data, but the Extract and Load parts work best, it has interfaces for quite a lot of different systems.

And maybe it's just me but everything I've seen being called ETL handled data in batches, in which case you really want support for scheduling, error handling and retries, which Nifi does not really have (there is a 'retry' function somewhere, but I found it confusing and it only seems to work for a single data item, not a good idea when you've got thousands of them). I much prefer Airflow for those scenarios.

Thanks so much for that, very helpful!

Besides the workflow projects described in this thread I've been poking around with https://siddhi.io/ a bit. It's interesting when paired up with debezium and kafka (webhooks -> kafka, db WAL -> kafka) or if your needs are light enough to avoid kafka altogether.

I like the way how it buffers messages. You basically stop a process and it will continue when it left off. It is easy to create a distributed cluster. It has hundreds of different connectors for external sources. On the other hand is very bulky. Making it to work with https was horrible. You cannot just put it behind a reverse proxy.

We only use NiFi on the edge of our data lake. It's very good at bulk loading, pulling log files/sensor data from hundreds of systems into our systems.

However, it does not handle small records well, and deploying custom processors is a pain, so don't use it to replace your stream processing framework.

Our team found this adapter to integrate ML with NiFI pretty handy: https://dev.to/tspannhw/easy-deep-learning-in-apache-nifi-wi...

Thanks! I wrote a few more of those. https://github.com/tspannhw/ExecuteClouderaML

It reminds me of LONI Pipeline[1], which was created for the need of a neuroscience lab to process images of brain scans.

[1] http://pipeline.loni.usc.edu/

I am new to this site, why is there just a link to Apache NiFI on the front page? Is this somehow news? Sorry not trying to be rude just confuses me a bit since NiFi has been around for some time.

Someone found it interesting, I guess. :) Personally, I'm finding the discussion very interesting, since I haven't seen it here before.

Sometimes Wikipedia articles hit the front page. It's fine. Usually.

Is this an open source self hosted equivalent to https://aws.amazon.com/datapipeline/?

Yes, basically. AWS Data Pipeline includes spinning up and tearing down VMs to run the pipelines.

Nifi has a much more extensible architecture though, you can pretty much do anything you want in a data pipeline in Nifi, and open source of course.

Reminds me of the early 2000s when we were all into BPM, graphical or otherwise. The drawbacks are pretty obvious. I bet the engineers who built it had fun, tho.

As a Java programmer back then, watching with horror as consultants brought in their flavor of the month (MQ/Spring/Camel), I could only pad my resume, and pretend to say "never look a gift horse in the mouth".

For someone not involved in the web stack, reading through that page tells me nothing. Can someone tell me what use cases this is meant to solve?

I tried to give an answer here. Does this help?


Does NiFi still use XML output for its source code? It makes it very hard to put under source control. Overall it’s a nice tool and very fast.

Has anyone replaced Mirth with NiFi for HL7 slinging?

Our devs hated it due to ease of accidental flow fsck ups. And good luck scaling it or trying to put it behind an LB!

> An easy to use, powerful, and reliable system to process and distribute data.

But what is it?

Brought to you by the NSA ;-)

I mean, yes. Paid for by $US Tax Payers. I'm for one glad to see at least some of my investment returned, regardless of your opinions about the NSA.

inb4 this shit gets bloated to hell like apache httpd

We have a team using this at work. They had built a process and needed it to be put on a VM and run periodically. They said the requirements were a dual core machine and 8gb of ram. The “binary” was like 1.8gb. I’m sure it included a jre and a full nifi runtime, but god damn that is ridiculous. Had this process been built using go or crystal or something like that it probably would have been less than a megabyte and able to run with 512mb of ram.

From the spec (8gb) it appears that NiFi was not needed. Were you co- locating zookeeper on the same nodes?

Also you say it needed to be run periodically? It’s supposed to be a long running service. If you need something that you can spin up and shutdown in a container or VM then it is probably not the solution.

Use it if you have high volumes of data you need to transport from A -> B. We run on a cluster with 256gb ram, 128 core per a node.

The reason the binary is huge because it contains all the jar/nar for the connectors ( https://www.nifi.rocks/apache-nifi-processors/). If you only require a subset of them you can use the minifi version which is quite small ( 49 MB for java and 3.2 MB for c++ version)

NiFi is intended as a serious workhorse, we run it on a cluster with 256GM of RAM and over 20 cores for each server (on-prem), the 2GB of disk storage is negligible in any sort of scale (Just like the dual core/8gb machine)

Interesting - can you expand on this: "Had this process been built using go or crystal or something like that it probably would have been less than a megabyte and able to run with 512mb of ram"?

Well, when one considers "no payload/data" flows through this system and it is merely managing metadata & workflow status, I wonder if it is web-scale enough :P your team will now ask for a cluster of these monstrosities.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact