* I've seen customers fall into the trap of thinking they don't need expensive developers because you can drag and drop, just people who can use a mouse can crack on with NiFi.
* It persisted its config to an XML file, including the positions of boxes on the UI. Trying to keep this config in source control with multiple devs working on it was impossible.
* Some people take the view that you should use 'native' NiFi processors and not custom code. This results in huge graphs of processors with 1000s of boxes with lines between you have to follow. Made both better and worse by being able to descend and ascend levels in the graph. The complexity that way quickly becomes insane.
* You're essentially programming with it. I've no doubt you could use it to write, say, an XMPP server if so inclined. Which means you can do a great many things of huge complexity. Programming tools have developed models for inheritance and composition, abstraction, static analysis, etc. which NiFi just didn't have. The amount of repeated logic I've seen it's configuration accumulate is beyond anything I've seen from any novice programmer.
I ended up feeling like it could be an OK choice in a very small number of places, but I never got to work on one of those. The NSA linking together multiple systems with a light touch is possibly one such use case. For most everyone else, I couldn't recommend it.
That is sort of the key problem I see with NiFi (and equivalents). The heavy emphasis on graphical UI and visual paradigm sort of implies that its oriented towards non-developers, but problem is that it doesn't make non-developers suddenly expert system architects or developers even if they manage to click through the UI. And many developers probably prefer just defining stuff in code instead of having fancy UIs. So it sort of falls between these to categories.
Of course there is huge spectrum of skill in people, and there are probably plenty of "semi-technical" persons to whom this is perfect match, especially if supported by some more techy people.
Sounds like it's improved a bit since then.
I have no experience with NiFi itself.. just that graphical tools are not inherently "non-technical"
It's more that graphical tools don't generally serve their stated purpose of making it possible for non-programmers to get things done because it appears that the main skill behind programming is actually problem decomposition and modelling rather than syntax.
Additionally, graphical tools tend to have a bunch of problems inherent to them, such as being unable to write comments and harder to store in git and work on collaboratively and only having one single editor, which is usually much buggier than a text editor and compiler.
So: graphical "languages" don't make most things much easier, and make other things much harder.
Totally. This is where declarative and intention-oriented systems shine. Take something like SQL, where, in the majority of cases, the end-user needs to know next to nothing about algorithmic complexity and can still achieve excellent performance and correct results.
It'd be neat to see a system that was actually designed toward problem decomposition. What's the state of the art around this kind of stuff?
Mind, though, that this system had made a ton of money for 15 years, so in a way it was a huge win for them! They just eventually painted themselves into a corner and performance started to get worse faster than Moore’s law could save them. Last I heard, the great untangling is still in progress, and I stepped away from that project a few years ago.
They likely have still come out on top when you take the sum of revenue they made from the system, but they incurred a huge unanticipated cost and got backed into a pretty bad corner. Sales people were still selling features that didn’t exist, and the team was desperately trying to build new things while also fixing the ever degrading performance.
We have recently released a GUI tool to do data transformation/ETL ( https://www.easydatatransform.com ). It is aimed more at power Excel users, than programmers or data scientists. We have tried to overcome some of the issues mentioned here:
* It is only intended for the limited domain of data transformation, which limits the complexity.
* No need to remember any syntax/commands for the vast majority of use cases.
* The transformation template is stored in a single XML file, so it can be versioned.
* It is written in C++ and is fast.
Obviously it can't do everything you can do with a general purpose programming language, but it hopefully do enough for many use cases and is much easier to learn.
Above a certain complexity they become very hard to organize. If the complexity of your problem is below this level and the user has any self control, all is good. If the complexity of your problem is higher you'd need a programmer who knoes how to handle this anyways.
No. But it's controversial to claim that a graphical tool can do maintainable code.
Everything under source control, defined as code.
* This gives you the ability to trace who did what
* Good code-commit hygiene with prs
* The ability to revert changes
* Ability to have an exact replication and promote across environments
Pragmatically test changes, unit test your flows
* Tie all the above up with a deployment pipeline
NiFi is being used in the critical parts of business data pipelines and all the devops rules for reliable development and deployment go out the window.
If NiFi is used by one person for a minor role the GUI is fine. But if it’s the core data pipeline which is a critical service used by multiple teams it’s not the smartest way to do things.
It's a similar train of thought that leads to a lot of game developers to look down on games made with GUI tools like gamemaker or such.
Programmers have seen these sorts of UIs fail to deliver countless times as described in the root of this thread. Where is the counter example?
I'm honestly all for tools that enable non-programmers to get things done. E.g., Excel, Google Forms, Zapier, and a lot of Salesforce tooling all deliver to some extent on that promise. But there's only so much they can do.
You're right that there's also an elitism problem. But if we're going to get past that, we have to start by understanding what GUI tools are good for. And being honest about their limits.
I see it very useful to automate certain operations (watch S3 storage, take action as soon as object comes in and store in into a DB), as for such use cases it's pretty much drag and drop.
I left a comment with my thoughts and we pretty much agree with the exact same issues. Good to see I’m not the only one as seeing some organisations use it more and more to the extent insisting any data ingested in to systems need to go via nifi. Granted some of these are extremely large and disfunctional companies.
It installs like an appliance and feels like you are grappling with a legacy tool weighed down by a classic view on architecture and maintenance.
We had built a data pipeline and it was for very high-scale data. The theory of it was very much like a TIBCO type approach around data-pipelines.
Sadly the reality was also like a TIBCO type approach around data-pipelines.
One persons experience and opinion and I am super jaded by it due to some vendor cramming it down one of our directors throats who subsequently crammed it down ours when we warned how it would turn out. It ended up being a very leaky and obtuse abstraction that didn't belong in our data-pipeline when you planned how it was maintained longer-term.
I ultimately left that company. It had to do with as much of their leadership and tooling dictation as anything else, NiFi was one of many pains. I am sure there are places that are using NiFi who will never outgrow the tool so take it with a grain of salt.
Said company ultimately struggled for the very reasons those of us who left were predicting (the tooling pipeline was a mess and was thrashing on trying to get it right, constantly breaking by forcing this solution, along with others, into the flow. Lots of finger-pointing).
Sucks to have that: "I told you so..." moment when you never wanted that outcome for them....I just couldn't be a part of their spiral anymore.
We were able to pass data around in incredible lightweight ways leveraging Spring sometimes even just leveraging RestRepositories and transforming the object to our data representation by hand, it was never more than 100 lines of code for the entire thing. You could spit one out in an hour...the time was really in composing them and ensuring the architecture reflected the world and was still sensible/manageable.
We ultimately faced issues with running microservices and the licensing cost of that. Our enterprise was sadly too big, they didn't realize they needed to price their internal infrastructure competitive to legacy vendors.
You could get a WAS box for 50k and cram so much on that server until it was bursting...price didn't change. On the other hand each microservice brought a cost which added up.
The economics didn't make sense and it was a new political battle to fight with someone who had zero understanding of marketing what they've built. It just wasn't worth it. Lambdas would have been an option or something more ephemeral/serverless...but the options just weren't there for us at the time.
Enter NiFi and this "new data pipeline" and the circus began.
This is actually a fair and well-articulated point of view. NiFi is currently an "appliance" like you said. Worse, it's a Pet and not a Cattle.
I believe there is active work in the community to address some of that pain. For example, there was a recent addition to NiFi called "stateless NiFi" which enables NiFi to better run in Kubernetes and other "cloud" architectures.
It's not there yet, it's still what would be described as a "fat" application. But I believe that eventually NiFi will evolve to more like a command-and-control tool for the cloud and less like something you have to install directly to your hardware. We hopefully see the day where "NiFi-as-a-Service" exists, which would really be an improvement over the current model.
It feels like the answer will end up being something totally different. The reality is that enterprises do well with appliances.
Selling them cattle is hard because the maintenance piece expects a certain level of hygiene, proactivity, discipline.
An appliance sits there and when the thing breaks, you call in someone to fix the box. That relationship between a customer and vendor surprisingly makes for a good selling environment/symbiosis.
It's the Cathedral and the Bazaar in another spectrum...
The goal is even more-so to be the interconnect for all systems across a varied enterprise at a higher level. It's all pub-sub underneath the hood. Think cheap butts in seats doing the same work for a "negligible hit on performance".
In the same way you can plug random devices into outlets around the house all served by some powerplant you don't know (or even need to care about), TIBCO attempts to provide that same interface.
Data does need some restructuring, whether these are aggregations, transformations, etc. So they provide steps in the process where you can perform these operations through a drag and drop UI.
There is an input defined and an output defined in XML that you don't have to code, but is managed and can be seen. The engine beneath provides the lower layers of routing, bytecode, implementation letting you just drag blocks around on a screen "connecting things".
The goal is very pure: I have many people in my organization that know how data flows, not all of them are developers. How can I enable them to connect my organization without everyone needing to be a developer.
In theory and in practice are always the interesting observations. What I had seen happen (as was mentioned somewhere else) is that very strong developers became weak by relying on this tool (or merely left for adequately challenging work). When the world moved on to something else, so much had changed it was almost a career change to get back into development.
They went from understanding Java 3/4, JEE to Java 11, Spring, DI Frameworks....I saw a lot give up or move over to product management roles. This only made the tension between on-premise infrastructure teams and public cloud teams more divisive and toxic. I don't think it's anything uncommon in other areas, just feel like we've reached a full revolution in this particular space (and not the first revolution either).
Why do non-technical people need to understand the data flow? It seems like documentation (data dictionaries) would be preferred. Or, are they useful for very non-technical people, while TIBCO data flow understanding is useful for people who are data savvy but not tech savvy?
If you can have less expensive operators driving and mapping the world and place all the smarts in the pipes, you can drive down opex and divert cash to capex for competitive advantage.
Linux and much of the streaming software world is smart people, dumb pipes.
If you invert that you have more automation, predictability, control at lower cost. The risk is a lot of eggs in one basket and when the market turns, if the company you are buying from mismanages tech, if they can't keep pace with change...you go along for their ride. Every company big and small falls into this technical debt. I have maby opinions on why as I am sure many do.
There is a lot baked into that comment but the constant tug-of-war every CIO is trying to wrap their head around....how do we do more with less and gain an advantage.
They spent a bunch of money on M&A and eventually had to go private and buy out the founder.
What you‘re referring to sounds like the legacy version of BusinessWorks 5.x that was launched back in 2001. The current generation of BusinessWorks 6.x provides Eclipse-based tooling just like closest alternatives like Talend, Mule, Fuse, etc. and deploys to 18+ PaaSes (k8s, swarm, GKE, AKS, etc.) or its own managed iPaaS aka TIBCO Cloud Integration. It’s aimed at Enterprise Integration specialists at a Global 2000 or F500.
If you‘re an app developer at a large bank/telco/retailer/airline building integration logic or stream processing or event-driven data pipelines, you‘re likely to use Project Flogo (flogo.io) It’s 3-clause BSD FLOSS and has commercial support and optionally commercial extensions available. Oh and you’re likely going to use Flogo apps with Apache Pulsar or Apache Kafka messaging. Both Pulsar and Kafka are available as commercially supported software from TIBCO (Rendezvous or EMS are our traditional proprietary messaging products). Flogo apps can deploy to TIBCO Cloud, dozen+ flavors of k8s, AWS Lambda, Google Cloud Run or as a binary on an edge device.
(Disclaimer: Product at TIBCO. Used to work on BW 6.0 back when the only PaaS was good ol’ Heroku)
I get the feeling you described, Nifi has a.. heavy and highly structured feel to it, but lighter alternatives are not as integrated, say... Airflow, Streamsets, AWS Glue, Kafka (different beast) etc.
That said, Nifi is incredibly powerful and complete considering it's open source and free.
Thats exactly how it looks like, thanks for confirming. Will avoid.
I like to think of it like Scribe from FB, but with an extremely dynamic configuration protocol.
The places where it really shines is where you can't get away with those 3 and the problem is actually something that needs a system which can back-pressure and modify flows all the way to the source - it is a spiderweb data collection tool.
So someone trying to Complex Event Processing workflows or time-range join operations with it, will probably succeed at the small scale, but start pulling their hair out at the 5-10GB/s rate.
So its real utility is in that it deploys outside your DC, not inside it.
This is the Site-to-Site functionality and MiniFI is the smallest chunk of it, which can be shrunk into a simple C++ something you can deploy it in every physical location (say warehouse or grocery store).
The actually useful part of that is the SDLC cycle for NiFi, which lets you push updates to a flow. So you might want to start with a low granularity parsing of your payment logs on the remote side as you start, but you can switch your attention over it to & remove sampling on the fly if you want.
If you're an airline flying over the arctic, you might have an airline rated MiniFI box on board which is sending low traffic until a central controller pushes a "give me more info on fuel rates".
Or a cold chain warehouse which is monitoring temperature on average, until you notice spikes and ask for granular data to compare to power fluctuations.
It is a data extraction & collection tool, not a processing and reporting tool (though it can do that, it is still a tool for bringing data after extraction/sampling, not enrichment).
A good way to get started with NiFi is to use it as a highly available quartz-cron scheduler. For example, running "some process" every 5 seconds.
Disclaimer: I'm an Apache NiFi committer.
An article you might find interesting about it's ability to scale.
Disclaimer v2: I used to work at Cloudera
Fundamentally NiFi is a "dataflow engine", a system that can be used to automate data transfer from different and varying types of sources and sinks. It has a fairly usable UI that enables a "dataflow manager" (end user) to perform transformation, routing and delivery of data using a "drag-n-drop" configuration approach.
Getting data into or out of your application/system, or performing simple schema transformations, is a common (maybe tedious) task that most developers face. NiFi helps connect the dots, so to speak, and decouples the receipt/delivery of data away from your application. NiFi comes with a set of "batteries included" connectors for almost every transport protocol you would generally need. And it's modular so you can create your own processing components as well.
NiFi is fundamentally modeled after what's called "Flow-Based Programming", which is a style of programming that facilitates composition of black-box processing units. It can run at an enterprise or IoT level, depending on where that decomposition best fits into your architecture.
(disclaimer: I'm affiliated with the NiFi project)
We use camel with DSLs to make programmatic workflows that connect data flows together. However Camel itself doesn't typically carry the data. Sometimes it SFTPs files around etc., but mostly, it is just a messaging layer.
Is that the main difference here?
1. Move data from A to B
2. Move data from A to B, but perform some intermediate processing in-between (ETL)
3. Grab data from A, write it to [INSERT SYSTEM HERE] (Mongo, Kafka, many others)
4. Periodically run an arbitrary script or program (like cron but you can schedule in seconds)
Those are just some use cases that people tend to start off doing. It's important to note...the above doesn't sound very impressive until you think about the following features:
1. Build these pipeline completely from a UI. You can leverage custom code, but the number of processors that come out of the box is astounding. If you can think of it, it most likely already has something built-in.
2. By default safety nets. Data is backed by a high performance WAL that allows for recovering in the case of a failure.
3. Ability to pause and resume specific areas of a pipeline at any time. For example, you may have a processor that's receiving data and perhaps the database you write the data to is offline (or has moved). Data is automatically buffered on disk so when the DB comes back online, you can backfill the data that you've been consuming. Further, none of the data was dropped in this process.
4. Tunable and verifiable back pressuring out of the box (_very_ hard to get right when writing custom code)
5. Easy to prioritize certain events over others (e.g. higher priority, latency sensitive data)
Lastly, when you start getting comfortable using NiFi, you will realize how general purpose the engine is. Data is represented as a "FlowFile". A "FlowFile" is essentially just a Array<Byte>. This means you can operate on virtually anything. Further, you can operate on large data since its backed by a WAL and NiFi facilitates the ability to stream data from disk when you want to process it. As long as you don't read the entire thing into memory (of course).
Easiest way to get started is to just download the binaries and run it. Then go to the http://localhost:8080/nifi and just mess around. There's plenty of tutorials online (Blogs, YouTube, etc). Once you get comfortable using NiFi, people will be blown away with how fast you can get something up and running. Things that would normally take days or weeks to get into production, you can routinely do it in hours or minutes.
Hope this helps!
edit: You can download from https://nifi.apache.org/download.html. Just unzip/untar the package and run "./bin/nifi.sh run"
NiFi gives insight to your enterprise data streams in a way that allows "active" dataflow management. If a system is down, NiFi allows dataflow operations to make changes and deal with problems directly, right at tier 1 support.
It's often the case that an enterprise software developer has an ongoing role of ensuring the healthy state of the applications from their team. They don't just develop, they are frequently on call and must ensure that data is flowing properly. NiFi helps decouple those roles, so that the operations of dataflow can be actively managed by a dedicated support team that is more tightly integrated with the "mission" of their dataflow.
NiFi additionally offers some features that most programmers skip to help with the resiliency of the application. For example:
- the concept of "back pressure" is baked into NiFi. This helps ensure that downstreams systems don't get overrun by data, allowing NiFi to send upstream signals to slow or buffer the stream.
- data provenance, the ability to see where every piece of data in the system originated and was delivered (the pedigree of the data). Includes the ability to "replay" data as needed.
- dynamic routing, allowing a dataflow operator to actively manage a stream, splicing it, or stopping delivery to one source and delivering to another. Sources and Sinks can be temporarily stopped and queued data placed into another route. Representational forms can be changed (csv -> xml -> json, avro), and even schemas can be changed based on stream.
Anyone can write a shell script that uses curl to connect with a data source, piping to grep/sed/awk and sending to a database. NiFi is more about visualizing that dataflow, seeing it in real-time, and making adjustments to it as needed. It also helps answer the "what happens when things go wrong" question, the ability to back-off if under contention, or replay in case of failure.
(disclaimer: affiliated with NiFi)
Out of the box it is incredibly powerful and easy to use; in particular it's data provenance, monitoring, queueing, and back pressure capabilities are hard to match; custom solution would take extensive dev to even come close to the features.
It is not code, and that means it is resistant to code based tooling. For years it's critical weakness was related to migrating flows between environments, but this has been mostly resolved. If you are in a place with dev teams and separate ops teams, and lots of process required to make prod changes, then this was problematic.
However, the GUI flow programming is insanely powerful and is ideal when you need to do rapid prototyping, or quickly adapt existing pipelines; this same power and flexibility means that you can shoot yourself in the foot. As others have said, this is not a tool for non technical people; you need to understand systems, resource management, and the principles of scaling high volume distributed workloads.
This flow based visual approach makes understanding what is happening easier for someone coming later. I've seen a solution that required a dozen containers of redis, two multiple programming languages, zookeeper, a custom gui, and and mediocre operational visibility, be migrated to a simple nifi flow that was 10 connected squares in a row. The complexity of the custom solution, even though it was very stable and had nice code quality, meant that that solution became a legacy debt quickly after it was deployed. Now that same data flow is much easier to understand, and has great operational monitoring.
- limit NiFi's scope to data routing and movement, and avoid data transformations or ETL in the flow. This ensures you can scale to your network limits, and aren't cpu/memory bound by the transformation of content.
- constrain the scope of each instance of nifi, and not deploy 100s of flows onto a single cluster.
- you can do alot with a single node, only go to a cluster for HA and when you know you need the scale.
I know a massive installation  which is about to be open sourced, where Apache NIFI is used in the middle of the stack as a key component. No dismissal of the capabilities this package offers intended.
slides [slide #32]: https://static1.squarespace.com/static/5c2f61585b409bfa28a47...
I have used Nifi a little bit and Airflow not at all. Reading the home pages of the two products, it's hard for me to know when it would be more appropriate to use Airflow than Nifi.
They both schedule jobs and move data according to control flow topologies that you build in a GUI, right?
- Would love to see the ability to develop custom NiFi processors in Go/Rust/Elixir etc.
- XML is a big pain in the rear.
- Being container-aware is big win. Stateless is even better.
I see a good opportunity there for users like me to explorer NiFi's capability in the future.
> They both schedule jobs and move data according to control flow topologies that you build in a GUI, right?
Airflow on the other hand is designed to run scheduled jobs (whether it be batched or otherwise). The 'job' can really be anything - build / data processing pipelines, system configuration management pipelines and so on. In Airflow parlance, one can create connected DAG's as pipelines that massage the data in a way you intend it to.
They both share some commonalities but I do gravitate towards their use cases being subtly different and an important one highlighted above.
There's a few articles in the old comments that explains it's use case a little
As far as I know, it at least partially fulfills the role of "enterprise application integration" pattern; the idea is sort of proto-servicemesh where you have bunch of weird enterprise applications all around and you need to get them talking to each other, so you plop in a fancy EAI middleware thingy in the middle which then will talk to each of the applications individually, converting and transforming data from one format and protocol to another. But its not just a dumb hub, in addition to doing arbitrarily complex transformations, it can also make "routing" decisions based on the data itself. And this pattern being product of its time, instead of defining the network in code in some nice DSL everything is just "configured" and lots of things is achieved through clicketyclicking through GUI. I suspect this is at least partially due "configuring a turn-key product" sounding less scary to managers than "developing a system on framework".
If you look at the docs page of NiFi and scroll quickly through the long list in the left sidebar with all the "processors", you already can get an idea what sort of things it does.
While this sort of solutions can be very useful in some situations, I wouldn't necessarily start designing a greenfield architecture around NiFi. But if you end up running it, then you might find that piece by piece it will accumulate all sorts of bits and bops of things because it's "convenient" to throw it in there, while the logic flows become more and more eldritch in nature.
> Over the years dataflow has been one of those necessary evils in an architecture. [...] NiFi is built to help tackle these modern dataflow challenges.
You create sources of data and can have non-technical (or maybe non-developer) roles wire together these data pipelines with transformations, aggregations etc.
My last experience with TIBCO was them telling us the product didn't support anything but running on bare metal because "virtualization does not guarantee the order of writes to disk" (their words not mine)
Today you will find even our traditional brokered messaging products like EMS lifted and shifted on AWS/Azure infra.
OTOH, if you are building new apps - you‘re likely to deploy TIBCO’s integration or stream apps / pipelines using Flogo/Pulsar/Kafka on k8s-lambda-gcr or on a fully managed service on cloud.tibco.com.
(Source / Disclaimer: I do product stuff over at TIBCO Software. Happy to chat if there are further qns: rkozhikk<remove-this-at>tibco<dot>com)
> An easy to use, powerful, and reliable system.
This is the title. That's the most important sentence, and it's absolutely meaningless.
It's bad enough that everything has to "sell" - just describe to me what your product does and I'll decide if I need or not. Don't try to convince me.
If you have to sell, do it by differentiating yourself from your competitors. No one is calling themselves "Difficult to use, weak, and unreliable", so saying the opposite is not differentiation.
When did we accept that marketing-speak was default communication. Can't we have some landing pages that are essays? Or even a few paragraphs instead of trying-to-be-catchy bullet point phrases in large font?
Well, yeah, it's meaningless if you cut off the second half...
> ...to process and distribute data.
That's what it does. The adjectives before it aren't the meat of the sentence.
* It doesn't need much in the way of dependencies to run. If you can get Java onto a machine, you can probably get NiFi to run on that machine.
- That is HUGE if you are operating in an environment where getting any new dependencies installed on a machine is an operational nightmare.
* It doesn't require a lot of overhead. Specifically, no database.
* You can write components for it that don't require a whole lot of tweaking for small changes to the incoming data. So, if I have a machine processing a JSON file that looks like XXYX and another machine processing a nearly identical JSON file that looks like XYXX, the tweaks can be made pretty easily.
So, if you're looking for a lightweight, low overhead, easily configurable tool that may be running in an environment where you've got to run lots of little instances that are mostly similar but not quite, NiFi is great.
If you are running a centralized data pipeline where you have a dedicated team of data engineers to keep the data flowing, there are better options out there.
They have a built-in source control product called "NiFi Registry", which can even be backed by git. The workflow for promoting flows between environments feels clunky though, especially as so much environment-specific configuration is required once your number of components gets high enough.
Moving our Java, Ruby or Go code between environments or handling versioning and releases was a piece of cake, in comparison.
If so, how does it compare to SSIS, dbt, and other projects (please name!)?
Otherwise, what is an analogous toolset?
Think, if order value > 100 and the customer has ordered 3 times in the last hour and the product will be in stock tomorrow.
Kafka streams, Flink and Dataflow are super powerful and I think there is room for a GUI tool.
Would be great to hear experiences of NiFi in this domain or discuss the space with any experienced users. Will add contact details in my profile.
> the open BPMS movement led by JBoss and Intalio Inc., IBM and Microsoft decided to combine these languages into a new language, BPEL4WS. In April 2003, BEA Systems, IBM, Microsoft, SAP, and Siebel Systems submitted BPEL4WS 1.1 to OASIS for standardization via the Web Services BPEL Technical Committee
if that doesn't scare you, then oh boy...
Just look at the list of organizations mentioned there, I don't know if you could make it more enterprisy.
I watched one of the explanation videos and it brought back memories.
My dislike of the phase back then, which I hope they've addressed now, is that while everything looked find and dandy while designing things on a UI, when something broke it was a whole heaps of generated XML no one could read.
I'm trying to piece together the main reason why someone would pick this over Camel, or vice-versa. I know they're different - but not night and day.
I have a problem where I want to stream data to an ML layer and then stream that to a web app (e.g. Laravel or Django).
Reading the docs here, this seems like this would solve this problem, but was wondering if people had alternatives given that people seem to think poorly of this application.
It's not a message bus, nor is it a data processing framework, nor a scheduler, nor an ETL tool. If you try to use it for one of those you're in for a bad time.
What you're describing sounds like you might need a message bus (think ZeroMQ, Kafka, etc). Assuming you're writing the software yourself and want to connect it together.
However if you just need something to send messages then you're better off using a tool that does just that, you don't need the overhead of a system that can connect arbitrary applications that talk in incompatible protocols you just need a single protocol that allows your applications to send each other data.
In their docs Nifi calls itself a dataflow tool and calls dataflow a necessary evil. It's the band-aid you need when you've got a mismatch between the way data is generated and the way it is consumed. It would be insane to deliberately create such a mismatch just to use Nifi.
Can you disambiguate that further? I think you mean that an ETL tool would handle the external interface work whereas Nifi is dealing with data that is somewhat inside your control. Is that an OK description?
And maybe it's just me but everything I've seen being called ETL handled data in batches, in which case you really want support for scheduling, error handling and retries, which Nifi does not really have (there is a 'retry' function somewhere, but I found it confusing and it only seems to work for a single data item, not a good idea when you've got thousands of them). I much prefer Airflow for those scenarios.
However, it does not handle small records well, and deploying custom processors is a pain, so don't use it to replace your stream processing framework.
Sometimes Wikipedia articles hit the front page. It's fine. Usually.
Nifi has a much more extensible architecture though, you can pretty much do anything you want in a data pipeline in Nifi, and open source of course.
But what is it?
Also you say it needed to be run periodically? It’s supposed to be a long running service. If you need something that you can spin up and shutdown in a container or VM then it is probably not the solution.
Use it if you have high volumes of data you need to transport from A -> B. We run on a cluster with 256gb ram, 128 core per a node.