Hacker News new | past | comments | ask | show | jobs | submit login

Need help to decide the tools to be used for the below problem:

The system is a bunch of batch jobs that are scheduled to run at different intervals. These jobs can be modelled as an acyclic directed graph of steps. They basically download files from vendors and map the rows inside them into a generic format (for generating reports). There are a lot of vendors and each vendor can have a different file format containing different fields -- hence requiring custom business logic to populate (map) the corresponding generic file (like aggregating fields, fetching values from DB, etc.). Also these vendors' files sometimes contain errors, or are dropped late for download, etc. -- failures can happen and these failed instances of jobs should be able to rerun.

Existing system is built using Spring Batch and Spring Integration. The problems with the existing system are:

1. there are more than 200 jobs and most of them have their own custom logic during mapping -- cannot be generified

2. lot of manual work needed to onboard new vendors

3. jobs are synchronous and run only on one node, typically for lots of hours

4. rerunning jobs is a nightmare

Dream state for this system:

1. Dynamically add jobs to the runtime using generic components that can be reused -- maybe through an API / UI

2. Preferably, multiple records from a single file be processed across distributed nodes to generate a single output generic file

3. Rerunning should be easier

I am a noob to CS. I did a good bit of research for the past month. Found a few data-science tools in Python -- which is a no-no for a production system. Also, I know that the steps cannot be made generic after some extent since custom mapping logic is required for almost every vendor. But asking to see what is possible. Any help to point to prospective tools and technologies to solve the above will be much appreciated.

Thanks




Use Airflow maybe?


Looks very promising. Can I add new jobs (tasks in Airflow's jargon) reusing my custom steps (operators in Airflow's jargon) during runtime? Also, is there something similar in Java, Go, etc.?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: