Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's not that difficult when you work with the right tools.

Most of these problems boil down to a couple operations.

  map
  reduce
  filter
Most of the data comes from a couple places:

  Some kind of SQL system
  Text files
  Webservices (of the JSON/XML/HTML variety)
Most of the data goes back to the same places that it came from:

  Some kind of SQL system
  Text files
  Webservices (of the JSON/XML/HTML variety)
All you need to make are four types:

  inputs: () -> Some<T>
  outputs: Seq<T> -> ()
  pipes: Seq<T> -> Seq<Q>
  tees: Seq<T> -> () -> Seq<T> -> Seq<Q> (once partially   applied the tee becomes a pipe)
Using the usual sequence operators pretty much anything is possible with those inputs, outputs, and transforms (pipes) and you avoid a lot of overhead making data pipelines.


I don't think creating the pipeline is the difficult part at all. When doing this kind of work I think more about things like:

- Designing usable APIs, which is to say identifying which aspects of System A are most important to System B, how the concepts should be transformed, and how the data should be aggregated.

- Catching errors thrown by one system when data is requested from another system and relaying, translating, or suppressing these errors. Also developing policies for dealing with unreliable services.

- Performance issues.


These are largely trees for the forrest type problems.

The first problem is the same as the second. (eg. which data (errors ARE data) where)

Make all your services unreliable and it solves 90% of your issues as then unreliable services aren't special.

Performance issues in data generally stem from two places (underspec'd hardware) and latency. Which are usually solved best by buying better hardware (more drives not more CPU) and buying a network nightmare box and putting 400ms of latency in the network. People stop designing chatty APIs pretty quickly at 400ms of latency.


I don't understand what you're saying.

The first problem is similar to the second, yes. They both still need to be solved, so what is your point?

Unreliable services are everywhere. So what's your universal solution for dealing with them? I'll give a concrete example: suppose I have a JSON service being consumed by a Javascript web UI, and my service needs to hit some kind of authentication backend. In the event that the authentication backend server is [pick one: down, giving 500 errors, being slow], what kind of response does my service give to the Javascript app, and what kind of message or visual cue does the app give to the user? If you think the answer is something other than "it depends on exactly what the application does", then I disagree.

If you think latency is the problem, why are you talking about building in latency? It seems like you actually think chatty APIs are the problem. And, yes, chatty APIs can cause slowness. But chatty APIs often exist because they are the simplest possible design. Once you realize that there is too much back-and-forth you may have to sacrifice API usability and simplicity by adding caching and eager-loading code. Again, you think this is just something that solves itself?


That particular problem is a trinary response, true,false,error, or in general a datatype of:

  type Response<d,e> =
  | Data of d
  | Error of e
which would be handled by a statement like:

  match response with
  | Data d -> doSomething(d)
  | Error e -> doSomethingElse(e)
Or perhaps

  match response with
  | Data d -> Some(d)
  | Error -> None
Any errors coalesce to error on the client and the client responds: "We're sorry this doesn't work, we've been notified and are investigating, here's your ticket #"

Each system in the chain should have a reference identifier to make tracking the requests across the system easy.

Chatty APIs often exist because a lot of programmers think that everything happens at the same speed and they think that having a getter and accessor for every field makes their code "object oriented". Putting in 400ms latency gets programmers to stop thinking that, it gets them thinking about "How can I issue a bunch of requests simultaneously, go do something else (like issuing more requests for someone else), and then respond to the client when I have all the data I need". It gets them writing async code, or using MARS. Maybe, 400ms is really excessive, but 100ms should still let your code run on systems with reasonable geographic separation.

Chatty APIs aren't simple, they're generally really annoying, because for decades the predominant idea in programming has been put as thin a veneer on top of the implementation as possible and lets call that an interface. It makes for a simple implementation at the expense of a horrible interface. APIs are about the interface.


  > Using the usual sequence operators pretty much anything is possible
i would absolutely love if this is true. however, i don't know of anyone who has constructed a convincing argument that it is true. even if it were true, it is probably impractical to require all systems you interface with to be composed in these terms. is it cheaper to work with a ball of mud (which poisons everything it touches), or is it cheaper to rewrite the ball of mud in the style you describe?

"People build BIG BALLS OF MUD because they work. In many domains, they are the only things that have been shown to work. Indeed, they work where loftier approaches have yet to demonstrate that they can compete.

"It is not our purpose to condemn BIG BALLS OF MUD. Casual architecture is natural during the early stages of a system’s evolution. The reader must surely suspect, however, that our hope is that we can aspire to do better. By recognizing the forces and pressures that lead to architectural malaise, and how and when they might be confronted, we hope to set the stage for the emergence of truly durable artifacts that can put architects in dominant positions for years to come. The key is to ensure that the system, its programmers, and, indeed the entire organization, learn about the domain, and the architectural opportunities looming within it, as the system grows and matures."[1]

Maybe, you are correct. Probably. We all hope. But, practically, in the 'bigger' problem domains, you can put a lot of smart and experienced people on a project, and it still comes out a ball of mud. Maybe, there just aren't yet enough experts to go around.

[1] http://www.laputan.org/mud/ (conclusion)

BTW, any references you know that can articulate your assertion without hand-waving, I would love to read. I must read. I'm currently devouring everything I can about FP and there's a lot of concrete stuff, even more stuff with handwaving, and then some alarming articles about people who did their startup in haskell and wouldn't do it again, even if for out-of-band factors.


Well I think its only cheaper at first and even if people know better they are forced to use a stupid java framework (that abstracts away the network) even if the know that it will end up badly.

A other problem is that people often get tought that you need <input some bad framework>. You can't do distributed computing if you done use some kind of framework. At least the places I know they would never tell you something like "just send json from one node to the other if thats all you need".

At clojure conj there where some talks about this, but the videos are not out jet. See these presentation on Concurrent Stream Processing (https://github.com/relevance/clojure-conj/blob/master/2011-s...) or this on Logs as Data (https://github.com/relevance/clojure-conj/blob/master/2011-s...)

For a other example that works in quite simular ways look at Storm (in use at twitter. Its all sequential abstractions. See this video by Nathan Marz (look all videos you can find): http://www.infoq.com/presentations/Storm

For a more philospical perspective look at the videos by rich hickey: http://www.infoq.com/presentations/Simple-Made-Easy


"durable artifacts that can put architects in dominant positions for years to come" that statement is why balls of mud work, because the primary alternative to balls of mud are space shuttles designed by architecture astronauts. Space shuttles are generally problems in search of a solution, which is why the ball of mud is seen as the only thing that works. Writing in an imperative language poking and prodding certain bits at certain addresses merely ensures a ball of mud will result.

I would never assert that every type of problem should be solved in this manner, but it's a pretty good framework for taking data from a bunch of different sources and outputting them to a bunch of other destinations. Pipeing and transforming data is not fundamentally a problem of hierarchical types (a problem somewhat solved by C++/Java/C#) but type transformation and streams.

Keep your business logic and GUIs built in Java but use something like this for moving data around the organization and importing / exporting as needed by clients.


Enterprise programming would have been a LOT more fun if I'd been allowed to use filter/map/reduce directly, instead of writing thousands of lines of C++.

In the end, I just resigned to go and do iPhone software for a while. I suspect any one person only has a certain amount of programming in them, and you don't want to waste it.


Haha, I hear ya on that. I worked on a .NET team, apparently that meant C# only, when F# came out I started writing F# until someone found out. (What's this Fsharp.exe that's crashing the build!?!?!?). Luckily, I had written so much (in terms C# code) that I was able to keep on writing in F# (technically, I wasn't allowed to create new F# code, but was allowed to maintain old code, oddly enough most new features made more sense as maintenance on the existing F# code).


Picture several thousand lines of C++ being reduced to 27 lines of Clojure and you get an idea.

That's not to say that you couldn't do something similar in C++, but I've met like 4 really really good C++ programmers in about 15 years of programming, so no, it's unlikely to happen.


What sort of tasks would make such a drastic reduction possible? Genuinely curious, having never written c++.


Higher-order functions. It's like the difference between calculus and arithmetic. If you're adding numbers together you're not going to see much difference. If you try to send a man to the moon... it's going to be a lot less code in a higher level language.


Imagine merging thousands or millions of records into disparate timelines by various attributes, merging similar records, or overwriting, changing lengths...

This is the sort of stuff that filter/map/reduce is really really good at, but also, it can be important code that people will pay you a lot of money to write over a longer period of time....


It's true that connecting components is just a matter of mapping one representation to another.

As you say, this mapping can be specified with a function, composed of other functions. But if the mapping is complex, with different levels interacting, writing this function can be difficult. That is, isomorphic mappings are straightforward; but non-isomorphisms (the non-homomorphic aspect) can be tricky.

How well does this approach handle the tricky cases? (an example would be great, if possible)




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: