> If you don't know any math ... learning a bit Haskell or ML is probably a good idea
This is very much counter to the point I was making. I didn't find that learning haskell or ml was anything like learning math, especially the kind of math that I've seen people get use out of in day-to-day engineering, and I don't understand why this idea is so persistent.
If you don't know any math, it might be worth learning some of whatever kind of math is applicable in your field. It's probably not haskell.
> It does help to be more familiar with regexes and regular languages.
Regexes are certainly useful, but I don't think that learning a lot about automata theory is a particularly efficient way to get better at using regexes in practice, compared to eg doing a bunch of drills on https://www.executeprogram.com/courses/regexes
Hm reading this section again, I don't doubt that you had that experience, but I doubt it generalizes.
I think the "latin" claim is pretty close to true -- getting exposure to different languages helps your general programming ability (though I have no idea if it's true for Latin itself :) ) You get to see how other people solve problems, and play with different abstractions, and both of those are valuable things.
For my particular experience, I've written about 80% Python code, 15% C/C++, and 5% other languages for 10+ years. So thoroughly mainstream, and I agree that's the way to get work done. But I did do SICP as a freshman in college and I think that permanently changed my thinking about programming.
I also spent a few weeks with Real World OCaml in 2014, and hacked on some open source OCaml code. That experience directly helped me with implementing a language (Oil). That might not generalize either, but I'd still stand by the claim that people who don't know any math could benefit from learning a functional language.
It's a long argument that's been made many times elsewhere, but exhaustive case analysis, recursion, and various types of composition are second nature to some programmers but not others. I also think it helps with the "basic logic, sets, functions" areas you point out. There is A LOT of production code in the world that can use help with these things.
Like another commenter, I do think it's possible to swing too far in the other direction :) You probably internalized a lot of stuff a long time ago from either math or Haskell that helps in programming, but then you have "the curse of knowledge" -- it's hard to imagine what it's like not to know those things.
----
I also disagree about regexes -- the main problem is that the syntax varies so much in practice, and people hack or copy and paste until it works (the regex as lingua franca paper in my blog post above talks about this.) I think learning the simple mathematical concepts of alternation, repetition, etc. first is actually easier than doing drills for JavaScript regex syntax, and you can use that knowledge in practical contexts like the shell (where none of the common tools use JavaScript syntax).
If none of that convinced, I now remember that John Carmack (who you quoted!) has advocated functional programming as a way to improve the design of your programs (and he also used Racket for a VR project as far as I remember):
Probably everyone reading this has heard "functional programming" put forth as something that is supposed to bring benefits to software development, or even heard it touted as a silver bullet ...
It isn't immediately clear what that has to do with writing better software. My pragmatic summary: A large fraction of the flaws in software development are due to programmers not fully understanding all the possible states their code may execute in. In a multithreaded environment, the lack of understanding and the resulting problems are greatly amplified, almost to the point of panic if you are paying attention. Programming in a functional style makes the state presented to your code explicit, which makes it much easier to reason about, and, in a completely pure system, makes thread race conditions impossible.
It did take me a while to write this (~20 hours, I think) because I kept worrying about that exact problem, and I kept catching myself writing generic advice rather than concrete experiences.
I tried as much as possible to only write things where I could think of multiple concrete examples in my own experience.
Most of the tasks in the article can now be solved by tools like airtable or notion with minimal training. Certainly much less training than would be required to do the same thing using traditional programming tools. So not only is the problem reasonable, we are actually making progress.
A shared database with permissions and forms covers most of the task in the article. Airtable also now has an event/workflow system and a ton of integrations and plugins for doing stuff like sending email.
It's getting to the point where it can handle most of the internal software needs of a small-medium company and also has a ton of javascript hooks for covering that last 20%.
Hi. I'm the author. Such nostalgia. With tools like airtable and notion around today, we're actually making substantial progress towards making simple tasks easy.
How did you choose Zig for Dida, and would that be your language of choice if you were considering a rewrite? You're super clear [0] that Zig was better for you than Rust in this instance.
Oh cool, I enjoyed the posts on differential dataflow and incremental/streaming systems on your new site! If you don't mind my asking, how did you get into independent research?
It's getting there. I started 3.5 months ago and I'm currently making ~75% of minimum wage (which to be fair is pretty high in BC).
I suspect it's also going to be fragile. The next time we hit a recession or, god forbid, another pandemic, sponsorships will probably be the first thing to go.
But it does help fill gaps. There are a lot of things I've wanted to code, study or write about that I think people will get value from but that I could not get any employer to pay for. So I really appreciate being able to work on things just because they're valuable, rather than because there is a way to capture the value and make a profit.
This is why the high temporal locality part of the map is all EC - when everything is windowed you can just wait for the window to close.
On the low temporal locality side, the only consistent system I've seen so far is differential dataflow (and materialize) which does internally track which results are consistent and gives you the option to wait for consistent results in the output.
* As of Spark 2.4, you can use joins only when the query is in Append output mode. Other output modes are not yet supported.
* As of Spark 2.4, you cannot use other non-map-like operations before joins. Here are a few examples of what cannot be used.
* Cannot use streaming aggregations before joins.
* Cannot use mapGroupsWithState and flatMapGroupsWithState in Update mode before joins.
* There are a few DataFrame/Dataset operations that are not supported with streaming DataFrames/Datasets. Some of them are as follows.
* Multiple streaming aggregations (i.e. a chain of aggregations on a streaming DF) are not yet supported on streaming Datasets.
* Limit and take the first N rows are not supported on streaming Datasets.
* Distinct operations on streaming Datasets are not supported.
* Sorting operations are supported on streaming Datasets only after an aggregation and in Complete Output Mode.
* Few types of outer joins on streaming Datasets are not supported.
I haven't looked into the implementation but I'm guessing they just don't have good support for retractions in aggregates/joins. I also think it's likely to at least fall afoul of early emission and confusing changes with corrections.
The rest of spark doesn't really fit the post at all afaict - there is no streaming or incremental update, just batch stuff.
Yeah, I touch on this near the bottom (ctrl-f 'bitemporal'). I think having multiple watermarks would be a neat solution. Kinda like the way flink sends markers through to get consistent snapshots for fault tolerance.
As I understand it, it unblocks a dataflow to provide answers with respect to a user-time processed as-of a system-time, which is certainly an improvement.
But you'd also have unbounded growth of tracked timestamps unless you're _also_ able to extract promises from the user that no further records at (or below) a user-time will be submitted...
a promise they may or may not be able to make.
Yes, I think so. If you want to be able to handle out-of-order data, I don't think there is a way to garbage collect old time periods and still produce correct results unless you stop accepting new data for those time periods.
With the differential dataflow code I wrote for this post, that cutoff point for gc is also coupled to when you can first emit results, creating a sharp tradeoff between latency of downstream results and how out-of-order upstream data is allowed to be. With bitemporal timestamps or multiple watermarks those are decoupled, so you can emit provisional results and correct them later without giving up internal consistency, and also make a totally separate decision about when to garbage collect. That doesn't remove the problem entirely, but it means that you can make decisions about latency and decisions about gc separately instead of them being controlled by one variable.
With flink datastreams there are a lot of different levers to pull. The code that Vasia contributed recently waits for watermarks like differential dataflow. There is also a notion of triggers that emit early results from windowed operators, but these are not coordinated across different operators so they can result in internal inconsistency.
The flink table api, as far as I can tell, mostly just ignores event-time outside of windowed aggregates. So it doesn't have to confront these tradeoffs.
> the broader context is building a view over transaction decisions made by another system(s).
Most the time this is fine. If upstream is:
* a transactional database, we get ordered inputs
* another system that has a notion of watermarks or liveness (eg spanner), we can use their guarantees to decide when to gc
* a bunch of phones or a sensor network (eg mobile analytics), then we pick a cutoff that balance resource usage with data quality, and it's fine that we're not fully consistent with upstream because we're never compared against it
The hard case would be when your upstream system doesn't have any timing guarantees AND you need to be fully consistent with it AND you need to handle data in event-time order. I can't think of any examples like that of the top of head. I think the answer would probably have to be either buy a lot of ram or to soft-gc old state - write it out to cheap slow storage and hope that really old inputs don't come in often enough to hurt performance.
---
This has been an interesting conversation and I'll probably try to write this up to clarify my thinking. I'm not on hn often though, so if you want to continue feel free to email jamie@scattered-thoughts.net
Bunch more in the pipeline that I'm excited about.