GitHub repos of mine are seeing upticks in strange PRs that may be attacks. But the article's PR doesn't seem innocent at all; it's more akin to a huge dangerous red flag.
If any GitHub teammates are reading here, open source repo maintainers (including me) really need better/stronger tools for potentially risky PRs and contributors.
In order of importance IMHO:
1. Throttle PRs for new participants. For example, why is a new account able to send the same kinds of PRs to so many repos, and all at the same time?
2. Help a repo owner confirm that a new PR author is human and legit. For example, when a PR author submits their first PR to a repo, can the repo automatically do some kind of challenge such as a captcha prompt, or email confirmation, or multi-factor authentication, etc.?
3. Create across-repo across-organization flagging for risky PRs. For example, when a repo owner sees a PR that's questionable, currently the repo owner can report it to GitHub staff but that takes quite a while; instead, what if a repo owner can flag a PR as questionable, which in turn can propagate cautionary flags on similar PRs or similar author activity?
About the binary encoding... It's a bit easy to armchair these things, and it's too late for WebAsm now... but if you're on the V8 team, you have access to Google's PrefixVarint implementation (originally by Doug Rhode, IIRC from my time as a Google engineer). A 128-bit prefix varint is exactly as big as an LEB128 int in all cases, but is dramatically faster to decode and encode. It's closely related to the encoding used by UTF-8. Doug benchmarked PrefixVarints and found both Protocol Buffer encoding and Protocol Buffer decoding would be significantly faster if they had thought of using a UTF-8-like encoding.
LEB128 requires a mask operation and a branch operation on every single byte, maybe skipping the final byte, so 127 mask operations and 127 branches. Using 32-bit or 64-bit native loads gets tricky, and I suspect all of the bit twiddling necessary makes it slower than the naive byte-at-a-time mask-and-branch.
Prefix varints just shift that unary encoding to the front, so you have at most 2 single-byte switch statements, for less branch misprediction, and for larger sizes it's trivial make use of the processor's native 32-bit and 64-bit load instructions (assuming a processor that supports unaligned loads).
There's literally no advantage to LEB128, other than more people have heard about it. A PrefixVarInt 128 is literally always the same number of bytes, it just puts the length-encoding bits all together so you can more easily branch on them, and doesn't make them get in the way of native loads for your data bits.
Also, zigzag encoding and decoding is faster than sign extension, for variable-length integers. Protocol Buffers got that part right.
Note that for security reasons, if there are no non-canonical representations, there can't be security bugs due to developers forgetting to check non-canonical representations. For this reason, you may want to use a bijective base 256[0] encoding, so that there aren't multiple encodings for a single integer. In the UTF-8 world, there have been several security issues due to UTF-8 decoders not properly checking for non-canonical encodings and programmers doing slightly silly checks against constant byte arrays. A bijective base 256 saves you less than half a percent in space usage, but the cost is only one subtraction at encoding time and one addition at decoding time.
Talk about timely, I recently took the opportunity to unmothball the old propagators idea and have been running a bunch of ideas past Sussman's former student Alexey Radul on a fairly constant basis for the last 3 weeks.
I started a project at https://github.com/ekmett/propagators
which at least gets the basic execution of them right, and have been working on a larger project.
Now, Sussman and Radul manage propagation and track provenance through using an assumption-based truth management system. This unfortunately results in a 2^n blowup in space, but if you change the schema somewhat you can treat it more like enumerating solutions in SAT, where you get a blowup, but in time not space -- and as usual with SAT "it usually isn't that bad".
In any event, a lot of work out there overlaps with the propagators story:
Lindsay Kuper's work on LVars drops idempotence and doesn't deal with 'triggering on change' but gets a parallel computation with forking and writes to "lattice variables" and uses threshold reads to manage communication.
Sussman and Radul's work really needs the notion of the propagators themselves being monotone functions between the semilattices that they are working with. Unfortunately in neither the paper or Radul's thesis do they ever actually spell that out.
Once you do spell it out you get something like Christopher Meiklejohn's on LASP. Both he and Lindsay Kuper have been tackling the issue of composing CRDTs lately. Her notion of threshold reads works well here when it can be applied because such functions are automatically monotone, there is nothing to check.
On the other hand, we can view things like datalog as a propagator network. You have tables which accumulate information monotonically. You have IDB statements that execute joins between those lattice variables. Now this starts to show some of the differences. There we don't bother sending the entire lattice. In seminaive evaluation of datalog we send deltas. So it becomes interesting instead to walk back the notion of a lattice variable and to think instead of a commutative (possibly idempotent) monoid acting on an ordered set. Now your "update language" is the monoid, and we can think of this in terms of firing updates at a set and tracking what makes it through, and propagating _that_. I've been playing with various ways to get a closer analogue to datalog processing and to effectively track these partial triggers.
Another issue is that propagator networks as they are currently designed typically do too much work. Too many things are woken up and fired. We can mitigate that somewhat. Consider a propagator that adds two lifted numbers. We can also add a propagator backwards that does the subtraction, etc. This yields a (local) propagator for which if you tell me any 2 of the arguments, I can tell you the third.
A more efficient scheme would be to adopt a "two watched literal" scheme like a SAT solver. Pick two lattice variables that are still at _|_. When one of those is written to, check to see if you're down to 1 variable not at _|_. If so we have to "wake up the propagator" and have it start listening to all of its inputs. If not disconnect this input and start listening to a different input. For 3 variables this isn't a big deal. For a propagator where you can use 499 variables to determine the 500th? You can get some nice zChaff like speadups!
We can also use a 2 watched literal scheme to kill a propagator and garbage collect it. If we use a covering relation to talk about 'immediate descendants of a node' in our lattice, we can look for things covered by our _|_ contradiction node. These are 'maximal' entries in our lattice. If your input is maximal you'll never get another update from that input. So if we take our (+) node as an example, once two of the 3 inputs are maximal this propagator has nothing more to say and can be removed from the network.
We can indicate that you don't want to say anything "new" to a lattice by adjoining a "frozen" flag. You can think of it as the moral equivalent of saying we'll tell you nothing new from here out. This winds up the moral equivalent of Lindsay's quiescence detection.
Now we can look at a propagator network itself as a lattice in one sense, in terms of adding propagators to the network as increasing information. Then taking the ability to tell the network that it is frozen to give us opportunities to topologically sort the network, in the same fashion as stratified datalog. Now we can be more intelligent about firing our propagators:
I can run the network in bottom up topological order, queuing updates from SCCs in parallel. It gets a little tricky if we want the 'warm start' scheme from 2 watched literals -- you need to make sure you don't miss updates if you want to be able to do some more CmRDT-like cases rather than CvRDTs.
Finally, there are a lot of problems where we can view a propagator network itself as something useful to implement a propagator.
Consider constraint propagation. We can view the classic AC-3 algorithm for enforcing arc consistency as a propagator network. Once it is done _now_ we have arc consistent domains to enumerate with our finite domain solver. Which we can drive in the fashion I mentioned above to avoid the ATMS overhead.
On the other hand, a linear programming or mixed integer linear programming solver can also give lattice like answers as you add cutting planes to the model and monotonically increase the information present. In the MILP case we typically loop over these considering relaxations which inform us of new cuts we can make, etc.
Both of these are computations that _use_ lattice variables with propagators between them to build a better propagator themselves.
They are also nicely both fairly 'convex' theories. You could run propagators for constraint programming and MILP on the same variables and they can mutually reinforce with nicer guarantees than the dumb (+) propagator that I mentioned.
That one runs into issues when you go to do something as simple as y = x + x, and try to run it backwards to get x, because it looks locally to the propagator like 2 unknowns. We could of course try to globally rewrite the propagator network, by adding propagators to work around that, in this case it works if you transform it into 2 * x, and now the 2 is a known as is the y, and you can get to x. Here it is probably better to just borrow tools from the clpqr crowd, but it requires an unfortunate amount of introspection on the propagator network.
I have a couple of dozen other sub-domains that fit into this model.
I mostly find it interesting how all of these subtly different domains over almost exactly the same tools with minor differences, and how powerful it is to adapt those differences to other problems in the nearby domain.
If any GitHub teammates are reading here, open source repo maintainers (including me) really need better/stronger tools for potentially risky PRs and contributors.
In order of importance IMHO:
1. Throttle PRs for new participants. For example, why is a new account able to send the same kinds of PRs to so many repos, and all at the same time?
2. Help a repo owner confirm that a new PR author is human and legit. For example, when a PR author submits their first PR to a repo, can the repo automatically do some kind of challenge such as a captcha prompt, or email confirmation, or multi-factor authentication, etc.?
3. Create across-repo across-organization flagging for risky PRs. For example, when a repo owner sees a PR that's questionable, currently the repo owner can report it to GitHub staff but that takes quite a while; instead, what if a repo owner can flag a PR as questionable, which in turn can propagate cautionary flags on similar PRs or similar author activity?