In order to be in the same G-node, they'd need to have the same rank and be close in value (such that they were not "broken up" by a value in the next highest rank), right?
Seems like brute-force search for adjacent values with the same rank is possible, but guaranteeing that intermediate higher-rank values dont also exist may not be (for an attacker). Maybe one mitigation on this sort of attack is to search for higher-rank extra values to insert to break up large G-nodes?
This also assumes they can know the hash function (if rank is chosen by cryptographically-secure hash); maybe also salting values before hashing could thwart these sorts of attacks?
Generating arbitrarily many values of the minimum rank is very easy for an attacker. Since the rank is geometrically distributed with parameter p = 1 - 1/k and k ≥ 2, randomly sampling a value will give you one of minimum rank with probability p ≥ 1/2 and it only gets easier for larger k.
If you want to break that up with dummy elements, you now have the problem of choosing those dummies in a history-independent manner efficiently.
But I think their recursive construction with G-trees of G-trees of ... might work if nodes with too many elements are stored as G-trees with a different, independent rank function (e.g. using a hash function with different salt). Producing many nodes with the same ranks should then get exponentially harder as the number of independent rank functions increases.
Yes this exactly. Another really simple way to do this, is to use alternating leading and trailing zero counts in the hash in your nested G-trees. Simple, and pretty effective.
Hmmm... if you need to go deeper (because 1/4 of all hashes have zero leading zeros and zero trailing zeros), you can generalize this by converting the hash into its run-length encoding to get a sequence of rank functions where finding values with the same rank for all rank functions is equivalent to finding hash collisions. Very nice.
Whoah, I totally hadn't taken the thought experiment this far. This is fantastic! I'd like to explore this further, interested in a quick research chat/sync some time? My email is linked from the paper.
Exactly. But my concern is that this is not any stronger/better than what Prolly trees already offer, which is why I'm disappointed that they are mentioned under "related work", but not discussed/compared in more detail.
You're right, we should delve into a comparison more with respect to prolly trees. We actually have a lot of experience with prolly trees, and have found, in practice, that you need to do a lot of the things that folks like dolt have had to do to make them work nicely. Whereas with G-trees, the basic implementation turns out to be quite nice (and extremely easy to reason about).
One of the biggest benefits of G-trees in my mind, is their ease of implementation. Additionally, we did a lot of work to explore their statistical properties, which doesn't exist for prolly trees (though in hindsight, we have done this, so should probably write it up formally).
As the name implies, the sizes of nodes of Prolly trees and geometric search trees are geometrically distributed. My question is: is this really the right distribution to use? The larger nodes get, the larger the probability is that they get mutated. This means that in a content addressed storage system, there will be more large objects than small ones. My gut feeling tells me that the distribution should be uniform, with the spread between min/max sizes bound by a small constant factor (2x? 4x?).
Some time ago I experimented with this, where I implemented a content defined chunking algorithm that chunks inputs at locations where the value of a rolling hash is maximal, as opposed to finding offsets at which the first/last n bits are zeros/ones. My observation was that this led to a 2-3% reduction in storage space usage. The source code for this can be found here:
Would it also be possible to model trees around this approach as well? If so, would this lead to better deduplication rates than Prolly/geometric search trees?
> it seemed natural to have a set of generations for items that have only been seen once, and then another generation for things that have been more active
Have you looked at ARC? It sounds similar - it is a cache split between LRU and MFU areas, where the split point changes dynamically depending on the workload. https://www.youtube.com/watch?v=F8sZRBdmqc0 is a fun watch on the topic.
Does anyone happen to have expertise/pointers on how ZFS' ARC interacts with Linux disk caching currently when using ZFS-on-Linux? It seems like the ARC space shows up as "used" despite being in a similar category of "made available if needed" - is that correct?
Is data in the ARC double-cached by Linux's disk caching mentioned in the post? If so, is it possible to disable this double-caching somehow?
ZFS ARC unfortunately does not integrate with the kernel file cache, so they step on each other a lot. ZFS does watch available system RAM and try to dynamically reduce its usage as memory pressure increases, but I've found its responsiveness for this to be far too slow. This combined with how ARC appears to just be an opaque block of RAM that cannot be reclaimed, I usually just set a hard limit on how big the ARC is allowed to get in the module load arguments and be done with it (at least for systems that are doing more than just storage).
At least on FreeBSD, there is a kmem_cache_reap() that is called from the core kernel VM system's low memory handlers.
Looking at the linux code in openzfs, it looks like there is an "spl_kmem_cache_reap_now()" function. Maybe the problem is the kernel dev's anti-ZFS stance, and it can't be hooked into the right place (eg, the kernel's VM low memory handling code)?
If I remember correctly, they charge for the BigQuery table you use for the detailed export. In other words: there's a charge for seeing detailed charges!
All other clouds provide a detailed cost breakdown report for free.
proto3 messsage fields allow for detecting set vs. not set; other field types (repeated, map, int32, string, bool, enum, etc.) have this "default value if not set" issue. The canonical way of handling this is to use wrapper messages (because one can detect if the wrapper is not set), and there are "well-known" canned messages/protos one can import and use without writing their own: https://protobuf.dev/reference/protobuf/google.protobuf/
Whether the codegen/libraries for a particular language provides a more idiomatic binding for these well-known wrappers is up to the implementation - for example, golang libraries have conveniences added for well known libraries: https://pkg.go.dev/google.golang.org/protobuf/types/known. Rust libraries may have the same; I'm not as familiar with the ecosystem there.
I can definitely sympathize here - in every context, just straight JSON/YAML configuration seems never expressive enough, but the tooling created in response always seems to come with sharp edges.
Here are some of the things I appreciate about Jsonnet:
- It evals to JSON, so even though the semantics of the language are confusing, it is reasonably easy to eval and iterate on some Jsonnet until it emits what one is expecting - and after that, it's easy to create some validation tests so that regressions don't occur.
- It takes advantage of the fact that JSON is a lowest-common-denominator for many data serialization formats. YAML is technically a superset of JSON, so valid JSON is also valid YAML. Proto3 messages have a canonical JSON representation, so JSON can also adhere to protobuf schemas. This covers most "serialized data structure" use-cases I typically encounter (TOML and HCL are outliers, but many tools that accept those also accept equivalent JSON). This means that with a little bit of build-tool duct-taping, Jsonnet can be used to generate configurations for a wide variety of tooling.
- Jsonnet is itself a superset of JSON - so those more willing to write verbose JSON than learn Jsonnet can still write JSON that someone else can import/use elsewhere. Using Jsonnet does not preclude falling back to JSON.
- The tooling works well - installing the Jsonnet VSCode plugin brings in a code formatter that does an excellent job, and rules_jsonnet[0] provides good bazel integration, if that's your thing.
I'm excited about Jsonnet because now as long as other tool authors decide to consume JSON, I can more easily abstract away their verbosity without writing a purpose-built tool (looking at you, Kubernetes) without resorting to text templating (ahem Helm). Jsonnet might just be my "one JSON-generation language to rule them all"!
---
Though if Starlark is your thing, do checkout out skycfg[1]
> shifting default workflows away from the standard client is one of those changes that doesn't serve users
There is no "default workflow" for git - Git can be used in a variety of ways; everyone has their own personal preference, and some of these workflows are shaped by interacting with systems like Github/Gerrit/etc.
Every place I've ever worked at has either:
- a home-grown wrapper around git to make common operations easy
- a recommended set of git aliases around common operations
The fact that Github has created their own to interact with their own system is a net win for individuals/organizations who would otherwise need to spend time scripting their own.
Of all the capabilities of the Github CLI, `gh pr checkout` is the only one I use, because it makes it easy to fetch a PR locally by ID without configuring a remote per fork. I'm pretty glad I didn't have to write this myself.
Question about this part of the hypermedia vs data APIs essay: "This new end point is driven entirely by hypermedia needs, not data model needs. This end point can go away if the hypermedia needs of the application change"
Presumably these hypermedia endpoints are all equally accessible - couldn't the removal of one break another application if the latter app decided to depend on a particular hypermedia snippet from the first?
If there are multiple teams in an org that own HTMX endpoints, aren't the set of endpoints effectively a sort of interface boundary? Or is there a good way of declaring certain endpoints "private" and ensuring they're only fetched from particular applications? (Maybe separate domains for every app?)
in the sense I'm using, an application is a single functional app: you wouldn't have two applications using a shared URL because that would break the uniform interface
you might have two separate apps that maintain their own hypermedia APIs and that, on the server side, then share a data API between each other
the idea is to take advantage of the uniform interface of a hypermedia APIs:
> but if we changed the culture to become more emotions-aware, people would know how to help each other and themselves far better.
> Fixing psychological and emotional problems through cultural change just doesn't solve anything.
By my reading, the two quotes are in direct opposition to one other. It sounds like you are advocating for cultural change - just a different type/approach than is currently used?
Cultural change that addresses behaviors or focuses on other external factors can't work; that's outside-in cultural change.
But if you want to fix issues on a systemic, national, or worldwide level (which is the goal here), you do need some kind of cultural change.
Cultural change that focuses on inside-out psychology (not the movie) is what I call "psychological change"; the cultural element is just the "trojan horse" through which any systemic change must happen.
Example:
- banning magazines that show excessively thin models, because girls "become insecure when they look at them", is the thinking I denounce. If they had high self-esteem they couldn't be affected by a magazine cover. If they have low self-esteem, they will be. But the magazine is blamed, and legislation and activist momentum focuses on that. The self-esteem problem doesn't get solved, just displaced. Exactly the same with the "Instagram harms teen girls' mental health" viewpoints. That's only a problem because 80% of the population has low self-esteem and is unaware of it. Shouldn't that be fixed?
Implement a culture where people, for example:
- know the signs of low self-esteem
- know exactly how to address it in themselves, without pop-psych quakery or needing to pay for therapy sessions
- know how to address it in others
- know what assertiveness looks like, and how to do it
- know how to problem-solve personal or relationship problems
- know how to effectively interact to defensiveness, depression, argumentativeness, egotism, insincerity, power struggles, irresponsibility, prejudice, whatever; without getting upset at the other person; how to assert boundaries when faced with people like that, and how to help them
- how to evaluate others' emotional health, so people can make more informed choices in mates and spouses
- know how to evaluate if their relationships are healthy, and what to do about it if they're not
- how to grieve effectively
- could go on, and on.
So many of society's problems come from psychological illiteracy. Again, people's obsession with pop-psychology proves that people see a big need in learning more. Sadly, the pop-psychology craze mostly focuses on superficial things like self-talk, or on trying to, for example, "spot" signs of Narcissism or psychopathy in other people, to try to "protect oneself" from these people; it's not deep enough and doesn't get people to actually understand themselves and each other better; just to project various medical labels onto others and themselves (self-diagnose). What I'm proposing is outside that framework of "mentally ill vs normal" and focuses on empathically learning more about oneself and others and ultimately being able to help each other.
The cultural change is just meant to address that; it's a change in awareness and knowledge, not in behavior.
Seems like brute-force search for adjacent values with the same rank is possible, but guaranteeing that intermediate higher-rank values dont also exist may not be (for an attacker). Maybe one mitigation on this sort of attack is to search for higher-rank extra values to insert to break up large G-nodes?
This also assumes they can know the hash function (if rank is chosen by cryptographically-secure hash); maybe also salting values before hashing could thwart these sorts of attacks?