>What's harder to deal with from a measurement perspective is sematic equivalence. calling some kinds of errors zero-cost, but not having a great way to categorize what the loss of, exactly. But it's kinda what you want for really extreme compression: the content is equivalent at a high level, but may be a very different byte stream.
Is basically saying
> What's harder is defining a reconstruction process in terms of a "semantic group" i.e. an output encoding and associated group actions under which the loss is invariant, and having the group actions express the concept of two non-identical outputs being"equivalent for the purposes of downstream processing".
Taco Cohen is one of the pioneers of this line of research, and invariant, equivariant and approximately iv/ev architectures are a big thing in scientific and small data ml
"an output encoding and associated group actions under which the loss is invariant"
What loss is that, exactly?
One of the difficulties in speech processing is that we generally don't have a great model for speech quality or equivalence. The human hearing system is a finicky and particular beast, and furthermore varies from beast to beast, depending both on physiology and culture. Good measures (eg, Visqol, a learned speech quality metric) tend to be helpful for measuring progress when iterating on a single system, but can give strange results when comparing different systems.
So it 's easy to imagine (say) pushing the generated speech into some representation space and measuring nearness in that space (either absolute or modulo a group action), but it begs the question of whether nearness in that space really represents semantic equivalence, and how to go about constructing it in the first place.
Let alone why one would bother allowing some group symmetries into the representation when we plan to define a loss invariant under those symmetries... Throwing group theory at speech representations feels like a solution in search of a problem, as someone who has worked a lot with group theory and speech compression.
Group theory is just a way to think about what properties a function needs to have under specific actions on the data. If you want to train a speech network that should be peak-amplitude invariant (to make a random example) you can normalise the amplitudes, modify your loss to be invariant to it or modify the network output to be invariant to it. These might have different numerical tradeoffs (e.g. one of the reasons why people use equivariant architectures with a final invariant aggregator is that it allows each layer to use and propagate more information, and one the reasons why graph neural networks are a thing is because we don't always have a canonical labeling).
All the stuff you mentioned is true, but thinking about it in an abstract sense, that means there's a set of universal symmetries and a set of highly context depending symmetries, and group theory is afaik our best method of thinking about them rigorously - as I say in the child comment, not the end point, but our current best starting point (in my opinion)
Thanks, I need to read more. I get how a semantic group could occur but it doesn't seem obvious to me that groups and group actions are necessarily the correct abstraction for representing semantic similarity.
It's not the right one on its own, but it's the closest we have right now to talk about things like "high level equivalence" rigourously. Like, one way to talk about logic is that it's e.g. invariant under double negation - so any compression algorithm that does "semantically lossless compression" can exploit that property. Deep learning then becomes an algorithm to find features in the data which remain invariant under irrelevant transformations, and representations useful for semantic similarity could then be formulated as a very big and complex composite group, potentially conditional... there's probably a new theory of hierarchal similarity and irreducible collisions/errors required, but as of now, group theory, approximate equivariance and category theory are the bleeding edge of thinking about this rigorously while still having sota benchmarks/real world applicability to my knowledge