Hypertokens define a spreadsheet-esque label-based referential coding environment. HTs can be used in any LLM context to create arbitrary operations as will be shown in the paper. Current expected release is later this month.
I believe he's describing Orthogonal Matching Pursuit. It's a dictionary learning algorithm that can be used to recover sparse dictionarries using L1 regularization.
Not quite, though very related and I believe both should end up with essentially the same result.
Matching pursuit is essentially a greedy algorithm, if I recall correctly - please do correct me if I am wrong, where you conceptually find the component that explains the most data at each iteration, remove it, and then repeat the process on the residual data. Pardon if that isn’t quite the right explanation, but it’s what my intuition is recalling right now…
What I was describing was a simpler algorithm that can be done with gradient descent or any other vanilla optimizer.
Your model parameters are the coefficients a_i over all basis functions in the frequency domain representation. Run them through the synthesis function to get a time domain signal, and select the values of the time domain signal where your target data is known. Compute the squared error at each of those locations, and take the mean. This is your reconstruction error, and should be trivially differentiable with respect to the coefficients a_i. Compute an additional error term which is a standard L1 regularization, ie sum(|a_i|), which can be added to the reconstruction error term with some weight λ (λ=1 is even fine here, at least for simple problems), and then is also trivially differentiable (provided you haven’t initialized any of the coefficients to 0). As with any L1 regularization term, the resulting solution should be sparse in the L1 regularized parameters (look up visualizations of problems with only 2 model parameters to see how this emerges from the L1 contour lines of equal loss forming “diamonds” with the points on the axes).
the diamond construct also feels evocative of dimer &/or branched manifold / lattice methods, be that Viterbi or otherwise 2.2 in op post is reminiscent of that, e.g., if we view the DCT reconstruction as implicit matched filter
yes in theory should converge on similar result may quickly get into alternating conic optimization especially depending how the signal pair constructed, e.g., if one signal is an ECC &/or if L1 regularization is operating as an error squasher alternating op,
also, the Microsoft Word vs. genome size chart, <https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto...>
reply