(1) Take your data as a stream. Use your machine learning gadget to give you the (predicted) probability for each of the possible next tokens. Then use those probability in arithmetic coding to specify which token actually came next.
(2) Take your data D. Apply lossy compression to it. Store the result L := lossy(D). Also compute the residue R := D - uncompress(L). If your lossy compression is good, R will be mostly zeroes (and only a few actually differences), so it will compress well with a lossless compression algorithm.
Approach (1) is a more sophisticated version of (2). None of this is anything I came up with, those approaches are well known.