
Parsing Series of Integers with SIMD (2018) - lelf
http://0x80.pl/articles/simd-parsing-int-sequences.html
======
plastic_teeth
I haven't read this properly, and I'm not sure I totally understand the stuff
being parsed, but here's my initial thought. Parsing a 4-digit span can be
done with PEXT-sub-multiply, and range checking requires a extra few
instructions.

initial: 0 0 0 0 3 4 1 2 (characters)

pext: 0 3 0 4 0 1 0 2 (characters)

sub: 0 3 0 4 0 1 0 2

multiply: 3412 412 12 2

Values below '0' will set high bits in the "sub" phase. Do another "sub" to
check for values above '9', and branch.

Use of PEXT means potentially very fast handling of separators. Everything can
be branchless.

Might not be as fast as SIMD but it should be better than a typical scalar
approach.

------
nsajko
A bit off-topic question: why do people use "parsing" instead of "lexing" or
"tokenization" for denoting the analysis of strings from regular languages;
considering lexing is just a trivial sub-case of parsing anything that
includes the regular languages?

Perhaps more on topic: I think this can be even faster if one is willing to
offload the heavy lifting to an FPGA, here's a result from 2016:
[https://www.miraclelinux.com/labs/pdf/fpga-
en](https://www.miraclelinux.com/labs/pdf/fpga-en)

~~~
gameswithgo
Human language is flexible and imprecise. There is no compiler. English in
particular is often interpreted in a descriptivist manner (meaning is defined
based on how the public uses the words) rather than prescriptivist (meaning is
defined by central authority).

Parsing is sometimes used to describe the translation of tokens to abstract
syntax tree, sometimes it is used to describe both tokenzation and tree
formation together, sometimes it is used to describe any process of taking in
text and producing something more structured (Turbo Pascal for instance went
straight from text -> machine code, was that parsing?)

Tokenzation is taking in text and producing something more structured, so it
may be appropriate to call that a form of parsing too.

Many programmers would like for human language to be as strict and precise as
programming, but absent some plan to cause that to happen it is better to go
with the flow, and simply be as clear in your own communication as you can
rather than quibbling about definitions of others all the time. You won't ever
win.

------
htfy96
Curious if Intel BCD ops
([https://en.wikipedia.org/wiki/Intel_BCD_opcode](https://en.wikipedia.org/wiki/Intel_BCD_opcode))
are still a thing in today's SiMD world. Converting chars to BCD code then
convert BCD to u32 sounds a little bit easier than this

------
29athrowaway
There's substantial difference between AVX and SSE. You can get a performance
boost just by compiling with AVX2 or AVX512 support.

~~~
rurban
AVX-512 not really, as it punishes you with downclocking. It's only available
on the slow, insecure legacy CPU's from Intel, not on any modern CPU.

~~~
qubex
> _It’s only available on the slow, insecure legacy CPU’s from Intel, not on
> any modern CPU._

Am I to take it that it is your opinion that any _x_ 86 CPU not manufactured
by AMD is essentially dangerous and worthless?

~~~
rurban
High security risk esp. Why bother with AVX-512? It's not portable, and has
questionable advantages.

Only AVX-512 downclocking is significant, 256 and 128bit not at all.

