
Data-Parallel Rank-Select Bit-String construction - KirinDave
https://haskell-works.github.io/posts/2018-08-08-data-parallel-rank-select-bit-string-construction.html
======
jonaslyk
I could see this ending up in my data ingestion pipeline... using SIMD
instructions for finding line boundaries right after decompression would give
a major performance boost.

Decompressing LZ4 can not be done in parallel, but right after that what I
want is to split the workload to CORE_COUNT threads at correct line
boundaries.

Some primitive way to control pipe flow would solve a problem I have not found
any good soloutions for. Example: I want to ingest csv/json with a row
containing a domainname.

That row will be my primary key, and my table is partionised on domainname
TLD.

In Postgre COPY FROM is the only really fast way to ingest data, but it is not
possible to COPY from into a partionised table unless all entries are ending
up in the same partition.

If I could do something like this:

curl -q [http://something.com/x.lz4](http://something.com/x.lz4) | lz4 -d -|
tool -pipeFlowCmd=postgreIngest.sh -pipeFlowCmdArg=split($domainname,\\.)[-1]

The tool should spawn a postgreIngest.sh for each tld it sees with the tld as
argument and keep the process open piping matching lines into them.

postgreIngest.sh would be something like: psql | awk '{ if ((NR % 500) == 0)
printf("COPY domains.$1 FROM stdin "); print; }'

so psql starts by getting "copy domains.com from stdin" then 500 lines piped
into it- all with a tld of .com then repeat.

Having a tool capable of using SIMD instructions for ultra fast pipeflow
parallelisation and control would be awesome because the bottleneck will not
be distribution of the decompressed data to X processess.

If I can find the time I will make it myself using hes ideas for inspiration(I
will make it in c++ Posix subset).

Anyway, great work, thanks for sharing- you inspire me :)

~~~
jonaslyk
just having a generic tool using SIMD instructions with controllable masking
arguments in the pipeline would be awesome.

Another example:

When ingesting data into postgre with copy from exceptions are bad...all
entries in the copy from batch are discarded because of one faulting line. So
either it have to be flawless or the data needs to also be stored on disk
until commit is successfull.

Datasets with invalid utf-8 are a pain because of that. One solution is to
pipe all input to iconv −f UTF−8 −t UTF−8 -c That will drop invalid
characters, but it eats alot of CPU as every character is parsed one by one.

With a pipe processing stage using SIMD custom masks as a way to control the
pipe flow I could: Select lines containing chars indicating we have a multi
byte utf-8 sequence(that can possibly fail) and only pipe those to iconv, it
would reduce overhead by 99%.

------
valenciarose
The input constraints might have been realistic 10+ years ago, but there
aren't a lot of performance sensitive parsing problems that can safely do
without either escaping or Unicode support.

~~~
jonaslyk
True, but irrelevant— hes experiments apply perfectly to finding escape
characters and bit sequences that indicates the following x bits should be
parsed by ICU. He is testing ideas and concepts, I look forward to an update.
Would like a way to subscribe.

