Hacker News new | past | comments | ask | show | jobs | submit login

Almost! I can't quite replicate these tests because it doesn't give the type (integer or float, and width) of the database columns. There are many possibilities[0] so I tested a few. Fortunately the testing CPU (Intel Xeon E5-2650 v4 @ 2.20 GHz) is comparable to mine (Intel Core i5-6200U CPU @ 2.30GHz). Turbo (one-core speed) is 3.00 GHz versus 2.80 GHz with advantage to the Xeon, but my laptop CPU is a year newer and on a smaller process so could have better IPC. Both have AVX2, and the server CPU has a much better cache, but I don't think Dyalog is bandwidth limited.

I adapted the vectorized Dőtsch solution, which gave the fastest times and is easy to write, to Dyalog APL. Pre-release 18.0 from when I worked there because I can't be bothered to do a real install, but I doubt the arithmetic code's been changed since 17.1. Here are results from a Dyalog session giving times in seconds for 8-byte floats, 4-byte integers, and 1-byte integers. From the article, Q takes 13E¯3 seconds.

          )copy dfns cmpx
          (sA sB pA pB)←?{1e6⍴  0}¨⍳4 ⋄ cmpx '(sA×pA≥pB) + sB×pA≤pB'
          (sA sB pA pB)←?{1e6⍴1e9}¨⍳4 ⋄ cmpx '(sA×pA≥pB) + sB×pA≤pB'
          (sA sB pA pB)←?{1e6⍴120}¨⍳4 ⋄ cmpx '(sA×pA≥pB) + sB×pA≤pB'
If the table consists of floats then Dyalog appears substantially faster, although this could plausibly be due to better IPC on my CPU. It could also be a real increase. Dyalog uses bit booleans for the comparison results which allows it to make smaller reads and writes, and code for packing results to bit booleans and multiplying floats by booleans does have to be written with vector intrinsics ([1] indicates Q doesn't pack booleans). If the benchmark uses 8-byte ints then they should be comparable to floats, and if it's using smaller ints then Dyalog is clearly much better.

Will see if I can get numpy benchmarks running.

[0] https://code.kx.com/q/basics/datatypes/

[1] https://code.kx.com/q4m3/3_Lists/#323-simple-binary-lists

Ran the following numpy program, which resulted in a time of 0.01116s, faster than the article's benchmark but slightly slower than my Dyalog timing of 0.0093s. Assuming relative times are consistent across CPUs, this would put Dyalog on 8-byte floats just ahead of Q on 8-byte ints. Floating-point vector instructions take longer than integer ones; while overflow checking could turn things the other way, K and Q famously don't do it. It's not consistent with a sum of 100 runs as geocar suggests.

    import numpy as np
    import time

    N = 1000 * 1000
    pA = np.random.randint(0, 5, N)
    pB = np.random.randint(0, 5, N)
    sA = np.random.randint(0, 100, N)
    sB = np.random.randint(0, 100, N)

    start = time.perf_counter()
    runs = 100
    for i in range(runs):
        bS = sA * (pA >= pB) + sB * (pA <= pB)
    end = time.perf_counter()
    print((end - start)/runs)

They're 8-byte ints, and the reported times are sum of 100 runs. This is what my 2019 i5 1.6ghz Macbook Air does:

    q)`sA`sB`pA`pB set'4 0N#1000000?100
    q)\t:100 (sA*pA>=pB)+sB*pA<=pB

Where do you get this information? Do you have some connection to the article?

It appears this benchmark uses vectors of a quarter-million, not a million, elements? My understanding is that 4 0N#a redistributes elements of a into four vectors without repeating them, and this is what ngn/k does. The article says a "sample table of size one million", refers to "one million rows" elsewhere, and later gives timings that scale linearly with the number of columns, so I don't think that's what is meant.

EDIT: Oh, there's generation code near the top. In an image for some reason so the Q version is transcribed below. Definitely a million rows. Seems both Numpy's random.randint and Q default to 8-byte ints? Dyalog would use 1-byte integers if the data fits, as it does in this example.

    N: 1000 * 1000

    t: ([] pA: N?5; pB: N?5
           sA: N?100; sB: N?100)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact