Hacker News new | past | comments | ask | show | jobs | submit login

People tend to severely underestimate how fast modern machines are and overestimate how much you need to spend on hardware.

Back in my last startup, I was doing a crypto market intelligence website that subscribed to full trade & order book feeds from the top 10 exchanges. It handled about 3K incoming messages/second (~260M per day), including all of the message parsing, order book update, processing, streaming to websocket connections on any connected client, and archival to PostGres for historical processing. Total hardware required was 1 m4.large + 1 r5.large AWS instances, for a bit under $200/month, and the boxes would regularly run at about 50% CPU.

What Andy giveth, Bill taketh away.[0]

I'm more than a little annoyed that so much data engineering is still done in Scala Spark or PySpark. Both suffer from pretty high memory overhead, which leads to suboptimal resource utilization. I've worked with a few different systems that compile their queries into C/C++ (which is transparent to the developer). Those tend to be significantly faster or can use fewer nodes to process.

I get that quick & dirty scripts for exploration don't need to be super optimized, and that throwing more hardware at the problem _can_ be cheaper than engineering time, but in my experience, the latter ends up costing my org tens of millions of dollars annually -- just write some code and allocate a ton of resources to make it work in a reasonable amount of time.

I'm hopeful that Ballista[1], for example, will see uptake and improve this.

[0] https://en.wikipedia.org/wiki/Andy_and_Bill%27s_law

[1] https://github.com/apache/arrow-datafusion/tree/master/balli...

I get a kick out of stuff like this - I’m mostly an exec these days, but I recently prototyped a small database system to feed a business process in SQLite on my laptop.

To my amusement, my little SQLite prototype smoked the “enterprise” database. Turns out that a MacBook Pro SSD performs better than the SAN, and the query planner needs more tlc. We ended up running the queries off my laptop for a few days while the DBAs did their thing.

Right. Local storage is much more performant and cost effective than network storage. I tried to run some iops sensitive workload on cloud. It turns out I need to pay several thousands dollar per month for the performance I can get on a $100 nvme ssd.

I'm working on a web app right now that does a lot of heavy/live/realtime processing with workers. The original thought was to run those workers on the node servers and stream the results over a socket, charging by the CPU/minute. But it surprised me that the performance looks a lot better up to about 16 workers if the user just turns over a few cores to local web workers and runs it on their own box. As long as they don't need 64 cores or something, it's faster to run locally, even on an older machine. Thread transport is slow but sockets are slower at a distance; the bottlenecks are in the main thread anyway. So I've been making sure parts are interchangeable between web workers and "remote socket web workers" assigned to each instance of the app remotely.

Who the fuck thinks it's a good idea to run database on networked storage?

You might have heard of this little company in Seattle called “Amazon”.

Like the river? They're gonna need to fight hard for that SEO.

Isn't that what's happening if you use any managed database product? They have probably colocated everything as much as possible and used various proprietary tricks to cut latency, but still.

The same people who will run a clustered database on VMs.

What reminded me of this the other day is how MacOS will grow your cursor if you “shake” it to help you find it on a big screen.

I was thinking about how they must have a routine that’s constantly taking mouse input, buffering history, and running some algorithm to determine when user input is a mouse “shake”.

And how many features like this add up to eat up a nontrivial amount of resources.

That particular example seems like something that's probably a lot cheaper than you'd initially think. The OS has to constantly take mouse input anyway to move the pointer and dispatch events to userspace. It also needs to record the current and new position of the mouse pointer to dispatch the events. Detecting whether the mouse is being "shaken" can be done with a ring buffer of mouse velocities over the last second or two of ticks. At 60 fps, that's about 120 ints = 480 bytes. Since you don't need to be precise, you can take Manhattan distance (x + y) rather than Euclidean distance (sqrt(x^2 + y^2)), which is a basically negligible computation. Add up the running total of the ring buffer - and you don't even need to visit each element, just keep a running total in a variable, add the new velocity, subtract the velocity that's about to be overwritten - and if this passes a threshold that's say 1-2 screen widths, the mouse is being "shaken" and the pointer should enlarge. In total you're looking at < 500 bytes and a few dozen CPU cycles per tick for this feature.

Or, alternatively, the engineer that worked on this at Apple has just read the above as another way of solving the problem and is throwing this on their backlog for tomorrow..

Thanks for the thoughtful analysis and napkin math. You may very well be right. I wonder if this is true in practice or if they suffer from any interface abstractions and whatnot.

On every modern (past few decades) platform, the mouse cursor is a hardware sprite with a dedicated, optimized, *fast* path thru every layer of the stack, just to shave off a few ms of user-perceived latency. Grab a window and shake it violently, you'll notice it lags a few pixels behind the cursor - that's the magic in action.

In some places there's no room left for unnecessary abstractions, I can imagine most of the code touching mouse / cursor handling is in that category.

I wish this was true with Ubuntu but every time I boot up is a dice roll on if my mouse will be sluggish and low -FPS or not.

If you shake the cursor up and down, it doesn't grow.

Looks like it's measuring something like number of direction changes in relation to distance traveled; ignoring the y axis completely.

That's some actual engineering there, understanding the problem you're trying to solve, instead of solving problems you don't even have. Nice.

Why not use an exponential filter that only uses a single variable?

The running-sum-difference approach suggested above is a box filter, which has the best possible noise suppression for a given step-function delay, although in the frequency domain it looks appalling. It uses more RAM, but not that much. The single-pole RC filter you're suggesting is much nicer in the frequency domain, but in the time domain it's far worse.

Is it „far worse in the time domain“ due to its infinite impulse response?

Not really? Sort of? I don't really have a good answer here. It depends on what you mean by "due to". It's certainly due to the impulse response, since in the sense I meant "far worse" the impulse response is the only thing that matters.

Truncating the impulse response after five time constants wouldn't really change its output noticeably, and even if you truncated it after two or three time constants it would still be inferior to the box filter for this application, though less bad. So in that sense the problem isn't that it's infinite.

Likewise, you could certainly design a direct-form IIR filter that did a perfectly adequate job of approximating a box filter for this sort of application, and that might actually be a reasonable thing to do if you wanted to do something like this with a bunch of op-amps or microwave passives instead of code.

So the fact that the impulse response is infinite is neither necessary nor sufficient for the problem.

The problem with the simple single-pole filter is that by putting so much weight on very recent samples, you sort of throw away some information about samples that aren't quite so recent and become more vulnerable to false triggering from a single rapid mouse movement, so you have to set the threshold higher to compensate.

Reading all of you sounding super smart and saying stuff I don’t recognize (but perhaps utilize without knowing the terms) used to make me feel anxious about being an impostor. Now it makes me excited that there’s so many more secrets to discover in my discipline.

Yeah, DSP has a lot of really awesome stuff in it! I recommend https://www.dspguide.com/.

It turns out that pretty much any time you have code that interacts with the world outside computers, you end up doing DSP. Graphics processing algorithms are DSP; software-defined radio is DSP; music synthesis is DSP; Kalman filters for position estimation is DSP; PID controllers for thermostats or motor control are DSP; converting sonar echoes into images is DSP; electrocardiogram analysis is DSP; high-frequency trading is DSP (though most of the linear theory is not useful there). So if you're interested in programming and also interested in graphics, sound, communication, or other things outside of computers, you will appreciate having studied DSP.

Don't worry, this is a domain of magic matlab functions and excel data analysis and multiply-named (separately invented about four times on average in different fields) terms for the same thing, with incomprehensible jargon and no simple explanation untainted by very specific industry application.

Agreed, a single-pole IIR filter would be cheaper and have a better frequency domain.

Please elaborate.

For both alternatives we begin by computing how far the mouse has gone:

    int m = abs(dx) + abs(dy);   // Manhattan distance
For the single-pole RC exponential filter as WanderPanda suggested:

    c -= c >> 5;                 // exponential decay without a multiply (not actually faster on most modern CPUs)
    c += m;
For the box filter with the running-sum table as nostrademons suggested:

    s += m;                      // update running sum
    size_t j = (i + 1) % n;      // calculate index in prefix sum table to overwrite
    int d = s - t[j];            // calculate sum of last n mouse movement Manhattan distances
    t[j] = s;
    i = j;
Here c, i, s, and t are all presumed to persist from one event to the next, so maybe they're part of some context struct, while in old-fashioned C they'd be static variables. If n is a compile-time constant, this will be more efficient, especially if it's a power of 2. You don't really need a separate persistent s; that's an optimization nostrademons suggested, but you could instead use a local s at the cost of an extra array-indexing operation:

    int s = t[i] + m;
Depending on context this might not actually cost any extra time.

Once you've computed your smoothed mouse velocity in c or d, you compare it against some kind of predetermined threshold, or maybe apply a smoothstep to it to get the mouse pointer size.

Roughly I think WanderPanda's approach is about 12 RISCish CPU instructions, and nostrademons's approach is about 18 but works a lot better. Either way you're probably looking at about 4-8 clock cycles on one core per mouse movement, considerably less than actually drawing the mouse pointer (if you're doing it on the CPU, anyway).

Does that help?

> they must have a routine that’s constantly taking mouse input

Possible but unlikely. Well-written desktop software never constantly taking input, it's sleeping on OS kernel primitives like poll/epoll/IOCP/etc waiting for these inputs.

Operating systems don't generate mouse events at 1kHz unless you actually move the mouse.

“Constantly taking” is not the same thing as “constantly polling”. The ring buffer approach works identically in the event-driven approach, you just need to calculate the number of “skipped” ticks and zero them out in the ring buffer.

Simply moving my USB mouse consumes 10% of my CPU. Computers these days...

Back in the day, moving your mouse made things look like they were processing faster.

Not just _look like_ — on Windows 95, moving the mouse _actually_ made things process faster, for fairly bizarre reasons.

Source: https://retrocomputing.stackexchange.com/questions/11533/why...

What's behind the mouse cursor while you're doing it? Could it be the UX/UI layer keeping state up-to-date?

Other possibility, do you have a gaming mouse with 1000Hz polling rate configured?

Yeah, it's a high quality mouse. But the only excuse for this is it's slightly cheaper to make everything USB. PS/2 worked much better. It was limited to 200Hz but needed no polling. Motherboards just stopped providing the port.

If the computer has to do anything at all it ads to complexity and it isn't doing other things. One could do something a bit like blue screening and add the mouse pointer to the video signal in the monitor. For basic functionality the computer only needs to know the x and y of clicks. (it could for laughs also report the colors in the area) Hover and other effects could be activated when [really] needed. As a bonus one could replace the hideous monitor configuration menus with a point and click interface.

This polling is not done by the CPU, this is a common misconception. In a typical modern system the only polling that happens with USB is done by the USB host controller and only when there is actual data the host controller generates interrupts for the CPU to process it. Obviously, when you configure the mouse at higher frequency you will get more interrupts and hence higher CPU usage but that has nothing to do with the polling.

Gaming or off the shelf prosumer mobos still have PS/2 on them occasionally. Although, it's probably just using a converter to USB anyway.

Your report rate may be too high.

What does that mean?

And yet MacOS doesn't allow to change the cursor color. On my Windows 10 desktop I set the cursor color to a slightly larger size and yellow color. So much easier to work with.

That seems like an incredibly cheap feature. The mouse pointer png probably costs a lot more than the shake detector.

Abstractions almost always end up leaky. Spark SQL, for example, does whole-stage codegen which collapses multiple project/filter stages into a single compiled stage, but your underlying data format still needs to be memory friendly (i.e. linear accesses, low branching, etc.). The codegen is very naive and the JVM JIT can only do so much.

What I've seen is that you need people who deeply understand the system (e.g. Spark) to be able to tune for these edge cases (e.g. see [1] for examples of some of the tradeoffs between different processing schemes). Those people are expensive (think $500k+ annual salaries) and are really only cost effective when your compute spend is in the tens of millions or higher annually. Everyone else is using open source and throwing more compute at the problem or relying on their data scientists/data engineers to figure out what magic knob to turn.

[1]: https://www.vldb.org/pvldb/vol11/p2209-kersten.pdf

What's wrong with relying on data engineers for data engineering?

Spark is very very odd to tune. Like, it seems (from my limited experience) to have the problems common to distributed data processing (skew, it's almost always skew) but because it's lazy, people end up really confused as to what actually drives the performance problems.

That being said, Spark is literally the only (relatively) easy way to run distributed ML that's open source. The competitors are GPU's (if you have a GPU friendly problem) and running multiple Python processes across the network.

(I'm really hoping that people will now school me, and I'll discover a much better way in the comments).

Data engineers should be building pipelines and delivering business value, not fidgeting with some JVM or Spark parameter that saves them runtime on a join (or for that matter, from what I've seen at a certain bigco, building their own custom join algorithms). That's why I said it's only economical for big companies to run efficient abstractions and everyone else just throws more compute at the issue.

I work in the Analytics space and been mostly on Java and I am so glad that other people feel the same. At this point, people have become afraid of suggesting something other than Spark. I see something written in Rust to be much better at problems like this. I love the JVM but it works well with transactional workloads and starts showing its age when its dealing with analytical loads. The worst thing is then people start doing weak references and weird off the heap processing usually by a senior engineer but really defeats the purpose of the JVM

I guess your company is running on Java and running something else would cost a lot in training, recruiting, understanding, etc. But down the line, defeating the JVM will be understood only by the guy who did it... Then that guy will leave... Then the newcomers will rewrite the thing in Spark 'cos it will feel safer. Rinse-repeat...

(I'm totally speculating, but your story seems so true that it inspired me :-)

Some of it is what you mentioned about training and hiring costs that but mostly its this creation of the narrative that it will scale someday in the future. This is usually done by that engineer(s) and they are good at selling so a different opinion is ignored or frowned upon by the leadership.

I have now seen this anti pattern in multiple places now

> I love the JVM but it works well with transactional workloads and starts showing its age when its dealing with analytical loads.

This is interesting. Can you elaborate a bit?

Analytical loads deal with very large datasets in the order of terabytes even after you compress them. These workloads dont change much so keeping them in the heap eventually results in long processing pauses because JVM tries to recover this memory. However, in most cases, this data is not meant to be garbage collected. For transactions, once you have persisted the data, it should be garbage collected so the pattern works. There are a lot of other aspects that I can probably think of but the above one is the most important in my mind.

Yes, the whole idea of sending “agents” to do processing is poor performing and things like snowflake and Trino, where queries go to already deployed code, run rings around it.

Furthermore, pyspark is by far the most popular and used spark, and it’s also got the absolute world-worst atrocious mechanical sympathy. Why?

Developer velocity trumps compute velocity any day?

(I want the niceness of python and the performance of eg firebolt. Why must I pick?)

(There is a general thing to get spark “off heap” and use generic query compute on the spark sql space, but it is miles behind those who start off there)

Could you elaborate on other systems besides Ballista? (which looks great btw, thank you for sharing)

U-SQL on Azure Data Lake Analytics[0] and ECL from HPCC Systems[1] are two I have fairly extensive experience with. There may be other such platforms, but those and Ballista are the three on my radar.

[0] https://github.com/MicrosoftDocs/azure-docs/blob/master/arti...

[1] https://wiki.hpccsystems.com/pages/viewpage.action?pageId=28...

ECL is the devil. There's so little documentation on it, that you basically need to work for LNRS to actually get any experience with the damn thing.

If it had an SQL layer, I'd spend time evangelising it, but it's not worth learning another query language for.

There exists a world where it got open-sourced before Hadoop was built, and in that world it's probably everywhere.

A lot of that is due to absolutely lousy code.

We had a system management backend at my last company. Loading the users list was unbearably slow; 10+ seconds on a warm cache. Not too terrible, except that most user management tasks required a page reload, so it was just wildly infuriating.

Eventually I took a look at the code for the page, which queried LDAP for user data and the database for permissions data. It did:

    get list of users
    foreach user:
        get list of all permissions
        filter down to the ones assigned directly to the user
    foreach user:
        get list of all groups
        foreach group:
            get list of all permissions
            filter down to the ones assigned to the group
        filter down to the ones the user has
I'm no algorithm genius, but I'm pretty sure O(n^2+n^3) is not an efficient one.

I replaced it with

    get list of all users
    get list of all groups
    get list of all permissions

    <filter accordingly>
Suffice to say, it was a lot more responsive.

Also worth noting was that fetching the user list required shelling out to a command (a python script) which shelled out to a command (ldapsearch), and the whole system was a nightmare. There were also dozens of pages where almost no processing was done in the view, but a bunch of objects with lazy-loaded properties were passed into the template and always used, so when benchmarking you'd get 0.01 seconds for the entire function and then 233 seconds for "return render(...)' because for every single row in the database (dozens or hundreds) the template would access a property that would trigger another SQL call to the backend, rather than just doing one giant "SELECT ALL THE THINGS" and hammering it out that way.

Note that we also weren't using Django's foreign keys support, so we couldn't even tell Django to "fetch everything non-lazily" because it had no idea.

If that app were written right it could have run on a Raspberry Pi 2, but instead there was no amount of cores that could have sped it up.

Yeah, I see this a lot. I think it's especially easy to introduce this kind of "accidentally quadratic" behaviour using magical ORMs like Django's, where an innocent-looking attribute access like user.groups can trigger a database query ... access user.groups inside a loop and things get bad quickly.

In the case of groups and permissions there's probably only a few of each, so fetching all of them is probably fine. But depending on your data -- say you're fetching comments written by a subset of users, you can tweak the above to use IN filtering, something like this Python-ish code:

  users = select('SELECT id, name FROM users WHERE id IN $1', user_ids)
  comments = select('SELECT user_id, text FROM comments WHERE user_id IN $1', user_ids)
  comments_by_user_id = defaultdict(list)
  for c in comments:
  for u in users:
    u.comments = comments_by_user_id[u.id]
Only two queries, and O(users + comments).

For development, we had a ?queries=1 query parameter you could add to the URL to show the number of SQL queries and their total time at the bottom of the page. Very helpful when trying to optimize this stuff. "Why is this page doing 350 queries totalling 5 seconds? Oops, I must have an N+1 query issue!"

Django's prefetch_related mechanism essentially implements the pattern you describe here for you: https://docs.djangoproject.com/en/3.2/ref/models/querysets/#...

Thanks. Yeah, I think I used that years ago when I first ran into this problem, and it worked well. Whether one uses an ORM or not, one needs to know how to use one's tools. My problem (not just with Django, but with ORMs in general) is how they make bad code look good. Like the following (I don't know Django well anymore, but something like this):

  users = User.objects.all()
  for u in user:
    print(u.name, len(u.comments))
To someone who doesn't know the ORM (or Python) well, u.comments looks cheap ("good"), but it's actually doing a db query under the hood each time around the loop ("bad"). Not to mention it's fetching all the comments when we're only using their count. Whereas if you did that in a more direct-SQL way:

  users = select('SELECT id, name FROM users WHERE id IN $1', user_ids)
  for u in user:
    num_comments = get('SELECT COUNT(*) FROM comments WHERE user_id = $1', u.id)
    print(u.name, num_comments)
This makes the bad pattern look bad. "Oops, I'm doing an SQL query every loop!"

The other thing I don't like about (most) ORMs is they fetch all the columns by default, even if the code only uses one or two of them. I know most ORMs provide a way to explicitly specify the columns you want, but the easy/simple default is to fetch them all.

I get the value ORMs provide: save a lot of boilerplate and give you nice classes with methods for your tables. I wonder if there's a middle ground where you couldn't do obviously bad things with the ORM without explicitly opting into them. Or even just a heuristic mode for development where it yelled loudly if it detected what looked like an N+1 issue or other query inside a loop.

I agree that this stuff can definitely be handled better.

https://github.com/django-query-profiler/django-query-profil... has a neat option for detecting likely N+1 queries. I usually use the Django Debug Toolbar for this.

Django's ".only()" method lets you specify just the columns you want to retrieve - with the downside that any additional property access can trigger another SQL query. I thought I'd seen code somewhere that can turn those into errors but I'm failing to dig it up again now.

I've used the assertNumQueries() assertion in tests to guard against future changes that accidentally increase the number of queries being made without me intending that.

The points you raise are valid, but there are various levels of mitigations for them. Always room for improvement though!

Thanks for the thoughtful responses and the link -- Django Query Profiler looks nice!

    users = User.objects.all()
    for u in user:
        print(u.name, len(u.comments))
This is fine if you are working with a small data set. It is inefficient, but if it's quick enough, readability trumps efficiency IMHO.

Django ORM has a succinct way of doing the "SELECT COUNT(*)" pattern:

    users = User.objects.all()
    for u in user:
        print(u.name, u.comments.count())
And you can use query annotations to get rid of the N+1 query issue altogether:

    users = User.objects.annotate(n_comments=Count("comments"))
    for u in user:
        print(u.name, u.n_comments)

> This is fine if you are working with a small data set. It is inefficient, but if it's quick enough, readability trumps efficiency IMHO.

And this is how you end up with the problems the parent is describing. During testing and when you setup the system you always have a small dataset so it appears to work fine. But when it’s real work the system collapses.

A good habit when writing unit tests is to use assertNumQueries especially with views. It's very easy even for an experienced developer to inadvertently add an extra query per row: for example if you have a query using only() to restrict the columns you return, and in a template you refer to another column not covered by only(), Django will do another query to fetch that column value (not 100% if that's still the case, but that was the behavior last time I looked). The developer might be just fixing what they think is a template issue and won't know they've just stepped on a performance landmine.

Sounds like a good habit, yeah. But the fact that an attribute access in a template (which might be quite far removed from the view code) can kick off a db query is an indication there's something a bit too magical about the design. Often front-end developers without much knowledge of the data model or the SQL schema are writing templates, so this is a bit of a foot-gun.

I've wanted a Django mechanism in the past that says "any SQL query triggered from a template is a hard error" - that way you could at least prevent accidental template edits from adding more SQL load.

My SQL is a bit rusty, but, to reduce the queries even further, I think you can do:

> SELECT user_id, COUNT(*) FROM comments WHERE user_id IN $1 GROUP BY user_id

> Or even just a heuristic mode for development where it yelled loudly if it detected what looked like an N+1 issue or other query inside a loop.

Fwiw, ActiveRecord with the Bullet gem does exactly that. I'd guess there's an equivalent for Django.

select_related and prefetch_related are absolutely essential when coding Django.

I’m coming around to considering Django-seal to be mandatory for non-trivial applications. It will “seal” a query set so that you can’t make accidental DB lookups after the initial fetch. That way you can be more confident that you are doing the right joins in the initial query, and you are safe from the dreaded O(N) ORM issue.

SqlAlchemy has this as part of the ORM, it should really be part of Django IMO.

I've never heard of django-seal before, thanks!

When I was doing a lot of DBIx::Class work, I contemplated writing a plugin to allow me to lock and and unlock automatic relation fetches and die (throw) if one was attempted when locked. I would carefully prefetch my queries (auto-fill relations from a larger joined query) and if something changed or there was an edge case I missed that caused fetching deep in a loop, it might kill performance, but be hard to track down if not given the best directions on what caused it other than "this is slow".

It's one of those things that in the long run would have been much more time effective to write, but the debugging never quite took long enough each time to make me take the time.

> For development, we had a ?queries=1 query parameter you could add to the URL to show the number of SQL queries and their total time at the bottom of the page. Very helpful when trying to optimize this stuff. "Why is this page doing 350 queries totalling 5 seconds? Oops, I must have an N+1 query issue!"

I'm so glad Symfony (PHP framework) has a built-in profiler and analysis tooling there...

This is an example of N+1 problem [0]. It should be a FizzBuzz for anyone doing any CRUD apps.

[0]: https://stackoverflow.com/questions/97197/what-is-the-n1-sel...

Your pattern is quite powerful: get data from several sources and do the rearranging on the client (which might be a web server), instead of multiple interactions for each data item.

For SQL you can also do a stored procedure. Sometimes that works well if you are good at your DBMS's procedure language and the schema is good.

Though stored procs will always execute on the server that is probably already the system bottleneck.

Sending multiple queries from your web server will likely put more load on the database server than using a stored procedure.

With either technique you are still pulling all the data you need from the DB but with multiple queries instead of a stored procedure you are usually pulling more data than you need with each query and then dropping any rows or fields you’re not interested in. Together with multiple calls over the network to the DB server and (often) multiple SQL connection setups this is much worse for performance on both the web and database servers

>Sending multiple queries from your web server will likely put more load on the database server than using a stored procedure.

Lots of people seem to not realize that db roundtrips are expensive, and should be avoided whenever possible.

One of the best illustrations of this I've found is in Transaction Processing book by Jim Gray and Andreas Reuters where they illustrate the relative cost of getting data from CPU vs CPU cache vs RAM vs cross host query.

Except for open and writing to files, it's about the heaviest action on a backend server.

I had to fix a similar thing in our internal password reset email sender last year. The code was doing something like:

    for each user in (get_freeipa_users | grep_attribute uid):
        email = (get_freeipa_users | client_side_find user | grep_attribute email)
        last_change = (get_freeipa_users | client_side_find user | grep_attribute krblastpwdchange)
        expiration = (get_freeipa_users | client_side_find user | grep_attribute krbpasswordexpiration)

        # Some slightly incorrect date math...

I changed it to a single LDAP query for every user that requests only the needed attributes. It cut that Jenkins job's runtime from 45 minutes to 0.2 seconds.

Filter in app is rarely the right solution, your data should be organized in a way that you can get what you need in a single query. Reasons:

- it's memory efficient

- it's atomic

- it's faster

Also doesn't LDAP support filtering in query?

Given the context, it probably took the poster very little time to implement that fix, without digging into ldapsearch. With massive speedup, for their likely smallish ldap install. Seems like not a bad call at all.

I did some work to improve performance on a dashboard several years ago. The way the statistics were queried was generally terrible, so I spent some time setting up aggregations and cleaning that up, but then... the performance was still terrible.

It turned out that the dashboard had been built on top of Wordpress. The way that it checked if the user had permission to access the dashboard was to query all users, join the meta table which held the permission as a serialized object, run a full text search to check which users had permission to access this page, and return the list of all users with permission to access the page. Then, it checked if the current user was in that list.

I switched it to only check permissions for the current user, and the page loaded instantaneously.

If I look at traces of all the service calls at my company within our microservices environment, the "meat" of each service is a fraction of the latency -- the part that's actually fetching the data from a database, or running an intense calculation. Often times its between 20-40ms

Everything else are network hops and what I call "distributed clutter", including authorizing via a third party like Auth0 multiple times for machine-to-machine token (because "zero trust"!), multiple parameter store calls, hitting a dcache, if interacting with a serverless function, cold starts, API gateway latency, etc...

So for the meat of a 20-40 ms call, we get about a 400ms-2s backend response time.

Then if you are loading a front end SPA with javascript...fugetaboutit it

But DevOps will say "but my managed services and infinite scalability!"

Exactly like that.

Not sure what the exact use case was (i.e. the output of the filtering) but—from reading the first algo—seems to be something to do with determining group membership and permissions for a user.

In that case, was there a reason joins couldn't be used? As it still seems pretty wasteful (and less performant) to load all of this data in memory and post-process; whereas a well-indexed database could possibly do it faster and with less-memory usage.

In defense of whoever wrote the original code: it probably would have been reasonably fast if it had been a database query with proper indexes. The filters would have whittled the selection down to only the relevant data, whereas returning basically three entire tables of data to then throw away most of it would have been extremely inefficient.

The mistake of course was not thinking about why this approach is faster in a database query and that it doesn't work that way when you already need to get all the data out of LDAP to do anything with it.

Yeah - you likely want to do this a single simple query - which you can optimize if necessary. an O(N+1) query is bad. An O(N^2) query is something I have rarely seen. Congrats!

I'd also add that groups and permissions are probably constant and can be cached with a long timeout.

Is there a reason you shelled out with a subprocess versus using a library like ldap3? Just curious

I believe the parent's point was that code tends to be faster than what people expect, not slower.

I think the child's point is that people expect code to be slower than it is because they have seen code be slow far more than necessary.

Crypto markets are very small :-)

I'm working and company which process "real" exchanges, like NASDAQ, LSE, and, especially, OPRA feed.

We've added 20+ crypto exchanges in our portfolio this year, and all of them are processed on one old server which is unable to process NASDAQ Total View in real-time anymore.

On the other hand, whole OPRA feed (more than 5Gbit/s or 65B/day, yes, it is billions, messages of very optimized binary protocol, not this crappy JSON) is processed by our code on one modern server. Nothing special, two sockets of Intel Xeons (not even Platinums).

I've read your few posts a few times and I'm still not sure why you made your post. You're telling the person that you handle more data than them and thus need more resources than them. Was your goal to smugly belittle them? It's not like they said any problem can be solved on their specific resources.

Nope, I want to say, that even much more data could be processed on very limited hardware, and that it is additional confirmation that current hardware is immensely powerful and very under-estimated.

And that go-to solutions / golden hammers like REST / JSON are very much suboptimal.

I mean I'm still going to use it for client/server communication and the like because I don't have serious performance constraints enough to warrant something that will be more difficult to develop for etc, but still.

I feel like he’s saying two things.

One, that he’s surprised by how small crypto markets are.

Two, that this one server (or very few server processing) thing scales quite well to billions of messages a day.

I didn’t find any element of smugness here, but maybe I misread the tone.

They are just pointing out the size of crypto market data vs normal market data. I found it interesting, not their fault you didn’t.

I'd initially responded, but I found that my response had an element of a pissing match over whose data is bigger, so I deleted it.

The thing is - when two engineers get smug, oftentimes lots of fairly interesting technical details get exchanged, so such discussions aren't really useless to bystanders.

It depends on what your server does with each request though; '65B a day' means little. If all it does is write it to a log then I'm surprised you're not using a rPI.

Could you share some more about that very optimized binary protocol? I know there are ways to be more efficient than JSON but since you call it crappy, your solution must be much much better. Honestly interested to readup more.

It is not "our" protocol, it is protocol designed by exchange and we need to support it, as we can not change it :). Simple binary messages, with binary encoded numbers, etc. No string parsing, no syntax, nothing like this, only bytes and offsets. Think about TCP header, for example.

JSON is very inefficient both in bytes (32 bit price is 4 bytes in binary and could be 7+ bytes as string, think "1299.99" for example) and CPU: to parse "1299.99" you need burn a lot of cycles, and if it is number of cents stored as native 4-byte number you need 3 shifts and 4 binary ors at most, if you need to change endianness, and in most cases it is simple memory copy of 4 bytes, 1-2 CPU cycle.

When you have binary protocol, you could skip fields which you are not interested in as simple as "offset = offset + <filed-size>" (where <filed-size> is compile-time constant!) and in JSON you need to parse whole thing anyway.

Difference between converting binary packet to internal data structure and parsing JSON with same data to same structure could be ten-fold easily, and you need to be very creative to parse JSON without additional memory allocations (it is possible, but code becomes very dirty and fragile), and memory allocation and/or deallocation costs a lot, both in GC languages and languages with manual memory management.

Curious if the binary protocol uses floating point or a fixed point representation? Or is floating point with its rounding issues sufficient for the protocol's needs?

No GP but familiar with these protocols. They use fixed point extensively; I can't even thing of an exchange protocol which would use floating point since the rounding issues would cause unnecessary and expensive problems.

This is typical (from NASDAQ http://www.nasdaqtrader.com/content/technicalsupport/specifi... ):

Prices are integer fields. When converted to a decimal format, prices are in fixed point format with 6 whole number places followed by 4 decimal digits. The maximum price in OUCH 4.2 is $199,999.9900 (decimal, 7735939C hex). When entering market orders for a cross, use the special price of $214,748.3647 (decimal, 7FFFFFFF hex).

> The maximum price in OUCH 4.2 is $199,999.9900

For NASDAQ it seems to have been something around 430k / share... Buffett's BRK shares threatened to hit that limit a couple months ago: https://news.ycombinator.com/item?id=27044044

Many exchanges use floating point, even I the Nasdaq technology stack.

X-Stream feeds do for example

Most of them use decimal fixed point. Sometimes exponent (decimal, not binary one!) is fixed per-protocol, sometimes per-instrument and sometimes per-message, it depends on exchange.

Thanks for the writeup!

Some Googling turned up this protocol descriptor:


If you're optimizing for latency JSON is pretty terrible, but most people who use it are optimizing for interoperability and ease of development. It works just fine for that, and you can recover decent bandwidth just by compressing it.

There are many binary encoding protocols. A popular one is protobufs[1], which is used by gRPC.

[1]: https://developers.google.com/protocol-buffers

"old skool" exchanges uses either FIX (old and really vernose), FAST (binary encoding for FIX) or custom fixed-layout protocols.

Most big USA exchanges uses custom fixed-layout protocols, where each message is described in documentation, but not in machine-readable way. European ones still use FAST.

I didn't seen FIX in the wild for data feeds, but it is used for brokers, to submit orders to exchange (our company didn't do this part, we only consume feeds).

I don't know why, but all Crypto Exchanges use JSON, not protobufs or something like this, and didn't publish any formal schemes.

Fun fact: one crypto exchange put GZIP'ed and base64'ed JSON data into JSON which pushed to websocket, to save bandwidth. IMHO, it is peak of bad design.

there's still ascii FIX floating around on the market data side for a few esoteric venues

FAST is not particularly common in Europe

the large European venues use fixed-width binary encoding (LSE group, Euronext, CBOE Europe)

Eurex uses FAST for sure, but I can be wrong about "common".

Eurex/XETRA EOBI no longer uses FAST. It is about 8 year old t this point, but IIRC some products are still on the older protocol.

Euronext uses SBE specifically.

And msgpack if you want an order of magnitude faster serialization/deserialisation and can put up with worse compression (I think mainly due to schema overhead since protobuf files don't store the schema?)


Good protobuf vs msgpack comparison: https://medium.com/@hugovs/the-need-for-speed-experimenting-...

Have a look at NASDAQ ITCH, OUCH, and RASH (these are the real names, the story I heard is the original author of them didn't like the usual corporate style brand names and wanted certain people to squirm when talking about them).

In 2001 we started a GSM-operator on Compaq Server (it was before they were bought by HP) with whole 1Gb(!) of RAM and 2x10Gb SCSI disks.

It served up to 70K of subscribers, call center with 30-40 employees, payment systems integration, everything.

Next was 8 socket Intel server. We were never able to saturate it's CPUs - 300 Mhz (or was it 400 ?) bus was a stopper. It served 350-400K of subscribers.

And next: we changed architecture and used 2 servers with 2 socket Intel CPUs again but that was time when Ghz frequencies appeared on market. We dreamed about 4xAMD server. We came to ~1 mln of active subscribers.

Nowadays: every phone has more power than it was those servers. Typical react application consumes more resources than billing system. Gigabyte here, gigabyte there - nobody counts them.

/grumpy oldster mode

People may underestimate how fast modern machines are, but that is probably in part because, at least in my fairly relevant experience, I have literally never seen a CPU bottleneck under normal circumstances. Memory pressure is nearly always the driving issue.

The CPU is rarely used up to 100% because most code fails to utilize several cores efficiently.

OTOH a service loading the single core with the main thread is a frequent sight :( Interpreted languages like Python can easily spend 30% of time just on the deserialization overhead, converting the data from a DB into a result set, and then into ORM instances.

Yeah. Now that CPUs are insanely powerful and you have NVMe SSDs etc the bottleneck is always memory.

It's also amazing how much you can fit in RAM if you're careful. I remember ~2007 people were aghast at Facebook's 4T memcached deployment that stored basically everyone's social network posts; now you can get single servers for ~$4K with 4T of RAM.

The trick is basically that you have to eschew the last 15 years of "productivity" enhancements. Pretty much any dynamic language is out; if you must use the JVM or .NET, store as much as possible in flat buffers of primitive types. I ended up converting order books from the obvious representation (hashtable mapping prices to a list of Order structs) to a pair of SortedMaps from FastUtils, which provides an unboxed float representation with no pointers. That change ended up reducing memory usage by about 4x.

You can fit a lot of ints and floats in today's 100G+ machines, way more than needed to represent the entire cryptocurrency market. You just can't do that when you're chasing 3 pointers, each with their associated object headers, to store 4 bytes.

> now you can get single servers for ~$4K with 4T of RAM

Does the $4K include the cost of the RAM? Where can I find these servers? Thanks!

The more I read comments on subjects I am intimately familiar with, the more I realize most people who comment on HN don't really know what they're talking about and mostly make things up.

To answer your question, you can't find these servers because they don't exist. A server with 4T of RAM will cost you at a minimum $20,000 and that will be for some really crappy low-grade RAM. Realistically for an actual server that one would use in an actual semi-production setting, you're looking at a minimum of $35,000 for 4TB of RAM and that's just for the RAM alone, although to be fair that 35k ends up dominating the cost of the entire system.

For $4k you could get an (old) refurbished R820 with 1.5TB of (old) DDR3 RAM.

Which is a far cry from the claimed 4TB, but still, damn.

Thanks for your response. Your numbers are better than what I came up with, but I thought I'd check just in case I was missing a great deal somewhere.

Unless he was referring to $4k/mo for renting a 4TB server?

Maybe they downloaded more RAM...

> the more I realize most people who comment on HN don't really know what they're talking about and mostly make things up.

I don't think they're typically making things up. It's what I prefer to call Reddit knowledge. They saw someone else claim it somewhere, they believed it, and so they're repeating it so they can be part of the conversation (people like to belong, like to be part of, like to join). It's an extremely common process on Reddit and most forums, and HN isn't immune to it. Most people don't read much and don't acquire most of their knowledge from high quality sources, their (thought to be correct) wider knowledge - on diverse topics they have no specialization on - is frequently acquired from what other people say and that they believe. So they flip around on Reddit or Twitter for a bit, digest a few nuggets of questionable 'knowledge' and then regurgitate it at some later point, in a process of wanting to participate and belong socially. It's how political talking points function for example, passed down to mimic distributors that spread the gospel to other mimic followers (usually without questioning). It's how religion functions. And it's how most teachers / teaching functions, the teachers are distribution mimics (mimics with a bullhorn, granted authority by other mimics to keep the system going, to clone).

It's because some very high percentage of all of humans are mimics. It's not something Reddit caused of course, it's biology, it's a behavior that has always been part of humanity. It's an increased odds of success method of optimizing for survival of the species, successful outcomes, meets the Internet age. It's why most people are inherent followers, and can never be (nor desire to be) leaders. It's why few people create anything original or even attempt to across a lifetime. It's why such a small fraction of the population are very artistic, particularly drawn to that level of creative expression. If you're a mimic biologically it's very difficult to be the opposite. This seems to be viewed by most people as an insult (understandably, as mimics are the vast majority of the population and control the vote), however it's not, it's simply how most living things function, system wise, by mimicry (or even more direct forms of copying). Humans aren't that special, we're not entirely distinct from all the other systems of animal behavior.

That saying, safety in numbers? That's what that is all about. Mimicry. Don't stand out.

The reason most Wall Street money managers can't beat the S&P 500? It's because they're particularly aggressive mimics, they intentionally copy eachother toward safe, very prosperous, gentle mediocrity. They play a game of follow, with popular trends (each decade or era on Wall Street has popular trends/fads). Don't drift too far below the other mimics and it's a golden ticket.

Nobody got fired for buying IBM? Same thing. Mimic what has worked well for many others is biologically typically a high success outcome pattern (although amusingly not always, it can also in rare occasions lead off a cliff).

The Taliban? The Soviet Union? Nazism? Genocide? Multi generational patterns of mistake repetition passed down from parental units? That's how you get that. People mimic (perhaps especially parental units; biology very much in action), even in cases where it's an unsuccessful/negative pattern. All bad (and good) ideologies have mimic distributors and mimic followers, the followers do what they're told and implement as they're told. And usually there are only a very small number of originators, which is where the mimic distributors get their material.

The concept of positive role models? It's about mimicry toward successful outcomes.

Being a mimic and being a leader/creator are not exclusive; you can do both in various fields. One can even mimic and eventually start creating their own things on the back of that: “fake it until you make it” is a real thing.

> The more I read comments on subjects I am intimately familiar with, the more I realize most people who comment on HN don't really know what they're talking about and mostly make things up.

HN doesn't look exactly like SlashDot, but it's absolutely just like SlashDot.

It's mostly the good bits of old /. for me.

Intelligent discourse by knowledgeable persons.

The omission of GNAA and "frosty" posts are a massive boon.

I just Googled [servers with 4T of RAM], but apparently no, it does not include the RAM itself. Came up with this (about $4K base, another $44K to max it out with RAM):


Yeah, I've been eyeing the EPYC servers, but RAM pricing seems to very roughly be in the ballpark of $1,000 for 128GB, so $4k for 4T sounded very attractive and I wanted to make sure that I wasn't missing anything. As cvwright pointed out, you can get old DDR3 RAM for less, but I haven't found servers that can fit that many DIMMs. Thanks for your response.

I'm fairly sure I know where I can buy used/refurb Dell R820 at good prices with 512 or 1024GB of RAM, but I don't think I could accomplish the same for 4TB of RAM and $4000. Certainly not in one machine. And not with two $2000 machines each with 2048MB. We're close, but it's not that cheap yet.

Looking on ebay I can find some pretty decent R820 with 512GB each for right around $1500 a piece. Not counting any storage, even if they come used with some spinning hard drives, would end up replacing with SSDs. So more like three servers, 1.5TB of RAM, for $4500.

Yeah I run mirrors for certain sites in different regions of the world and a trick I often do is scrape the target site and then hold the entire site in memory. Even the top 500 posts of HN and all of the associated comments can fit in < 16 MB of RAM. If you want to serve that up over a socket, it's really fast to grab a pointer to the memory and just write it to the socket. You can get response times which are dwarfed by network latency that way.

~$40K? or ~$4k/mo?

In my experience disk i/o is the biggest bottleneck. It used to be sync()ing writes to disk for strict consistency but that's been pushed down to the DB now. I just looked at my DB systems and CPU is low but disk is nearly pegged.

My data sets are far too big to fit into memory/cache. Disk pressure can be alleviated by optimizing queries but it's a game of whack-a-mole.

I have exhausted EBS i/o and been forced to resort to dirty tricks. With RDS you can just pay more but that only scales to a point – normally the budget.

Does anyone know if any of the current crop of standard databases make use of things like Direct IO (where available on Linux), I’ve seen some write-ups that indicate you can get eye-wateringly fast performance when combined with an nvme drive.

Scylla is built arround direct io.

Specifically, aio (async io).

Sure, but as far as software is concerned, optimizing for memory bandwidth (the typical bottleneck in modern systems) is not so different from optimizing for CPU.

In my experience it is almost always the database holding things up. If your app does not use a database or it makes very simple use of it, then I'm not surprised it is blazing fast. As soon as you need to start joining tables and applying permissions to queries, it all gets slow.

I've seen exactly the opposite although you certainly can't ignore memory speed.

I'm running a crypto trading platform I'm developing on 30$ on DigitalOcean. I coded exclusively in Rust and recently added a dynamic interface to python. Today during the BTC crash it spiked at 20k events/s, and that's only incoming data.

> I coded exclusively in Rust

This reminds me of back in 2003, a friend of mine worked for an online casino vendor; basically, if you wanted to run an online casino, you'd buy the software from a company and customize it to fit your theme.

They were often written in Java, ASP.NET, and so on. They were extremely heavyweight. They'd need 8-10 servers for 10k users. They hogged huge amounts of RAM.

My friend wrote the one this company was selling in C. Not even C++, mind you, just C. The game modules were chosen at compile time, so unwanted games didn't exist. The entire binary (as in, 100% of the code) compiled to just over 3 MB when stripped. He could handle 10k concurrent users on one single-core server.

I'm never gonna stop writing things in Python, but it still amazes me what can happen when you get down close to the metal.

OpenResty [1] is a good mix of these concepts. It serves requests through nginx (which is at its core just a lightweight event loop) and then serves pages through LuaJIT. If you need more speed you could always write an nginx module in C (or in some other language viz the C ABI).

[1]: https://openresty.org/en/

Yeah currently for me there's no tool like Rust for performance, the borrow-checker isn't that complicated. However, last month I started adding a python wrapper around the public api so I'm slowly going your way ;)

Probably, Erlang would be a good fit for your task.

In fact you aren't just perspicacious, the entire system uses actors in Rust.

And you are totally right

is that $30 for a single droplet or is it spread out between a few different services? I'm kind of curious since I use DO for small projects myself.

I’m also curious on this. My stack currently is a SQLite file -> ORM -> .net core on Linux on a single box

I use terraform and ansible, one droplet has monitoring (prom and grafana), 2 run services that currently run without any other infrastructure The rest of the cost is a 250 gb space on DO

It's not a single droplet, it's 3 + data

I'm running two different projects on a single instance of cheap VM too. Both of them runs on a not-so-memory-efficient programming runtime, yet the VM handles the load just fine.

Of course, a lot of it depends on what your app does for each request but most apps are simple enough and can live with being a monolith / single fat binary running on a single instance.

The problem with today's DevOps culture is that they present K8's as answers for everything. Instead of defining a clear line on when to use them and when not to.

Would you mind describing your stack in more detail? Did you use gRPC with Go?

Sure, startup is defunct now and I think arbitrage & data on centralized exchanges is a dead market now. Wall Street HFTs got into the arbitrage game, and the data sites laypeople actually visit are the ones started in 2014.

Codebase was pure server-side Kotlin running on the JVM. Jackson for JSON parsing, when the exchange didn't provide their own client library (I used the native client libraries when they did). Think I used Undertow for exchange websockets, and Jetty for webserving & client websockets. Postgres for DB.

The threading model was actually the biggest bottleneck, and took a few tries to get right. I did JSON parsing and conversion to a common representation on the incoming IO thread. Then everything would get dumped into a big producer/consumer queue, and picked up by a per-CPU threadpool. Main thread handled price normalization (many crypto assets don't trade in USD, so you have to convert through BTC/ETH/USDT to get dollar prices), order book update, volume computations, opportunity detection, and other business logic. It also compared timestamps on incoming messages, and each new second, it'd aggregate the messages for that second (I only cared about historical data on a 1s basis) and hand them off to a separate DB thread. DB would do a big bulk insert every second; this is how I kept database writes below Postgres's QPS limit. Client websocket connections were handled internally within Jetty, which I think uses a threadpool and NIO.

Key architectural principles were 1) do everything in RAM - the RDS machine was the only one that touched disk, and writes to it were strictly throttled 2) throw away data as soon as you're done with it - I had a bunch of OOM issues by trying to put unparsed messages in the main producer/consumer queue rather than parsing and discarding them 3) aggregate & compute early - keep final requirements in mind and don't save raw data you don't need 4) separate blocking and non-blocking activities on different threads, preferring non-blocking whenever possible and 5) limit threads to only those activities that are actively doing work.

Would you use Kotlin again for the back end? Having not yet used it for that purpose, it seems like you’d get the benefit of the JVM ecosystem along with a nice language (but perhaps too many power-features).

Yes. Kotlin is great for both front and backend.

For that particular use-case (or related financial ones) I'd consider Rust, which was a bit too immature when I was working on this but would give you some extra speed. HFT is winner-take-all and the bar has risen significantly even in the last couple years, so if I were putting my own money at risk now I'd want the absolute fastest processing possible.

i tried to do something similar recently. Your architecture sounds a lot like mine. I did all the thread pool management and JSON parsing using C++ libraries/frameworks.

The real-time data was visualized on a website. Here is an example. https://algot.io/

True. I can completely relate with this. I developed an open source crypto exchange connector [0] and created a fun twitter bot [1] on top of that. Currently twitter bot processes all the USDT market trades from Binance (around 260 markets with average 30,000 trades per minute) and calculates OHLC metrics every 15 minutes using InfluxDB. All these installations and calculations are done in a free tier 1 VCPU / 1 GB RAM AWS server (less than 10% CPU and less than 40% RAM usage always). [0] : https://github.com/milkywaybrain/cryptogalaxy [1] : https://twitter.com/moon_or_earth

They really do, and because of that they reach for over-engineered infrastructure solutions. I mean I get that you'd like to have some redundancy for your webserver and database, maybe some DDOS mitigation, off-site backup, etc, but you don't need an overblown microservices architecture for a low-traffic CRUD application. That just creates problems, instead of solving problems you WISH you had, and it's slower than starting simple.

I totally agree! I run a data science department at a corporation and it's amazing how much of our job is done on our laptops. I have a Dell Precision. When I need more power (a very rare situation), I can spin up a GPU cloud server and complete my big analysis for under $5.

may be OT, but how do you subscribe to these trade feeds, is there a unified service or do you need to do it individually for each source, and how much does it cost approximately ?

I'm guessing if you put all this data into Kinesis or message queues it would end up costing quite a bit more.

There are probably unified services that let you do it - I was kinda competing in this area but didn't want to deal with enterprise sales, and it's a bit of a hard sell anyway.

If you do it individually, there are public developer docs for each exchange that explain how their API works. It's generally free as long as you're not making a large number of active trades.

Never heard of a crypto exchange that charges for data feeds, the norm is free and fast. One of the positive of the industry compared to old school finance.

They're rent seeking in other ways though, no worries.

How do I get in touch with you? Definitely using more resources than this to process fewer integrations. I’m curious what trade offs you made to enable this.

If you’re anywhere in the US, let me know.

What were the most important aspects of the technology stack which enabled that?

How much bandwidth did this use daily/monthly?

I don't remember offhand. I think bandwidth costs averaged about $40-50 out of that ~$200/month.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact