Hacker News new | past | comments | ask | show | jobs | submit | Matthias247's comments login

I've been working in that domain for a long time - including being one of the main architects for HTTP/3+QUIC in a public CDN offering. And I'll agree with everyone that this is very niche question, and a great answer seems out of reach for most "senior engineers".

Translating a UDP packet into a HTTP request and back is reasonably easy. Yes, maybe one can do that in a coding interview with some pseudo code. But scaling it and making it reliable is yet another dimension.

Any candidate would need to understand that a single UDP socket itself would probably already a bottleneck for just running this on single machine, and figure out how to increase concurrency there. That's not easy - the amount of engineers knowing how SO_REUSEPORT works and when it doesn't work is low.

After that you start to dig into how you can actually spread load between hosts. Would an answer like I "I hope my cloud provider solves this for me" be good enough? Probably not. If it actually is, do candidates have to both understand the cloud providers native APIs and Terraform (mentioned in the blog post). Seems pretty unnecessary, terraform is just one tool out of the myriad of tools which can be used to configure cloud services. Not everyone will have used it before. Or would it even expect candidates to do a long writeup about the pro's and con's of client-side load balancing?

Are applicants required to talk about upstream connection pooling? Describe and implement a CDN like multi-tier architecture?

Last but far from least is that the requested architecture is very easy to misuse for denial of service and amplification attacks. Just being able to describe how to monitor and mitigate that is already a very very hard task, that very few specialists have worked on so far.

It's very fuzzy what would be good enough if this is a "homework task". At least in a synchronous interview the interviewer could give feedback that they are satisfied. So I think in a synchronous interview the question might be ok - but there will probably just be time to either talk about coding or about system architecture.


This is clearly a simple proxy that can scale horizontally. That should get you the task completed. The SO_REUSEPORT/epoll/io_uring stuff is definitely a point of research (and for TCP too, not just UDP), but it's doable (here it helps if the senior eng. applicant can read Linux kernel code and is resourceful). If you're going to be exceeding a 10GB NIC's bandwidth you'll have to talk about using multiple IPs, DNS tricks, client smarts, etc. And all of this assumes that the HTTP backend can go at least as fast as the UDP proxy, which... is a big assumption, because it's much harder to get an application to perform as fast as a proxy, and TFA is already asking a lot of a proxy.


Generally I've provided this as a homework task, with the ability to email me and ask specific questions to help guide the candidate over a period of whatever time the candidate wants.

There are definitely degrees of correctness and completeness and depending on the candidate experience and level, certain solutions are acceptable. For example, a totally naive implementation in golang that doesn't quite hit the scalability requirements would be a good conversation starter and would pass a mid level or junior candidate.

A senior or above "badass" candidate would be expected to hit the scalability requirements.

An incredible candidate would teach us something new about this problem that we don't already know.


Why? If the pod is defined to spawn multiple containers, and each container runs the same application, then this seems true to me? Unless you would add an additional filter on the container name.


Well yes obviously you have to filter on container if you want a single container (just like kubectl logs -l <...>). The parent comment was phrased as a limitation of Loki, of course if you request all logs for an application you'll get all containers, or if you request all logs for an applications or a namespace you will get that.

Not being able to filter between multiple processes or multiple restart of a container was a genuine issue, not being able to filter between pods of a deployment is not.


I actually didn't understand it being phrased as a limitation. It could also be a feature - maybe one would prefer to look at logs for multiple services within a single query?

Anyhow, the nice thing about the system is that one can get anything that is preferable as long as the logs are annotated correctly (with pod and container id).


No, once again, the trouble is that you can't get the logs for a specific execution. If a container in your pod restarts, that is invisible to Loki, you have to look for whatever the container writes on startup and cut there manually. If you want a specific process in your container, it's mixed with the rest.


yes it can, if you tag your log stream correctly - either by having the stream externally tagged via attributes, or internally by following certain conventions in the log line.

You can also do something like

select client_ip from requests where elapsed_ms > 10000

which is incredibly powerful


It's not easy to cancel any future. It's easy to *pretend* to cancel any future. E.g. if you cancel (drop) anything that uses spawn_blocking, it will just continue to run in the background without you being aware of it. If you cancel any async fs operation that is implemented in terms of a threadpool, it will also continue to run.

This all can lead to very hard to understand bugs - e.g. "why does my service fail because a file is still in use, while I'm sure nothing uses the file anymore"


Yes, if you have a blocking thread running then you have to use the classic threaded methods for cancelling it, like periodically checking a boolean. This can compose nicely with Futures if they flip the boolean on Drop.

I’ve also used custom executors that can tolerate long-blocking code in async, and then an occasional yield.await can cancel compute-bound code.


Actually the benchmarks just measure the first part (cpu efficiency) since it’s a localhost benchmark. The gap will be most likely due to missing GSO if it’s not implemented. Its such a huge difference, and pretty much the only thing which can prevent QUIC from being totally inefficient.


We need to distinguish between performance (throughput over a congested/lossy connection) and efficiency (cpu and memory usage). Quic can achieve higher performance, but will always be less efficient. The linked benchmark actually just measures efficiency since it’s about sending data over loopback on the same host


What makes QUIC less efficient in CPU and memory usage?


Quic throws away roughly 40 years of performance optimizations that operating systems and network card vendors have done for TCP. For example (based on the server side)

- sendfile() cannot be done with QUIC, since the QUIC stack runs in userspace. That means that data must be read into kernel memory, copied to the webserver's memory, then copied back into the kernel, then sent down to the NIC. Worse, if crypto is not offloaded, userspace also needs to encrypt the data.

- LSO/LRO are (mostly) not implemented in hardware for QUIC, meaning that the NIC is sent 1500b packets, rather than being sent a 64K packet that it segments down to 1500b.

- The crypto is designed to prevent MiTM attacks, which also makes doing NIC crypto offload a lot harder. I'm not currently aware of any mainstream (eg, not an FPGA by a startup) that can do inline TLS offload for QUIC.

There is work ongoing by a lot of folks to make this better. But at least for now, on the server side, Quic is roughly an order of magnitude less efficient than TCP.

I did some experiments last year for a talk I gave which approximated loosing the optimizations above. https://people.freebsd.org/~gallatin/talks/euro2022.pdf For a video CDN type workload with static content, we'd go from being about to serve ~400Gb/s per single-core AMD "rome" based EPYC (with plenty of CPU idle) to less than 100Gb/s per server with the CPU maxed out.

Workloads where the content is not static and has to be touched already in userspace, things won't be so comparatively bad.


- The crypto is designed to prevent MiTM attacks, which also makes doing NIC crypto offload a lot harder.

Huh? Surely what you're doing in the accelerated path is just AES encryption/ decryption with a parameterised key which can't be much different from TLS?


Among others: having to transmit 1200-1500byte packets individually to the kernel, which it will all route, filter (iptables, nftables, ebpf) individually instead of just acting on much bigger data chunks for TCP. With GSO it gets a bit better, but it’s still far off from what can be done for TCP.

Then there’s the userspace work for assembling and encrypting all these tiny packets individually, and looking up the right datastructures (connections, streams).

And there’s challenges load balancing the load of multiple Quic connections or streams across CPU cores. If only one core dequeues UDP datagrams for all connections on an endpoint then those will be bottlenecked by that core - whereas for TCP the kernel and drivers can already do more work with multiple receive queues and threads. And while one can run multiple sockets and threads with port reuse, it poses other challenges if a packet for a certain connection gets routed to the wrong thread due to connection migration. Theres also solutions for that - eg in the form of sophisticated eBPF programs. But they require a lot of work and are hard to apply for regular users that just want to use QUIC as a library.


NICs assume stuff for TCP (segmentation offload) that they can’t do for UDP, or can only do in a very limited fashion (GSO).

TLS offloads are very niche. There’s barely anyone using them in production, and the benchmarks are very likely without


And Renesas continues to produce the existing chips/architectures (like SuperH), but also adopted ARM due to customer demand.


libmill (https://github.com/sustrik/libmill) and libdill (https://github.com/sustrik/libdill) should be similar and probably mentioned.

As far as I understand the differences between CspChan and libmill might be that libmill also implements lightweight tasks (coroutines) and everything that goes with it (IO multiplexing, async timers, etc), while CspChan uses OS threads?


Was going to ask if the current implementation could be used together with libmill, but maybe they overlap a lot in functionality?


I don't think you can. The library would block the libill worker thread, which would also block all other libmill coroutines from running. The libmill channels need to yield in a way where they don't block the underlying threads.


HTTP Streams (and Server Sent Events - which is a predefined body format that is understood among web browsers) are supported just fine by most infrastructure - probably even better than websockets.

The advantage of them is that "they are just HTTP requests", so as long as your proxy actually can forward request and response bodies in a streaming fashion it will work. There's no need to understand the contents. It will not work if the proxy is implemented by waiting for the whole HTTP response to complete before forwarding the body. But that wouldn't work very well for a whole lot of other use-cases like file transfers, where buffering the whole body is not practical.

A caveat is that most proxies won't allow for indefinite body body streaming, since they want to avoid all the TCP connections being tied up, so it's likely that the streaming would be interrupted at some time. In that case a reconnect logic would be necessary. But same applies to websockets.


He is not using SSE, he is writing to the stream of an open/incomplete http response. The caveat you name is exactly why I said this solution is one I'd avoid. You're also not taking into account several other pieces of middleware that expect a whole response before forwarding it on.


Anything attempting to proxy public Internet traffic and buffer the entire body at all times is in big trouble.

Either it has to have a very low body buffer limit and reject anything more (making it impossible to download any files through the proxy, or do anything similar) or it's trivially vulnerable to DoS where any client or server involved can crash the whole caboodle.


These are often devices or services used in the enterprise that do a lot of things that would surprise you, and just break applications willy nilly. We often have to request that exceptions be added for specific endpoints that use some of the aforementioned methods and are why I would shy away from them.

It's always fun to Response.Flush() in your server side application only to find the client receives nothing. They typically have limits to the amount they buffer, yes, usually pretty low. However, when all your trying to send is something as simple as "<script>Report.Progress(30)</script>" "<script>Report.Progress(35)</script>" that's part of some legacy code, you often see a sudden completion or jumps in the completion that are not representative of what's happened on the server side.

The best ones are the ones that try to handle mobile connections by holding open connections in weird ways, I'm talking about you NetMotion....


> In that case a reconnect logic would be necessary. But same applies to websockets.

Arguably you’re even better off in the SSE case, because it specifies a reconnect mechanism that allows clients to reconnect without receiving duplicate data. If you’re working with a client library that understands this (which includes any to-spec browser-based client), you just need to handle the reconnect header on the server and you get reconnects for “free”.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: