More

0xQSL · 2025-06-12T20:43:38 1749761018

My employer just started doing daily reauth for all microsoft logins (teams, ...). The worst thing is that it's just 24h not start of day, so it may just be five seconds before you want to join a meeting.

They haven't found the setting for mobile yet, so I might just stop using desktop teams.

eddythompson80 · 2025-06-12T20:59:42 1749761982

Had that on the WiFi system at a facility I used to work from for a while.

When you connect to their WiFi, you go to a guest portal to connect to the internet. The guest portal grants your MAC address 24 hours of access. Meaning one day you get to work at 9, the next day you get in at 8:55, you’ll have 5 minutes more of WiFi before things just stop working and your system takes a minute to realize you need to reauth with the captive portal

lxgr · 2025-06-12T21:05:31 1749762331

This is why 24 hours is a particularly bad timespan for reauthentication. With e.g. 16 hours, you’d at least get a predictable prompt on each new workday.

steadicat · 2025-06-13T01:20:35 1749777635

One time I led a project and ran daily standups by screen-sharing our Asana board so the team could review in-progress tasks. Every day, right in the middle of the meeting, Asana logged me out. I’d rush to log back in to finish the review, thus ensuring we’d repeat the cycle exactly 24 hours later. This silly dance lasted the whole project.

nofunsir · 2025-06-13T04:12:52 1749787972

Ironically, daily standups:project_management::frequent_reauth:security

arccy · 2025-06-13T09:14:21 1749806061

Didn't you take weekends off?

paxys · 2025-06-12T21:02:07 1749762127

And successfully opening a wifi captive portal is the most difficult thing to achieve in all of tech for some reason.

marcosdumay · 2025-06-12T23:16:47 1749770207

It's even harder than moving a file from a desktop into a telephone on the same LAN with an USB cable plugging both.

Computer security has a problem.

6LLvveMx2koXfwn · 2025-06-12T21:17:23 1749763043

man, I thought that was just me.

Marsymars · 2025-06-12T22:09:30 1749766170

I’ve been complaining about this exact thing at my company for years. The worst is, they actually had it at 12h but rolled it up to 24h after some exec complained he had to sign on twice in one day.

gs17 · 2025-06-12T22:56:16 1749768976

This also just happened to me too, except we only use Outlook. Web Outlook handles this state really poorly for some reason. It doesn't kick me out, it just pops up a little banner.

0xQSL · 2025-06-12T20:40:34 1749760834

There's "disable key expiry" per machine in the tailscale admin panel

0xQSL · on Jan 7, 2025

less than 5 seconds to respond to d2d4 on apple a18 pro

0xQSL · on Nov 6, 2024

I'm not so sure about memory actually being the bottleneck for these 8 core parts. If memory bandwidth is the bottleneck this should show up in benchmarks with higher dram clocks. I can't find any good application benchmarks, but computerbase.de did it for gaming with 7800MHz vs 6000MHz and didn't find much of a difference [1]

The apple chips are APUs and need a lot of their memory bandwidth for the gpu. Are there any good resources on how much of this bandwidth is actually used in common cpu workloads? Can the CPU even max out half of the 512bit bus?

[1] https://www.computerbase.de/artikel/prozessoren/amd-ryzen-7-...

sliken · on Nov 6, 2024

Well there's much more to memory performance than bandwidth. Generally applications are relatively cache friendly, thus the X3D helps a fair bit, especially with more intensive games (ones that barely hit 60 fps, not the silly game benchmarks that hit 500 fps).

Generally CPUs have relatively small reorder windows, so a cache miss hurts bad, 80ns latency @ 5 GHz is 400 clock cycles, and something north of 1600 instructions that could have been executed. If one in 20 operations is a cache miss that's a serious impediment to getting any decent fraction of peak performance. The pain of those cache misses is part of why the X3D does so well, even a few less cache misses can increase performance a fair bit.

With 8c/16 threads having only 2 (DDR4) or 4 (DDR5) cache misses pending with a 128 bit wide system means that in any given 80-100ns window only 2 or 4 cores can continue resume after a cache miss. DDR-6000 vs DDR-7800 doesn't change that much, you still wait the 80-100ns, you just get the cache line in 8 (16 for ddr5) cycles @ 7800MT/sec instead of 8 (16 for DDR5) cycles @ 6000MT/sec. So the faster DDR5 means more bandwidth (good for GPUs), but not more cache transactions in flight (good for CPUs).

With better memory systems (like the Apple m3 max) you could have 32 cache misses per 80-100ns. I believe about half of those are reserved for the GPU, but even 16 would mean that all of the 9800X3Ds 16 threads could resolve a cache miss per 80-100ns instead of just 2 or 4.

That's part of why a M4 max does so well on multithreaded code. M4 max does better on geekbench 6 multithread than not only the 9800x3d (with 16 threads) but also a 9950x (with 16c/32 threads). Pretty impressive for a low TDP chip that fits in thin/light laptop with great battery life and competes well against Zen 5 chips with a 170 watt TDP that often use water cooling.

Dylan16807 · on Nov 6, 2024

> only 2 (DDR4) or 4 (DDR5) cache misses pending with a 128 bit wide system

Isn't that the purpose of banks and bank groups, letting a bunch of independent requests work in parallel on the same channel?

sliken · on Nov 6, 2024

Dimms are dumb. Not sure, but maybe rambus helped improve this. Dimms are synchronous and each memory channel can have a single request pending. So upon a cache miss on the last level cache (usually L3) you send a row, column, wait 60ns or so, then get a cache line back. Each memory channel can only have a single memory transaction (read or write) in flight. The memory controller (usually sitting between the L3 and ram) can have numerous cache misses pending, each waiting for the right memory channel to free.

There are minor tweaks, I believe you can send a row, column, then on future accesses send only the column. There's also slight differences in memory pages (a dimm page != kernel page) that decrease latency with locality. But the differences are minor and don't really move the needle on main memory latency of 60 ns (not including the L1/l2/l3 latency which have to miss before getting to the memory controller).

There are of course smarter connections, like AMD's hypertransport or more recently infinity fabric (IF) that are async and can have many memory transactions in flight. But sadly the dimms are not connected to HT/IF. IBM's OMI is similar, fast async serial interface, with an OMI connection to each ram stick.

wmf · on Nov 6, 2024

For AMD I think Infinity Fabric is the bottleneck so increasing memory clock without increasing IF clock does nothing. And it's also possible that 8 cores with massive cache simply don't need more bandwidth.

sliken · on Nov 6, 2024

My understanding is the single CCD chips (like the 9800x3d) have 2 IF links, while the dual CCD chips (like the 9950x) have 1. Keep in mind these CCDs are shared with turin (12 channel), threadripper pro (8 channel), siena (6 channel), threadripper (4 channel).

The higher CCD configurations have 1 IF link per chip, the lower have 2 IF links per chip. Presumably AMD would bother with the 2 IF link chips unless it helped.

CobaltFire · on Nov 7, 2024

This was only true for Epyc, and only true for a small number of low CCD SKUs.

Consumer platforms do NOT do this; this has actually been discussed in depth in the Threadripper Pro space. The low CCD parts were hamstrung by the shortage of IF links, meaning that they got a far smaller bump from more than 4 channels of populated RAM than they could have.

sliken · on Nov 7, 2024

Ah, interesting and disappointing. I've been looking for more memory bandwidth. The M4 max is tempting, even if only half the bandwidth is available to the CPUs. I was also looking at the low end epyc, like the Epyc Turin 9115 (12 channel) or Siena 8124P (6 channel). Both in the $650-$750 range, but it's frustratingly hard to figure out what they are actually capable of.

I do look forward to the AMD Strix Halo (256 bit x 8533 MHz).

Dylan16807 · on Nov 6, 2024

I can't find anything to back that up.

That said, each link gives a CCD 64GB/s of read speed and 32GB/s of write speed. 8000MHz memory at 128 bits would get up to 128GB/s. So being stuck with one link would bottleneck badly enough to hide the effects of memory speed.

sliken · on Nov 6, 2024

I've been paying close attention, found various hints at anandtech (RIP), chips and cheese, and STH.

It doesn't make much difference to most apps, but I believe the single CCD (like the 9700x) has better bandwidth to IOD then their dual CCD chips, like the 9900x and 9950x

Similarly on the server chips you can get 2,4,8, or 16 CCDs. To get 16 cores you can use 2 CCDs or 16 CCDs! But the sweet spot (max bandwidth per CCD) is at 8 CCDs where you get a decent number of cores and twice the bandwidth per CCD. Keep in mind the genoa/turin EPYC chips have 24 channels (32 bit x 24) for a 768 bit wide memory interface. Not nearly as constrained as their desktops.

Wish I could paste in a diagram, but check out:

https://www.amd.com/content/dam/amd/en/documents/epyc-techni...

Page 7 has a diagram of 96 core with one GMI (IF) port per CCD and a 32 core chip two GMI ports per CCD.

That's a gen old I believe, the max CCDs is now 16, not 12 with turin.

Dylan16807 · on Nov 6, 2024

So "GMI3-wide" and similar terms are the important things to search for.

some diagrams: https://www.servethehome.com/amd-epyc-genoa-gaps-intel-xeon-...

From another page: "The most noteworthy aspect is that there is a new GMI3-Wide format. With Client Zen 4 and previous generations of Zen chiplets, there was 1 GMI link between the IOD and CCD. With Genoa, in the lower core count, lower CCD SKUs, multiple GMI links can be connected to the CCD."

And it seems like all the chiplets have two links, but everything I can find says they just don't hook up both on consumer parts.

sliken · on Nov 6, 2024

Didn't find anything clearly stating one way or another, but the CCD is the same between ryzen and epyc, so there's certainly the possibility.

I dug around a bit, and it seems Ryzen doesn't get it. I guess that makes sense, if the IOD on ryzen gets 2 GMI links. On the single CCD parts there's no other CCD to talk to. On the dual CCD parts there's not enough GMI links to have both with GMI-wide.

Maybe this will be different on the pending Zen 5 part (Strix Halo) that will have 256 bits wide (16 x 32 bit) @ 8533 MHz = 266 GB/sec since there will be 2 CCDs and a significant bump to memory bandwidth.

wmf · on Nov 6, 2024

I'm pretty sure that memory bandwidth is only for the GPU just like on Apple silicon.

sliken · on Nov 6, 2024

Apple silicon manages around 50% (giver or take) for the CPUs.

Dylan16807 · on Nov 6, 2024

Yeah, the most relevant diagram I can find shows 32 bytes wide per core cluster and 128 bytes to the GPU.

0xQSL · on March 25, 2024

Primary energy consumption isn't a good indicator for this, as it also includes lost energy. If you replace a 50% efficiency coal power plant with wind or solar you'd reduce the primary energy by half but could still get the same amount of used electricity.

0xQSL · on April 13, 2023

Nice improvements! I'd be interested to see how much overhead tailscales magicsock adds and what a flamegraph after the change looks like. Mostly crypto or still a lot of networking syscall time?

raggi · on April 13, 2023

magicsock definitely does a bunch more work, and we do look at both profiles. The magicsock profile is harder to read as a consequence of being a more complex path, adding packet filters, the indirection for DERP and other NAT busting details, etc. Jordan did do some optimizations in the magicsock path alongside this wireguard-go work to get us over the 10gbps line.

Overall the summary of time spent is still a similar story at the coarse scale - our recent optimizations mean that we're getting ever closer to the point where we need to start working on the next layer, such as optimizing the queues (visible here in the chanrecv and scheduler times - Go runtime stuff), and once we get that out of the way things like crypto and copying will become targets. The work goes on, we have lots of plans and ideas!

ignoramous · on April 13, 2023

Super neat.

Have these optimizations (TCP GRO/GSO) been applied to non-root tailscale? I imagine, the changes needed are wildly different as the TUN device itself is gvisor/netstack. I believe, the UDP GRO/GSO part (discussed in today's blog post) may work as-is.

raggi · on April 13, 2023

Good question, it's bits and pieces. I know there's more we can do with the userspace stack - netstack has some support for GRO/GSO, but unless I'm forgetting a detail we haven't fully plumbed that yet. It would definitely be interesting to do so - avoiding TUN turnaround while still utilizing mmsg and so on should provide excellent performance for something like a tsnet/libtailscale based server. We did recently improve performance in that configuration by enabling SACK, which is very significant.

0xQSL · on Dec 13, 2022

Would it be an option to use io-uring to further recuce syscall overhead? Perhaps there's also a way to do zerocopy?

bradfitz · on Dec 13, 2022

That was previously explored in https://github.com/tailscale/tailscale/issues/2303 and will probably still happen.

When Josh et al tried it, they hit some fun kernel bugs on certain kernel versions and that soured them on it for a bit, knowing it wouldn't be as widely usable as we'd hoped based on what kernels were in common use at the time. It's almost certainly better nowadays.

Hopefully the Go runtime starts doing it instead: https://github.com/golang/go/issues/31908

Matthias247 · on Dec 13, 2022

It will likely not help a lot, because syscall overhead is not the bottleneck - but the implementation of the actual syscalls is - which is very visible in the flamegraphs. Both sendmmsg/recvmmsg and io_uring only help with the syscall overhead, and therefore are not as helpful in improving efficiency as the the offloads which make the actual network stack more efficient.

Besides kernel supplied offloads, the thing which helps further is actually bypassing the kernel with AF_XDP or DPDK. But those techniques will have other challenges and limitations.

0xQSL · on Nov 10, 2022

I was planning to build something similar with esp32 until i found the emporia.

It's a bit of a hassle to install (don't think anything can be done about that) and I didn't really like the official ui, but it works great with home assistant / esphome. [0]

[0] https://gist.github.com/flaviut/93a1212c7b165c7674693a45ad52...

0xQSL · on Dec 20, 2021

One might be able to calibrate this by using cars in the area close to the shot [1]. Assuming most cars are driving at the speed limit

[1] https://www.google.com/maps/place/38%C2%B059'37.1%22N+93%C2%...

myself248 · on Dec 20, 2021

The cars aren't flying at altitude though. They're farther from the satellite.

contravariant · on Dec 20, 2021

How low are these satellites? I wouldn't expect the altitude of a plane to be any significant proportion of the altitude of a satellite.

natosaichek · on Dec 20, 2021

The satellites are probably greater than 400 km altitude, or so, while the plane is probably 10 km. The plane is in the same rough order of magnitude as really tall mountains, which the satellite is presumably designed to compensate for, so I agree with your general assessment.

tech_timc · on Dec 20, 2021

The Google Maps image copyright is for Maxar, which means the collector is likely Worldview-3, which is at nominally 614 km above the Earth.

https://earth.esa.int/eogateway/missions/worldview-3

willis936 · on Dec 21, 2021

The shadow is 750 m away and the sun is about 15 degrees off noon, so the B2 altitude is roughly 3 km. Cousin comment says the satellite altitude is 614 km, so I wouldn't expect parallax effects to be dominant. This is also evidenced by the fact that both the plane and ground are in focus and the plane is measured to be 70 feet long in google maps, which is what it is specified to be in reality.

exabrial · on Dec 20, 2021

That bomber is also likely quite low, since it's based out of an air force base in that area.

myself248 · on Dec 20, 2021

Sure, but when the word "calibrate" comes up, you want to account for even the not-very-significant factors.

0xQSL · on Nov 13, 2021

There are small bulges on the bottles which get worn away and allow making a rough estimate of uses. The german wikipedia link above has a a little bit more detailed explanation