I don't use credit cards for the credit; in fact mine are completely paid for every statement. They are used for the customer protections and other provided "free" benefits. If some scummy or outright scam-y thing is charged to my Amex, I know I will have Amex on my side. If my card is stolen, Amex will refund any fraudulent charges and overnight me a new card; I probably won't get my debit card overnighted, though they will probably refund the fraud. The other thing is credit card points, which are essentially a benefit paid for by credit card processing fees charged to businesses. Many cards also offer access to "private" airport lounges. And other benefits I'm forgetting off the top of my head.
Additionally, having high credit limits, low usage, and older accounts improves credit scores for loans/etc.
No interest is charged if there is no balance carried statement-to-statement, so why bother with silly debit pins and such.
That's how it becomes the default way of payment; it's not really "credit".
Less area means less sources of interference for others (this property is also true in the other direction). So the attenuation reduces the signal area, and stronger attenuation lets the transmitter be "strong" in the house without the downsides in congested areas.
> Why is cooperation unlikely? AFAIK it’s not too hard to make a compiler support a function attribute that says “do not optimize this function at all”
Compilers like Clang actually generate terrible code; it's expected that a sufficiently smart optimizer (of which LLVM is a member) will clean it up anyway, so Clang makes no attempt to generate good code. Rust is similar. For example, a simple for-loop's induction variable is stored/loaded to an alloca (ie stack) on every use, it isn't an SSA variable. So one of the first things in the optimization pipeline is to promote those to SSA registers/variables. Disabling that would cost a ton of perf just right there, nevermind the impact on instruction combining/value tracking/scalar evolution, and crypto is pretty perf sensitive after security.
BTW, Clang/LLVM already has such a function-level attribute, `optnone`, which was actually added to support LTO. But it's all or nothing; LLVM IR/Clang doesn't have the info needed to know what instructions are timing sensitive.
AMD's string store is not like Intel's. Generally, you don't want to use it until you are past the CPU's L2 size (L3 is a victim cache), making ~2k WAY too small. Once past that point, it's profitable to use string store, and should run at "DRAM speed". But it has a high startup cost, hence 256bit vector loads/stores should be used until that threshold is met.
Isn't the high startup cost what FSRM is intended to solve?
> With the new Zen3 CPUs, Fast Short REP MOV (FSRM) is finally added to AMD’s CPU functions analog to Intel’s X86_FEATURE_FSRM. Intel had already introduced this in 2017 with the Ice Lake Client microarchitecture. But now AMD is obviously using this feature to increase the performance of REP MOVSB for short and very short operations. This improvement applies to Intel for string lengths between 1 and 128 bytes and one can assume that AMD’s implementation will look the same for compatibility reasons.
Fast is relative here. These are microcoded instructions, which are generally terrible for latency: microcoded instructions don't get branch prediction benefits, nor OoO benefits (they lock the FE/scheduler while running). Small memcpy/moves are always latency bound, hence even if the HW supports "fast" rep store, you're better off not using them. L2 is wicked fast, and these copies are linear, so prediction will be good.
Note that for rep store to be better it must overcome the cost of the initial latency and then catch up to the 32byte vector copies, which yes generally have not-as-good-perf vs DRAM speed, but they aren't that bad either. Thus for small copies.... just don't use string store.
All this is not even considering non-temporal loads/stores; many larger copies would see better perf by not trashing the L2 cache, since the destination or source is often not inspected right after. String stores don't have a non-temporal option, so this has to be done with vectors.
I'm not sure that your comment is responsive to the original post.
FSRM is fast on Intel, even with single byte strings. AMD claims to support FSRM with recent CPUs but performs poorly on small strings, so code which Just Works on Intel has a performance regression when running on AMD.
Now here you're saying `REP MOVSB` shouldn't be used on AMD with small strings. In that case, AMD CPUs shouldn't advertise FSRM. As long as they're advertising it, it shouldn't perform worse than the alternative.
Or you leave it as is forcing AMD to fix their shit. "fast string mode" has been strongly hinted as _the_ optimal way over 30 years ago with Pentium Pro, further enforced over 10 years ago with ERMSB and FSRM 4 years ago. AMD get with the program.
rep movsb might have been fast at one point but it definitely was not for a few decades in the middle, where vector stores were the fastest way to implement memcpy. Intel decided that they should probably make it fast again and they have slowly made it competitive with the extensions you’ve mentioned. But for processors that don’t support it, using rep movsb is going to be slow and probably not something you’d want to pick unless you have weird constraints (binary size?)
I have been very happy with my Minisforum Venus UM790, though I use it as a mobile computer since I can just throw it into my backpack. It's been great to have access to AVX512 on the go.
> It is not a language flaw. C++ requires types to be complete when defining them because it needs to have access to their internal structure and layout to be in a position to apply all the optimizations that C++ is renowned for. Knowing this, at most it's a design tradeoff, and one where C++ came out winning.
This statement is incorrect. "Definition resolution" (my made up term for FE Stuff(TM) (not what I work on)) happens during the frontend compilation phase. Optimization is a backend phase, and we don't use source level info on type layout there. The FE does all that layout work and gives the BE an IR which uses explicit offsets.
C++ doesn't allow two phase lookup (at least originally); that's why definitions must precede uses.
The power of the optimizations available to C++ are what make it so fast (see how slow debug mode is vs -O2/etc), and what allow C++ to be fast in the face of common/easy-to-understand, but technically perf-hostile, patterns. Bit counting loops vs popcnt, auto-vectorization, DCE, RCE, CSE, CFG simplification, LTCG/LTO, and so on. These things let you write "high level" (to a point - there are some ways to do "high level" paradigms and absolutely eviscerate the compilers ability to optimize) code/algos and still get great hardware level performance. This is so much more important overall than the time it takes to compile your program, and even more so once you consider that often such programs are shipped once and then enter maintenance mode.
It doesn't really have anything to do with compatibility (not entirely, but the things that are the biggest issue to good optimization quality and are fixable are things that need a system-level rethinking on how hardware exceptions happen). It just isn't reasonable to expect developers to know how to optimize, and it doesn't scale.
In many contexts, one should rarely pass -O2/-O3. A project that is built thousands of times during development may only be run on intensive workloads (where -O2 performance is actually a necessity) a handful of times by comparison. A dev build can usually be -O0, which can dramatically improve compilation time.
It depends. O0 turns off a few trimming optimizations and could potentially causes more information (code or DWARF) to be included in the objects, which may eventually slow down the compilation. In our large code base, we found that -O1 works best in terms of compilation speed.
CISC vs RISC doesn't matter. An ISA should ideally be a healthy mixture of both (citation needed). Arm64 allows memory operands, "just" like x86; but it still has code size issues. Memory operands (ie having a bit of address calculation in the load that's fused into its use) are very useful for reducing register pressure, which is an issue that every call ABI must contend with. This is something that the RISC ISA totally misses (and ARM64.. isn't really RISC).
The issue with this "debate" is that it misses the forest for trees. Instead we should be talking about binary encoding (ie how much "variability" is required), and you're right on that bit; memory isn't the issue it once was.
Additionally, having high credit limits, low usage, and older accounts improves credit scores for loans/etc.
No interest is charged if there is no balance carried statement-to-statement, so why bother with silly debit pins and such.
That's how it becomes the default way of payment; it's not really "credit".