Hacker News new | past | comments | ask | show | jobs | submit login
Apple’s M1 processor and the full 128-bit integer product (lemire.me)
230 points by tgymnich on March 18, 2021 | hide | past | favorite | 175 comments



Did anyone actually look at the machine code generated here? 0.30ns per value? That is basically 1 cycle. Of course, there is no way that a processor can compute so many dependent instructions in one cycle, simply because they generate so many dependent micro-ops, and every micro-op is at least one cycle to go through an execution unit. So this must mean that either the compiler is unrolling the (benchmarking) loop, or the processor is speculating many loop iterations into the future, so that the latencies can be overlapped and it works out to 1 cycle on average. 1 cycle on average for any kind of loop is just flat out suspicious.

This requires a lot more digging to understand.

Simply put, I don't accept the hastily arrived-at conclusion, and wish Daniel would put more effort into investigation in the future. This experiment is a poor example of how to investigate performance on small kernels. You should be looking at the assembly code output by the compiler at this point instead of spitballing.


This is the benchmarking loop:

  for (size_t i = 0; i < N; i++) {
    out[i++] = g();
  }
N is 20000 and the time measured is divided by N. [1] However, that loop has two increments and only computes 10000 numbers.

This is also visible in the assembly

  add     x8, x8, #2
So if I see this correctly the results are off by a factor of 2.

[1] https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...


Yes, the i++ seems an oversight.

The relative speed between the two hashes is still the same, but it is no longer one iteration per cycle.


> Update: The numbers were updated since they were off by a factor of two due to a typographical error in the code.

The article got updated by now :)


A C for statement is a "benchmarking loop" in the same sense that "slice a sponge cake into two layers and place custard in the middle" is an actionable dessert recipe.


Failing to post disassembly for a micro benchmark is annoying.

It is of course speculating all the way through the loop; a short backwards conditional branch will be speculated as "taken" by even very simple predictors.

Op fusion is very likely, as is register renaming: I suspect that "mul" always computes both products, and the upper one is left in a register which isn't visible to the programmer until they use "mulh" with the same argument. At which point it's just renamed into the target register.


The dependency chain is state += 0x60bee2bee120fc15ull or (state += UINT64_C(0x9E3779B97F4A7C15)); the rest of the calculations are independent per iteration.

Anyway, the more important fact is that 64x64b -> 128b mul might be one instruction on x86, but it's broken into 2 µops. Because modern CPUs generally don't design around µops being able to write two registers in the same set.


It's a shame we can't see the rest of the code. What is happening to the result value? Is it being compared to something? Put into an array, or what? All of that code probably totally outweighs what you pointed out here. Or, at least it should. I have a bad feeling it might be being dead-code eliminated, since compilers are super aggressive about that nowadays, but I hope he's somehow controlled for that.


The blog post links to the benchmark... It's repeatedly populating a 20k entry array with the results.

godbolt clang compiles it to:

    .LBB5_2:                                // =>This Inner Loop Header: Depth=1
        mul     x13, x11, x10
        umulh   x14, x11, x10
        eor     x13, x14, x13
        mul     x14, x13, x12
        umulh   x13, x13, x12
        eor     x13, x13, x14
        str     x13, [x0, x8, lsl #3]
        add     x8, x8, #2                      // =2
        cmp     x8, x1
        add     x11, x11, x9
        b.lo    .LBB5_2
[1] https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...


Thanks for the link.

Just staring at the machine code, it looks like the hottest loop for wyrng is about 10 instructions with a store in it. If the processor can do that loop in 1 cycle on average then...holy fuck.

edit: I was looking at similar code generated by clang on my machine. Again, holy fuck.

I don't think the story here is that 64x64=128 multiply is fast, honestly. The real story is the insane level of speculation and huge ROB that is necessary to make that non-unrolled loop go so fast. Everything has to go pretty much perfect to achieve that throughput.


Based on the information from [1] we have something like this for both loops:

    .LBB0_2:
            eor     x13, x9, x9, lsr #30     # 2 \* p1-6
            mul     x13, x13, x11            # 1 \* p5-6
            eor     x13, x13, x13, lsr #27   # 2 \* p1-6
            mul     x13, x13, x12            # 1 \* p5-6
            eor     x13, x13, x13, lsr #31   # 2 \* p1-6
            str     x13, [x0, x10, lsl #3]   # 1 \* p7-8
            add     x13, x10, #2             # 1 \* p1-6
            add     x9, x9, x8               # 1 \* p1-6
            mov     x10, x13                 # none
            cmp     x13, x1                  #
            b.lo    .LBB0_2                  # Fused into 1 \* p1-3
                                             # Total: 11 uops

    .LBB1_2:
            mul     x13, x9, x11             # 1 \* p5-6
            umulh   x14, x9, x11             # 1 \* p5-6
            eor     x13, x14, x13            # 1 \* p1-6
            mul     x14, x13, x12            # 1 \* p5-6
            umulh   x13, x13, x12            # 1 \* p5-6
            eor     x13, x13, x14            # 1 \* p1-6
            str     x13, [x0, x10, lsl #3]   # 1 \* p7-8
            add     x13, x10, #2             # 1 \* p1-6
            add     x9, x9, x8               # 1 \* p1-6
            mov     x10, x13                 # none
            cmp     x13, x1                  #
            b.lo    .LBB1_2                  # Fused into 1 \* p1-3
                                             # Total: 10 uops
Purely based on number of uops, there's a slight win for wyhash, all other things being equal. However, I doubt that you're really getting one iteration per second here; there are 6 integer units, and even if you perfectly exploited instruction parallelism you're limited to 6 ALU instructions per cycle, which are less than the extent of either loop. It would be possible if the mul-umulh pairs are getting fused, which would bring it down to 8 uops per iteration.

Taking into account the port distribution, each iteration of wyhash involves 4 uops being dispatched to ports 5 and 6, which means you should be getting at least 2 cycles/iteration purely for the multiplications. If it's much lower than that, the whole multiplication being fused into a single port 5-6 uop might be right.

However I can neither confirm nor deny that the loops behave like that on the M1, as I don't have one.

[1] https://dougallj.github.io/applecpu/firestorm.html


I think you are right that mull/h are fused. I think that M1 has 128 ALUs for the vector unit, so it would be a good way to make use of them. M1 is far from the first iteration of the architecture and Apple has likely picked most if not all low hanging fruits. It also helps x86 emulation I guess.

edit: but see the comment else thread about the loop iteration time being off by a factor of 2.


Oh yeah, I thought the add r,r,2 was odd but didn't investigate. This brings things back to ~2+ cycles per iteration, which strictly speaking does not require fusion.

It would be easier to test this explicitly instead of inside some unrelated RNG.


Perhaps there was a misunderstanding when executives talked to the silicon engineers about "the unbelievable speculation for our first in-house desktop-class CPU"?


I don't think a large rob is a significant contributor here.

The very wide execution is though.


I would be wary of using gettimeofday to measure such short periods. As https://pubs.opengroup.org/onlinepubs/009604599/functions/ge... says, the resolution of the system clock is unspecified, and 20,000 × ≈10 ≈ 200k instructions easily run in under a millisecond on modern hardware.

The benchmark probably get rids of that by doing it 40,000 times in quick succession, but why not measure the time of all 40,000 iterations in one go, and decrease the risk (and the overhead of calling gettimeofday 39,999 times)?


The resolution of the clock is unspecified by the POSIX standard but that does not mean it's unspecified by the platform this is actually running on. If this was trying to be portable code you'd have a big issue there but it's not. It is still limited by gettimeofday maxing out at microsecond precision, though, which is quite poor. And seems to be using a realtime clock, which introduces errors from network time sync & such. That's unlikely to crop up here, but it's still a risk. clock_gettime_nsec_np(CLOCK_MONOTONIC) is what should be used here.


Agree, you probably want to just read some cpu time stamp counter before and after, and make sure you ar pinning your threwads, locking the clocks, etc. so that you can get a reliable time from there, or... just use cycles as your unit of measure.


On Linux and MacOS, gettimeofday is accurate to microseconds. Not only is the returned struct is expressed in microseconds, but I have personally observed that it is accurate to microseconds.


The conclusion seems based on the relative execution times for the two benchmarks. Since the benchmarks are measured in the same way, their error bars should be basically the same as well. This analysis is not an analysis of the absolute execution time of these algorithms, but the difference between them.

I don't think the conclusion is hasty. Lemire is saying: "look, if the M1 full multiplication was slow, we'd expect wyrng to be worse than splitmix, but it isn't".


> Lemire is saying: "look, if the M1 full multiplication was slow, we'd expect wyrng to be worse than splitmix, but it isn't".

But that doesn't follow either. Only by inspecting the machine code do we get to see what's really going on in a loop, and the ultimate result is dependent on a lot of factors: if the compiler unrolled the loop (here: no), whether there were any spills in the loop (here: no), what the length of the longest dependency chain in the loop is, how many micro-ops for the loop, how many execution ports there are in the processor, and what type, the frontend decode bandwidth (M1: seems up to 5 ins/cycle), whether there is a loop stream buffer (M1: seems no, but most intel processors, yes), the latency of L1 cache, how many loads/stores can be in-flight, etc, etc. These are the things you gotta look at to know the real answer.


At that throughput the CPU is speculating and exploiting the access pattern.

It's also worth saying that if Apple were dead set on throughput in this area they could've implemented some non-trivial fusion to improve performance. I don't have an M1 so I can't find out for you (and Apple are steadfast on not documenting anything about the microarchitecture...)


Totally agree. I was thinking he'd get there, and then the post abruptly ended.


RISC-V does this too: https://five-embeddev.com/riscv-isa-manual/latest/m.html

"If both the high and low bits of the same product are required, then the recommended code sequence is [...]. Microarchitectures can then fuse these into a single multiply operation instead of performing two separate multiplies."


For the interested, LLVM-MCA says this

    Iterations:        10000
    Instructions:      100000
    Total Cycles:      25011
    Total uOps:        100000

    Dispatch Width:    4
    uOps Per Cycle:    4.00
    IPC:               4.00
    Block RThroughput: 2.5

    No resource or data dependency bottlenecks discovered.
, which to me seems like 2.5 cycles per iteration (on Zen3). Tigerlake is a bit worse, at about 3 cycles per iteration, due to running more uOPs per iteration, by the looks of it.

For the following loop core (extracted from `clang -O3 -march=znver3`, using trunk (5a8d5a2859d9bb056083b343588a2d87622e76a2)):

    .LBB5_2:                                # =>This Inner Loop Header: Depth=1
    mov     rdx, r11
    add     r11, r8
    mulx    rdx, rax, r9
    xor     rdx, rax
    mulx    rdx, rax, r10
    xor     rdx, rax
    mov     qword ptr [rdi + 8*rcx], rdx
    add     rcx, 2
    cmp     rcx, rsi
    jb      .LBB5_2


Multiplier in M1 can be pipelined or relicated (or both), so issuing two instructions can be as fast as issuing one.

Instruction recognition logic (DAG analysis, BTW) is harder to implement than to implement pipelined multiplier. Former is a research project, while latter was done at the dawn of computing.


I wonder if order matters? That is, would mul followed by mulh be the same speed as mulh followed by mul?

How about if there is an instruction between them that does not do arithmetic? (What I'm wondering here is if the processor recognizes the specific two instruction sequence, or if it something more general like mul internally producing the full 128 bits, returning the lower 64, and caching the upper 64 bits somewhere so this if there is a mulh before something overwrites that cache it can use it).


It seems like something that would be arbitrary depending on how the optimization was implemented. There wouldn't be an inherent need for that amount of generalization. Apple can tightly control their compiler to follow the rules, and there seemingly wouldn't be any compelling reason not to stick those two instructions back to back in a consistent order, since the second instruction is effectively free.

It would be fun to experiment with, for someone that has the hardware. My guess is that swapping the order will make it slower, but adding an independent instruction or two between them probably won't have a measureable effect. It would be fun to try and consistently interrupt the CPU between the two instructions as well somehow, to see if that short-circuits the optimization.


I love my M1, but does anyone else have horrific performance when resuming from wake? It’s like it swaps everything to disk and takes a full minute to come back to life.


Yes. This is actually a known issue, provided you have an external monitor attached; lots of people complaining about it. The Mac actually wakes up instantly if you lift the screen, but it usually takes 5-10 seconds before it will wake up the external monitor.

Worse, for some of us when it does finally wake up the monitor, sometimes it wakes it up with all the wrong colors, and rebooting is the only reliable fix. (and before anyone asks, yes, I tried a different HDMI cable)


For me it's maybe 5-10 seconds for it to wake my Thunderbolt monitor (LG 4K) from "cold".

It's much faster if the monitor has been used recently, though, so I always figured it was the monitor that was causing the delay by going into some deep sleep state?


It seems a little faster, but not dramatically so for me, when I'm hot switching between different inputs on the monitor. Definitely I think some of it is just the monitor not being really fast about switching, but I use the same monitor with a brand new 16 inch MBP and it's much faster at triggering the monitor to wake up.


Same monitor. The workaround is to put the monitor on a power strip you can trigger when you walk up to it. Hard power off/on of this monitor and it instantly displays for me now.


Found the same


I don't know why I got downvoted on that, it's strange advice but it really works


Do you have issues with the sound on your LG monitor? I have an LG too and the sound being transmitted by the M1 is very bad.


The sound quality is indeed pretty bad on the LG 4K. Noticeably worse than the MacBook Air's built in speakers.

It's just "cheap speakers" bad, though, not anything that would suggest an issue with the sound output from the M1. I've used the LG with a few different Macs, and sound quality is the same from any of them.


I think they are still working through some external monitor driver issues. The colors on my LG 4k were initially way off until I did an advanced calibration. Occasionally waking up from sleep it will revert, but opening the Display preference and swapping between the calibrations fixes it. My connection is through usbc.

I don't have any performance issues waking up though.


I think there are a couple issues going on with the colors. For some people it is calibration. But what I experience is almost like a color inversion (but it's not a full inversion, it looks like maybe one or two channels got inverted). Makes it difficult to even find the mouse pointer so I can get to the menu and reboot the machine. Then it comes up fine.


Interesting. I did submit a bug to Apple about my issue. Mainly because I never did a calibration with my 2017 mbp and the monitor looked great. With the M1 MBA I had to do the advanced calibration just to make the same monitor usable. Otherwise the colors were completely washed out. It almost seemed like the setting to auto-dim the laptop monitor was also being oddly applied to the external.


This happens to me too! Exactly as described. I have 16gbac mini conmected to an external monitor with an HDMI cable. A bit irritating but not a huge deal I don't do any color work on it so the occassional color issues are not a problem for me.


I've got an Intel Mini, and it's much faster to wake when I'm using a TB3->HDMI cable than an HDMI->HDMI cable. (Going to a dell 3219Q)


Anecdotal, but I haven’t seen any issues while using DisplayPort rather than HDMI. EDIT: Other than the delay in waking the monitor.


Yeah, I would like to use DP, for sure. A peculiarity of my situation and the devices I need to connect means I can't make that work with just two ports on the laptop. So I use the Apple USB-C PD/HDMI/USB adapter so I can get the ports I need and still charge.


Does this happen with the Mac Mini?


M1 instantly wakes my XDR display and never has a problem.


>and before anyone asks, yes, I tried a different HDMI cable

Have you tried a USB-C/thunderbolt cable/controller tho?


I'm kinda limited in my options, because this MBP only has two USB ports. So I have one port which goes to a USB hub, and the other port goes to an Apple A/V adapter with USB-C power delivery pass-thru, HDMI, and a USB port. If not for my need to have one USB port be switchable, and the rest not, I'd use a USB-C -> DP cable like I use on my 16" MBP.

This may prompt me to upgrade prematurely, if/when the next M1 MBP comes out with more than two ports.


Get a thunderbolt dock and you don’t have to worry about it. The dock also charges my MBA.


Which dock are you using? Caldigit? Those seem popular, but I see there are a number of options at every price point.


I have the OWC TB dock that was just released (and has been backordered). It works great other than the headphone jack has a slight hiss that I don't get plugging directly into the MBA.


CalDigit TS3+ works great for me with monitor hooked up on DisplayPort. No delay on resume. This is the dock Apple sells.


Instant wake for me. However any time I come across a password field in a website the computer freezes for a painfully long 10 seconds or so while it presumably decrypts my password vault.

Sometimes this will happen multiple times per page load if I deselect and reselect the password field.


Oh goodness I thought this was just me after searching online and finding very little, if any, discussion of the bug. The issue I had wasn't on password login prompts, but on account creation prompts (i.e. password+confirm password). I had assumed it was lastpass at first, but the freeze persisted even after removing that -- on a one-day old computer.


Feel free to report it in Feedback Assistant.


I gave up on reporting bugs to Apple a while ago. The experience, repeated several times over, of spending all the time to file a radar, gather all the logs and files that Apple insists on (probably reasonably), then submitting the thing and getting exactly zero feedback for the next year when you are then told to repeat all the same work on a new macos version or the radar will be automatically closed, is demoralizing. Mind you, the issues I reported were fixed/went away a few years after that (so clearly not as a result of the then-long-closed ticket), but the process just feels like a waste of time to interact with the black box of apple.


It helps more than nothing. I would recommend just becoming more casual about it.

My experience maintaining projects is actually not that people don't provide enough info in their bugs (certainly true for random forum rants though), but when they try they try too hard and end up spending a long time writing a bunch of stuff I don't even read, because the log speaks for itself.

In this case you're not necessarily reaching an engineer, it could go to someone who combines reports together, or the fix is because of your report but it doesn't get communicated back properly, but it's still letting someone know it's a problem.


This happens to me too, I’m using 1Password. I have a suspicion the plug-in is involved with this but I’ve not had time to collect evidence yet. Seems to be rare though, I’ve not spotted anybody else with this issue.


I am seeing lots of spinning balls with 1Password on my 2015 Macbook Pro. Even just switching between fields with the keyboard brings up a spinner for two seconds or so. It's a recent thing I think. No other app has any issues.


Is it possibly due to the x86 emulation? I could see something like that being the cause


Apparently the latest version 7.8 runs natively on Apple Silicon.


I see the same thing on native x86 1Password.


I just updated to version 7.8 from version 7.7 and it looks fine so far. I haven't seen any freezes in the five minutes that I have been using it.

This makes me realise how much of a pain it has been to use in the past weeks or so. Now that it is back to normal snappy it is such a pleasure to work with again.


Anecdotal but keeper and the built in keychain are instant ish at least not slow enough to notice


I don't use any external password manager, only Keychain and Safari and I see this 10-15 second delay every time when adding new password to Keychain since day 1.

It is my only issue with M1.


Huh, no problem here. I use lastpass and safari's built in password thing.


Did you somehow accumulate a bajillion passwords?


Crypto optimizations are no joke. My Pinebook Pro takes several seconds longer than my T430 to decrypt my keepassxc database.


Maybe the time is hashing your password. This is designed to take as long as it can while still being reasonable. 1 second on a fast machine isn't unheard of.


Why would it need to decrypt a whole database though, and not just the required password? To avoid leaking the key name?


Yes I think it is an all or nothing deal.


Perhaps it could be a two-level thing then, where you first decrypt e.g. the list of keys and then read the one you want and get a "start/end" offset to read/decrypt just the value for that key from another file. So keys and values are still decrypted, but you don't have to decrypt the whole bundle in one step to get the value.


I have a non-M1 with Big Sir and it takes more than full minute from sleep to usable.

I suspect it’s because I have five monitors and 20 million pixels (actually more as that’s the post-retina resolution).


How does the number of pixels (or even monitors, for that matter), affect that much sleeping time?

Rendering a FPS game at 1080p is 2 million pixels per frame. At 60fps, that's rendering 120 million pixels per second.

What am I missing?


Detecting the monitors, negotiating the correct resolutions, setting scaling factors and window positions after coming out of sleep will take some time. Maybe macOS does monitor setup sequentially? (no idea, just got 1 big external screen that also takes a few seconds to light up - handshake speed seems to vary between monitor brands)


This is almost certainly it - the screens show all sorts of weird graphical artifacts (some clearly a Retina display in "native" resolution) as it starts and loads. I assume it's having to fire up all the GPU memory, etc.


This is anecdotal and I don't have anything to prove it, but I really feel like my old spinning-hard-drive 2010 MacBook Pro woke faster from sleep running Snow Leopard, than the Retina models ever did (or the old ones did after a few software updates).

Of course for general tasks it was slower, but I really remember that thing waking up instantly when I raised the lid, every time.


My nostalgia agrees with you, but I think it’s probably wrong. The wake from sleep was what finally convinced me to get my first Mac. Even now they have the best hibernation and wake from sleep.


Yeah. To me it looks like macOS goes so deep into sleep it disconnects the external display. On wake, the system rediscovers the external and resizes the desktop across both displays. With a bunch of apps/windows open, half your apps simultaneously resizing all their windows can peg all CPU cores for a number of seconds.

(It's still way faster than the same set of apps on an Intel Mac laptop, where it could sometimes take on the order of 30 seconds to get to a usable desktop after a long sleep. On Intel Macs it seemed more obvious that the GPU was the bottleneck)


No, it's perfect, instant wake for me. I have never seen anything like this.

I have buggy apps (like Facebook Messenger) locking up, but I guess that's normal, I just uninstall them.


Which M1 do you have?


16 GB RAM pro 1TB SSD.

I guess the only drawback compared to MB Air is that it's a bit heavier.


Haven’t seen this on my M1 Mac mini; my wake times are faster than my monitors can wake. No performance issues immediately after sleep.

Maybe desktop platforms sleep differently than laptops?


Most likely. Same here with my Mini


I am using the LG Ultrafine 5K (so it’s a TB monitor) and it takes maybe 1.5s to 5s longer than the built in display (which wakes instantly) to turn on.

I do occasionally have an issue where the brightness on the built in display is borked and won’t adjust back to the correct level for anywhere between 30s to a few minutes.

And then I don’t know if it’s my monitor or the M1, but sometimes there will be a messed up run of consecutive pixel columns about 1/10th of the screen wide starting about 30% from the left of the display. The entire screen in that region is shifted a few pixels upwards. Sometimes it’s hard to notice it but once you do it can’t be unseen. Replugging the monitor into the M1 resolves the issue.


It way faster than my old macbook pro. But one thing is the external monitor won't be open upon resuming. I have to plug/unplug the cable to re-active the external monitor. It seems HDMI handshake was failed somehow


I haven't noticed poor wake times, but my laptop does kernel panic and reboot a fair amount. Maybe 4 times in the past week. My hunch is that it's Spotify's fault but I haven't dug into the logs.


A user-land program cannot be at fault for a kernel panic, that is the kernels fault, always.


xnu will panic if it doesn't receive periodic check-ins from userspace. For example if WindowServer hangs, then the kernel may deliberately panic so that the system reboots. See man watchdogd for (a tiny bit) more.


True, but a third-party program shouldn’t really be able to do this, at least not easily.


I have this problem but only if I've been plugged into a monitor, and then unplugged and gone onto battery. Rebooting after unplugging stops it but its annoying.


I've had mine panic and reboot twice, both times happened shortly after disconnecting other Macs that were connected via a Thunderbolt cable (target disk mode).


How does spotify panic the kernel itself?


Definitely no issue with waking my M1 8GB MacBook Air. Takes a fraction of a second, every time. In fact, this is specifically something that Apple were bragging about when they launched the M1 Macs!


Very happy with my Air 16gb on resume - much faster than my 2018 air


Do you have it plugged into a monitor upon wake? How many programs do you have when it is resuming? I noticed this really infrequently.


I tried reinstalling OSX, now it seems fine! ¯\_(ツ)_/¯


Nope, my M1 Air is connected to an external monitor most of the day and I don’t have issues.


It’s inappropriate of you to post an offtopic end user technical support question on this post about CPU microarchitecture performance.


It's inappropriate of you to scold another commenter with a content-less comment.


Why? There isn't really anything interesting in this post, beyond that Apple obviously wanted this feature for whatever reason


So this is why the integer multiply accumulate instruction mullah, only delivers the most significant bits? Ironic if you aren't religious about these things.


I believe that ARMv8 NEON crypto extensions has a special instruction for 64-bit multiply to 128-bit product, which is useful for Monero mining for example.


The amount of bugs in the M1 and MacOS posted on HN in a week could keep developers working for months at Apple.


A common misconception about RISC processors.


It’s not a RISC thing - CISC implementations do exactly the same kind of fusion for similar pairs of operations.


It has a bit less gain on a RISC due to the code density (or lack thereof), since it requires more fetch bandwidth. Apple works around this by using a very wide front-end: https://news.ycombinator.com/item?id=25257932


x86-64 code density is more than 4 bytes / instruction.


That looks like inverse density.


Isn’t that the point?


Anyone want to take a guess at how long it will be until Apple has their own fab in the US making M1 chips?


Not in the next 10 years. Why would they? Fabs are extremely capital-intensive and take years to get up and running, when (like Taiwan Semi) knows how to do it. Intel has shown how hard it can be to do this right. Let TSM work on production (and hopefully get more/larger fabs in the USA up and running) and getting better at packing in the transistors, and let Apple improve the design (and software).


> Why would they?

Because Apple has a lot of capital and they wouldn’t need to compete as hard for their share of tsmc production capacity.


I wouldn't considered Apple to be even competing.

They are by far the biggest customer and have a multi-faceted relationship e.g. OLEDs, Modems.


they are addicted to cheap labor.


How much labor is involved in semiconductor manufacturing? It's mostly automated, right?


To a man carrying an axe, everything is a tree.


Labour isn't particularly cheap in Taiwan


They’ve vertically integrated everything else, and they’ve had great success along the way. TSMC has other customers that compete with Apple for production capacity. And there’s geopolitical risk in the region where TSMC (currently) operates.


> They’ve vertically integrated everything else

What do you mean by this? Apple does very little manufacturing, they're famous for it.


They’ve vertically integrated basically everything except manufacturing. Even compared to other tech giants who are highly integrated, they do a ton.


>They’ve vertically integrated everything else

Only in that they design it, not in that they build it.

In that area they have "vertically integrated" nothing.


Apple hasn't been vertically integrated in any kind of manufacturing since the 90s. It's all built in China.


> They’ve vertically integrated everything else, and they’ve had great success along the way.

Why would they want to get into the low-margin, high-risk part of their supply chain, the bit where you can sink billions of dollars and have the value wiped out by a poor choice?


Especially when semiconductor manufacturing has become a major geopolitical flashpoint between China, EU, US etc.

Apple would be insane to get into the middle of it.


Or, that's exactly the reason to get involved in it. You don't want to rely on suppliers that are at high risk to implement trade barriers or worse. Apple has been moving production away from China over the past couple of years, no doubt in part due to risk (an Apple spokesperson has specifically cited risk as a reason). They've recently moved some production to suppliers in Vietnam, India, and even some has moved to the US.


Why would they make their own M and A processors? They could continue to buy Intel chips and revert back to using PortalPlayer and Samsung chips. Chip designing is capital intensive and takes years to get up and running, when Intel and Samsung know how to do it.


They did buy PA Semi to get off the ground. TSMC is probably way too expensive though.


They also bought Passif Semiconductor, parts of Dialog Semiconductor, Intel's modem team, Xnor etc.


Designing chips isn’t capital intensive, making them is.


Apple designs their own CPUs, GPUs, SOCs, Modems etc.

And they dominate the competition in performance/power.


Because of the non-zero chance of Southeast Asian fabs becoming suddenly unavailable.


This. It's not a money-making play; it's a catastrophic insurance policy.


But Apple has a lot of capital, and could win massive political brownie points for doing so, especially if they promised that some percentage of fab capacity would be sold to other American firms.


TSMC is based in on the soil of one of America's allies.


An ally who China is dreaming of re-assimilating (or taken over, depending on which side of the view you are) since its inception.

China is engaging in ever more aggressive saber rattling and the total lack of any measurable reaction to their takeover of Hong Kong only has emboldened them. Who can guarantee Taiwan won't end up the same fate?


I am very aware of the situation, being a resident of Hong Kong, and it's completely different. Hong Kong is indisputably a part of the People's Republic of China, and the Basic Law (our Constitution) is part of the constitution of the PRC. The government of Hong Kong has always emphasised that Hong Kong is part of One Country.

Taiwan is completely self-governed at the moment and sees itself as an independent nation.


Besides the 9 figure capital costs and the multi-year start-up, If you want a cutting-edge fab technology the pool of talent you have to recruit from is pretty small and most of them are outside of the United States.


By buying up the newest nodes, TSMC already gives them quasi-monopoly status. They would need to buy TSMC, but Taiwan may not want to sell it, as it's their most important company by far.


TSMC already announced a new fab in Arizona that will start delivering chips ~2024. Apple does not need to own a fab, they just need to diversify their chip supply chain.


Apple outsources all manufacturing. Why would that change?


Apple still runs an iMac factory in Ireland.


Without a major geopolitical change, like escalation of US-China tensions, sanctions on SE Asia or straight up war - likely never. Despite every incentive in place (tax breaks, consumer goodwill, better accessibility/control), what is the last domestic electronics manufacturing success story?


Seems likely Taiwan situation will escalate in the next 10 years. Apple may want to get started on the Fab now.


Manufacturing capacity will shift to South Korea, Japan, Malaysia, Singapore, India etc. before the US if that happens. The labor and expertise needed to set up such an operation is just not available here anymore.


My M1 Mac Mini says ‘Made in Malaysia’ on it


In the past, haven’t they made exclusive deals with manufacturers by helping them spin up new factories in exchange for exclusive access for X number of years?


IIRC, Apple basically made "retina displays" happen - when nobody else would make that expensive step - by telling a screen manufacturer "here is a very large check; build a factory to make these at commodity scale, within one year."


With Apple holding a lot of cash offshore awaiting for a favourable way to onshore it, combined with the political eagerness to bring chip production onshore. Those two aspect may well pan out to a situation in which the accountants see it as a win win.

Even then, do Apple use enough chips to justify running a fab, let alone one that would be locked into the node of the time. I really don't see it happening for many reasons and the only reason they would - would be some tax break incentive to onshore some of the money they have offshore in that it pays for itself, win or fail.


They would be better off paying the taxes. Making chips isn’t a tax dodge, it’s a hugely expensive many year commitment.


> With Apple holding a lot of cash offshore awaiting for a favourable way to onshore it

I don't think that's relevant anymore. My understanding is that the 2017 TCJA required prior unrepatriated earnings to be recognized and taxed over eight years (so still ongoing) and future foreign earnings not subject to US tax (except if the foreign tax is below the corporate alternative minimum tax rate). As a result of those changes, there's no need to hold cash offshore.


I can absolutely see both parties wrapping a flag around an investment tax credit for building a fab in the US. To the person saying taxes don't matter because it's expensive -> The more capital cost the better if they can use that to offset taxes.


Why would they want to do that?


That's great if you App is compute bound. "May all your Processes be compute bound." Back in the real world most of the time your Process will be io bound. I think that's the real innovation of the M1 chip.


Exactly because of the "real world" argument, turns out that a lot of actual real world loads are CPU bounds because they are so wastefully implemented. IO of all kinds has extremely high bandwidth these days and OoO helps hide the latency.


Explain please. What does the M1 do to IO loads?


Nothing. Compute speed isn't that important if you're waiting on IO is GP's point.


In that case it's a confusing point, given GGP calls this nothingness "the real innovation of the M1". GP is asking what the innovation is.


It’s clearer if for M1 you read “the new architecture promoted by Apple around their M1 processor”.


What new architecture? Other than the addition of a neural unit, it's an identical architecture to every other APU from the last decade?


It would be more clear if someone answered the question.


On die memory and storage. No bottlenecks, very little latency.


Important to clarify this every time it comes up: there is no on-die memory on the M1. It is normal, everyday, DDR4 memory which is located near to the processor. It's actually quite high latency at ~100ns.


Indeed and I was quite surprised by this as it's actually higher latency than you'd get on AMD or Intel's chips.


128-bit muls really help speed up finite field impl, which speed up elliptic curve crypto. That’s one crucial place where faster code helps.


You mean to tell me that a $2000 Macbook is almost as performant as a $1000 PC? Tell me more!


Based on U.S. prices, it's more like $999 vs $609 for similar specs (but no doubt a nicer machine and much better screen/touchpad with the Air.)

https://www.apple.com/shop/buy-mac/macbook-air

https://www.amazon.com/Lenovo-IdeaPad-Laptop-Newest-Display/...


The <10 second comparison I did was between an ASUS A15 and a 16" MacBook. $1000 vs $2500.


There is no 16" MacBook.

The 16" MacBook Pro is not an M1.

The Air starts at $999, the mini at $699 (official list from Apple itself), $899/$679 education.


Both Minis and Airs start at under $1000, and they're all the same speed.


Air starts at $1500 if you're not in the USA


This isn't entirely accurate. The Air starts at 1129 EUR in Germany, presumably it's the same in the rest of Europe. That's 1344 USD right now, but that's not an entirely fair comparison since this price is after tax whereas the USD prices are generally before tax. Before taxes, the Air is actually 1129 USD in Germany (current exchange rate and taxes cancel each other out). More than a 1000, but significantly less than 1500.


1498 EUR in Croatia. I'm glad is cheaper for you.

edit: I love I how I keep getting downvoted on HN if I dare say anything about the M1. Even if it's the true like the price of the machine.


I'm sorry it's more expensive for you, I just wanted to point out that there are locations which are not the US, such as parts of mainland Europe, where this claim does also not hold.

I just checked and it appears that notebooksbilliger.de sells the Air M1 starting at 1057 EUR and is willing to ship to Croatia for 30 EUR. If you were interested, maybe that's a better alternative.

Edit: Amazon.de charges 1079 EUR and seems happy enough to ship to consumers in Croatia as well for around 14 EUR. I haven't tried completing an order, obviously, but there are no relevant restrictions listed.


$1345 in my country, $1265 with edu discount.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: