The biggest factor though is support. Raspberry Pi has a lot of software support, so you aren't running into weird bugs here and there with nobody around to help. The Jetson community isn't nearly as big, but Nvidia's track record on their software support is generally quite good. In this case, they have an extra interest given their push for commercial applications and that the X1 sees use in the Nintendo Switch and Shield TV (among other things).
That being said, I missed the PCIe on the specs last time I was comparing SOCs and I had forgotten about hackerboards, thanks for the reminder!
Listed under Storage in the specs.
eg not a useful PCIe slot for anything other than plugging in an SSD
PCIe support also allows M.2 cards to take advantage of the nonvolatile memory express (NVMe) protocol, which brings a large performance advantage over other types of interfaces due to reduced latency, increased IOPS and lower power consumption.
There's a whole parade of micro-computers that don't differentiate each other.
Would the open source drivers that are part of the Kernel work out the box on ARM?
VideoBIOS expects to run and expects a well behaving Intel CPU to do the power-up. That said X can sometimes emulate these quite well. On ARM we’d also run into alignment issues and likely other quirks - but in principle...
Still… X86EmulatorPkg allows running an amd64 VBIOS in UEFI on an aarch64 machine :)
AFAIK the bigger problem on embedded boards is half assed Synopsys Designware host controllers. I have a Radeon running on my Marvell MACCHIATObin, on FreeBSD even. But from what I've heard the Rockchip RK3399 has a worse version of the controller, and people trying GPUs on the ROCKPro64 saw errors related to not large enough BAR space or something.
UPD: yeah, someone in the thread mentioned BAR space issues wrt NXP i.MX SoCs, that's probably what's happening on Rockchip. Would be amazing if the Broadcom chip in the Pi turns out to be the one with enough BAR space! :D
(See for yourself in your machine: run "lspci -v", the lines starting with "Memory at ..." or "I/O ports at ..." are the BARs.)
Does video BIOS even need to be executed, though? I always assumed it was only necessary for the primary card to be able to display output during boot on PCs. (Otherwise, wouldn't two different cards trample over eachother's implementation of int 10h?)
Sorry for the probably-obvious questions, it's sometimes tricky to find good sources for weird information like this that are also up to date.
My AMD TAHITIs for instance need VideoBIOS to start some form of thermal management loop - otherwise they just run full-throttle on the fan.
Then whichever card prevails (BIOS has the ability to select the initialization order) becomes the boot display device.
X11 has some (generally working, for well behaved GPUs) emulation of this environment, so that the GPU can initialize late, and even reset under X control. This is how sane cards can work under headless ARM etc.
Now, some manufacturers assume you get something like SSE or MMX - VideoBIOS spec technically mandates 386 instruction set only.
That crap gets badly emulated.
On top of this, drivers can sometimes reinit anyways, from native kernel code. If that happens, the VideoBIOS concerns are moot.
Pretty fascinating that X11 of all things is dealing with Video BIOS initialization, though. The sheer amount of functionality that was overloaded into X...
The amdgpu kernel driver POSTs the GPUs when loaded just fine! I run an RX 480 on a MACCHIATObin with FreeBSD and Wayland only :)
Plus, there are two ways to get the GPU in UEFI already – X86EmulatorPkg and native aarch64 GOP driver builds provided by AMD.
OpenBSD even has (or had, build breaks often) packages for chromium browser on arm64 that were tested on this.
Why? I recently enabled amdgpu on FreeBSD arm64 — as soon as KMS happened, 3D just worked. There was some buffer corruption (https://user-images.githubusercontent.com/208340/60443774-97...) for which I've had to cherry-pick this tiny Linux patch: https://patchwork.kernel.org/patch/10778815/
> packages for chromium browser on arm64
And we have Firefox, which doesn't break, has a working JS JIT, works on Wayland, renders with WebRender :)
> memory-cacheability-attibutes issues
Recently I've added aarch64 support to FreeBSD's port of the DRM/KMS drivers :) Took a couple hours to realize that our implementations of Linux's mapping functions used normal uncacheable memory instead of device memory – fixing that stopped the hangs on driver load and allowed everything to work.
Then there was some corruption on the screen – our drm-kms is from Linux 5.0 for now, and I've had to cherry-pick a fix that only landed in 5.1 I think: https://patchwork.kernel.org/patch/10778815/
Just don't use ARMv5 or older. ARMv6+ supports unaligned access.
By the way, you can also opt to crash on x86. Just need to enable bit AM (Alignment Mask) bit 18) in CR0 (kernel stuff) and AC (Alignment Check, also bit 18) in EFLAGS.
After that point, unaligned accesses trap. I use this in my CPU JIT/dynamic compiler to trap target unaligned accesses.
The x86 world had the advantage of user upgradable GPUs, which necessitates standardization and common firmware.
On top of which, the accelerated graphics of cell phones is a horrible kludge of various standards.
Linux 32-bit ARM drivers
Taking a low-cost Mini ITX board and pairing it with a decent mini-sized GTX 750 TI SC video card (I prefer the EVGA one) should yield a system that is somewhat on par with some of NVidia's embedded ML offerings, with the tradeoff of size and efficiency - but at a significantly reduced cost.
Add a very short PCIe riser ribbon (or, if you can find one tall enough, a right-angle riser), and you could lay the card over "flat" above the CPU (using a 1U cooler/heatsink); you'd have to make a custom mounting frame of course, but I think you could make the whole package relatively compact.
From my limited experience using Tensorflow with the 750 - it is a very capable card in that capacity and relatively inexpensive today. If you were willing to spend a bit more, there are mini-sized NVidia GPUs available in more recent models; of course if space is not an issue, then full-length GPUs can be substituted (my "goal" was to build such a system as close to within the footprint of a Mini-ITX board as possible).
As far as the Mini-ITX motherboard is concerned, again, size to your budget. That said, if you decided to build this system using cheap pre-owned components, maybe an I5 with 8 gig - you could probably get it going for under $250.00 USD, maybe less.
But is it possible to bring the project to the next level? Is it possible to make a daughterboard with QFN connector? If so, one can make a pin-compatible daughterboard with an extension connector. To use it, just desolder the USB chip and solder a new daughterboard on it, and you're ready to go. It would be one of the coolest Raspberry Pi projects!
That said, PCIe phy’s are extremely robust - they do most of the impedance matching and delay mismatch training. And if you don’t ruin the onboard caps, this could be jumpered straight across.
I was thinking about using a SMT ribbon cable connector because of the limited available space, but apparently it won't be an issue if you raise the board high enough?
Anyway, if this project ever goes to batch-production, make sure to update your blog when the funding campaign starts!
It looks like an unreliable modification.
USB 2.0 runs at 480MHz and motherboards have used 0.1" headers for them since the beginning.
I envy people like OP for their tenacity. I barely have time to follow what's happening in IT, much less get ahead of the pack in doing cools hacks like this.
I definitely like this sort of hack, but such hacks with documentation already available (and doing more than documented, basically) are certainly preferable.
E: Apparently it's the PCI bus that is shared, not PCI Express lanes. Ty.
The chip has to be removed.
It’s not clear if the lanes themselves can be multiplexed with packets from many devices but they can change the number of assigned lanes after initialization so a clever chipset could probably dynamically allocate lanes as used.
these generally basic passive devices operating at analog signals level, no higher layer activity required. however some may exist which operate as "retimers", which do participate in the lowest layer of the PCIe electrical protocols (generally to extend reach). these are unlikely to be used for a typical x16 <-> 2x8 sort of motherboard feature though.
the example i picked here is 4 lanes, and you would need 4 such chips to do a x16 <-> 2x8. (spoiler: you mux lanes 8-15 from slot X to lanes 0-7 of slot Y, and there are both TX and RX pairs which need muxing.)
there do exist devices called "pcie switches" which operate at all layers of the pcie protocols, and allow for all sorts of sharing of the point-to-point links. examples at microsemi  ... for example a 48 lane switch could be used to connect two 16 lane GPUs to a 16 lane slot. this would allow either of the GPUs to burst to the full 16 lanes, or on average if both GPUs are communicating with the host then they would see 8 lanes of bandwidth. there's a picture of such a dual GPU card in this article , you can see the PCIe switch ASIC centered in between the two GPUs, above and to the right of the edge connector.
They can be, this is what the chipsets do on most platforms. AMD's X570 splits out 4x gen4 PCI-E lanes into 8x gen4 PCI-E lanes + a bunch of other stuff: https://i.imgur.com/8Aug02l.png
Intel's been doing this better and is what their marketing calls "platform lanes" - the Z390 for example provides 24 PCI-E gen3 lanes from what is essentially a single 4x gen3 uplink to the CPU: https://images.anandtech.com/doci/12750/z390-chipset-product... (DMI 3.0 is essentially PCI-E x4 gen3 in all but name)
Thunderbolt 1/2 requires a pcie gen2 x4 connector to have enough bandwidth. The SoC in the pi4, the Broadcom BCM2711, has just a single gen2 pcie lane. 1/4th the required bandwidth for thunderbolt 1/2, and a mere 1/8th the requirement for thunderbolt 3.
To get a full 8x thunderbolt 3 connectors you need a staggering 32 pcie gen3 lanes off of the CPU. This is out of reach of all but the HEDT & enterprise platforms, to say nothing of the $5 ARM SoC chips for SBCs. Well in theory you could also use something like a Ryzen 3000 and split out the 24 PCI-E gen4 lanes into 48 gen3 lanes and then you could have your 8x thunderbolt 3 connectors, too. But that's expensive, of course.
Even though it would be limited, a Thunderbolt 3 port would expand the connectivity of the Pi, and very few, if any, devices require the maximum bandwidth to operate at all.
> I don't see any reason why the 1x gen2 lane of the pi 4 couldn't host a Thunderbolt 3 port; it would just severely bottleneck the bandwidth of tunnelled PCIe links.
But that's kind of literally the reason? An entire ecosystem of products assumes a reasonably high amount of bandwidth from the connector. That's its singular reason to exist. If you take away the bandwidth from Thunderbolt 3 it just becomes USB, and at that point why not just offer USB connectors which have even broader support and not as many cabling restrictions?
> If you take away the bandwidth from Thunderbolt 3 it just becomes USB
It becomes low bandwidth Thunderbolt / PCIe. You could still use it to attach PCIe devices which don't need a lot of bandwidth. GPUs can be attached for high performance compute where CPU-GPU bandwidth isn't critical. PCIe has non-bandwidth benefits over USB such as DMA and interrupts.
> why not just offer USB connectors which have even broader support and not as many cabling restrictions?
You can't attach PCIe devices via USB, but you can attach USB and PCIe devices via Thunderbolt.
Additionally, this would require adding more PCIe lanes to the SoC, as there isn't bandwidth to provide the two 4K HDMI outputs and the other connectivity would be severely bottlenecked.
Being able to attach an Intel 660P and get 2 TB of fast SSD storage on a Raspberry Pi would be sweet.
I'd like to see benchmarks, but my guess is single-thread random 4k read performance will more than double via PCI Express rather than the USB.
What really motivated me to do this hack is the relative abundance of stuff I can now plug into an FPGA :)
But as with everything high performance, it depends entirely on the use case.
With secondary IP layer on 802.11, it might actually work reasonably well.
Ungar, D., & Adams, S. S. (2009). Hosting an object heap on manycore hardware. https://sci-hub.tw/10.1145/1640134.1640149
RoarVM demo https://www.youtube.com/watch?v=GeDFcC6ps88
I now plan a crowdfunding for our own FPGA in an ASIC, that would bring $0.25 with a hundred links. This HN pages shows me there will be enough buyers for a $300 364 core RPI4 compatible cluster (100 BCM2711 chips connected to 1 FPGA plus 100GB DDR4) but without the RPi4 boards.
Instead of attaching RPi4 or BCM2711, you could have 100 SATA, SSD, 30 HDMI, 10G or a rack full of PCIe servers connected to this interconnect FPGA.
You are welcome to help realise the project or the crowdfunding.
AR. VR .
Image processing, neural nets.
But not in programming languages like C, Java or Linux.
Only in massively parallel message passing languages.
I suggest a Smalltalk message passing scalable cluster VM like the RoarVM 8].
 Shadama 3D. Yoshiki Oshima. https://www.youtube.com/watch?v=rka3w90Odo8
 OpenCobalt Alpha. https://www.youtube.com/watch?v=1s9ldlqhVkM&t=13s
 OpenCroquet Alan Kay Turing lecture. https://www.youtube.com/watch?v=aXC19T5sJ1U?t=3581
 OpenCroquet Alan Kay et al. https://www.youtube.com/watch?v=XZO7av2ZFB8
 RoarVM paper. David Ungar et al. https://stefan-marr.de/renaissance/ungar-adams-2009-hosting-...
 RoarVM Demo. https://www.youtube.com/watch?v=8pmVAtk612M
 RoarVM Demo. https://www.youtube.com/watch?v=GeDFcC6ps88
I might be showing my age but...
Imagine a Beowulf cluster of these!
(For those who don't know the reference, this used to be a common saying in slashdot back in the day.)
To everyone who feels left out: old slashdot.org "humor".
What, above 25?
This says, there's OTG controller intended to be used in "peripheral device only" mode.
I haven't found datasheet for BCM2711, so it's hard to tell.
I wonder why? The silicon required is far far simpler for host mode (since you only need a single memory buffer, and you fully dictate the timing, so can pass all the complex stuff up to software).
the form factor of pcie devices doesn't really play well with rpi, but there's definitely a need for faster, more stable persistent storage. i have heard a lot of issues with microsd cards based on wear leveling and such. it would be really nice if rpi could develop like an m2 interconnect where i could install an nvme ssd within the form factor of an rpi, that would make for a truly incredible little machine.
I would absolutely love it if the Raspberry Pi foundation made a version with PCIe instead of USB.
-  https://www.newegg.com/syba-si-pex40064-sata-iii/p/N82E16816...
16 core A72 CPU, up to 64GB RAM, PICe x8, 10GbE SFP+ ports and 1GbE RJ45, SATA.
It's early access hardware, so it may have some pain points. Normal units will be available from the end of the year for $750.
If I would have time and money right now, that's what I would buy.
I hope the PCIe works fine. And I hope the firmware will be FOSS like on their MACCHIATObin.
One thing they revealed is that the chip is overclockable (including memory), which is awesome. IIRC they got 2.5ish GHz core clock working. Would be amazing if it does like 3GHz with a voltage boost. (I don't expect software voltage control… but there's always hard mods :D)
I develop remotely on VPSes because I like to have an always-on box reachable from any client. I am wondering if a RP4 offers a similar experience at lower cost.
Does anyone use a RP for this?
Then also a lot webpack, Docker: wondering if they would get the Pi stuttering when compiling/building? And if vim is still smooth then (which isn't the case with my 20$ vps).
Let me clarify a bit for the public here why the comment is relevant: in the crypto mining community, some groups are looking into what minimal single board computer can provide a PCIe signal to connect a single GPU. Idea is to be cheap and reliable. If you have a many-PCIe board failure (http://bitcoin.zorinaq.com/many_pcie/) you have 10-20 GPUs going down at once. Not good. By isolating each GPU on its own motherboard, you can isolate failures, thus increase mining profits. When I saw the OP mention cryptocurrency in the blog post, I thought hey maybe that's what he is looking to do...