
A Peek Inside a 400G Cisco Network Chip - protomyth
https://www.nextplatform.com/2017/09/14/rare-peek-inside-400g-cisco-network-chip/
======
rnxrx
As impressive as this is, it's likely 3-4 generations back from what's
currently shipping. It's a switch-on-chip (SoC) set up for 12.5 Gbit SerDes.
As pointed out elsewhere, it might have been deployed for 16 10G front panel
ports, another 160G to a backplane ASIC and some of the remaining ports for
control plane use.

Reading between the lines a bit it might have been used in one of the Nexus
5K's - which would put it at around 8 years old, depending on how we're
counting.

~~~
MattSteelblade
Wouldn't the fact that it's using a 22nm processor mean that 8 years is at
least a couple years too far?

~~~
rnxrx
Possibly, but it's in the right neighborhood. It depends on whether we're
counting from the time of design/initial fabrication or actual mass production
and general sale.

The point is that 12.5Gb/s SerDes pretty much means they'd be practically to
10/40G. At least in the DC networking world this puts it several generations
back.

~~~
petermonsson
NPUs are not used in data centers. They don't need the expensive but totally
flexible network processing capabilities. Think carrier and enterprise access
networks that have lower Ethernet bandwidth requirements, but higher
processing requirements per packet.

------
userbinator
I'm surprised that there's even a processor (as in a Turing-complete machine)
in those ASICs at all --- at those speeds, I would've expected more along the
lines of lots of hardcoded switching circuitry with any programmability
restricted to lookup tables and configuration registers, closer to an FPGA
than CPU in concept. Certainly, a lot of slower switch ASICs are designed as
hardcoded switches; here's one example:

[https://www.intel.com/content/dam/www/public/us/en/documents...](https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/ethernet-
switch-fm2112-datasheet.pdf)

~~~
kelleyk
That's not because it's slower; that's because it's the low-end (non-
programmable) member of its family. It was designed by a company called
Fulcrum, which produced the FocalPoint FM2000/FM3000/FM4000 (1/10G) and
FM5000/FM6000 (10G/40G) switching ASICs before being acquired by Intel.

I think that a comparison to the FM4000, which is the programmable series of
parts in the "Monaco" family, would be more fair. Here's their datasheet:
[https://www.intel.com/content/dam/www/public/us/en/documents...](https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/ethernet-
switch-fm4000-datasheet.pdf)

The FocalPoint ASICs were, as far as I know, some of the first to support (a
demo-quality implementation of) OpenFlow in hardware. When Intel bought them,
they released the datasheets, which is neat.

As a real-world example, these ASICs were used in Arista's 7100 series (c.
2008) switches. They published a two-part "Technical Evaluation Guide" for
those switches which are, among other things, an interesting glance at how
switches are constructed out of ASICs. Part 1
([https://local.com.ua/forum/index.php?app=core&module=attach&...](https://local.com.ua/forum/index.php?app=core&module=attach&section=attach&attach_id=6518))
shows the topology of each switch (starting on page 13).

The 7124 is a single 24-port FM4224 with all 24 ports connected to front-panel
ports. The 7148S has three FM4224 ASICs; each is connected to 16 front-panel
ports and uses 4 ports (40 Gb/s) to connect to each of the other two ASICs in
a ring. Intuitively, this means that it's possible that those inter-ASIC
connections could cause bottlenecks (if e.g. all 16 ports connected to the
first ASIC try to send 160 Gbit/s of traffic to the 16 connected to the second
ASIC, they'll saturate the 40 Gbit/s of connectivity between the ASICs).
Therefore, Arista also offered the 7148SX, which is non-blocking but needs six
(!) FM4224s to make it happen!

~~~
userbinator
The FM4000 doesn't appear to have an actual programmable CPU either; just the
same "lots of lookup tables and configuration registers" (but more of them.)

~~~
myrandomcomment
The most interesting thing about the FMxxxx chips was the pipeline was async.

------
aomix
That number seems stupendous but I don't know enough to know that. My very
casual understanding was that 100G was state of the art.

~~~
pavs
100G has been around for a while and not that common in usage, other than
extreme cases. 10g and 40G are still the most widely used ports. Considering
that most switches and routers have a minimum 10-20 ports (10g-40g), that's
already a stupendous amount of BW - and on routers, you can always add more
cards. 100g routers/switches are very expensive.

Unless you are pushing google/facebook/Comcast level traffic - there are very
few use cases. Apparently, Google/Facebook uses their own network hardware.

~~~
rnxrx
This isn't really true in the market today. 100G switches have emerged at
roughly the same cost per port as 40G from a couple of years ago and, indeed,
can often accept both 40 and 100 gig optics. Even at list price the cost per
port for 100G switching has been under $1K for more than a year.

As a result it's actually fairly common to find new DC fabrics (read: inter-
switch connections, not end hosts) being built with 100G because there's no
significant economic disadvantage to doing so. That said, the pricing for
inter-site 100G is still high enough that it hasn't commonly made its way to
smaller organizations.

------
VectorLock
> The processor complex on this unnamed NPU has 672 processors, each with four
> threads per core.

Very cool.

~~~
QAPereo
My favorite bit:

 _L2 instruction cache that also has an on-chip interconnect that links the
clusters and caches to each other as well as packet storage, accelerators, on-
chip memories, and DRAM controllers together. This interconnect runs at 1 GHz
and has more than 9 Tb /sec of aggregate bandwidth_

I keep seeing elephant guns and experimental German artillery when I read
that!

------
lsjdfkljdfwkwdf
Why is Cisco still using TCAM instead of HBM/HMC like Juniper?

~~~
signa11
(t)cam is specialized hardware for doing content based lookups. specifically,
they are used for parallel table searching based on comparator values...

------
zackmorris
Ok this is uncanny, I just commented on the need for something similar
yesterday:

[https://news.ycombinator.com/item?id=15244655](https://news.ycombinator.com/item?id=15244655)

I wish I could set up a notification of some kind to know when someone gets
Erlang/Elixir running on this chip. It would be a great platform for stress-
testing Go concurrency as well.

Beyond that, I would really like to see Octave running on it because it's the
only approachable vector programming language that I know of. The holy grail
for me is to be able to use something like the MATLAB libraries at 1000 times
their current speed to simulate the interesting stuff.

~~~
imtringued
You should instead look out for the 1024 core epiphany. It's a streaming
processor so it may not perform as well on random memory access and there is
no cache hierarchy but it's very close to your previous comment about
"Something on the order of 256 or more 1 GHz MIPS, ARM, or even early PowerPC
processors."

~~~
zackmorris
That's really cool, thank you! I would actually prefer a cacheless
architecture like that because I don't think it really has a place in
streaming or message-passing paradigms like Erlang or Go (it can still be
relevant within the local address space of each process though, but I don't
feel that the gain is worth it in most cases). Plus the problem space is still
large so it might be better to let people discover alternative approaches to
data locality like map reduce/sharding, copy on write and content-addressable
memory.

I spent my teens writing blitters for shareware games and found that even
then, the cache mostly got in the way. Processors like the PowerPC 603e had a
pretty substantial cache miss penalty that was on the order of 5-20% for me
depending on the situation. It was difficult to come up with appropriate cache
hints for even relatively minor random access. I tried disabling the cache,
but that made it even slower than a 601. So that's where my head is at, and
the Epiphany sounds perfect. Here's a quick link for anyone curious:

[https://www.parallella.org/2016/10/05/epiphany-
v-a-1024-core...](https://www.parallella.org/2016/10/05/epiphany-
v-a-1024-core-64-bit-risc-processor/)

