Been following this for a while b/c Jim Keller, but every time I look at the arc...

Pet_Ant · on March 10, 2024

Sounds like a manycore architecture. If you have played TIS-100 it is that exact same idea. If you haven’t, but have played Factorio think of instead of having a central area where all the work happens you build a series of interconnected stations, each doing their own part of calculation before passing it onto the next.

Upside is each core has its own code and is fully Turing complete and independent of eachother. You can handle conditionals much better. And you lose the latency of having network hops for workers.

Downside is you need to break down your process to map onto specific nodes and flows.

(Assuming it is in fact manycore - which is not the same as multicore)

pavlov · on March 10, 2024

Sounds kind of like the IBM/Sony/Toshiba Cell? It made an appearance in the PlayStation 3, but was supposed to be a more general high-performance architecture. At some point IBM sold blade servers with Cell processors.

amelius · on March 10, 2024

Is that a dataflow architecture?

https://en.wikipedia.org/wiki/Dataflow_architecture

yvdriess · on March 10, 2024

No, the Cell is a many-core architecture with a Power 'general' core and 6-8 'special purpose' vector processing units.

marty1885 · on March 11, 2024

Yes. "Coarse-grain dataflow processor". It's like an FPGA. But LUTs are replaced with RISC-V cores.

bschne · on March 10, 2024

> You can handle conditionals much better

Because you can have a core set up for each branch and just pass to that vs. “context-switching” your core to execute the branch that ends up being taken?

Pet_Ant · on March 11, 2024

Both

> Because you can have a core set up for each branch

and

> because each core has its own instruction counter

You can have each core be a tighter loop because you can limit the types of cases it handles.

But you also have your own instruction program counter you can take all sorts of branches in a way that you can't with SIMD, because in SIMD as the name implies you get only a Single-Instruction to deal with Multiple-Data. So if you need to change the behaviour depending on the exact value of the data, you are better off using separate cores than wider vectors.

vrighter · on March 10, 2024

because each core has its own instruction counter

BirAdam · on March 10, 2024

This reminds me very much of transputers. The idea here is that each cpu can context switch extremely quickly with minimal latency to any resource and you have a topology that is great for matrix maths as a result.

bschne · on March 10, 2024

What makes the resulting topology great for matrix math, vs. non-matrix math workloads? Naively if you know you‘re „only“ going to multiply matrices, what do you need the flexibility and fast context-switching for? Is the end-game here that you can lay out the workload s.t. you have a series of closely colocated cores carrying out the operations of some linalg expression one after the other and the memory for intermediate results right in between, or something like that?

imtringued · on March 10, 2024

Is this some kind of trick question?

Your core needs to be fully programmable so you can do things like kernel fusion. The simplest form is to load quantized weights and dequantize them to bfloat16 as you go. Llama.cpp and it's gguf files support various types of quantization and most of them require programmability to efficiently support them.

bschne · on March 11, 2024

not a trick question, I'm just genuinely ignorant about the topic

cavisne · on March 10, 2024

I suspect thats basically it (operations one by one and then pipelining to saturate). Thats basically what Groq does also AFAIK. From their website it seems the chips are designed to be connected together into one big topology, the "Galaxy" system. Also similar to TPU's, although they use HBM with only a few very powerful "cores" vs DRAM with low powered cores.

crq-yml · on March 10, 2024

It's another iteration of the unit record machine [0] - batch processes done with a physical arrangement of the steps in the process.

CPU design moved away from this analogy a long while ago because the tasks being done with CPUs involved more dynamic control flow structures and arbitrary workloads. But workloads that are linear batches of brute force math don't need that kind of dynamism, so gridded designs become fashionable as a way of expressing a configurable pipeline with clear semantics - everything on the grid is at a linear, integer-scalable distance, buffers will be of the same size, etc.

[0] https://en.m.wikipedia.org/wiki/Unit_record_equipment

transitionnel · on March 10, 2024

Definitely looks wacky! Has nice concept though, I like the "Network On Chip" reversed toruses.

Hopefully some of y'all tinkerers with time and dough can bear some of these ideas to fruition, keep Nvidia on their toes ;)

64gb is a good RAM amount IMO, cheap yet still vastly underutilized since we play to the LCD of users... guessing Linux will be able to make that pivot much faster/so little baggage.

Plus..."Grayskull"

bschne · on March 10, 2024

> I like the "Network On Chip" reversed toruses

What about them do you like as a design decision? (genuinely curious, as again, I don't understand it)

binarymax · on March 10, 2024

> 64gb is a good RAM amount IMO

It doesn’t have 64gb of RAM, it has 8. The system requirements otherwise need 64gb of RAM for model compilation

tormeh · on March 10, 2024

You want memory to be close to where it's used because at the speeds of high-performance ICs, the latency caused by distance is actually significant.

bschne · on March 10, 2024

but isn't that aspect common among this, CPUs, GPUs, ...? And it feels like the whole NOC thing would add quite some overhead to moving things around.

Are you saying proximity here more than offsets this vs. e.g. each core having its own cache as I think they do in a "normal" CPU? And if so, is this more true of ML inference workloads than other workloads, for some reason?

adgjlsfhk1 · on March 10, 2024

I think the distinction with ML inference workloads is that you often have very little control flow, so this type of architecture lets you match layers to adjacent cores so that each operation gets it's data directly from the step before rather than from RAM.

weebull · on March 11, 2024

Right, so latency is less of an issue (because control flow is predictable) and performance becomes about bandwidth.

loeg · on March 10, 2024

I think the NOC approach is fundamentally similar to Intel's rings that they used for core interconnect back in the mid-2010s. It works.

https://www.realworldtech.com/includes/images/articles/snbep...

https://en.wikichip.org/wiki/intel/microarchitectures/sandy_...

paulmd · on March 10, 2024

They still use this today, and the ring interconnect is also the topology inside zen3/zen4 CCXs (with 8 cores). Ring is one of the simplest and best systems for >4 cores, until you get to about 8-10 cores (at which point you generally split it into multiple “tiers” like multiple CCXs etc).

loeg · on March 10, 2024

Intel switched to a mesh model: https://en.wikichip.org/wiki/intel/mesh_interconnect_archite...

paulmd · on March 11, 2024

not for consumer chips, which still use the same ringbus design.

e-cores do have a CCX/core cluster, but the clusters themselves go on the ringbus lol

imtringued · on March 10, 2024

I honestly don't understand this latency obsession for LLMs. You are loading millions of parameters sequentially for each matrix. The access pattern is perfectly predictable. I just ran llama.cpp in with profiling and 99.9% of the time is spent in matrix multiplication. This shocked me, honestly, because I genuinely thought that there is going to be much more variety.

adinb · on March 10, 2024

Seeing the topology I had a flashback to college and the MasPar [1] we were using in ‘92!

[1] https://en.wikipedia.org/wiki/MasPar

imtringued · on March 10, 2024

The real question is how do they plan to compete with say a Ryzen 8700G with 32 GB of overclocked RAM and the Ryzen AI NPU. 2x DDR5-6600 gives you more memory bandwidth than grayskull. Their primary advantage appears to be the large SRAM and not much else.

bgnn · on March 10, 2024

In general putting memory physically close to compute is good. If two cores need to share that memory doesn't it make sense to place the memory at the interface?