Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Been following this for a while b/c Jim Keller, but every time I look at the arch [1; as linked by other commenter] as someone who doesn't know the first thing about CPU/ASIC design it just looks sort of... "wacky"? Does anyone who understands this have a good explainer for the rationale behind having a grid of cores with memory and IFs interspersed between and then something akin to a network interconnecting them with that topology? What is it about the target workloads here that makes this a good approach?

1. https://docs.tenstorrent.com/tenstorrent/v/tt-buda/hardware



Sounds like a manycore architecture. If you have played TIS-100 it is that exact same idea. If you haven’t, but have played Factorio think of instead of having a central area where all the work happens you build a series of interconnected stations, each doing their own part of calculation before passing it onto the next.

Upside is each core has its own code and is fully Turing complete and independent of eachother. You can handle conditionals much better. And you lose the latency of having network hops for workers.

Downside is you need to break down your process to map onto specific nodes and flows.

(Assuming it is in fact manycore - which is not the same as multicore)


Sounds kind of like the IBM/Sony/Toshiba Cell? It made an appearance in the PlayStation 3, but was supposed to be a more general high-performance architecture. At some point IBM sold blade servers with Cell processors.



No, the Cell is a many-core architecture with a Power 'general' core and 6-8 'special purpose' vector processing units.


Yes. "Coarse-grain dataflow processor". It's like an FPGA. But LUTs are replaced with RISC-V cores.


> You can handle conditionals much better

Because you can have a core set up for each branch and just pass to that vs. “context-switching” your core to execute the branch that ends up being taken?


Both

> Because you can have a core set up for each branch

and

> because each core has its own instruction counter

You can have each core be a tighter loop because you can limit the types of cases it handles.

But you also have your own instruction program counter you can take all sorts of branches in a way that you can't with SIMD, because in SIMD as the name implies you get only a Single-Instruction to deal with Multiple-Data. So if you need to change the behaviour depending on the exact value of the data, you are better off using separate cores than wider vectors.


because each core has its own instruction counter


This reminds me very much of transputers. The idea here is that each cpu can context switch extremely quickly with minimal latency to any resource and you have a topology that is great for matrix maths as a result.


What makes the resulting topology great for matrix math, vs. non-matrix math workloads? Naively if you know you‘re „only“ going to multiply matrices, what do you need the flexibility and fast context-switching for? Is the end-game here that you can lay out the workload s.t. you have a series of closely colocated cores carrying out the operations of some linalg expression one after the other and the memory for intermediate results right in between, or something like that?


Is this some kind of trick question?

Your core needs to be fully programmable so you can do things like kernel fusion. The simplest form is to load quantized weights and dequantize them to bfloat16 as you go. Llama.cpp and it's gguf files support various types of quantization and most of them require programmability to efficiently support them.


not a trick question, I'm just genuinely ignorant about the topic


I suspect thats basically it (operations one by one and then pipelining to saturate). Thats basically what Groq does also AFAIK. From their website it seems the chips are designed to be connected together into one big topology, the "Galaxy" system. Also similar to TPU's, although they use HBM with only a few very powerful "cores" vs DRAM with low powered cores.


It's another iteration of the unit record machine [0] - batch processes done with a physical arrangement of the steps in the process.

CPU design moved away from this analogy a long while ago because the tasks being done with CPUs involved more dynamic control flow structures and arbitrary workloads. But workloads that are linear batches of brute force math don't need that kind of dynamism, so gridded designs become fashionable as a way of expressing a configurable pipeline with clear semantics - everything on the grid is at a linear, integer-scalable distance, buffers will be of the same size, etc.

[0] https://en.m.wikipedia.org/wiki/Unit_record_equipment


Definitely looks wacky! Has nice concept though, I like the "Network On Chip" reversed toruses.

Hopefully some of y'all tinkerers with time and dough can bear some of these ideas to fruition, keep Nvidia on their toes ;)

64gb is a good RAM amount IMO, cheap yet still vastly underutilized since we play to the LCD of users... guessing Linux will be able to make that pivot much faster/so little baggage.

Plus..."Grayskull"


> I like the "Network On Chip" reversed toruses

What about them do you like as a design decision? (genuinely curious, as again, I don't understand it)


> 64gb is a good RAM amount IMO

It doesn’t have 64gb of RAM, it has 8. The system requirements otherwise need 64gb of RAM for model compilation


You want memory to be close to where it's used because at the speeds of high-performance ICs, the latency caused by distance is actually significant.


but isn't that aspect common among this, CPUs, GPUs, ...? And it feels like the whole NOC thing would add quite some overhead to moving things around.

Are you saying proximity here more than offsets this vs. e.g. each core having its own cache as I think they do in a "normal" CPU? And if so, is this more true of ML inference workloads than other workloads, for some reason?


I think the distinction with ML inference workloads is that you often have very little control flow, so this type of architecture lets you match layers to adjacent cores so that each operation gets it's data directly from the step before rather than from RAM.


Right, so latency is less of an issue (because control flow is predictable) and performance becomes about bandwidth.


I think the NOC approach is fundamentally similar to Intel's rings that they used for core interconnect back in the mid-2010s. It works.

https://www.realworldtech.com/includes/images/articles/snbep...

https://en.wikichip.org/wiki/intel/microarchitectures/sandy_...


They still use this today, and the ring interconnect is also the topology inside zen3/zen4 CCXs (with 8 cores). Ring is one of the simplest and best systems for >4 cores, until you get to about 8-10 cores (at which point you generally split it into multiple “tiers” like multiple CCXs etc).



not for consumer chips, which still use the same ringbus design.

e-cores do have a CCX/core cluster, but the clusters themselves go on the ringbus lol


I honestly don't understand this latency obsession for LLMs. You are loading millions of parameters sequentially for each matrix. The access pattern is perfectly predictable. I just ran llama.cpp in with profiling and 99.9% of the time is spent in matrix multiplication. This shocked me, honestly, because I genuinely thought that there is going to be much more variety.


Seeing the topology I had a flashback to college and the MasPar [1] we were using in ‘92!

[1] https://en.wikipedia.org/wiki/MasPar


The real question is how do they plan to compete with say a Ryzen 8700G with 32 GB of overclocked RAM and the Ryzen AI NPU. 2x DDR5-6600 gives you more memory bandwidth than grayskull. Their primary advantage appears to be the large SRAM and not much else.


In general putting memory physically close to compute is good. If two cores need to share that memory doesn't it make sense to place the memory at the interface?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: