Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
AMD Demonstrates Stacked 3D V-Cache Technology: 192 MB at 2 TB/SEC (anandtech.com)
466 points by zdw on June 1, 2021 | hide | past | favorite | 213 comments


Imagine 32GB of this V-Cache instead of RAM. Does each additional layer add latency? If so, I wonder how many layers you could have until it would reach RAM-equivalent latencies. Also, it would require a compiler change if you had 500 sections of RAM (at 64MB per section) with latency increases for each section.


In short, yes, higher capacity will involve higher latency.

The thing here is balancing core speed vs “effective” memory latency (factoring in cache hits, misses, etc).

Cache management is a hard problem, and the equilibrium point is load dependent (i.e. depends on what type of program you use).

AMD has been smart enough to understand that sometimes it’s better to just brute force your way in (with higher cache sizes) than being super clever about how you handle it.


At 192MB, just give me explicit access to this cache as memory and I’ll decide what goes into it. No need to try to predict what my code will want, I can just tell you.


Put everything in contiguous memory and load it, it'll be in cache


That not a great solution. My working data set is likely larger than 192MB. But my application is much more likely to know what data is in the hot path than the CPU is to guess it. I might want to put my DB index into this cache, or I might want to put only a part of it. I might want to preload the data then work on it for a bit, then load the next chunk. I might want one core preparing the data in RAM while another core is getting ready to work on it in this cache. Essentially I want full access to it independent of the main RAM because I can do a lot more with that. Think of this as the difference between RAM and disk: is RAM only useful as a cache to what’s on disk?


We know both positions are correct in practice. Most automatic prediction/management algorithms perform better on 90% of the code that’s run on them. However, careful performance tuning over time by experts can outperform generic solutions, sometimes dramatically.

I generally err on the side of “I’m not as smart as I think I am” in these kinds of discussions. It’s not that I can never do it. I’m sure if I studied it a bit I could. The reason is that 99% of the lines of code I’ve ever written don’t warrant that kind of attention. I’m sure there are uses like when you’re writing some core fundamental algorithm in the fields of compression, crypto algorithm, hash, video processing, etc etc etc. I don’t work in those spaces though so the benefit is much more marginal.

It’s possible 192mb is enough to start needing some explicit memory management. It makes the coding model much more complex though. And in a greedy software system where your code isn’t the only one running, such complexity doesn’t necessarily net overall wins. It’s the reason we have drivers and OSes even though we started with each piece of content bundling explicit HW support (at much better perf generally)


Imagine coding for a 64-bit version of the 6502, where you have a 192MB "zero page" :)


If you boot on actual hardware (no VM), if your application code + kernel (or unikernel) fits in under 192 megabytes, you are golden.


Actually, this is where 3d stacking changes the game. With 1 additional vertical layer, you can get double the memory at practically the same latency, as the source of latency is related to the wire distance (and capacitance) in the 2d plane. Latency can be lowered with 3d stacking as well if you reduce the 2d area of the memory array. This is how the industry will keep scaling going when 2d feature sizes can no longer be shrunk.


What's the limitation on adding additional vertical layers though? One apparent limitation could be that by adding several vertical layers it would require the wire to be longer.


Heat dissipation, although the power consumption of SRAM is mostly affected by I/O rather than capacity.


2 limitations.

1. Heat doesn't go away with layers and gets harder to cool. 2. These are still layers. You want each layer to be as flat as possible. As the layers deform you lose the ability to cleanly add more layers.

Current lithography is very much based on the notion that everything is flat.


AMD's SenseMI is a neural network to predict data fetching. That's quite the opposite of brute force.

They add massive caches as a side effect of having loads of cores enabled by their chipset designs.


Depends on your definition of “brute force”, I guess. Between explicitly deriving some algorithm from an understanding of causal relationships, v.s. training a neural net on billions of samples without encoding any broader understanding of the world, the latter approach is the one I would have called “brute force”.


Using NN for cache management is not that new, and is a simpler approach in comparison to Intels as far as I know.


Whilst it would be nice, we would be adding a lot more heat into the CPU block. Maybe they could and add extra cooling on the other side, though would really be looking at reinventing the CPU socket and thoughts of Intel Slot-1 form factor start to become appealing again as a form factor to allow such cooling solutions.

Now what we have over the years seen interest in is adding processing cores to the RAM itself and maybe having a small dedicated processor for some tasks attached to the RAM may well prove viable.

Imagine if we didn't have any CPU or RAM sockets/slots and just a row of slots you addedd a module that had CPU/RAM in one and can just add more upto the slot limit. But then, that is kinda how GPU's have gone already in many aspects and look at how much RAM they hold and how large the cooling solution for them is. That gives you an idea of the cooling needed for large amounts of processing and RAM when closely packaged.


> Now what we have over the years seen interest in is adding processing cores to the RAM itself and maybe having a small dedicated processor for some tasks attached to the RAM may well prove viable.

I think we call these 'GPUs' today.

> just a row of slots you addedd a module that had CPU/RAM

As you mentioned, GPUs fit this description, but some PCIe devices are full-blown embedded systems.

I think we're at the point where the only major improvements will take place on the CPU die itself.


GPUs aren't quite the same, because they're (mostly) SIMD. GPUs don't have the independent per-compute-core memory bandwidth required for every core to be running its own execution thread with an independent instruction pointer, independent data fetches, and (especially) non-correlated branch-prediction failures / cache misses. (This is most of the reason that branching, if even implemented in the GPU's ISA, is effectively useless.)

Whereas, what the parent poster is describing would be true MIMD: a bunch of tiny cores each with its own on-board RAM, its own instruction pointer, and then a bus (or a bunch of busses) fast enough to feed them all data from (probably NUMA) main memory.

GPUs don't provide any advantage for running e.g. a high-concurrency Erlang application server. But a true-MIMD system would.


I'm confident we'll see memory-on-chip within the next few years. Actually isn't that already the case for mobile CPU's?


Main memory, if you mean DRAM, no.

DRAM is incompatible with the logic family used in CPU making, though attempts were numerous at trying to work around that. Every few years, there are somebody coming with claims of passable CMOS+DRAM tech, but none got adoption so far.


I work on ferroelectric hafnium, which looks promising with regards to CMOS compatibility. It's non-volatile, though I guess you could also use it like DRAM. Endurance is an issue for now though (it's approaching RAM use-cases, but not SRAM yet).

As you said, I'm not sure about the density part. Unless it's completely done in BEOL, I don't see designers trade precious chip real estate for memory (unless pin or power-limited, of course).


Hafnium? Is it something FRAM related?

What do you think on FRAM vs. next generation MRAM?


I don't know much about STT-MRAM, so I'll avoid making a fool of myself. Looking more into the details is on my TODO. Look at the datasheets and make your choice, as usual. I am not sure how the situation will evolve on either side, my understanding is that FRAM could be more energy-efficient.

Well, hafnium can be used to make ferroelectric crystals, that are necessary for ferroelectric memories (FRAM). The most used ferroelectric material is PZT, which has lead, and is a nightmare for CMOS compatibility due to contamination issues, and temperatures.

It's a bit light on the details, but maybe I should point you towards the video that explains our project: https://www.youtube.com/watch?v=M8tL-nN7G-A

The most exciting part might not be the performance (though it looks good), but the way ferroelectrics can be used for new circuits (variable treshold transistors thanks to FeFETs). Hafnium has been used in gate oxides for a few years now so it's quite compatible with CMOS.


It is not actually incompatible within the same die, IBM z15 uses eDRAM for the L1 caches even.


Not L1, but L2 and on.

A very curious case, and a very clever hackery with a custom SOI process. It's still very, very far from coming to mainstream chips.


It says L1 here

"Both it and all levels of cache in the main processor from level 1 use eDRAM, instead of the traditionally used SRAM."

https://en.m.wikipedia.org/wiki/IBM_z15_(microprocessor)



IBM's technical guide also says it's eDRAM

https://www.redbooks.ibm.com/redbooks/pdfs/sg248851.pdf


It's almost sad to see the x86s compared to the IBM mainframes...


What is the main hurdle? Why are these processes incompatible?



Genuinely curious, could you say why logic and dram on the same chip are incompatible?


For practical considerations, both DRAM, and CMOS need a lot of steps, and both of them are already dialled closed to thermal limits of materials.

So if you just first make CMOS, and leave empty space nearby protected by something to do DRAM later, it will be very hard to fit into the tiny remaining thermal budget past which CMOS devices will turn into schmoo.


Basically, at the low nanometer range,[a] if you want on-chip DRAM, it’d be a separate die (“chiplet”) connected by the substrate

[a]: what’s the upper limit? As in, when does it not work anymore? For example, would it work at 100+ nm?


I am not a specialist, but I think I can guess.

First, you need a wafer on which you can make a capacitor, which already means an SOI process, and SOI wafers.

The lowest nodes with SOI available to mortals are 40nm, and 14nm on GlobalFoundries, but god knows how one gets GloFo to collaborate on that.

Then, you need CMOS device which has enough thermal budget to survive both own, and DRAM creation.

Third, I think on-die DRAM will only make sense when it wins over SRAM. DRAM cells cannot physically go smaller than the minimal size of a trench capacitor. I believe at 5-7nm nodes, 6T SRAM will already be smaller than a reasonably fast eDRAM per area.

We know that IBM's Z15 is a 14nm FinFet chip, and it has DRAM on board, with them probably somehow doing DRAM first.


That's because high-density logic and high-density DRAM require very different steps during wafer fabrication.


> I'm confident we'll see memory-on-chip within the next few years.

Depending on what you mean by "memory-on-chip" we've already seen this with AMD's Fiji (2015) & Vega (2017) line of GPUs which used on-package memory in the form of HBM/HBM2 memory ( https://www.anandtech.com/show/9390/the-amd-radeon-r9-fury-x... and https://www.anandtech.com/show/11717/the-amd-radeon-rx-vega-... ).

It was incredibly fast, but also very expensive and limiting in SKU configurations. The resulting 16GB SKU for the Vega 56/64, for example, made basically nobody happy. It was too much for gamers, who then didn't want to pay money for something that didn't help, and it wasn't enough for the professional crowd, who were getting used to 24GB offerings from Nvidia.

> Actually isn't that already the case for mobile CPU's?

Nope. They "just" replace the slot with solder more or less. It's still externally packaged DRAM modules.


Why are you confident about this? Semiconductor processes are complicated for a ton of reasons, and a process which is optimized for making main memory (DRAM) has some differences from a process optimized for making logic (CPU), and flash memory is different still.

I see AMD's effort as a way to sidestep the whole problem of putting memory and CPU on the same chip, by making separate chips and stacking them. It makes sense. You can get CPU packages with CPU + DRAM + Flash, but these are separate chips which are wired together inside the package.

E.g.

https://en.wikipedia.org/wiki/Package_on_a_package

The whole point of these systems is to avoid putting memory and CPU (and flash) on the same chip. AMD's version is smaller & more integrated than package-on-package but it still achieves the same goal: multiple chips.


You can find stacked chips pretty commonly, but it's effectively two chips within a single package.


Actually already a thing but production of the memory isn't easy so not as cost effective yet. HBM is used on the VEGA graphics. GPUs are a type of processor.


Mobile SoCs like A14 have a Package-on-Package architecture with DRAM die stacked on top of CPU, but it is not the same die.


As I imagine it, I immediately wonder how to cool it.


RAM typically doesn't generate much heat.


But IIUC you would at least need to pass the heat of the CPU though the RAM. So if you want to keep a lot of memory near the CPU you are at the very least adding some sort of "blanket" between the heat source and the heat sink.


Alternatively, you could put the CPU core on top of RAM, closest to the heat sink.

You also don't want to run your CPU at 80°C, else the RAM will produce much more errors. (ECC is a must anyway, I suppose.)


You can definitely do that, but now your RAM is twice as far away from the CPU on average than if you put RAM on all sides. The best solution is probably some sort of hybrid where you put RAM on all side, but more on the bottom to balance the thermal effects and the latency.


Thermal resistance is linear wrt the thickness of the material, so if the layer is sufficiently thin, it doesn't form a significant obstacle.


They also mentioned both dies are thinner so they are mechanically compatible with current heat spreaders.


On the other hand, 128MB is enough for my entire desktop environment and most of the (non-game) apps I use (obviously not at the same time.)


I absolutely love that AMD is making technology open and cross platform on the GPU end. They don't get enough credit for doing that.


I'm sure Intel and NVIDIA would also like to make more of their platform and driver-stack open, but they are bound by confidentiality clauses in third-party technology that they've licensed. That's part of why huge chunks of Windows aren't open-source: because Microsoft doesn't "own" all of it, there's a lot of third-party licensed code compiled from source in Windows.


Nah, nVidia massively profits from the marketability of stuff like HairWorks / GameWorks a couple of years back being proprietary, and now it's DLSS. It's super cool tech, but kinda sad that [Speculation] it'll be replaced with a slightly worse version that works on all cards down the line just because nVidia wants to market it's exclusivity while it lasts.


Nvidia likes to use Gameworks etc as a weapon against AMD. Like when they turned up tessellation to ridiculous levels because AMDs cards didn't have the hardware for it at the time.


Crysis 2 simulating an entire ocean to render a puddle comes to mind. Also tessalating perfectly leveled floors IIRC.


That has been throughly debunked a million times over. The ocean only renders in wireframe mode because occlusion culling doesn't happen in wire frame mode.


Nvidia never did that though.


Not fully owning your own drivers/OS is something only a galaxy-brained MBA could think of.


Not necessarily.

Bryan Cantrill talked extensively about how long it took for Sun Microsystems to open source Solaris, and a lot came from code that was often outsources because it was "boring" and basically non-core tech (the example was i18n/l10n)


You can outsource while retaining ownership so that you can opensource it later if you want. Just need to pay for it and add it to the contract.


And they, you know, couldn't demand rights to the code of their own OS? Sounds like Sun had some galaxy-brainers of their own.


Underrated comment :)


Intel works on developing Mesa, together with AMD and VMWare. Intel graphics work quite well on the open source graphics stack.


Intel aggressively pushes their chips into Mesa, likely since they use Linux as the testbed for unreleased chips in development

AMD is something like 10% of the total Linux kernel size now which is ridiculous. But their commitment in recent years is admirable as hell

NVIDIA need to pull their fucking head out and just start working on Nouveau


Most of it is header files full of redundant register names. It's pretty ridiculous.

The driver code itself also has lots of code duplication. It's just strange.


What's strange is why Linus allows that kind of code in the kernel.


I think there are three reasons why overall this is not that bad for the Linux code base:

- it is a driver, not a core module

- the constants are implementation details of the driver

- active maintenance of code is a necessary condition for inclusion in the Kernel. The other Kernel developers are not supposed to maintain and refactor code dumps.


Because done is better than perfect.

I've had enough issues with ideologues precenting any meaning full progress


It’s also not the worst way of making sure new chips get supported relatively quickly. Those “giant” header files are mechanically generated from the RTL source for each GPU, it’s not like it’s some horrendously inefficient bloat form an outsource developer that doesn’t care about writing clean code or something.


NVIDIA doesn't give a crap about it's driver stack being open source. In fact, they actively want it to be not open. They refuse to even release signed firmware so the community can build an independent driver.

Intel's stack is already pretty open on Linux.


Even with a proprietary driver they didn't have to kill OpenCL once they got the slightest edge on it with CUDA.


They didn't kill openCL. 2.0 was evidently unpopular with everyone. Not sure why you're trying to pin it on nvidia. And afaik Nvidia's only stated position on CUDA is that it's royalty free.


Who killed OpenCL? OpenCL 3.0 is supported on Nvidia's latest GPUs.


Yes, now, after letting the whole ecosystem die by not supporting OpenCL 2.x for ~10 years, which in turn caused the whole ML field to turn into a Nvidia/CUDA monoculture.


OpenCL 2.x wasn’t well supported by others either, AMD’s implementation was pretty unusable and only had printf as a debug mechanism.

There was no hardware at all on which you could rely on it, on top of it being much worse than CUDA.


Going by wikipedia it took AMD almost two years to release an SDK with full OpenCL 2.0 support and Intel wasn't that much faster. Most of the Open Source implementations also seem to have died around the OpenCL 1.2 mark, with some having incomplete 2.0 support. So while you can probably blame NVIDIA for kicking a dead horse it looks as if something else got to it first.


I believe OpenCL 2.0 had a mandatory feature that nvidia couldn't support (maybe something about sharing pointers between GPU and CPU?). OpenCL 3.0 solved this by making many features optional.


I tried to create a ML framework[0] that would work on both CUDA and OpenCL (and natively on the CPU) around 2015/2016, which included creating FFI wrappers for both CUDA and OpenCL. This is where my experience on the subject (and my contempt for NVIDIA) comes from.

Me memory isn't perfect, but IIRC the situation was roughly the following: We were quite short on resources (both devtime and money), which meant that we had to choose our scope wisely. Optimally we would have implemented both CUDA and OpenCL 2.0, but we had to settle for OpenCL 1.2 (which offered reduced performance, but was "good enough" for inference). IIRC OpenCL 2.0 was very very similar in what capabilities it assumed and offered to the CUDA version at the time, and cards like the GTX Titan X had "compute capabilities" that supported features like shared virtual memory between CPU and GPU in CUDA at the time. In fact the advances around memory management (and async copying) that were present in CUDA and not in OpenCL 1.x were the main source for the performance differences between the two.

From everything that I can tell at that point in time, if NVIDIA would have wanted to support OpenCL 2.0 they could have done so based on technical requirements. What the reason for not doing so is, is just pure speculation (lack of internal resources due to focusing on devtools?), but to me it always looked like they were using the edge they got via their proprietary libraries like cuDNN to get a foot into the field of ML and then purposefully neglected OpenCL to prevent any competitors from catching up. Classic Embrace, Extend, Extinguish.

[0]: https://github.com/autumnai/leaf


Maybe you're right, however OpenCL 2.2 came out 4 years ago and (almost) nobody has adopted it yet, so the problem could be with the spec. OpenCL 3.0 was adopted by nvidia (although OCL 3.0 is the same as 1.2 with some optional features added)

https://en.wikipedia.org/wiki/OpenCL#OpenCL_2.2_support


But on that very same wikipedia page you can see it took less than a year for both intel and AMD to release openCL 2.0 drivers, yet nvidia didn't even start evaluating that for years.

Then again, it sounds like OpenCL 2.0 required some flexibility the nvidia drivers or hardware wasn't able to provide.

It's pretty hard to speculate to which degree nvidia was intentionally sandbagging here, and to which degree it really was stuck.

However, it's a member of khronos, and it's hard to believe they as such a major manufacturer could not have either said beforehand that the spec was a problem or simply complied with it; as both AMD and intel did.

Also, given CUDA's success it's all rather convenient for nVidia - at the very least it looks like they didn't mind leveraging their market position for continued market dominance, even if it's unclear whether that was an intentionally anti-competitive aim from the get go, or simply a fortunate happenstance they didn't try to avoid.

Then again; with antitrust enforcement mostly remaining in vaporware mode it's a little hard to blame them.


Presumably Intel or nVidia could buy out any confidentiality, after all it really is just about dollars at that level.


I'm not sure they have much choice, since they're not the ones driving the innovation.


Here comes another knockdown punch for Intel.

Now they got a solution to add hundreds of megs of cache for cheap.

More importantly SRAM can be binned/tested/KGDed separately!

And it can be fabbed on its own customised process!

And also, they can throw in an eDRAM there instead on a moment notice.

And as they already have silicon interposer here, adding HBM2/3 will be also a triffle.


Intel's been showing off a lot of 3D stacking tech & mixed dies on packages recently, too. This isn't really a kockdown punch. It'll depend on who can ship what and when, not just who can show off demos. They both have shown off demos.

See Intel's Foveros https://www.anandtech.com/tag/foveros or Ponte Vecchio that they've been talking about for 2 years now ( https://www.anandtech.com/show/16453/intel-teases-ponte-vecc... )


Sure, but Intel hasn't even been able to ship 10nm or 7nm in meaningful quantities for years. I have my doubts that stacking is yielding well for them...


For mobile (the cash cow) the extra power of off-die communication kills it before it's even out of design.


selling mobile phones is a cash cow, but selling mobile chips really isn't, unless you are Qualcomm and have monopoly rights on the concept of a cellular radio.


Even selling mobile phones isn't the cash cow it was some short 3 years ago. The demand is plateauing and people hold on to their phones for longer time which makes perfect sense in a world where almost every mobile company is constantly inflating the prices.

Plus there's an upper limit to the computing power that people need from their phones as well, so the sleek ads for phones simply don't work on many people nowadays. They are content with what they have.


It's not off-die, it's pretty much on top of it, and I believe they are using very wide, parallel I/O


cash cow or not, the world is calculating on servers


SRAM / for cheap?


Yes, it would certainly be slower than an on-die cache, but very cheap.

They can seriously reduce the L3 cache area, which is like 50% of the die currently, or get rid of it alltogether.

Imagine, 2 times more dies per wafer at near no cost except for extra packaging.


Separating SRAM cache would improve chip yield so possibly decrease some cost, but I don't believe it make SRAM "very cheap" or "at near no cost".


See how the defect rate work:

If a particle kills a repairable part of the die (which can be downgraded, or cut-off) you still get from a quarter, to a half of die area dead, and a second grade chip.

But if you can make 2x dies from wafer even with the same defect rate (it will usually be less.) Even if the die will completely die from a local defect, you still get much more perfect dies in total, which is more important for your bottom line.

Standalone SRAM can be very repairable, and high yield with a custom process. Adding few spare columns, or SRAM banks should be covering for much more defects than if binning was done per entire CPU die.


This. A great way to reduce the cost of your chip is to add SRAM to it. I'm not joking.


I don’t know if it’s “cheap” but It’s probably cheaper than the monolithic alternative right?


How do the thermals work out? The compute dies are already one heck of a hotplate, putting an entire layer of insulating silicon on top isn't going to improve things?


Cutress addresses thermal design of TSVs vs Intel Foveros: https://youtu.be/FqmcWOVv2eY?t=383

There's additional silicon stiffeners which should help with thermal transfer, granted at lower thermal efficiency than a single element.

-----

Cooling? Well, I'm using a 5900X right now, air cooled, with a high airflow case loaded with slow spinning 140mm case fans.

In games at 1440p120 (maxed settings, with RTX) I'm hotspotting at -- wait for it -- about 68C. With most of the CPU at 60C. In CPU intensive applications, more like 74C hotspot, with most of the chip at 65-67C. That's a setup that's still inaudible to me at 1m distance.

I feel this is going to be a 5900XT and 5950XT. Fills a price gap between the higher end X desktop CPUs and Threadripper for the HEDT market. Great for reasonably priced dev desktops (without falling down the workstation rabbit hole), as compilers love cache.

... though next year with 64-core EPYCs at 5nm with 768MB of L3? Oh. Dear. What's in the Xeon pipeline that can attempt to compete with A CPU that dominates on PPW and will be neither cache-starved nor core-starved? I guess it'll fuel a lot of Optane 5800-series sales, as driving down IO latency to sub-10μs will matter more.


"On top" in a flip chip means on the side facing the package. The silicon substrate's closest layer is the bottom layer, which is the hottest and still the closest to the IHS.


Silicon is a crystal, and like metals it conducts heat very well.

Copper 400 W / m K Aluminium 230 W / m K Silicon 150 W / m K Iron/Steel 50 W / m K


These devices are often thermally constrained as it is, so the question how this (adding extra silicon on top of the existing die [1]) affects thermals is important. Also neither copper nor silicon are responsible for the "bulk" of the heat transfer - the goal is always to get to a heatpipe or vapor chamber near you as quickly as possible, because those deliver well above 10000 W/m*K.

[1] Though Intel thinned their dies recently to improve thermal performance (CPUs are flip-chip, so the metal layers and active circuitry are facing the interposer). Some were concerned about stability/cracking of the thinner dies. Perhaps AMD is doing the same here, thinning the compute die, then stacking the memory die on top to end up with a stack that's exactly the same thickness as before. Since they're bonded together, the structural integrity should be similar. That additionally has the advantage that you can keep using the exact same IHS as before.


Updates from the article:

> - The processor with V-Cache is the same z-height as current Zen 3 products - both the core chiplet and the V-Cache are thinned to have an equal z-height as the IOD die for seamless integration

> - As the V-Cache is built over the L3 cache on the main CCX, it doesn't sit over any of the hotspots created by the cores and so thermal considerations are less of an issue. The support silicon above the cores is designed to be thermally efficient.


Silicon isn't that bad of a conductor, and given that it is very thin I'd guess it doesn't affect things too badly.


I recall years ago IBM had an experimental solution to that that involved liquid in between the layers to transport the heat (vague memory, I think they used an analogy to blood).

A quick googling brings up this article, which I think may be the same thing: https://www.bbc.com/news/science-environment-24571219


There are copper to copper connections in between dies.


I'm pretty sure the longevity of chips manufactured now is going to be alot shorter than for older chips if you can run them at the same temperature!

Both litography and the desperate search for infinite growth is going to hurt in the long run!

I'm stacking up on 14nm Atom that I can passively cool at 42 degrees celcius full blast on 8 cores and 50nm (100.000 writes per bit) X25-E SSDs!


> I'm pretty sure the longevity of chips manufactured now is going to be alot shorter than for older chips if you can run them at the same temperature!

Don't be so sure. If it was going to be the case, we'd be changing CPUs like early Seagate disks at our datacenter. There's no measurable longevity loss in the CPUs that we observed in our data center.

We use every system for ~7 years at full load.


Good server CPUs should survive 7 years of use without problems.

Nevertheless, their expected lifetime is not much greater than that.

I normally replace CPUs because they become obsolete much earlier than they should show age problems.

Nevertheless, I had a pair of Opterons that were used continuously for about 10 years.

At the end, they developed a very high leakage current, leading to an idle power consumption several times higher than in the beginning.

Increased leakage current is one of the most frequent signs of aging in semiconductor integrated circuits.


How do you assess that increased consumption?


Server BMCs provide power consumption timelines, which are measured in real time by the PSUs themselves. Also processors can assess their power consumption and report it to the OS and to the platform.

Querying and recording these as time series provides very nice insights over time.


Nice I'll look into that, I log those but never looked at them


What temp you got on those machines?

Since you haven't tried the 7nm CPUs yet I guess we'll have to see?

7 years sounds really short to me, I'm planning on running my machines for 100 years!


Just looked to a bunch of Xeon Gold 6258Rs. They're running around ~70 degrees Celsius per die. The machine is under full load. I'm sure its fans are at full speed.

That CPU's reported temperatures are 81 for high, 91 for critical (in degrees Celsius).

We get new systems almost every year, so we have a rolling set consisting many systems. So, we have a cross section of systems to observe.


Server parts are big chips (or a bunch of chiplets) running at fairly moderate clock speeds. For example your XCC CPU should have a die size in the 600-700 mm² neighborhood for a power density of around 0.3 W/mm² (which neglects hotspots because probably half the CPU is cache, which does not consume half the power). Desktop parts like the 10900K can chug down 200+ W on a 200 mm² die, though that's basically impossible to cool. Even more reasonable CPUs like AMD's 5000 series run something like 40-50 W through an 80 mm² die, these are still hard to cool.


This is a different and refreshing perspective. I've never looked from the perspective of W/mm². OTOH, the number I've given is calculated by eyeballing the per core internal thermistors, provided by lm_sensors (hence by CPU itself).

So, per die sensors are probably reading somewhat lower numbers, but the core's cooking at 70 degrees C internally. Nevertheless, bigger die surface inevitably allows better heat conduction and reduces internal stresses considerably, when compared to a desktop part.


Desktop use is also very uneven, leading to a lot of large changes in temperature (35 -> 80 and back), which causes more material stress, than when everything is loaded evenly all the time. The same is true for electrical transients.


If you have a good cooler with lots of thermal capacity and good fan (like an Arctic Cooling or Noctua, nothing at the extreme end), the increase is very gradual. Also, short spikes in loads are well absorbed with minimal temperature changes.

Nevertheless the numbers you mention are neither unrealistic, nor impossible in stock cooling and/or sustained load scenarios.


5950X with the top of the line noctua cooler jumps from 35C to 80C very quickly when a single core is loaded at 100% (this is the worst case, because it leads to max voltage being applied to the CPU cores (because of boost to almost 5GHz), compared to loading all 16 cores at 100%, which will result in just 60˚C temperature sustained over long periods of full load). So interestingly, compiling the Linux kernel with 32 threads is less thermally straining on the CPU than browsing the web and a single page loading 10 different animated ads.


This mirrors my experience with Zen 2 and 3 as well. Multi-core loads result in just a few W per core and decent temperatures, while few or single core loads push per-core power into the 12-16 W region and the temperature rise/fall is basically instant (brick-wall at 10 Hz update rate, though it's unclear what filtering is applied and how the CPU derives a single temperature from its probably numerous sensors), which suggests that they're not limited by the cooler, but rather by the thermal impedance from the active area to the cooler. The core itself is really tiny (iirc around 4 mm²)...


This is my conundrum when pondering buying a Threadripper 3970x / 3975x Pro workstation: aren't those chips cooked very aggressively when loaded, even with very good coolers? Doesn't that mean that one day the CPU might just burn?

I really want to buy a proper serious workstation and I can afford it in a few months but I keep wondering: is now the best time for it?


I wouldn't really worry about it. I have an overclocked 3970x and a Noctua air cooler keeps it much cooler than the Intel 6950X I had before (also aircooled, with an even larger Noctua cooler). Air cooling is more of a limit to how high an overclock you can get; normal operation will be no problem for any cooling system designed for Threadripper.

There's never a good time to buy a new computer. It's always going to be obsolete before you open the box. Though in this case, one might wait for the Zen 3 Threadrippers.


Thanks for the feedback, it's reassuring. As other posters said, I am worried because it's a workstation which means it won't have consistent load, and material stress is much more destructive in the thermal conditions of expanding and contracting. So if in your ~8h workday your station gets stressed for 2 or so hours in irregular intervals, how destructive is that, objectively?

--

As for the Zen 3 TRs, well, to be fair, I am not looking to blow $50k on a workstation. :) I am more interested how -- and if -- they will drive down the prices of the TR 3900 SKUs...


It probably won't drive down prices that much. You won't want the old thing, and the incremental cost between generations is not the expensive part of a workstation. I have bought old parts before (to replace broken parts), and my experience is that the price didn't change much. What was a $130 motherboard when brand new, is $130 when a generation old. Does it make any sense? Nope. But that seems to be how it is.

You won't spend $50,000 on a workstation just by using current-generation parts. I think when you see a workstation that costs that much, it's because it has multiple GPUs in them. Pro GPUs are always artificially overpriced, and given the GPU shortage, they're now even more overpriced.

I did a quick pcpartpicker expedition and found that using last-gen parts saves you about $1000 on a $6000 32 core Threadripper workstation. I compared last-gen SSDs, consumer GPUs, processor, and motherboard, and picked relatively high-end parts. You will save more money by dropping to 24 (or heaven forbid, 16) cores, not getting an extreme motherboard, getting 64G of RAM instead of 128, etc.

This could all be invalid in a few months. It is hard to separate "market is always crazy" from "this whole COVID thing is going on". Building a workstation during the pandemic was a pain -- I bought a used GPU, and didn't get ECC memory because nobody would sell me any. If you wait a year, that is likely to improve, and there will be newer hardware. But, if you need to do some computing between now and then, you don't have much choice but to buy what's available now, and it certainly makes for a very good computer.


Thanks for the check! Much appreciated, it did put my mind at ease.

I realized I was looking at several custom-made TR Pro workstations and they of course have a hefty markup on top -- custom cases, custom cooling, pre-added several PCIe NVMe riser cards, plus style price tag I suppose, etc. Should have looked in PC Part Picker indeed.

And yeah, last-gen tech very rarely gets discounted even by normal people who just post ads in local Craigslist-like websites (like OLX). Stuff that's 2 or more generations ago is discounted, but not last gen. Puzzling indeed, especially having in mind that this last gen tech is very soon going to be "two gens ago". Oh well.

As for ECC RAM, I hear you. I was unable to find such anywhere officially but lucked out that one guy in the local OLX was having loads of it but just didn't post the ads due to being very busy (we communicated because of other ads) and then I just bought 64GB ECC DDR3 RAM from him for a home NAS that I am gradually expanding.

I'm definitely very interested in having a TR Pro workstation; started getting sick of Macs and their artificial slowness. I got the iMac Pro and granted, I have it connected to my TV where it plays Twitch streams all day but hell, a lot of stuff on the terminal (that's not Python) just works slower than it does on a meager i3 Linux machine that I have lying around.

So I do want a TR Pro machine but I am very curious about TR 5000. A release is expected in August which isn't that far away. On the other hand, the TR Pro 5000 might take several more months on top. Hmmm. Decisions, decisions. :)


I wouldn't focus too much on the Pro SKU. It has a slower base frequency and boost frequency, with the upside that it has 8 memory channels and supports 2TB of RAM. If you need 2TB of RAM and 8 channels, it's what you need, but quad channel memory is still very good. Most desktop-class machines are still 2 channel.

There is something to be said for the prebuilt workstations from reputable vendors. It shows up in a box, and starts working. I have strongly considered that angle; being a system integrator is tough. If something doesn't work, it's a week cycle time where you find a new part, organize the RMA of the faulty part, etc. and there is a potentially unbounded amount of time spent tweaking. Meanwhile, HP or Lenovo just ships you a computer; they tested it and it works. You pay several thousand dollars for the privilege, but it might be worth it. (And if you want a TR Pro, you have no choice. AMD doesn't sell them to consumers.)


My rule of thumb is "if hardware is too hot for you to hold your finger on it indefinitely" it's too hot and it's going to break very soon!

Temperature kills hardware, you need to bring those temperatures down, and the only way to do that is to lower the wattage!

Atom is the perfect design, no crap and low power!


> "if hardware is too hot for you to hold your finger on it indefinitely" it's too hot and it's going to break very soon!

From my experience, that doesn't happen like that in enterprise hardware. Either wrong voltage (inside the system) or defective design causes premature death. If the server BMC says it's fine, it's fine.

> Temperature kills hardware, you need to bring those temperatures down, and the only way to do that is to lower the wattage!

High Performance Computing doesn't work like that, unfortunately :)

> Atom is the perfect design, no crap and low power!

I'm sure it has its own uses and can accomplish a lot, but in HPC, it won't cut it. I use small SBCs at home to do and try a lot of fun and useful stuff, but it has limits.


Your rule of thumb is wrong. The thing what kills hardware is temperature fluctuations. Keeping a processor running consistently at 90 degrees is much better than switching it off and on all the time. Running a CPU at full load has a very minor impact on its mechanical properties. Even after running for 10 years. Much less than storing it in your drawer for the same amount of time.


Temperature fluctuations damage the interfaces between the different materials inside an integrated circuit, so they will greatly lower the lifetime.

Nonetheless, there is a continuous aging of the metal traces (electromigration) and especially of the insulating layers, e.g. the MOS transistor gates, due to the difusion of atoms.

This continuous aging is accelerated by steady-state high temperatures and it eventually results in either open circuits or short circuits somewhere, destroying the device.

Good MOS integrated circuits are designed for a lifetime at their maximum specified temperature of at least 10 years or even 20 years or more for the better of them.

Nevertheless, this is puny in comparison with the lifetime of many semiconductor components produced 40-50 years ago, before the continuous shrinking of the active device sizes, which could have lifetimes of hundreds of years, when free from fabrication defects.


Yes, so I'm betting on these 14nm Atom boards because they are very performant per watt (I get twice the server juice out of these at 25W than I get out of my 6600 desktop at 65W!!!!!!!!!) and I think they will be more reliable at 42 celcius full blast than the still non existent 10 or smaller nm.

I even think like you that my 45nm D510MO probably will outlive these! But it's so underpowered (like 1/10 of the perf. at 15W) compared to these that I'm willing to take the risk!

The risk of new boards being so much better that I'll have to throw these away ever is zero at this point, memory being the bottleneck!


> Temperature kills hardware, you need to bring those temperatures down, and the only way to do that is to lower the wattage!

The best way to keep your CPU intact for a long time is keep it powered off. But then you have probably no use for this CPU…

> it's too hot and it's going to break very soon!

“very soon” on your scale from now to 100 years, probably. But most people prefer having a CPU working at full capacity for a few years than having a useless brick of silicon sitting around for a hundred year.


> "if hardware is too hot for you to hold your finger on it indefinitely" it's too hot and it's going to break very soon!

Are you talking about junction temperature, package temp, or heatsink temp? The only CPU I own with a package temp that's cold enough to touch is inside my phone.

Meanwhile some of my data center machines are a decade old and run at 75C all day. I've never had a CPU fail before the machine became obsolete


linux sensors command, where ever that is.


That data can't really be trusted, because various sensors often have large offsets or are just bogus on specific hardware. E.g. lm-sensors reports both a sensor with 0 °C and one that's usually something like 90-100 °C on my desktop, while also misreporting the CPU temperature due to not taking the AMD offset into account (might be patched in recent versions).


In terms of Intel and/or enterprise hardware, the values are reliable. You can cross-check them via IPMI Sensors (provided by BMC itself) or Intel powertop (by crosschecking cpu throttle states). Intel also writes relevant support code (cpufreqd, thermals, etc.) themselves.

The biggest offender in my experience is OrangePi Zero, but I need a thermal probe to verify it.


>if hardware is too hot for you to hold your finger on it indefinitely

That's 50C (bit less)... which is quite low operating temperature and very far from anything that damages silicon. Temperature fluctuations are far worse due to thermal expansion/contraction.


Does temperature kill hardware? I wasn't very careful with my first laptop (a 2007 model with a dual-core CPU and a discrete GPU), and often used to run it on my lap with the fans blocked. It regularly got up to a sustained temperature of 96c. And it ran fine for 5 years (used for several hours most days) before the screen gave out. It actually still runs fine now with an external display, although it rarely gets used.


> "if hardware is too hot for you to hold your finger on it indefinitely"

That's only like ~45 °C or so.


More like 48, but yeah.


Enterprise networking chips happily run 80C+ junction temperature (much too hot to touch) for the part lifetime - which is 10+ years.

There are plenty of things that damage chips, especially those at 7nm and beyond but this too-hot-to-touch "rule" is rubbish.


Is this an exaggeration, or do you genuinely want to run something on the same hardware for the next 100 years?

If so, why? I can't imagine a workload that is so unchangeable in its nature, especially considering the world we live in and how often storage media changes that I'd want to plan ahead for 100 years.


A museum obviously


Things are about to not change as fast.

My drives are from 2011, peak SSD at 50nm (100.000 writes per bit).

My CPU/motherboards are from 2017, I think 2Gflops/watt is about where peak CPU is at.

Time will tell... but there is no point waiting for improvements now, time to go all in!


Why would that be “peak SSD” or “peak CPU”? Your CPU is already way behind the latest processors in both performance and performance per watt.


> I'm planning on running my machines for 100 years!

Heh, that's an interesting thought that i've also sometimes had.

For doing something like that, the hardware would need to be way more resilient and have failover for various components - for example RAID for the HDDs/SSDs and some mechanism for clustering and failover for apps and other stuff you'd want to run on them. But even with those in place, i'm not sure that the hardware that's available to us wouldn't just die long before the 100 year mark.

Anyone have any idea what are the oldest computers presently in continuous use? Best i could find was this, but it was just turned on for once, instead of being continuously working: https://www.smithsonianmag.com/smart-news/watch-the-worlds-o... Apart from that, all i can think of are mainframes and such.

I really doubt that any piece of currently modern hardware could last 100 years without becoming some hard to understand mess that's incredibly out of touch with the OSes and paradigms of the future. What would you even run on it? Debian? FreeBSD? Haiku? Would there be anyone to debug Python 2 or Python 3 in 100 years? What about Java, .NET, PHP, Golang, Rust, C++ or even C? Considering that many of those ecosystems are more and more migrating to integration with the Internet, especially for the dependencies, what are the chances that any of the software will survive that long?


I'm going to not use raid and instead mount the drives "manually" (via fstab) because I heard som nasty things about raid.

Instead I made my own async. distributed database: http://root.rupy.se

That way I can just fix the drives as they fail.

As for OS/languages I'm 100% sure linux + Java is the final server platform for eternity.

Nothing has improved for 5+ years since NIO got epoll!

The io_uring and potentially user network/disk stuff to work around the kernel waste (in my case ~30% of the CPU) might happen but also might not...


> Anyone have any idea what are the oldest computers presently in continuous use?

Voyager 1 and 2 are good candidates. In the end, they'll probably shut off because their generators won't provide enough power to operate their computers after 50 years or so.



I still have Pentium 4 running at one location. Sometimes the temperature is scary to look at, but it doesn't care. Those things were made to last


What temp. you get?


The wear on most electronics is more dependent on thermal __cycling__ than it is on temperature, assuming it is run without it's operating temperature regime.


Got any specs on those Atoms? A passively cooled homelab build sounds pretty interesting...


I have [1] which has a similar chip to what the parent described. Its a fun little computer.

[1] https://www.newegg.com/supermicro-mbd-a1sri-2758f-o-intel-at...


I'm planning on building a mini-ITX Atom-based system in the future either with a Jasper Lake (Pentium N6005) or a future Alder Lake-L CPU. ASRock makes mini-ITX motherboards with integrated Atom CPUs that are a bit more modern than Supermicro offerings.


I'm not waiting for an eventual tiny improvement in Gflops/watt... the SuperMicro boards are really industrial and the Atom line is being cancelled for a low wattage version of the consumer chips!

I think these boards are the first and last to be able to run 100 years! Earlier models consume too much energy and subsequent will be too fragile/complex!


It's the SuperMicro ones from 2017.


Thermal resistance is linear in thickness, so if the Si on top is very thin, then perhaps it doesn't matter too much?


They are not putting a layer of SRAM on top of the Core Hotplate. They are putting the SRAM on top of the SRAM section.


So the next logical step would be to remove L3 cache from the compute chiplet all together? That would let AMD either save money since the chiplet is smaller or add more logic for the same die space.

This could also mean a GPU chiplet on package with compute. Each chiplet gets at least one cache layer. The next few years could be pretty crazy.


Latency is still unclear, and probably worse than L3.


If you're on a high end chip then you already have to talk to the other chiplet for half your L3. A stacked chip should be a lot faster than that.


The additional L3 is per chiplet, so you're still going to have to talk to other chiplets and their L3. Latency numbers would be good to see. This is definitely better than eDRAM L4 on the I/O die, and that's still something they could do, so props for that.

The power cost will really determine if we see this in Laptops or not though.


> The additional L3 is per chiplet, so you're still going to have to talk to other chiplets and their L3.

You will sometimes, but traffic that still overflows to the other chiplet is probably almost all traffic that would have gone to RAM before.

And as a reference point, cross-chiplet L3 is about 50 nanoseconds slower than local L3. That will dwarf most things.


Going over TSV will definitely incur at least one clock cycle.


1. Latency ( One layer ) is the same, as updated in the article.

2. You will still ( for now ) need L3 Cache on the Chiplet where you have layers of SRAM on top.



Smart matter! So far we've only really been able to make smart paper (all chips are basically 2D). I remember reading in Stephenson's Diamond Age about a brick of compute, when I understood what it was it blew my mind...


This is packaging, the chips are still square (though they do have several layers).

A solid brick of compute would have to run pretty slow, square chips have the surface area for heat removal.


> A solid brick of compute would have to run pretty slow

That's not necessarily a problem if the cost of fabbing logic drops enough. It actually could make a lot of sense: throw more and more transistors and solving problems higher up the stack but use them with a lower duty cycle to avoid thermal issues. We may end up living in a world build of logic bricks, gently throwing off heat and solved problems while hiding as structural or aesthetic components.


> A solid brick of compute would have to run pretty slow, square chips have the surface area for heat removal.

If we’re talking science function anyway, could it be a superconductor at room temperature? Or would it be impossible to build logic out of that?


Even if it's very conductive, it still has to dump all that heat somewhere, so it will get very hot or just melt/burn eventually.


Unless it's doing adiabatic computing, of course. When computations are reversible, you don't have to waste energy to destroy bits.

https://en.wikipedia.org/wiki/Reversible_computing#Reversibi...


A superconductor is a material with zero resistance, so it does not generate heat with electrical flow.


But the electric path can be ultra short if arranged properly, result in smaller heat disposal. If designed properly, probably it can do the exact same thing with far shorter wire length. Result in higher power efficiency.


Theoretically, could we run small heatpipes through chips to cool them down?


It's already proven to work[0], now we just have to cross our fingers that no big corp makes it super expensive for the masses.

[0]: https://www.youtube.com/watch?v=YdUgHxxVZcU


I believe there is active research in micro fluidics to do exactly that.


Kinda like magic crystals in fantasy stories. With the complexity of modern processors, they are basically slab of magic stones to most people anyway.


That's unfair to Stephenson. The computer in his story was based on real design sketches of what you could do if you had atomically-precise manufacturing with an integrated liquid cooling network. (He did make the cooling network jet steam out at the end to go with the story's steampunk vibe -- that part seemed unrealistic.)


Many chips today are multilayer (many sheets of paper)


Multilayer interconnect or multiple layers of transistors?


Multiple layers of transistors - at least for RAM (HBM). Multiple logic layers are coming really soon.


Many chips as in flash (and recently also DRAM), but stacking logic is fairly new.


Exactly! Stacked logic is the key to the fabled smart matter. Heat dissipation is going to be an issue, although I think IBM had some ideas on microfluidics. Also, there's been a ton of work on dilution fridges for quantum chips, and some startups are doing cryo logic, these two could complement


This is just an aside but almost every post about writing high-performance code stresses the importance of data locality, that related data must be packed together in memory, and a cache miss that goes to main memory costs thousands of instructions worth of executed code. This implies that languages with tons of pointer indirection like Java will never be as fast as well-written C++ code.

I wonder how the continuous growth of caches in the recent past, as well as this approach will affect this performance requirement.


The general rule of thumb in software performance is that the larger the scale of the codebase is, the more the "hotspots" will matter.

The CPU word size, number of registers, memory sizes and cache all matter to this since they determine the "base unit size" that the hotspot code could address without dropping down a layer.

Pointer-chasing code can be slow, but if the entire structure fits in cache, it won't be that slow. It gets slow mostly because the CPU can't predict what's needed next and keep the pipeline fed. And if you were to have a pointer structure that also existed in memory mostly-linearly, it would most likely have performance characteristics similar to a flat array. At scale, the data format really is the bottleneck to performance - shaving off a few bytes with better packing can make the difference, and in this respect, yes, Java is going to impose some limits via lack of control.

That said, as a society we still process petabytes of JSON daily. We aren't constantly doing high-performance things - inputting and editing data and making it legible to humans are tasks that have the natural result of putting some "slack" and redundancy into data. Games tend to get results fast by "cheating" and turning the full-fat editable data into a lean, single-purpose rendering structure.


Have they grown? I don’t recall that l1 has grown much, and the bigger ones likely have more latency. AMD has some big l3 lately but their l3 has a bit more latency for it

these big caches will help indirect code be a few % less bad but it will still be better by a lot to keep things in l1.

now maybe one day with some unforeseen developments, linked lists will be cool again.


L1, not so much, but comparing my old Haswell and my new Zen 2 proc, the L2 size has doubled, and the L3 has has increased from 6MB to a whopping 32MB


right but I bet the ratios of cpu latency to l2 and l3 latency have both gotten worse with that growth.


I checked, and it's about the same according to these: https://www.7-cpu.com/cpu/Haswell.html https://www.7-cpu.com/cpu/Zen.html

I think latency has a lot to do with the speed of light - since both of these CPUs run at similar frequencies, you can cover more SRAM cells worth of distance in the one with a denser process geometry, hence you can have access to a larger cache with the same latency.


The speed of light is the absolute limit but isn't really a useful measure at any chip-level design. IIRC, its more about capacitance and leakage current.

The less capacitance (and 3d finfets / other innovations reduce capacitance), the fewer electrons it will take to "turn on" or "turn off" a transistor.

That's why modern process designers are trying more and more exotic shapes to reduce capacitance and reduce the number of electrons that need to be moved anywhere for the "on-off" toggle to happen.

Leakage current is the amount of electrons that "leak" even in the offstate (which all CMOS designs are statically in the off-state. Even at 5V / on, there's a complementary transistor that's in the off-state somewhere that theoretically prevents current from flowing).

As such, the two sources of current (and therefore power usage) are capacitance (the number of electrons you need to move to turn a transistor from on to off, or vice versa). Or leakage (the number of electrons that flow despite the transistor being nominally off).

---------

The lower the leakage, the less power is used. The lower the capacitance, the less power is used AND the quicker the transistor turns on / off (because fewer electrons have to move, so the switch goes from .25 nanoseconds to .1 nanoseconds or whatever).


Talk to game developers - game code lives or dies with it's ability to persist on cache.


Data locality is just one of the things needed for writing high-performance code. Another thing needed is to have more development time for performance optimizations. C++ developers spend significant amount of time for manual-memory-management related bugs. Developers in languages without manual-memory-management can spend this time for performance optimizations.

And regarding "Java will never be as fast as well-written C++ code." => since Java is getting inline/value/data classes [1] is will also benefit from data locality.

[1] https://openjdk.java.net/jeps/169


Side-question: is there a way to write code to be sure my code and data is stored in L3 cache? I understand it's done automatically from RAM, but what if I want to hack it out so that it's guaranteed to be in cache? What about L2 cache?


Yes, kind of. It is called Data-oriented design: https://en.wikipedia.org/wiki/Data-oriented_design


Seems related to Entity Component Systems.


Make the next piece of data you touch after an operation close to the one your touching (statistically, on average).

The introduction talk to this is by Mike Acton at cppcon, but Jon Blow has been harping on about it as well


Yes, but I want it to be guaranteed, not statistically.

E.g. I want to run Quake1 right from the L3 cache.

E.g. with filesystem I'm sure when I read or write it, if I use regular IO operations, no mmap-ing (of course there's a lot of quirks with drive caches not respecting fsync, etc, but let's skip that for sake of this comment).


Intel had extensions to let you encourage that to happen CAT and CDP. However they require OS support.

It's a common feature on embedded processors, where portability is less of a concern.


So AMD doesn't have such commands?


The way I read this tweet (the author works for AMD), probably yes:

https://twitter.com/Underfox3/status/1399589740094099456


I honestly don't know; I'm not as familiar with AMD.


Linux surely must support that, right?


A quick search makes me think it's called "pseudo locking" but I couldn't find documentation on how to use it from userspace.


Probably need an VM professional to answer this. AFAIK Most of the Dynamic Lang VM optimisation basically comes down to prediction and cache hit.

Would a 512MB of Cache, or say a 4GB Cache dramatically change the performance of these Dynamic Languages? It still wouldn't be C like performance, but instead of the current 10-20x for Python or Ruby, it would be closer to 3-5x?

Or would it not be much difference at all? I could imagine in the future there will be a System Level Cache built on top the IOD on EPYC where it has 4-8GB capacity.


I don't understand. Is it just cache put above the compute-chip (so "layered" and the connections are still just 2D) or is it really 3D-interconnected?


3D interconnected. Only 2 layers so perhaps more "2.5D" but they are different chips (potentially) from different wafers or processes.


I think the most interesting thing about this is combining a 7nm process for the vast L3 cache expansion with presumably 3-5nm processes. Cheap silicon enhancing expensive silicon with high-bandwidth interconnects.


Anyone here has experience with Xeon Phi (Knights landing) processor which had 16GB L4 cache. Was it useful. Too bad that Intel botched up their process tech, and this line of processors had to be stopped.


> processor which had 16GB L4 cache

That was HMC (a competitor to HBM), a DRAM technology that IIRC was slightly higher latency than the other DRAM on their board. So it was kind of hard to use. HMC had more bandwidth but worse latency, so it was only sometimes faster.

This V-Cache is SRAM, and therefore likely to be lower-latency than any DRAM technology, and therefore actually useful as a cache.

I don't have experience with Xeon Phi. I just remember people talking about it back then.


Maybe the os core/kernel can live on the cpu and be able to make live optimizations and have its own assembly code...


Not sure why you are downvoted, IMO this can really help with security in server environments -- or help some companies close down their kernel even more (like Google's Android closed-source extensions).


How do you get more bits per second than is there?

Can you do many writes to the same byte?


It's a rate, so the same way you can drive 60 miles per hour between two points that are less than 60 miles apart.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: