Imagine 32GB of this V-Cache instead of RAM. Does each additional layer add latency? If so, I wonder how many layers you could have until it would reach RAM-equivalent latencies. Also, it would require a compiler change if you had 500 sections of RAM (at 64MB per section) with latency increases for each section.
In short, yes, higher capacity will involve higher latency.
The thing here is balancing core speed vs “effective” memory latency (factoring in cache hits, misses, etc).
Cache management is a hard problem, and the equilibrium point is load dependent (i.e. depends on what type of program you use).
AMD has been smart enough to understand that sometimes it’s better to just brute force your way in (with higher cache sizes) than being super clever about how you handle it.
At 192MB, just give me explicit access to this cache as memory and I’ll decide what goes into it. No need to try to predict what my code will want, I can just tell you.
That not a great solution. My working data set is likely larger than 192MB. But my application is much more likely to know what data is in the hot path than the CPU is to guess it. I might want to put my DB index into this cache, or I might want to put only a part of it. I might want to preload the data then work on it for a bit, then load the next chunk. I might want one core preparing the data in RAM while another core is getting ready to work on it in this cache. Essentially I want full access to it independent of the main RAM because I can do a lot more with that. Think of this as the difference between RAM and disk: is RAM only useful as a cache to what’s on disk?
We know both positions are correct in practice. Most automatic prediction/management algorithms perform better on 90% of the code that’s run on them. However, careful performance tuning over time by experts can outperform generic solutions, sometimes dramatically.
I generally err on the side of “I’m not as smart as I think I am” in these kinds of discussions. It’s not that I can never do it. I’m sure if I studied it a bit I could. The reason is that 99% of the lines of code I’ve ever written don’t warrant that kind of attention. I’m sure there are uses like when you’re writing some core fundamental algorithm in the fields of compression, crypto algorithm, hash, video processing, etc etc etc. I don’t work in those spaces though so the benefit is much more marginal.
It’s possible 192mb is enough to start needing some explicit memory management. It makes the coding model much more complex though. And in a greedy software system where your code isn’t the only one running, such complexity doesn’t necessarily net overall wins. It’s the reason we have drivers and OSes even though we started with each piece of content bundling explicit HW support (at much better perf generally)
Actually, this is where 3d stacking changes the game. With 1 additional vertical layer, you can get double the memory at practically the same latency, as the source of latency is related to the wire distance (and capacitance) in the 2d plane. Latency can be lowered with 3d stacking as well if you reduce the 2d area of the memory array. This is how the industry will keep scaling going when 2d feature sizes can no longer be shrunk.
What's the limitation on adding additional vertical layers though? One apparent limitation could be that by adding several vertical layers it would require the wire to be longer.
1. Heat doesn't go away with layers and gets harder to cool.
2. These are still layers. You want each layer to be as flat as possible. As the layers deform you lose the ability to cleanly add more layers.
Current lithography is very much based on the notion that everything is flat.
Depends on your definition of “brute force”, I guess. Between explicitly deriving some algorithm from an understanding of causal relationships, v.s. training a neural net on billions of samples without encoding any broader understanding of the world, the latter approach is the one I would have called “brute force”.
Whilst it would be nice, we would be adding a lot more heat into the CPU block. Maybe they could and add extra cooling on the other side, though would really be looking at reinventing the CPU socket and thoughts of Intel Slot-1 form factor start to become appealing again as a form factor to allow such cooling solutions.
Now what we have over the years seen interest in is adding processing cores to the RAM itself and maybe having a small dedicated processor for some tasks attached to the RAM may well prove viable.
Imagine if we didn't have any CPU or RAM sockets/slots and just a row of slots you addedd a module that had CPU/RAM in one and can just add more upto the slot limit. But then, that is kinda how GPU's have gone already in many aspects and look at how much RAM they hold and how large the cooling solution for them is.
That gives you an idea of the cooling needed for large amounts of processing and RAM when closely packaged.
> Now what we have over the years seen interest in is adding processing cores to the RAM itself and maybe having a small dedicated processor for some tasks attached to the RAM may well prove viable.
I think we call these 'GPUs' today.
> just a row of slots you addedd a module that had CPU/RAM
As you mentioned, GPUs fit this description, but some PCIe devices are full-blown embedded systems.
I think we're at the point where the only major improvements will take place on the CPU die itself.
GPUs aren't quite the same, because they're (mostly) SIMD. GPUs don't have the independent per-compute-core memory bandwidth required for every core to be running its own execution thread with an independent instruction pointer, independent data fetches, and (especially) non-correlated branch-prediction failures / cache misses. (This is most of the reason that branching, if even implemented in the GPU's ISA, is effectively useless.)
Whereas, what the parent poster is describing would be true MIMD: a bunch of tiny cores each with its own on-board RAM, its own instruction pointer, and then a bus (or a bunch of busses) fast enough to feed them all data from (probably NUMA) main memory.
GPUs don't provide any advantage for running e.g. a high-concurrency Erlang application server. But a true-MIMD system would.
DRAM is incompatible with the logic family used in CPU making, though attempts were numerous at trying to work around that. Every few years, there are somebody coming with claims of passable CMOS+DRAM tech, but none got adoption so far.
I work on ferroelectric hafnium, which looks promising with regards to CMOS compatibility. It's non-volatile, though I guess you could also use it like DRAM. Endurance is an issue for now though (it's approaching RAM use-cases, but not SRAM yet).
As you said, I'm not sure about the density part. Unless it's completely done in BEOL, I don't see designers trade precious chip real estate for memory (unless pin or power-limited, of course).
I don't know much about STT-MRAM, so I'll avoid making a fool of myself. Looking more into the details is on my TODO. Look at the datasheets and make your choice, as usual. I am not sure how the situation will evolve on either side, my understanding is that FRAM could be more energy-efficient.
Well, hafnium can be used to make ferroelectric crystals, that are necessary for ferroelectric memories (FRAM). The most used ferroelectric material is PZT, which has lead, and is a nightmare for CMOS compatibility due to contamination issues, and temperatures.
The most exciting part might not be the performance (though it looks good), but the way ferroelectrics can be used for new circuits (variable treshold transistors thanks to FeFETs). Hafnium has been used in gate oxides for a few years now so it's quite compatible with CMOS.
For practical considerations, both DRAM, and CMOS need a lot of steps, and both of them are already dialled closed to thermal limits of materials.
So if you just first make CMOS, and leave empty space nearby protected by something to do DRAM later, it will be very hard to fit into the tiny remaining thermal budget past which CMOS devices will turn into schmoo.
First, you need a wafer on which you can make a capacitor, which already means an SOI process, and SOI wafers.
The lowest nodes with SOI available to mortals are 40nm, and 14nm on GlobalFoundries, but god knows how one gets GloFo to collaborate on that.
Then, you need CMOS device which has enough thermal budget to survive both own, and DRAM creation.
Third, I think on-die DRAM will only make sense when it wins over SRAM. DRAM cells cannot physically go smaller than the minimal size of a trench capacitor. I believe at 5-7nm nodes, 6T SRAM will already be smaller than a reasonably fast eDRAM per area.
We know that IBM's Z15 is a 14nm FinFet chip, and it has DRAM on board, with them probably somehow doing DRAM first.
It was incredibly fast, but also very expensive and limiting in SKU configurations. The resulting 16GB SKU for the Vega 56/64, for example, made basically nobody happy. It was too much for gamers, who then didn't want to pay money for something that didn't help, and it wasn't enough for the professional crowd, who were getting used to 24GB offerings from Nvidia.
> Actually isn't that already the case for mobile CPU's?
Nope. They "just" replace the slot with solder more or less. It's still externally packaged DRAM modules.
Why are you confident about this? Semiconductor processes are complicated for a ton of reasons, and a process which is optimized for making main memory (DRAM) has some differences from a process optimized for making logic (CPU), and flash memory is different still.
I see AMD's effort as a way to sidestep the whole problem of putting memory and CPU on the same chip, by making separate chips and stacking them. It makes sense. You can get CPU packages with CPU + DRAM + Flash, but these are separate chips which are wired together inside the package.
The whole point of these systems is to avoid putting memory and CPU (and flash) on the same chip. AMD's version is smaller & more integrated than package-on-package but it still achieves the same goal: multiple chips.
Actually already a thing but production of the memory isn't easy so not as cost effective yet. HBM is used on the VEGA graphics. GPUs are a type of processor.
But IIUC you would at least need to pass the heat of the CPU though the RAM. So if you want to keep a lot of memory near the CPU you are at the very least adding some sort of "blanket" between the heat source and the heat sink.
You can definitely do that, but now your RAM is twice as far away from the CPU on average than if you put RAM on all sides. The best solution is probably some sort of hybrid where you put RAM on all side, but more on the bottom to balance the thermal effects and the latency.
I'm sure Intel and NVIDIA would also like to make more of their platform and driver-stack open, but they are bound by confidentiality clauses in third-party technology that they've licensed. That's part of why huge chunks of Windows aren't open-source: because Microsoft doesn't "own" all of it, there's a lot of third-party licensed code compiled from source in Windows.
Nah, nVidia massively profits from the marketability of stuff like HairWorks / GameWorks a couple of years back being proprietary, and now it's DLSS. It's super cool tech, but kinda sad that [Speculation] it'll be replaced with a slightly worse version that works on all cards down the line just because nVidia wants to market it's exclusivity while it lasts.
Nvidia likes to use Gameworks etc as a weapon against AMD. Like when they turned up tessellation to ridiculous levels because AMDs cards didn't have the hardware for it at the time.
That has been throughly debunked a million times over. The ocean only renders in wireframe mode because occlusion culling doesn't happen in wire frame mode.
Bryan Cantrill talked extensively about how long it took for Sun Microsystems to open source Solaris, and a lot came from code that was often outsources because it was "boring" and basically non-core tech (the example was i18n/l10n)
I think there are three reasons why overall this is not that bad for the Linux code base:
- it is a driver, not a core module
- the constants are implementation details of the driver
- active maintenance of code is a necessary condition for inclusion in the Kernel. The other Kernel developers are not supposed to maintain and refactor code dumps.
It’s also not the worst way of making sure new chips get supported relatively quickly. Those “giant” header files are mechanically generated from the RTL source for each GPU, it’s not like it’s some horrendously inefficient bloat form an outsource developer that doesn’t care about writing clean code or something.
NVIDIA doesn't give a crap about it's driver stack being open source. In fact, they actively want it to be not open. They refuse to even release signed firmware so the community can build an independent driver.
They didn't kill openCL. 2.0 was evidently unpopular with everyone. Not sure why you're trying to pin it on nvidia. And afaik Nvidia's only stated position on CUDA is that it's royalty free.
Yes, now, after letting the whole ecosystem die by not supporting OpenCL 2.x for ~10 years, which in turn caused the whole ML field to turn into a Nvidia/CUDA monoculture.
Going by wikipedia it took AMD almost two years to release an SDK with full OpenCL 2.0 support and Intel wasn't that much faster. Most of the Open Source implementations also seem to have died around the OpenCL 1.2 mark, with some having incomplete 2.0 support. So while you can probably blame NVIDIA for kicking a dead horse it looks as if something else got to it first.
I believe OpenCL 2.0 had a mandatory feature that nvidia couldn't support (maybe something about sharing pointers between GPU and CPU?). OpenCL 3.0 solved this by making many features optional.
I tried to create a ML framework[0] that would work on both CUDA and OpenCL (and natively on the CPU) around 2015/2016, which included creating FFI wrappers for both CUDA and OpenCL. This is where my experience on the subject (and my contempt for NVIDIA) comes from.
Me memory isn't perfect, but IIRC the situation was roughly the following: We were quite short on resources (both devtime and money), which meant that we had to choose our scope wisely. Optimally we would have implemented both CUDA and OpenCL 2.0, but we had to settle for OpenCL 1.2 (which offered reduced performance, but was "good enough" for inference). IIRC OpenCL 2.0 was very very similar in what capabilities it assumed and offered to the CUDA version at the time, and cards like the GTX Titan X had "compute capabilities" that supported features like shared virtual memory between CPU and GPU in CUDA at the time. In fact the advances around memory management (and async copying) that were present in CUDA and not in OpenCL 1.x were the main source for the performance differences between the two.
From everything that I can tell at that point in time, if NVIDIA would have wanted to support OpenCL 2.0 they could have done so based on technical requirements. What the reason for not doing so is, is just pure speculation (lack of internal resources due to focusing on devtools?), but to me it always looked like they were using the edge they got via their proprietary libraries like cuDNN to get a foot into the field of ML and then purposefully neglected OpenCL to prevent any competitors from catching up. Classic Embrace, Extend, Extinguish.
Maybe you're right, however OpenCL 2.2 came out 4 years ago and (almost) nobody has adopted it yet, so the problem could be with the spec. OpenCL 3.0 was adopted by nvidia (although OCL 3.0 is the same as 1.2 with some optional features added)
But on that very same wikipedia page you can see it took less than a year for both intel and AMD to release openCL 2.0 drivers, yet nvidia didn't even start evaluating that for years.
Then again, it sounds like OpenCL 2.0 required some flexibility the nvidia drivers or hardware wasn't able to provide.
It's pretty hard to speculate to which degree nvidia was intentionally sandbagging here, and to which degree it really was stuck.
However, it's a member of khronos, and it's hard to believe they as such a major manufacturer could not have either said beforehand that the spec was a problem or simply complied with it; as both AMD and intel did.
Also, given CUDA's success it's all rather convenient for nVidia - at the very least it looks like they didn't mind leveraging their market position for continued market dominance, even if it's unclear whether that was an intentionally anti-competitive aim from the get go, or simply a fortunate happenstance they didn't try to avoid.
Then again; with antitrust enforcement mostly remaining in vaporware mode it's a little hard to blame them.
Intel's been showing off a lot of 3D stacking tech & mixed dies on packages recently, too. This isn't really a kockdown punch. It'll depend on who can ship what and when, not just who can show off demos. They both have shown off demos.
Sure, but Intel hasn't even been able to ship 10nm or 7nm in meaningful quantities for years. I have my doubts that stacking is yielding well for them...
selling mobile phones is a cash cow, but selling mobile chips really isn't, unless you are Qualcomm and have monopoly rights on the concept of a cellular radio.
Even selling mobile phones isn't the cash cow it was some short 3 years ago. The demand is plateauing and people hold on to their phones for longer time which makes perfect sense in a world where almost every mobile company is constantly inflating the prices.
Plus there's an upper limit to the computing power that people need from their phones as well, so the sleek ads for phones simply don't work on many people nowadays. They are content with what they have.
If a particle kills a repairable part of the die (which can be downgraded, or cut-off) you still get from a quarter, to a half of die area dead, and a second grade chip.
But if you can make 2x dies from wafer even with the same defect rate (it will usually be less.) Even if the die will completely die from a local defect, you still get much more perfect dies in total, which is more important for your bottom line.
Standalone SRAM can be very repairable, and high yield with a custom process. Adding few spare columns, or SRAM banks should be covering for much more defects than if binning was done per entire CPU die.
How do the thermals work out? The compute dies are already one heck of a hotplate, putting an entire layer of insulating silicon on top isn't going to improve things?
There's additional silicon stiffeners which should help with thermal transfer, granted at lower thermal efficiency than a single element.
-----
Cooling? Well, I'm using a 5900X right now, air cooled, with a high airflow case loaded with slow spinning 140mm case fans.
In games at 1440p120 (maxed settings, with RTX) I'm hotspotting at -- wait for it -- about 68C. With most of the CPU at 60C. In CPU intensive applications, more like 74C hotspot, with most of the chip at 65-67C. That's a setup that's still inaudible to me at 1m distance.
I feel this is going to be a 5900XT and 5950XT. Fills a price gap between the higher end X desktop CPUs and Threadripper for the HEDT market. Great for reasonably priced dev desktops (without falling down the workstation rabbit hole), as compilers love cache.
... though next year with 64-core EPYCs at 5nm with 768MB of L3? Oh. Dear. What's in the Xeon pipeline that can attempt to compete with A CPU that dominates on PPW and will be neither cache-starved nor core-starved? I guess it'll fuel a lot of Optane 5800-series sales, as driving down IO latency to sub-10μs will matter more.
"On top" in a flip chip means on the side facing the package. The silicon substrate's closest layer is the bottom layer, which is the hottest and still the closest to the IHS.
These devices are often thermally constrained as it is, so the question how this (adding extra silicon on top of the existing die [1]) affects thermals is important. Also neither copper nor silicon are responsible for the "bulk" of the heat transfer - the goal is always to get to a heatpipe or vapor chamber near you as quickly as possible, because those deliver well above 10000 W/m*K.
[1] Though Intel thinned their dies recently to improve thermal performance (CPUs are flip-chip, so the metal layers and active circuitry are facing the interposer). Some were concerned about stability/cracking of the thinner dies. Perhaps AMD is doing the same here, thinning the compute die, then stacking the memory die on top to end up with a stack that's exactly the same thickness as before. Since they're bonded together, the structural integrity should be similar. That additionally has the advantage that you can keep using the exact same IHS as before.
> - The processor with V-Cache is the same z-height as current Zen 3 products - both the core chiplet and the V-Cache are thinned to have an equal z-height as the IOD die for seamless integration
> - As the V-Cache is built over the L3 cache on the main CCX, it doesn't sit over any of the hotspots created by the cores and so thermal considerations are less of an issue. The support silicon above the cores is designed to be thermally efficient.
I recall years ago IBM had an experimental solution to that that involved liquid in between the layers to transport the heat (vague memory, I think they used an analogy to blood).
> I'm pretty sure the longevity of chips manufactured now is going to be alot shorter than for older chips if you can run them at the same temperature!
Don't be so sure. If it was going to be the case, we'd be changing CPUs like early Seagate disks at our datacenter. There's no measurable longevity loss in the CPUs that we observed in our data center.
Server BMCs provide power consumption timelines, which are measured in real time by the PSUs themselves. Also processors can assess their power consumption and report it to the OS and to the platform.
Querying and recording these as time series provides very nice insights over time.
Just looked to a bunch of Xeon Gold 6258Rs. They're running around ~70 degrees Celsius per die. The machine is under full load. I'm sure its fans are at full speed.
That CPU's reported temperatures are 81 for high, 91 for critical (in degrees Celsius).
We get new systems almost every year, so we have a rolling set consisting many systems. So, we have a cross section of systems to observe.
Server parts are big chips (or a bunch of chiplets) running at fairly moderate clock speeds. For example your XCC CPU should have a die size in the 600-700 mm² neighborhood for a power density of around 0.3 W/mm² (which neglects hotspots because probably half the CPU is cache, which does not consume half the power). Desktop parts like the 10900K can chug down 200+ W on a 200 mm² die, though that's basically impossible to cool. Even more reasonable CPUs like AMD's 5000 series run something like 40-50 W through an 80 mm² die, these are still hard to cool.
This is a different and refreshing perspective. I've never looked from the perspective of W/mm². OTOH, the number I've given is calculated by eyeballing the per core internal thermistors, provided by lm_sensors (hence by CPU itself).
So, per die sensors are probably reading somewhat lower numbers, but the core's cooking at 70 degrees C internally. Nevertheless, bigger die surface inevitably allows better heat conduction and reduces internal stresses considerably, when compared to a desktop part.
Desktop use is also very uneven, leading to a lot of large changes in temperature (35 -> 80 and back), which causes more material stress, than when everything is loaded evenly all the time. The same is true for electrical transients.
If you have a good cooler with lots of thermal capacity and good fan (like an Arctic Cooling or Noctua, nothing at the extreme end), the increase is very gradual. Also, short spikes in loads are well absorbed with minimal temperature changes.
Nevertheless the numbers you mention are neither unrealistic, nor impossible in stock cooling and/or sustained load scenarios.
5950X with the top of the line noctua cooler jumps from 35C to 80C very quickly when a single core is loaded at 100% (this is the worst case, because it leads to max voltage being applied to the CPU cores (because of boost to almost 5GHz), compared to loading all 16 cores at 100%, which will result in just 60˚C temperature sustained over long periods of full load). So interestingly, compiling the Linux kernel with 32 threads is less thermally straining on the CPU than browsing the web and a single page loading 10 different animated ads.
This mirrors my experience with Zen 2 and 3 as well. Multi-core loads result in just a few W per core and decent temperatures, while few or single core loads push per-core power into the 12-16 W region and the temperature rise/fall is basically instant (brick-wall at 10 Hz update rate, though it's unclear what filtering is applied and how the CPU derives a single temperature from its probably numerous sensors), which suggests that they're not limited by the cooler, but rather by the thermal impedance from the active area to the cooler. The core itself is really tiny (iirc around 4 mm²)...
This is my conundrum when pondering buying a Threadripper 3970x / 3975x Pro workstation: aren't those chips cooked very aggressively when loaded, even with very good coolers? Doesn't that mean that one day the CPU might just burn?
I really want to buy a proper serious workstation and I can afford it in a few months but I keep wondering: is now the best time for it?
I wouldn't really worry about it. I have an overclocked 3970x and a Noctua air cooler keeps it much cooler than the Intel 6950X I had before (also aircooled, with an even larger Noctua cooler). Air cooling is more of a limit to how high an overclock you can get; normal operation will be no problem for any cooling system designed for Threadripper.
There's never a good time to buy a new computer. It's always going to be obsolete before you open the box. Though in this case, one might wait for the Zen 3 Threadrippers.
Thanks for the feedback, it's reassuring. As other posters said, I am worried because it's a workstation which means it won't have consistent load, and material stress is much more destructive in the thermal conditions of expanding and contracting. So if in your ~8h workday your station gets stressed for 2 or so hours in irregular intervals, how destructive is that, objectively?
--
As for the Zen 3 TRs, well, to be fair, I am not looking to blow $50k on a workstation. :) I am more interested how -- and if -- they will drive down the prices of the TR 3900 SKUs...
It probably won't drive down prices that much. You won't want the old thing, and the incremental cost between generations is not the expensive part of a workstation. I have bought old parts before (to replace broken parts), and my experience is that the price didn't change much. What was a $130 motherboard when brand new, is $130 when a generation old. Does it make any sense? Nope. But that seems to be how it is.
You won't spend $50,000 on a workstation just by using current-generation parts. I think when you see a workstation that costs that much, it's because it has multiple GPUs in them. Pro GPUs are always artificially overpriced, and given the GPU shortage, they're now even more overpriced.
I did a quick pcpartpicker expedition and found that using last-gen parts saves you about $1000 on a $6000 32 core Threadripper workstation. I compared last-gen SSDs, consumer GPUs, processor, and motherboard, and picked relatively high-end parts. You will save more money by dropping to 24 (or heaven forbid, 16) cores, not getting an extreme motherboard, getting 64G of RAM instead of 128, etc.
This could all be invalid in a few months. It is hard to separate "market is always crazy" from "this whole COVID thing is going on". Building a workstation during the pandemic was a pain -- I bought a used GPU, and didn't get ECC memory because nobody would sell me any. If you wait a year, that is likely to improve, and there will be newer hardware. But, if you need to do some computing between now and then, you don't have much choice but to buy what's available now, and it certainly makes for a very good computer.
Thanks for the check! Much appreciated, it did put my mind at ease.
I realized I was looking at several custom-made TR Pro workstations and they of course have a hefty markup on top -- custom cases, custom cooling, pre-added several PCIe NVMe riser cards, plus style price tag I suppose, etc. Should have looked in PC Part Picker indeed.
And yeah, last-gen tech very rarely gets discounted even by normal people who just post ads in local Craigslist-like websites (like OLX). Stuff that's 2 or more generations ago is discounted, but not last gen. Puzzling indeed, especially having in mind that this last gen tech is very soon going to be "two gens ago". Oh well.
As for ECC RAM, I hear you. I was unable to find such anywhere officially but lucked out that one guy in the local OLX was having loads of it but just didn't post the ads due to being very busy (we communicated because of other ads) and then I just bought 64GB ECC DDR3 RAM from him for a home NAS that I am gradually expanding.
I'm definitely very interested in having a TR Pro workstation; started getting sick of Macs and their artificial slowness. I got the iMac Pro and granted, I have it connected to my TV where it plays Twitch streams all day but hell, a lot of stuff on the terminal (that's not Python) just works slower than it does on a meager i3 Linux machine that I have lying around.
So I do want a TR Pro machine but I am very curious about TR 5000. A release is expected in August which isn't that far away. On the other hand, the TR Pro 5000 might take several more months on top. Hmmm. Decisions, decisions. :)
I wouldn't focus too much on the Pro SKU. It has a slower base frequency and boost frequency, with the upside that it has 8 memory channels and supports 2TB of RAM. If you need 2TB of RAM and 8 channels, it's what you need, but quad channel memory is still very good. Most desktop-class machines are still 2 channel.
There is something to be said for the prebuilt workstations from reputable vendors. It shows up in a box, and starts working. I have strongly considered that angle; being a system integrator is tough. If something doesn't work, it's a week cycle time where you find a new part, organize the RMA of the faulty part, etc. and there is a potentially unbounded amount of time spent tweaking. Meanwhile, HP or Lenovo just ships you a computer; they tested it and it works. You pay several thousand dollars for the privilege, but it might be worth it. (And if you want a TR Pro, you have no choice. AMD doesn't sell them to consumers.)
> "if hardware is too hot for you to hold your finger on it indefinitely" it's too hot and it's going to break very soon!
From my experience, that doesn't happen like that in enterprise hardware. Either wrong voltage (inside the system) or defective design causes premature death. If the server BMC says it's fine, it's fine.
> Temperature kills hardware, you need to bring those temperatures down, and the only way to do that is to lower the wattage!
High Performance Computing doesn't work like that, unfortunately :)
> Atom is the perfect design, no crap and low power!
I'm sure it has its own uses and can accomplish a lot, but in HPC, it won't cut it. I use small SBCs at home to do and try a lot of fun and useful stuff, but it has limits.
Your rule of thumb is wrong.
The thing what kills hardware is temperature fluctuations.
Keeping a processor running consistently at 90 degrees is much better than switching it off and on all the time.
Running a CPU at full load has a very minor impact on its mechanical properties. Even after running for 10 years. Much less than storing it in your drawer for the same amount of time.
Temperature fluctuations damage the interfaces between the different materials inside an integrated circuit, so they will greatly lower the lifetime.
Nonetheless, there is a continuous aging of the metal traces (electromigration) and especially of the insulating layers, e.g. the MOS transistor gates, due to the difusion of atoms.
This continuous aging is accelerated by steady-state high temperatures and it eventually results in either open circuits or short circuits somewhere, destroying the device.
Good MOS integrated circuits are designed for a lifetime at their maximum specified temperature of at least 10 years or even 20 years or more for the better of them.
Nevertheless, this is puny in comparison with the lifetime of many semiconductor components produced 40-50 years ago, before the continuous shrinking of the active device sizes, which could have lifetimes of hundreds of years, when free from fabrication defects.
Yes, so I'm betting on these 14nm Atom boards because they are very performant per watt (I get twice the server juice out of these at 25W than I get out of my 6600 desktop at 65W!!!!!!!!!) and I think they will be more reliable at 42 celcius full blast than the still non existent 10 or smaller nm.
I even think like you that my 45nm D510MO probably will outlive these! But it's so underpowered (like 1/10 of the perf. at 15W) compared to these that I'm willing to take the risk!
The risk of new boards being so much better that I'll have to throw these away ever is zero at this point, memory being the bottleneck!
> Temperature kills hardware, you need to bring those temperatures down, and the only way to do that is to lower the wattage!
The best way to keep your CPU intact for a long time is keep it powered off. But then you have probably no use for this CPU…
> it's too hot and it's going to break very soon!
“very soon” on your scale from now to 100 years, probably. But most people prefer having a CPU working at full capacity for a few years than having a useless brick of silicon sitting around for a hundred year.
> "if hardware is too hot for you to hold your finger on it indefinitely" it's too hot and it's going to break very soon!
Are you talking about junction temperature, package temp, or heatsink temp? The only CPU I own with a package temp that's cold enough to touch is inside my phone.
Meanwhile some of my data center machines are a decade old and run at 75C all day. I've never had a CPU fail before the machine became obsolete
That data can't really be trusted, because various sensors often have large offsets or are just bogus on specific hardware. E.g. lm-sensors reports both a sensor with 0 °C and one that's usually something like 90-100 °C on my desktop, while also misreporting the CPU temperature due to not taking the AMD offset into account (might be patched in recent versions).
In terms of Intel and/or enterprise hardware, the values are reliable. You can cross-check them via IPMI Sensors (provided by BMC itself) or Intel powertop (by crosschecking cpu throttle states). Intel also writes relevant support code (cpufreqd, thermals, etc.) themselves.
The biggest offender in my experience is OrangePi Zero, but I need a thermal probe to verify it.
>if hardware is too hot for you to hold your finger on it indefinitely
That's 50C (bit less)... which is quite low operating temperature and very far from anything that damages silicon. Temperature fluctuations are far worse due to thermal expansion/contraction.
Does temperature kill hardware? I wasn't very careful with my first laptop (a 2007 model with a dual-core CPU and a discrete GPU), and often used to run it on my lap with the fans blocked. It regularly got up to a sustained temperature of 96c. And it ran fine for 5 years (used for several hours most days) before the screen gave out. It actually still runs fine now with an external display, although it rarely gets used.
Is this an exaggeration, or do you genuinely want to run something on the same hardware for the next 100 years?
If so, why? I can't imagine a workload that is so unchangeable in its nature, especially considering the world we live in and how often storage media changes that I'd want to plan ahead for 100 years.
> I'm planning on running my machines for 100 years!
Heh, that's an interesting thought that i've also sometimes had.
For doing something like that, the hardware would need to be way more resilient and have failover for various components - for example RAID for the HDDs/SSDs and some mechanism for clustering and failover for apps and other stuff you'd want to run on them. But even with those in place, i'm not sure that the hardware that's available to us wouldn't just die long before the 100 year mark.
Anyone have any idea what are the oldest computers presently in continuous use? Best i could find was this, but it was just turned on for once, instead of being continuously working: https://www.smithsonianmag.com/smart-news/watch-the-worlds-o... Apart from that, all i can think of are mainframes and such.
I really doubt that any piece of currently modern hardware could last 100 years without becoming some hard to understand mess that's incredibly out of touch with the OSes and paradigms of the future. What would you even run on it? Debian? FreeBSD? Haiku? Would there be anyone to debug Python 2 or Python 3 in 100 years? What about Java, .NET, PHP, Golang, Rust, C++ or even C? Considering that many of those ecosystems are more and more migrating to integration with the Internet, especially for the dependencies, what are the chances that any of the software will survive that long?
> Anyone have any idea what are the oldest computers presently in continuous use?
Voyager 1 and 2 are good candidates. In the end, they'll probably shut off because their generators won't provide enough power to operate their computers after 50 years or so.
The wear on most electronics is more dependent on thermal __cycling__ than it is on temperature, assuming it is run without it's operating temperature regime.
I'm planning on building a mini-ITX Atom-based system in the future either with a Jasper Lake (Pentium N6005) or a future Alder Lake-L CPU. ASRock makes mini-ITX motherboards with integrated Atom CPUs that are a bit more modern than Supermicro offerings.
I'm not waiting for an eventual tiny improvement in Gflops/watt... the SuperMicro boards are really industrial and the Atom line is being cancelled for a low wattage version of the consumer chips!
I think these boards are the first and last to be able to run 100 years! Earlier models consume too much energy and subsequent will be too fragile/complex!
So the next logical step would be to remove L3 cache from the compute chiplet all together? That would let AMD either save money since the chiplet is smaller or add more logic for the same die space.
This could also mean a GPU chiplet on package with compute. Each chiplet gets at least one cache layer. The next few years could be pretty crazy.
The additional L3 is per chiplet, so you're still going to have to talk to other chiplets and their L3. Latency numbers would be good to see. This is definitely better than eDRAM L4 on the I/O die, and that's still something they could do, so props for that.
The power cost will really determine if we see this in Laptops or not though.
Smart matter! So far we've only really been able to make smart paper (all chips are basically 2D). I remember reading in Stephenson's Diamond Age about a brick of compute, when I understood what it was it blew my mind...
> A solid brick of compute would have to run pretty slow
That's not necessarily a problem if the cost of fabbing logic drops enough. It actually could make a lot of sense: throw more and more transistors and solving problems higher up the stack but use them with a lower duty cycle to avoid thermal issues. We may end up living in a world build of logic bricks, gently throwing off heat and solved problems while hiding as structural or aesthetic components.
But the electric path can be ultra short if arranged properly, result in smaller heat disposal. If designed properly, probably it can do the exact same thing with far shorter wire length. Result in higher power efficiency.
That's unfair to Stephenson. The computer in his story was based on real design sketches of what you could do if you had atomically-precise manufacturing with an integrated liquid cooling network. (He did make the cooling network jet steam out at the end to go with the story's steampunk vibe -- that part seemed unrealistic.)
Exactly! Stacked logic is the key to the fabled smart matter. Heat dissipation is going to be an issue, although I think IBM had some ideas on microfluidics. Also, there's been a ton of work on dilution fridges for quantum chips, and some startups are doing cryo logic, these two could complement
This is just an aside but almost every post about writing high-performance code stresses the importance of data locality, that related data must be packed together in memory, and a cache miss that goes to main memory costs thousands of instructions worth of executed code. This implies that languages with tons of pointer indirection like Java will never be as fast as well-written C++ code.
I wonder how the continuous growth of caches in the recent past, as well as this approach will affect this performance requirement.
The general rule of thumb in software performance is that the larger the scale of the codebase is, the more the "hotspots" will matter.
The CPU word size, number of registers, memory sizes and cache all matter to this since they determine the "base unit size" that the hotspot code could address without dropping down a layer.
Pointer-chasing code can be slow, but if the entire structure fits in cache, it won't be that slow. It gets slow mostly because the CPU can't predict what's needed next and keep the pipeline fed. And if you were to have a pointer structure that also existed in memory mostly-linearly, it would most likely have performance characteristics similar to a flat array. At scale, the data format really is the bottleneck to performance - shaving off a few bytes with better packing can make the difference, and in this respect, yes, Java is going to impose some limits via lack of control.
That said, as a society we still process petabytes of JSON daily. We aren't constantly doing high-performance things - inputting and editing data and making it legible to humans are tasks that have the natural result of putting some "slack" and redundancy into data. Games tend to get results fast by "cheating" and turning the full-fat editable data into a lean, single-purpose rendering structure.
Have they grown? I don’t recall that l1 has grown much, and the bigger ones likely have more latency. AMD has some big l3 lately but their l3 has a bit more latency for it
these big caches will help indirect code be a few % less bad but it will still be better by a lot to keep things in l1.
now maybe one day with some unforeseen developments, linked lists will be cool again.
I think latency has a lot to do with the speed of light - since both of these CPUs run at similar frequencies, you can cover more SRAM cells worth of distance in the one with a denser process geometry, hence you can have access to a larger cache with the same latency.
The speed of light is the absolute limit but isn't really a useful measure at any chip-level design. IIRC, its more about capacitance and leakage current.
The less capacitance (and 3d finfets / other innovations reduce capacitance), the fewer electrons it will take to "turn on" or "turn off" a transistor.
That's why modern process designers are trying more and more exotic shapes to reduce capacitance and reduce the number of electrons that need to be moved anywhere for the "on-off" toggle to happen.
Leakage current is the amount of electrons that "leak" even in the offstate (which all CMOS designs are statically in the off-state. Even at 5V / on, there's a complementary transistor that's in the off-state somewhere that theoretically prevents current from flowing).
As such, the two sources of current (and therefore power usage) are capacitance (the number of electrons you need to move to turn a transistor from on to off, or vice versa). Or leakage (the number of electrons that flow despite the transistor being nominally off).
---------
The lower the leakage, the less power is used. The lower the capacitance, the less power is used AND the quicker the transistor turns on / off (because fewer electrons have to move, so the switch goes from .25 nanoseconds to .1 nanoseconds or whatever).
Data locality is just one of the things needed for writing high-performance code. Another thing needed is to have more development time for performance optimizations. C++ developers spend significant amount of time for manual-memory-management related bugs. Developers in languages without manual-memory-management can spend this time for performance optimizations.
And regarding "Java will never be as fast as well-written C++ code." => since Java is getting inline/value/data classes [1] is will also benefit from data locality.
Side-question: is there a way to write code to be sure my code and data is stored in L3 cache? I understand it's done automatically from RAM, but what if I want to hack it out so that it's guaranteed to be in cache? What about L2 cache?
Yes, but I want it to be guaranteed, not statistically.
E.g. I want to run Quake1 right from the L3 cache.
E.g. with filesystem I'm sure when I read or write it, if I use regular IO operations, no mmap-ing (of course there's a lot of quirks with drive caches not respecting fsync, etc, but let's skip that for sake of this comment).
Probably need an VM professional to answer this. AFAIK Most of the Dynamic Lang VM optimisation basically comes down to prediction and cache hit.
Would a 512MB of Cache, or say a 4GB Cache dramatically change the performance of these Dynamic Languages? It still wouldn't be C like performance, but instead of the current 10-20x for Python or Ruby, it would be closer to 3-5x?
Or would it not be much difference at all? I could imagine in the future there will be a System Level Cache built on top the IOD on EPYC where it has 4-8GB capacity.
I don't understand. Is it just cache put above the compute-chip (so "layered" and the connections are still just 2D) or is it really 3D-interconnected?
I think the most interesting thing about this is combining a 7nm process for the vast L3 cache expansion with presumably 3-5nm processes. Cheap silicon enhancing expensive silicon with high-bandwidth interconnects.
Anyone here has experience with Xeon Phi (Knights landing) processor which had 16GB L4 cache. Was it useful.
Too bad that Intel botched up their process tech, and this line of processors had to be stopped.
That was HMC (a competitor to HBM), a DRAM technology that IIRC was slightly higher latency than the other DRAM on their board. So it was kind of hard to use. HMC had more bandwidth but worse latency, so it was only sometimes faster.
This V-Cache is SRAM, and therefore likely to be lower-latency than any DRAM technology, and therefore actually useful as a cache.
I don't have experience with Xeon Phi. I just remember people talking about it back then.
Not sure why you are downvoted, IMO this can really help with security in server environments -- or help some companies close down their kernel even more (like Google's Android closed-source extensions).