Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's not about multi-socket configuration; selfbooting KNL machines can have two classes of memory. For brevity's sake you can think of them as "fast" and "huge." A regular malloc call gets you a piece of "huge", and there's a separate malloc function available to allocate "fast." This is the difference between the DDR4 and the MCDRAM you mentioned -- they're not accessed uniformly.

While Intel has done a ton of work to make sure you don't have to care about this, it's obviously in their best interest to have as much software as possible be able to care about this, especially because the KNL clock is so slow.



Sure, but it's not clear to me that the best way to deal with the two memory classes would be with NUMA-aware scheduling, unless you happen to have an application with "fast" and "slow" application threads (which I suspect describes relatively few applications in practice; and even if it does, wouldn't you have to tell the scheduler about it explicitly?) Seems to me like it will usually be much more efficient to use the MCDRAM either as L3 (default configuration) or as an explicit scratchpad (which a scheduler wouldn't really be able to exploit, given that if the scheduler has data structures that don't fit in L2 it's already probably screwed).

That being said, I did some more reading this morning and the sub-NUMA clustering configuration on the new Phi does provide tile-to-directory-to-MCDRAM affinity (via pin domains), which would make sense for maximizing its performance as either L3 or scratchpad; AFAICT this is not the case for the remote DDR4, though. So whether it's worth caring probably depends very much on your workload; I think KNL is most interesting for workloads with working datasets that are much larger than 16GB, since otherwise you could just use a GPU (you can get more usable working memory per second with much better bandwidth with something like a DGX-1 thanks to NVLink, but unless I'm missing something not at a remotely competitive pricepoint, and it's unclear to me whether it's sustainable for larger working sets since you can only transfer up to 80 GB/s from the CPU to the GPUs, which is lower than the 90 GB/s each Phi gets out of DDR4 on Triad [and a better comparison is probably the 115.2 theoretical peak for KNL anyway]).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: