It's not about multi-socket configuration; selfbooting KNL machines can have two...

Jweb_Guru · on Sept 9, 2016

Sure, but it's not clear to me that the best way to deal with the two memory classes would be with NUMA-aware scheduling, unless you happen to have an application with "fast" and "slow" application threads (which I suspect describes relatively few applications in practice; and even if it does, wouldn't you have to tell the scheduler about it explicitly?) Seems to me like it will usually be much more efficient to use the MCDRAM either as L3 (default configuration) or as an explicit scratchpad (which a scheduler wouldn't really be able to exploit, given that if the scheduler has data structures that don't fit in L2 it's already probably screwed).

That being said, I did some more reading this morning and the sub-NUMA clustering configuration on the new Phi does provide tile-to-directory-to-MCDRAM affinity (via pin domains), which would make sense for maximizing its performance as either L3 or scratchpad; AFAICT this is not the case for the remote DDR4, though. So whether it's worth caring probably depends very much on your workload; I think KNL is most interesting for workloads with working datasets that are much larger than 16GB, since otherwise you could just use a GPU (you can get more usable working memory per second with much better bandwidth with something like a DGX-1 thanks to NVLink, but unless I'm missing something not at a remotely competitive pricepoint, and it's unclear to me whether it's sustainable for larger working sets since you can only transfer up to 80 GB/s from the CPU to the GPUs, which is lower than the 90 GB/s each Phi gets out of DDR4 on Triad [and a better comparison is probably the 115.2 theoretical peak for KNL anyway]).