Intel SPMD Program Compiler: A Compiler for High-Performance SIMD Programming

rrss · on Oct 25, 2019

Matt Pharr wrote a series of posts telling the story of ispc: https://pharr.org/matt/blog/2018/04/18/ispc-origins.html

I found them extremely interesting - highly recommended.

joe_the_user · on Oct 26, 2019

That is an interesting read - compiler writers got hung up on creating auto-vectorization where Cuda is essentially manual vectorization. And that's the thing. Once you find ways that writing a massively vectorized program on a GPU makes sense, why would you write a program where you had to hope your program gets vectorized?

That said, as I understand things, vectorization can fail with Cuda if you allocate more kernels than exist on the chip, in which case the chip may run the kernels in serial, producing surprising results.

rrss · on Oct 26, 2019

That isn't really a failure case in cuda (or opencl). It's very common to launch more blocks/workgroups than can be resident simultaneously on the GPU.

joe_the_user · on Oct 26, 2019

Neither autovectorization-fail nor Cuda executing kernels in serial is a "fail-fail", both are fall-back actions that accomplish a given task in a bit longer than the explicit instructions imply. That said, executing kernels in serial can supposedly create problems if a programmer creates logic that assumes kernels are always moving in lock-step.

Fronzie · on Oct 26, 2019

After reading, I do wonder, though: If one would work out the 'packet' datatype of the Eigen (http://eigen.tuxfamily.org/index.php?title=Main_Page) library, wouldn't that give the same kind of parallelism?

Together with a parallel-for from, for example, Intel's TBB library would make it parallel.

Making a new language imposes a huge development cost. Code will have to be adjusted.

Although the story is interesting, especially with notes on company politics, it is a bit one-sided on the technical side. The compiler-chiefs might have had a point that went beyond 'politics'.

marmaduke · on Oct 26, 2019

It's a variation on Clang, so calling it a new language seems overkill. It adds the OpenGL notion of uniform and varying variables, so that you can be explicit about SIMD without it being awkward (pragmas, compiler flags, prayers, assembly readings).

I rewrote some OpenCL kernels for ISPC and it was a great experience. The resulting shared object could just be loaded and called from ctypes in Python, where OpenCL requires a lot of rituals.

corysama · on Oct 25, 2019

I only played with ISPC a little bit. What I found is that it is really great if you need to write a large volume of SIMD code and that code sticks to one lane size. Like, 4 32-bit floats or ints. But, if you want switch mid-stream to 8 shorts or 16 bytes, you’re gonna have a hard time. Or, if you just need a few instructions, it’s easier to just use intrinsics.

a1369209993 · on Oct 25, 2019

I think that's somewhat intentional? The name stands for [Intel [Single Program Multiple Data] Program Compiler], not "SIMD"/"Instruction". I'd expect switching lane depth (4 shorts or 4 bytes) to work fine, but switching SIMD width (8 or 16 lanes) seems out of scope for "run 4 instances of this program in lockstep".

BubRoss · on Oct 26, 2019

I would have to see an example of what you mean, but it should be completely possible though might require converting without using vectorization.

Switching lane size doesn't make much sense to me because ideally you would want lanes that are as wide as possible and mostly be agnostic to their size.

corysama · on Oct 26, 2019

I had some code that tried to stay 16x8bit, but would occasionally _mm_unpacklo_epi8, _mm_unpackhi_epi8 to 2 8x16bit vectors to keep precise intermediate results during some fixed-point math.

Writing it out like that, it sounds like it should have been easy. Don't remember what I ran into. Maybe didn't bang on it long enough.

BubRoss · on Oct 26, 2019

The original AVX instructions didn't have all the integer operations that the most modern chips have. It might have been haswell that added small integers over the 256 bit lane width.

stephencanon · on Oct 26, 2019

Right. AVX (the original extension) only added 256b floating-point and non-destructive 128b integer. The 256b integer SIMD ops are all in AVX2 or later.

gnufx · on Oct 26, 2019

This would benefit from a comparison with current OpenMP/OpenACC (which supports offloading to attached processors in a standard way for C and Fortran, at least). Also, comparing with gcc 4.2 in the performance examples doesn't seem very useful; it didn't support AVX, regardless of auto-vectorization. (That's not meant to dismiss ISPC.)

yarg · on Oct 25, 2019

It's open source, so it should be fine?

But Intel's history when it comes to compilers and applied optimisations leaves this making me immediately uncomfortable.

The sort of PR work that these guys would need to do in order for me to consider them even remotely trustworthy is beyond even their budget.

wahern · on Oct 25, 2019

You don't have to guess. The process of upstreaming and the [then] current state of ARM support is described here: https://pharr.org/matt/blog/2018/04/29/ispc-retrospective.ht...

Not sure what conclusions to draw from that, but it looks like ARM support was finally made first class this past August: https://github.com/ispc/ispc/blob/cf90189/docs/ReleaseNotes....

I think it might be difficult to purposefully cripple AMD in an open source project.

loeg · on Oct 26, 2019

> I think it might be difficult to purposefully cripple AMD in an open source project.

It's not as explicit as it has been in the past, but the CPUID checks for very specific feature sets aligned with particular Intel models may not match AMD models, producing worse code on AMD models that support featuresets above baseline AVX2:

https://github.com/ispc/ispc/blob/master/check_isa.cpp#L106-...

That said, I don't assume malice here and I haven't investigated thoroughly. Most likely they just want to support their own silicon well and that's what they know. It's possible they would accept similar support for AMD µarchs in the OSS project (or maybe not).

I wouldn't draw too much inference from the ARM example, as I don't see ARM as an Intel competitor. AMD, on the other hand, is currently very competitive with Intel.

gnufx · on Oct 26, 2019

ARM seems a competitor to Intel in HPC, particularly in the Fujitsu post-K system (whose actual name I forget), but even the existing ThunderX2 systems.

Interestingly, Intel people don't necessarily have the information to optimize for Intel CPUs even, and the hardware may not tell you. It's really complicated and messy. An example that's relevant to linear algebra is figuring out whether to use FMA, e.g. https://github.com/jeffhammond/vpu-count

mcbain · on Oct 26, 2019

I’ve spent time writing CPU detection code for previous projects, and there is nothing that jumps out at me as biased in the linked ISA check. In fact that is really the bare minimum required to split the AVX variants, and will detect AMD support just the same as Intel.

You can compare it to other detection functions - one relatively easy to read, non-vendor biased example that does dig into all the extensions is this Go implementation (not mine): https://github.com/klauspost/cpuid/blob/master/cpuid.go

loeg · on Oct 26, 2019

Right, it looks pretty reasonable to me too. Zen2 still doesn't have AVX-512 anyway, so the super parallel paths this aims to really help aren't applicable anyway.

Zen1-2 should land on the "AVX 2 (Haswell)" path in the linked excerpt -- they have AVX/AVX2, F16C, OSXSAVE, and RDRAND -- which is the best ISA without AVX512 implemented in the compiler. That's entirely reasonable on Intel's part.

(I don't know why they look for RDRAND in a compiler, but whatever.)

moonbug · on Oct 26, 2019

Because it has an rdrand() function.

loeg · on Oct 26, 2019

Just search the github repo for "cpuid." It's used in two files, and while it does select for featuresets of Intel-specific models, it does so using common feature bits that AMD can implement, and has a fallback path for baseline AVX2-only mode. It isn't clear that they intentionally cripple AMD in any way. There is no explicit check for GenuineIntel, for example.

insulanus · on Oct 25, 2019

I assume you're referring to Intel's robust history of "Intellectual Property" lawsuits?

Agreed. ISPC is under a BSD license, but I wonder if there's a lawsuit to be made from the use of the techniques.

frostburg · on Oct 25, 2019

He's probably referring to Intel compilers outputting code designed to slow down CPUs from other makers.

amiga-workbench · on Oct 25, 2019

And this is an even bigger issue because several popular benchmarking tools were compiled with ICC, skewing CPU reviews in Intel's favour.

wolfgke · on Oct 26, 2019

> And this is an even bigger issue because several popular benchmarking tools were compiled with ICC, skewing CPU reviews in Intel's favour.

A benchmark whose speed also depends on a compiler is not a benchmark of the CPU, but of a combination compiler/CPU. If this is not written down (and, of course, also written down which compiler was used), it is a deception of the reader.

rbanffy · on Oct 26, 2019

During the UNIX wars, it was suspected many companies tuned their compilers to generate unrealistically fast code for common benchmarks.

ncmncm · on Oct 27, 2019

All benchmarks depend on both the CPU and the compiler -- except those coded in assembly language. Everybody knows it, so it is not (by itself) a deception.

BubRoss · on Oct 26, 2019

Why would you need 'PR' to trust something. Just compile your program and try it.

yarg · on Oct 26, 2019

I'm not saying that it won't work - but I'd be very surprised if it was well optimised for any vendor other than Intel.

This in and of itself is not proof that they are operating in bad faith; if they provide clean interfaces and reasonable default implementations that's all that can be justifiably expected of them.

The question of good faith actually only comes into play when another vendor provides their own optimised implementations and creates a pull request.

How will the project maintainers respond? Even getting to that point requires a trust based buy in to the project from the other vendor.

Given Intel's history of toxic behaviour in competition I think they need to do far more than they have done (and are likely to ever do) to earn that minimal baseline of trust.

BubRoss · on Oct 26, 2019

This is an open source project that came from a single person inside Intel. Intel is a huge company and this project is barely a blip on priorities. You don't have to gues, go try it or do some research. It has been maintained for many years now, you can see all of the history on GitHub.