GCC OpenMP + MKL, Performance Tips / Optimizations and a Bug

Unfortunatelly I have to say MKL_THREADING_LAYER=sequential makes reference uks (unrestricted) optimization be single-threaded durring the iterations. MKL_THREADING_LAYER=GNU keeps it multi-threaded but loads the cores to no more than 50%, while in the input all “Threads” statements say 8 threads (as it should be) in both cases.

If there is a way to exclude the intel mp or the openmp in case of MKL … it would be wonderful. Unfortunatelly, from what I have noticed there are portions of the program which rely on external lib for threading and that lib is not intel’s.

I switched to GCC + OpenBLAS and the speed is at max - 100% load and it is visibly faster. No need to set any variables.

I have to give the conda option a fare shake before judging performance, but I would not believe that anything other than -O3 -march=native for GCC or -fast -march=native for intel is realy going to give me the highest performance. In no way was I stating that the program was not optimized programming-wise, on the contrary - I expect and hope it to be optimized and additionally I am not a programmer, so there is no way for me to know; just to be sure that I was understood - what I meant was only the optimization flags. A code running on quite a few previous architectures can rarely be as fast as a code compiled only for the specific one. I wanted it to be more than tuned for … an arch a few before mine, I want it to be exactly for mine.

But I am thinking about the conda option … specially if I keep getting problems.