Unfortunatelly I have to say MKL_THREADING_LAYER=sequential
makes reference uks
(unrestricted) optimization be single-threaded durring the iterations. MKL_THREADING_LAYER=GNU
keeps it multi-threaded but loads the cores to no more than 50%, while in the input all “Threads” statements say 8 threads (as it should be) in both cases.
If there is a way to exclude the intel mp or the openmp in case of MKL … it would be wonderful. Unfortunatelly, from what I have noticed there are portions of the program which rely on external lib for threading and that lib is not intel’s.
I switched to GCC + OpenBLAS and the speed is at max - 100% load and it is visibly faster. No need to set any variables.
I have to give the conda
option a fare shake before judging performance, but I would not believe that anything other than -O3 -march=native
for GCC or -fast -march=native
for intel is realy going to give me the highest performance. In no way was I stating that the program was not optimized programming-wise, on the contrary - I expect and hope it to be optimized and additionally I am not a programmer, so there is no way for me to know; just to be sure that I was understood - what I meant was only the optimization flags. A code running on quite a few previous architectures can rarely be as fast as a code compiled only for the specific one. I wanted it to be more than tuned for … an arch a few before mine, I want it to be exactly for mine.
But I am thinking about the conda
option … specially if I keep getting problems.