GCC OpenMP + MKL, Performance Tips / Optimizations and a Bug

Hello,

I noticed a way to increase performance (at the least for DFT but since it’s really about the SMT/multithread aspects - I guess for everything) by playing more with a solution to a bug I had, which solution I found in another thread: Nested parallelism? - #15 by LGS.

I’m starting a new thread, because the previous one is only partially on the topic (bug part only), a little bit branched, with multiple solutions (depending on sacrificing performance) and I’ll make the important part easier to read.

The problem is an error if GCC compiled OpenMP is used with MKL:

OMP: Error #15: Initializing libiomp5.so, but found libomp.so.5 already initialized.
OMP: Hint This … As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results …

For one of the users in that thread recompiling worked - for me it doesn’t (recompiling from zero, that is).
For another user using a more automated way of installation worked - I want to compile for performance reasons, so I have not tried.
For another user setting certain env variables worked - so it did for me:

export MKL_THREADING_LAYER=GNU
export MKL_INTERFACE_LAYER=GNU
export OMP_NESTED=“FALSE”

However, that leads to a performance impact.

First of all, I think this is still a bug.
Using conda and / or binary install may work, but I don’t want less optimization.
I also understand, that mixing GCC with MKL may not be a good reason and there are warnings in the psi4 sphere, for doing so.

But what if I want to mix them. Intel compiler don’t always give a more optimized code. Linux is generally speaking built with GCC and I don’t want any eventual compatibility issues (let’s say with libraries, interfaces, etc.), either.

Instead of using all three of the env variables from above I use only one:

export MKL_THREADING_LAYER=sequential and that gives me 50% increase in performance. The input file is minimal for DFT geometry optimization. There are two other options for that variable - gcc and intel. sequential is the one that causes no performance issues.

Setting MKL_INTERFACE_LAYER does nothing or next to nothing.
Setting OMP_NESTED to either TRUE or FALSE does nothing or next to nothing.
I have never even tried setting KMP_DUPLICATE_LIB_OK=TRUE for the reasons in the error log file.
Setting other OMP or MKL variables does nothing even for the number of threads in the .out-put file - it is always the number of cores I have. As a matter of fact I use no OMP and no other MKL variables.
I envoke psi4 as an executable from the command line with the -n 8 switch (I’ve got 8 physical cores and I have switched the logical off).
I don’t have to source the mklvars.sh from the intel MKL installation besides during compilation.

Only pure cmake and make for compilation.
I don’t have conda.

Another interesting thing, performance wise, is that using OpenBLAS gives pretty much the same performance as MKL. Once again, this is using GCC only. I have not tried the intel compilers. The difference is up to 5%, if any at all. At best env for MKL and simply no settings for OMP (it never needed any), that is.
I understand that the OpenBLAS libraries may give a stability / compatibility and / or accuracy problem.

Switching Hyper Threading (HT) off also gives a performance boost. Most people wouldn’t expect it but it does. I have a hard time imagining how at a calculating group anyone would do that … but it works. HT does not double the number of cores you have it represents each real core as 2 logical. If your code competes for different calculating units in the core then running the instructions in parrallel (on the same core) does happen faster, compared to if you do it consecutively; but if the sets are competing for the same calculating units (let’s say AVX2) - then you have problem. Sequential would be faster - hence, a speed up if HT is turned off. This is a BIOS setting, although it may be accessible from a loaded OS, if you have the software. Psi4 gives me an additional 30% increase in performance for geometry optimizations. All of this depends on hardware, OS and algorithm, as well.

Setting the memory in the input to the max the system can afford does give a performance boost, even if the job doesn’t need all of it. (That is generally speaking the case with every program).

I’ll say a few things about my system. Profiling purposes … if anyone wants to go after this bug, even if there is a resolution (technically). And why not if anyone wants to make performance improvements, so I’m not getting something system specific, but it can work for everyone.

I’m using Intel’s Clear Linux from late April 2020, with kernel 5.6.5-941.native. This linux is known for very aggressive optimizations (and happens to be the fastest on the average). If you need heavy CPU and / or MEM and / or I/O operations there is no substitute. It is more than just a testbed for optimizations, it is becoming a distribution. It is custom patched by Intel.
My GCC is 9.3.0.
My intel compiler suite is parallel_studio_xe_2020.1.102 from late April 2020.
There are no performance setting I have made on the linux.

My CPU is core i9 9900KF @ 5.0 GHz, including the AVX2 units. The architecture is skylake. The cache is at 4.x GHz. The DDR4 is at 3.2 GHz, with some pretty sweet (low) timings. The mainboard and power supply are handling everythin just fine. The overclock has never gave me an issue. Temperatures are below 80deg C at 100% load (noctua cooling). Core voltage is low - 1.26. Everything in that department is set just fine. Difference in performance is always recreatable, as long as the compilation and env settings are the same - hence it is not a hardware instability.
Kind of think about it … I guess one bonus to non-xeon systems is the non-ECC memory, which alone gives 5% higher performance. Plus non-ECC memory has higher clocks and so on.

BTW with this setup the CPU beats xeons for up to a couple of thousand U$D. And that is just the xeon, not to mention every other custom / cluster a like piece of hardware in a node and every excuse for its astronomical price. I’m pretty sure that I am not going to be competing vs anyone with a cluster or a mainframe, but for people of a limited budget - here is something on a personal instead of a rich university budget.

Clear linux has it’s own agressive optimization flags, mainly split to such for kernel, gcc and all other software. The flags I compile (or I hope it they were used by make) with are:

export CFFLAGS=“-g -O3 -feliminate-unused-debug-types -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=32 -m64 -fasynchronous-unwind-tables -Wp,-D_REENTRANT -ftree-loop-distribute-patterns -Wl,-z -Wl,now -Wl,-z -Wl,relro -malign-data=abi -fno-semantic-interposition -ftree-vectorize -ftree-loop-vectorize -Wl,-sort-common -Wl,–enable-new-dtags -mtune=native -march=native”
export FFLAGS=“-g -O3 -feliminate-unused-debug-types -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=32 -m64 -fasynchronous-unwind-tables -Wp,-D_REENTRANT -ftree-loop-distribute-patterns -Wl,-z -Wl,now -Wl,-z -Wl,relro -malign-data=abi -fno-semantic-interposition -ftree-vectorize -ftree-loop-vectorize -Wl,–enable-new-dtags -mtune=native -march=native -Wa,-mbranches-within-32B-boundaries”
export CXXFLAGS=“-g -O3 -feliminate-unused-debug-types -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=32 -Wformat -Wformat-security -m64 -fasynchronous-unwind-tables -Wp,-D_REENTRANT -ftree-loop-distribute-patterns -Wl,-z -Wl,now -Wl,-z -Wl,relro -fno-semantic-interposition -ffat-lto-objects -fno-trapping-math -Wl,-sort-common -Wl,–enable-new-dtags -mtune=native -march=native -Wa,-mbranches-within-32B-boundaries -fvisibility-inlines-hidden -Wl,–enable-new-dtags”
export CFLAGS=“-g -O3 -feliminate-unused-debug-types -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=32 -Wformat -Wformat-security -m64 -fasynchronous-unwind-tables -Wp,-D_REENTRANT -ftree-loop-distribute-patterns -Wl,-z -Wl,now -Wl,-z -Wl,relro -fno-semantic-interposition -ffat-lto-objects -fno-trapping-math -Wl,-sort-common -Wl,–enable-new-dtags -mtune=native -march=native -Wa,-mbranches-within-32B-boundaries”

I realize those flags may break something - I have not tested everything, yet.

I think a very close equivalent for Intel’s compilers is:
-fast -march=skylake, put in the compiler’s .cfg filebut then I have not tried intel’s compilers, yet.

ctest -L smoke -j 8 gives all fine.
My patience for make pytest run out around 1x% but it was all fine.

I want to say Psi4 is what I wanted. Performance oriented, with many abilities and activelly improved. It is very cool that you make that program.

Thank you.

:smiley:

Welcome, great post!

Though a few things are at least controversial.:slight_smile: (which is fine)

Just for other readers: The psi4 conda package is fully optimized with the Intel compiler.
It is fast ‘out of the box’. Not compiling the src code does not lead to less optimization.
The heavy work is done by a BLAS package (MKL) and the integral library (libint).

We will soon switch to libint2, and there is also simint with better vectorization in theory.
Tweaking the compilation of libint is hard but may be more fruitful than Psi4 itself.
You might enjoy playing with simint!
For testing timings you want to look at the timer.dat files (repeated execution and average the timings).

How your OS is build is really not an issue.

Maybe. DGEMM benchmarks often have MKL ahead. This is problem-size dependent and CPU dependent. And as you mentioned there have been bugs and missing AVX2 kernels for some CPUs.
It’s a cool project and hats off to the people involved! A lot of work.

True. This is common knowledge for some time for HPC people in our field. Worth to test new architectures though.

Why would that help? If the program handles everything in-core that it wants, and does not need to batch intermediates, than more memory does not help. The OS should take care of I/O caching. Better to leave it the remaining memory.

Yeah, but what if the non-ECC memory does lead to problems? While a SCF will heal from erroneous intermediate results or wrong memory entries, a MP2 calculation may not and give actually a wrong result. I know of a user who saw this (long time ago…).
Even ‘just’ for academic research I’d be careful.

better check with make VERBOSE=1 ! cmake might not use those env flags! You can set them via cmake equivalents -DCMAKE_CXX_FLAGS=' <flags> '. No need to delete the obj dir it will force recompilation.

I’ll have to try that! I suggested those other flags and it was the only way I got things to work well at the time.
It may be gomp and MKL version dependent since some of those flags should be default.
My kernel is also very old (outdated centos on cluster)

Maybe you want to check out the script to test the threading behavior: psi4/share/psi4/script/test_threading.py

Hello,

and thank you for the post, I find it very informative.

For the memory setting. I just tried it quickly, but I guess the program has a more conservative default and simply setting any setting at least as much as the maximum the calculation would actually like to use, would also increase performance. Clearly any amount of memory above that is not necessary. With me the difference with no setting in the input and explicit setting to the max of my mem gave a 30% difference in speed in a small system under dft optimization with a decent basis set. Although I gave 30 gb or mem it used something like 5.x. I have noticed this behavior in other software packages, as well.

Additional for the HT on or off - the number in my first post was for GCC + OpenBLAS. For GCC + MKL switching HT off gives 25% better performance. I was actually rather surprised, because I expected that modern MKL has a hardcoded optimization to use only physical cores, as many as it is allowed to use. If this is so, the difference must come from something else.

I found a way to check the compiler flags after the build process. The file CMakeCache.txt in the psi4 build dir (in my case psi4/objdir) declares that the flags used are the exact flags I had set in the environment.

Unfortunatelly I have to say MKL_THREADING_LAYER=sequential makes reference uks (unrestricted) optimization be single-threaded durring the iterations. MKL_THREADING_LAYER=GNU keeps it multi-threaded but loads the cores to no more than 50%, while in the input all “Threads” statements say 8 threads (as it should be) in both cases.

If there is a way to exclude the intel mp or the openmp in case of MKL … it would be wonderful. Unfortunatelly, from what I have noticed there are portions of the program which rely on external lib for threading and that lib is not intel’s.

I switched to GCC + OpenBLAS and the speed is at max - 100% load and it is visibly faster. No need to set any variables.

I have to give the conda option a fare shake before judging performance, but I would not believe that anything other than -O3 -march=native for GCC or -fast -march=native for intel is realy going to give me the highest performance. In no way was I stating that the program was not optimized programming-wise, on the contrary - I expect and hope it to be optimized and additionally I am not a programmer, so there is no way for me to know; just to be sure that I was understood - what I meant was only the optimization flags. A code running on quite a few previous architectures can rarely be as fast as a code compiled only for the specific one. I wanted it to be more than tuned for … an arch a few before mine, I want it to be exactly for mine.

But I am thinking about the conda option … specially if I keep getting problems.

Another thing to mention. Psi4 is also a python package and interacts with numpy.
You want to have Psi4 and numpy use the same BLAS/LAPACK and threading settings.
Your own numpy may either be default BLAS or even openblas.
For checking this and getting to know if your psi4 uses gomp and iomp together you can run the
threading test (python <path_to_psi4_bin>/share/psi4/script/test_threading.py)

It makes a lot of sense for us to say use the conda numpy and conda MKL as they use the same MKL version. There are otherwise a lot of combinations between all components that have unforeseen issues and are hard to diagnose and support from us. (And by us I mean really @loriab )

GCC+MKL is a combination that should be better supported, I agree. I fear changes in MKL versions make this harder that it should be. Luckily the Intel compiler is free again for now (through Intel oneapi HPC toolkit), but that wasn’t always the case.

How odd. Maybe you just didn’t see it for RHF? There is openmp in psi4 and openmp-threading in MKL.
Potentially nested in some post-HF methods.

One set put multiple code paths into a binary to counter this.

The conda packages are optimized for multiple architectures with the following settings:
-msse2 -axCORE-AVX512,CORE-AVX2,AVX -Wl,--as-needed -static-intel -wd10237
(see psi4meta/conda-recipes/conda_build_config.yaml at master · psi4/psi4meta · GitHub)

So you have everything from sse2 to AVX2-512 instruction sets available. Depending if your CPU supports it.
The optimization flags should be the cmake’s standard -O3. (@loriab may correct me)
I believe it should give you as fast binaries as running -O3 -xHost on your own machine.

Intel’s -Ofast is cheating on the FP accuracy and risky, but may be a viable way to get a tiny bit more performance for some type of calculations.
As maybe be other compiler settings for fine-tuning opt. and especially guided-optimization for a particular setup.

The heavy lifting is usually done by the MKL anyway that will use optimal settings for your CPU.
(For AMD you should set export MKL_DEBUG_CPU_TYPE=5 to force AVX2)

If you build all external libraries from scratch, which is what you do when you don’t use our conda setup, then everything uses the same BLAS/LAPACK routines. (Except python/numpy which you need to take care of yourself)

Well, thank you again … this is useful, as well.

I did test it fine … using top, time and reading the .out files for threads used.

I am certain I have both MPs at the same time … numpy is with openblas (from my linux distro) and the rest is with MKL - so, this must be the problem. Plus when I build everything with openblas there are no problems at all, everything is faster and top shows 100% usage on all cores.

Do I need a special python, designed with MKL’s threading in mind or just the numpy software ? I hope there is no other lib, not even mentioned in the dependencies, for being a part of the standard python (or yet another lib’s) distribution, which is going to give me the same threading issues, as well.

And finally can I use conda not just for the binary installation, but to also compile psi4 and all dependencies in the same environment, with the same libs (MKL’s threading vs OpenMP), and is there a manual for that ?

And, once again … thanx for the info.

I think just numpy is enough.

Conda can also be used as a convenient environment for compiling, providing all those externals libs ready to use with compatible python/numpy/mkl

See this http://psicode.org/psi4manual/master/conda.html?highlight=psi4%20dev#how-to-use-conda-to-compile-psi4-faster-and-easier and next section for an overview

You were right - the binary distribution seems to be quite fast, for the kind of tests I had done before.

conda in my distro is broken with an error which occationally happens in other distros, as well. Still unresoved, at least not for everyone. Until recently there was no conda in Clear Linux, so they don’t really play nice together. So, conda build is not an option.

Manual intel compiler build kept giving errors … at the first (cmake stage) which I don’t remember right now, so … it doesn’t matter. icc was just fine, but icpc was bad.

But the binary option for psi4 (with the .bash install script) works really well. As a matter of fact the speed is slightly better (compared to manual build with gcc + openblas). I guess intel compiler + MKL is a really fast combination.