Hello,
I noticed a way to increase performance (at the least for DFT but since it’s really about the SMT/multithread aspects - I guess for everything) by playing more with a solution to a bug I had, which solution I found in another thread: Nested parallelism?.
I’m starting a new thread, because the previous one is only partially on the topic (bug part only), a little bit branched, with multiple solutions (depending on sacrificing performance) and I’ll make the important part easier to read.
The problem is an error if GCC compiled OpenMP is used with MKL:
OMP: Error #15: Initializing libiomp5.so, but found libomp.so.5 already initialized.
OMP: Hint This … As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results …
For one of the users in that thread recompiling worked - for me it doesn’t (recompiling from zero, that is).
For another user using a more automated way of installation worked - I want to compile for performance reasons, so I have not tried.
For another user setting certain env
variables worked - so it did for me:
export MKL_THREADING_LAYER=GNU
export MKL_INTERFACE_LAYER=GNU
export OMP_NESTED=“FALSE”
However, that leads to a performance impact.
First of all, I think this is still a bug.
Using conda
and / or binary install may work, but I don’t want less optimization.
I also understand, that mixing GCC with MKL may not be a good reason and there are warnings in the psi4 sphere, for doing so.
But what if I want to mix them. Intel compiler don’t always give a more optimized code. Linux is generally speaking built with GCC and I don’t want any eventual compatibility issues (let’s say with libraries, interfaces, etc.), either.
Instead of using all three of the env
variables from above I use only one:
export MKL_THREADING_LAYER=sequential
and that gives me 50% increase in performance. The input file is minimal for DFT geometry optimization. There are two other options for that variable - gcc
and intel
. sequential
is the one that causes no performance issues.
Setting MKL_INTERFACE_LAYER
does nothing or next to nothing.
Setting OMP_NESTED
to either TRUE
or FALSE
does nothing or next to nothing.
I have never even tried setting KMP_DUPLICATE_LIB_OK=TRUE
for the reasons in the error log file.
Setting other OMP
or MKL
variables does nothing even for the number of threads in the .out
-put file - it is always the number of cores I have. As a matter of fact I use no OMP
and no other MKL
variables.
I envoke psi4
as an executable from the command line with the -n 8
switch (I’ve got 8 physical cores and I have switched the logical off).
I don’t have to source the mklvars.sh
from the intel MKL installation besides during compilation.
Only pure cmake
and make
for compilation.
I don’t have conda
.
Another interesting thing, performance wise, is that using OpenBLAS gives pretty much the same performance as MKL. Once again, this is using GCC only. I have not tried the intel compilers. The difference is up to 5%, if any at all. At best env
for MKL and simply no settings for OMP (it never needed any), that is.
I understand that the OpenBLAS libraries may give a stability / compatibility and / or accuracy problem.
Switching Hyper Threading (HT) off also gives a performance boost. Most people wouldn’t expect it but it does. I have a hard time imagining how at a calculating group anyone would do that … but it works. HT does not double the number of cores you have it represents each real core as 2 logical. If your code competes for different calculating units in the core then running the instructions in parrallel (on the same core) does happen faster, compared to if you do it consecutively; but if the sets are competing for the same calculating units (let’s say AVX2) - then you have problem. Sequential would be faster - hence, a speed up if HT is turned off. This is a BIOS setting, although it may be accessible from a loaded OS, if you have the software. Psi4 gives me an additional 30% increase in performance for geometry optimizations. All of this depends on hardware, OS and algorithm, as well.
Setting the memory in the input to the max the system can afford does give a performance boost, even if the job doesn’t need all of it. (That is generally speaking the case with every program).
I’ll say a few things about my system. Profiling purposes … if anyone wants to go after this bug, even if there is a resolution (technically). And why not if anyone wants to make performance improvements, so I’m not getting something system specific, but it can work for everyone.
I’m using Intel’s Clear Linux
from late April 2020, with kernel 5.6.5-941.native
. This linux is known for very aggressive optimizations (and happens to be the fastest on the average). If you need heavy CPU and / or MEM and / or I/O operations there is no substitute. It is more than just a testbed for optimizations, it is becoming a distribution. It is custom patched by Intel.
My GCC is 9.3.0
.
My intel compiler suite is parallel_studio_xe_2020.1.102
from late April 2020.
There are no performance setting I have made on the linux.
My CPU is core i9 9900KF @ 5.0 GHz, including the AVX2 units. The architecture is skylake. The cache is at 4.x GHz. The DDR4 is at 3.2 GHz, with some pretty sweet (low) timings. The mainboard and power supply are handling everythin just fine. The overclock has never gave me an issue. Temperatures are below 80deg C at 100% load (noctua cooling). Core voltage is low - 1.26. Everything in that department is set just fine. Difference in performance is always recreatable, as long as the compilation and env
settings are the same - hence it is not a hardware instability.
Kind of think about it … I guess one bonus to non-xeon systems is the non-ECC memory, which alone gives 5% higher performance. Plus non-ECC memory has higher clocks and so on.
BTW with this setup the CPU beats xeons for up to a couple of thousand U$D. And that is just the xeon, not to mention every other custom / cluster a like piece of hardware in a node and every excuse for its astronomical price. I’m pretty sure that I am not going to be competing vs anyone with a cluster or a mainframe, but for people of a limited budget - here is something on a personal instead of a rich university budget.
Clear linux has it’s own agressive optimization flags, mainly split to such for kernel, gcc and all other software. The flags I compile (or I hope it they were used by make
) with are:
export CFFLAGS="-g -O3 -feliminate-unused-debug-types -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=32 -m64 -fasynchronous-unwind-tables -Wp,-D_REENTRANT -ftree-loop-distribute-patterns -Wl,-z -Wl,now -Wl,-z -Wl,relro -malign-data=abi -fno-semantic-interposition -ftree-vectorize -ftree-loop-vectorize -Wl,-sort-common -Wl,–enable-new-dtags -mtune=native -march=native"
export FFLAGS="-g -O3 -feliminate-unused-debug-types -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=32 -m64 -fasynchronous-unwind-tables -Wp,-D_REENTRANT -ftree-loop-distribute-patterns -Wl,-z -Wl,now -Wl,-z -Wl,relro -malign-data=abi -fno-semantic-interposition -ftree-vectorize -ftree-loop-vectorize -Wl,–enable-new-dtags -mtune=native -march=native -Wa,-mbranches-within-32B-boundaries"
export CXXFLAGS="-g -O3 -feliminate-unused-debug-types -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=32 -Wformat -Wformat-security -m64 -fasynchronous-unwind-tables -Wp,-D_REENTRANT -ftree-loop-distribute-patterns -Wl,-z -Wl,now -Wl,-z -Wl,relro -fno-semantic-interposition -ffat-lto-objects -fno-trapping-math -Wl,-sort-common -Wl,–enable-new-dtags -mtune=native -march=native -Wa,-mbranches-within-32B-boundaries -fvisibility-inlines-hidden -Wl,–enable-new-dtags"
export CFLAGS="-g -O3 -feliminate-unused-debug-types -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=32 -Wformat -Wformat-security -m64 -fasynchronous-unwind-tables -Wp,-D_REENTRANT -ftree-loop-distribute-patterns -Wl,-z -Wl,now -Wl,-z -Wl,relro -fno-semantic-interposition -ffat-lto-objects -fno-trapping-math -Wl,-sort-common -Wl,–enable-new-dtags -mtune=native -march=native -Wa,-mbranches-within-32B-boundaries"
I realize those flags may break something - I have not tested everything, yet.
I think a very close equivalent for Intel’s compilers is:
-fast -march=skylake
, put in the compiler’s .cfg
filebut then I have not tried intel’s compilers, yet.
ctest -L smoke -j 8
gives all fine.
My patience for make pytest
run out around 1x% but it was all fine.
I want to say Psi4 is what I wanted. Performance oriented, with many abilities and activelly improved. It is very cool that you make that program.
Thank you.