Suggestions for HPC installation?

Dear Psi4 experts,

I have used Psi4 for quite some time but always on a single system with multi-threading. I am quite comfortable installing Psi4 from the source under Linux, and compiling Libint with higher angular momentum than what the conda packages currently offer.

I am now planning to install Psi4 on a few HPC systems, and I am wondering if there are any special considerations I should be aware of. The target systems are AMD EPYC (Rome) or Intel Xeon based systems. I anticipate running jobs via DFOCC, FNOCC, and DETCI modules.

In particular:

a) For AMD EPYC system, shall I use OpenBLAS, AOCL-BLIS, or MKL for linear algebra when running jobs using up to 64 threads on a single CPU?

b) What is the current state of MPI support in Psi4? Any particular MPI implementation that are recommended?

c) Is there anything special about Libint compilation to optimally support MPI parallelization?

d) Apart from a now decade-old DePrice’s DF-CCSD and BrianQ’ SCF/DFT implementations, is there anything else in Psi4 that can benefit from GPU?

d) Anything else I might have not considered?

Thanks

Thanks for your questions. Here’s a few answers. I don’t know first-hand of any large-scale AMD HPC installations, so let us know If you find anything interesting. Or if you have further questions.

(a) I know Psi4 runs correctly with MKL, OpenBLAS (make sure you use the openmp, not the pthreads, variant), and Accelerate (Mac). It’s probably not going to scale well beyond ~10 threads. Last time (~5yo) we tried MKL on an AMD, the timings weren’t great compared to an Intel chip; fnocc was part of that test. But that was a while ago.

(b) There is no MPI in Psi4 (there have been modules in the past with MPI, but those were removed before v1.0, iirc).

(c) Libint itself has neither MPI nor OpenMP directives, so no need for special compilations. Parallelism in integrals happens at the Psi4 layer.

(d) Yes! David Poole and David Williams-Young have worked to get the latter’s GauXC working in Psi4 for sn-LinK SCF. There’s a recent JCP paper. The PR is at https://github.com/psi4/psi4/pull/3150 ; merging only awaits some testing fixes.

Thank you so much for the guidance. Here are a couple of “interesting” things that have transpired:

a) The recent Psi4 Git version (as well as the older 1.7) builds well with CMake 3.29.8, GCC (ver. 12.4 or 13.3), Python 3.11.10 and OpenBLAS 0.3.28 as long as INTERFACE64=OFF when building OpenBLAS. All tests pass with the recent Git version on znver3 (AMD Milan). However:

CMake Warning at src/psi4/libqt/CMakeLists.txt
Your BLAS/LAPACK library does not seem to be providing the DGGSVD3 and
DGGSVP3 subroutines. No re-routing is available.

is somewhat concerning. Is OpenBLAS not including these LAPACK subroutines? This seems to be not consistent with what I gather from #2832.

b) The recent Git version of Psi4 fails at the configuration stage on AMD EPYC systems with icpx compiler from 2024 OneAPI. The issue seems to be that this compiler returns “4.2.1” for “GNUC, GNUC_MINOR, GNUC_PATCHLEVEL”, which I think is its intrinsic mode that has not much to do with whatever GCC compiler is actually installed. The current logic in custom_cxxstandard.cmake decides that 4.2.1 is less than 4.9 and yields a FATAL_ERROR. When I commented out this check, I can build with Intel 2024 compilers. Compilation with Intel Classic 2019 icc works out of the box, which reports back the correct GCC compiler:

Found base compiler version 12.4.0

versus icpx:

– Found base compiler version
Please verify that both the operating system and the processor support Intel(R) X87, CMOV, MMX, SSE, SSE2, SSE3, SSSE3, SSE4_1, SSE4_2, MOVBE, POPCNT, AVX, F16C, FMA, BMI, LZCNT, AVX2 and ADX instructions.

c) The recent introduction of eigen3 headers into Psi4 code gives me trouble when I have Eigen3 installed at a custom location. I think that CMake passes the Eigen3_DIR variable only to Libint’s external build but not to Psi4. In any case, because I could not figure out the correct include directive for CMake, I had to hard-code the proper include path to
psi4/src/psi4/libfock/SplitJK.h
and
psi4/src/psi4/libmints/matrix.h

d) Requesting

MAX_AM=5 builds Libint 5-4-3-6-5-4 … kind of expected that

while

MAX_AM=6 builds Libint2 7-7-4-12-7-5 … while one might have expected 6-5-4-7-6-5

Not complaining, but just surprised :slight_smile:

I will share a bit more about relative performance and scaling later but in general, OLCCD scaling in znver1 (AMD Threadripper), znver2 (AMD Rome), and znver3 (AMD Milan) looks rather similar, and there is indeed not much to be gained by going beyond 8 threads.