Problem with building from source according to CMakeLists.txt

Dear Developers and Users,

I am new to this forum, please, correct me if I am asking these questions in a wrong place.

I tried to build Psi4 from source following the manual
http://psicode.org/psi4manual/master/external.html#compiling-and-installing-from-source
and
CMakeLists.txt
I would like to build an efficient executable in a computer cluster and run large scale, parallel CC computations.

Question 1:
I kept every default option in the CMakeLists.txt file, including
option_with_print(ENABLE_OPENMP “Enables OpenMP parallelization” ON),
but I can not run psi4 with more than 1 core on a multicore machine.
(OMP_NUM_THREADS and MKL_NUM_THREADS are set to 4,
I also tried the command line option: -n 4)
OpenMP works perfectly with multiple threads on my local computer if I install the binary version using Psi4conda, that is why I think I make some mistake when I build from source on the cluster.
Could you help me, how to setup OpenMP properly?

Question 2:
When I run the same CC job with the binary and the compiled version without OpenMP using a single core on my local computer, I find that the binary version is much faster (3-4 times!) than the locally complied version. I expected the locally built one to be faster, so I think this misbehavior could be the result of other wrong settings during the compilation or linking.
Executing

cmake -H. -Bobjdir

finds the PATH to the locally installed Intel MKL libraries (version 11.1.3).

Could you help me finding the proper compiler settings to build a more efficient executable?
Unfortunately, I did not find any guidance on which BLAS/LAPACK library works best with PSI4 or which compiler options should be set for an optimal performance.
Perhaps the compiler versions/settings and BLAS/LAPACK library version with which you build the binary code could be very helpful.

Thank you very much, Peter

Nothing that you say you’ve done strikes me as problematic. For the most part, you needn’t tune the CMake options, especially if it’s picking up a local MKL installation. (The binary is also using MKL only they’re statically linked.) One thing to try initially is to look at a SCF job supposedly run multithreaded and look for this line to see how many threads the program thinks it has.

Next thing is post the output of your cmake -H. -Bobjdir and the file (within your objdir) psi4_core-prefix/src/psi4_core-build/src/CMakeFiles/core.dir/link.txt which contains the final link line and should have the openmp flag and MKL libraries within.

MKL is the BLAS/LAPACK that the developers use and recommend (not that we regularly try alternatives). The options for building the binary are here, but except for the compiler/python specification, many of those options are really not appropriate for an ordinary build.

Thank you, Lori, for your help.

Using the locally built version the output of the SCF says ‘1 Threads’ if I set OMP_NUM_THREADS=4 & MKL_NUM_THREADS=4, but the same job says ‘4 Threads’ if I use the -n 4 option. However, both runs used only 1 CPU, based on the wall times and looking at feedback from the top command.
Both options to set the number of OMP threads work properly for the binary version, that version uses all 4 CPUs.

I copy below the outputs and files you have requested. I did not see anything that would explain to me, why does the OpenMP not work and why does the built code run 4 times slower, than the binary.
Thank you again!

Standard output of cmake -H. -Bobjdir

– Detecting C compiler ABI info - done
– Detecting C compile features
– Detecting C compile features - done
– Check for working CXX compiler: /usr/bin/c++
– Check for working CXX compiler: /usr/bin/c++ – works
– Detecting CXX compiler ABI info
– Detecting CXX compiler ABI info - done
– Detecting CXX compile features
– Detecting CXX compile features - done
– Setting (unspecified) option BUILD_SHARED_LIBS: OFF
– Setting (unspecified) option ENABLE_OPENMP: ON
– Setting (unspecified) option ENABLE_AUTO_BLAS: ON
– Setting (unspecified) option ENABLE_AUTO_LAPACK: ON
– Setting (unspecified) option ENABLE_XHOST: ON
– Performing Test CMAKE_C_FLAGS [-xHost] - Failed
– Performing Test CMAKE_C_FLAGS [-march=native] - Success, Appending
– Performing Test CMAKE_CXX_FLAGS [-xHost] - Failed
– Performing Test CMAKE_CXX_FLAGS [-march=native] - Success, Appending
– Setting (unspecified) option ENABLE_CODE_COVERAGE: OFF
– Setting (unspecified) option ENABLE_BOUNDS_CHECK: OFF
– Setting (unspecified) option ENABLE_ASAN: OFF
– Setting (unspecified) option ENABLE_TSAN: OFF
– Setting (unspecified) option ENABLE_UBSAN: OFF
– Setting (unspecified) option MAX_AM_ERI: 5
– Setting (unspecified) option CMAKE_BUILD_TYPE: Release
– Setting (unspecified) option FC_SYMBOL: 2
– Setting (unspecified) option BUILD_FPIC: ON
– Setting (unspecified) option CMAKE_INSTALL_LIBDIR: lib
– Setting (unspecified) option PYMOD_INSTALL_LIBDIR: /
– Setting (unspecified) option ENABLE_GENERIC: OFF
– Setting (unspecified) option CMAKE_INSTALL_MESSAGE: LAZY
– Setting (unspecified) option PSI4_CXX_STANDARD: 11
– Found PythonInterp: /usr/bin/python (found version “2.7.12”)
– Found PythonLibs: /usr/lib/x86_64-linux-gnu/libpython2.7.so (found suitable version “2.7.12”, minimum required is “2”)
– Suitable pybind11 could not be located, building one instead.
– Suitable libint could not be located, building one instead.
– Suitable libefp could not be located, building one instead.
– Try OpenMP C flag = [-fopenmp]
– Performing Test OpenMP_FLAG_DETECTED
– Performing Test OpenMP_FLAG_DETECTED - Success
– Try OpenMP CXX flag = [-fopenmp]
– Performing Test OpenMP_FLAG_DETECTED
– Performing Test OpenMP_FLAG_DETECTED - Success
– Found OpenMP: -fopenmp
– Math lib search order is MKL;ESSL;ATLAS;ACML;SYSTEM_NATIVE
– You can select a specific type by defining for instance -D BLAS_TYPE=ATLAS or -D LAPACK_TYPE=ACML
– or by redefining MATH_LIB_SEARCH_ORDER
– Found BLAS: MKL (/export/intel/parallel_studio_xe_2013/composer_xe_2013_sp1.3.174/mkl/lib/intel64/libmkl_gf_lp64.so;/export/intel/parallel_studio_xe_2013/composer_xe_2013_sp1.3.174/mkl/lib/intel64/libmkl_gnu_thread.so;/export/intel/parallel_studio_xe_2013/composer_xe_2013_sp1.3.174/mkl/lib/intel64/libmkl_core.so;/usr/lib/x86_64-linux-gnu/libpthread.so;/usr/lib/x86_64-linux-gnu/libm.so)
– Found LAPACK: MKL (/export/intel/parallel_studio_xe_2013/composer_xe_2013_sp1.3.174/mkl/lib/intel64/libmkl_lapack95_lp64.a;/export/intel/parallel_studio_xe_2013/composer_xe_2013_sp1.3.174/mkl/lib/intel64/libmkl_gf_lp64.so)
– No Doxygen, no docs.
– No Sphinx, no docs. Pre-built documentation at http://psicode.org/psi4manual/master/index.html
– No LaTeX (incl. pdflatex), no PDF docs. Pre-built documentation at http://psicode.org/psi4manual/master/index.html
– Adding test cases: Psi4
– Found CFOUR: /export/prog/cfour_publ/bin/xcfour
– Adding test cases: Psi4 + CFOUR
– Adding test cases: Psi4 + libefp
– Configuring done
– Generating done
– Build files have been written to: /export/home/nape/programs/psi4_build/psi4-master_161130b/objdir

Error output of cmake -H. -Bobjdir

– BLAS will be searched for based on MKLROOT=/export/intel/parallel_studio_xe_2013/composer_xe_2013_sp1.3.174/mkl
– LAPACK will be searched for based on MKLROOT=/export/intel/parallel_studio_xe_2013/composer_xe_2013_sp1.3.174/mkl

psi4_core-prefix/src/psi4_core-build/src/CMakeFiles/core.dir/link.txt file:
/usr/bin/c++ -fPIC -march=native -fopenmp -O3 -DNDEBUG -shared -Wl,-soname,core.so -o core.so CMakeFiles/core.dir/export_psio.cc.o CMakeFiles/core.dir/export_mints.cc.o CMakeFiles/core.dir/export_fock.cc.o CMakeFiles/core.dir/export_functional.cc.o CMakeFiles/core.dir/export_oeprop.cc.o CMakeFiles/core.dir/export_plugins.cc.o CMakeFiles/core.dir/export_blas_lapack.cc.o CMakeFiles/core.dir/export_benchmarks.cc.o CMakeFiles/core.dir/export_efp.cc.o CMakeFiles/core.dir/export_cubeprop.cc.o CMakeFiles/core.dir/export_misc.cc.o CMakeFiles/core.dir/create_new_plugin.cc.o CMakeFiles/core.dir/read_options.cc.o CMakeFiles/versioned_code.dir/core.cc.o psi4/adc/libadc.a psi4/ccdensity/libccdensity.a psi4/ccenergy/libccenergy.a psi4/cceom/libcceom.a psi4/cchbar/libcchbar.a psi4/cclambda/libcclambda.a psi4/ccresponse/libccresponse.a psi4/ccsort/libccsort.a psi4/cctransort/libcctransort.a psi4/cctriples/libcctriples.a psi4/dcft/libdcft.a psi4/detci/libdetci.a psi4/dfmp2/libdfmp2.a psi4/dfocc/libdfocc.a psi4/efp_interface/libefp_interface.a psi4/findif/libfindif.a psi4/fisapt/libfisapt.a psi4/fnocc/libfnocc.a psi4/mcscf/libmcscf.a psi4/mrcc/libmrcc.a psi4/occ/libocc.a psi4/optking/liboptking.a psi4/psimrcc/libpsimrcc.a psi4/sapt/libsapt.a psi4/scfgrad/libscfgrad.a psi4/thermo/libthermo.a psi4/transqt2/libtransqt2.a psi4/gdma_interface/libgdma_interface.a psi4/dmrg/libdmrg.a -lpython2.7 psi4/libthce/libthce.a psi4/libcubeprop/libcubeprop.a psi4/libmoinfo/libmoinfo.a psi4/libsapt_solver/libsapt_solver.a psi4/libscf_solver/libscf_solver.a psi4/libdiis/libdiis.a psi4/libdpd/libdpd.a psi4/lib3index/lib3index.a psi4/libfock/libfock.a psi4/lib3index/lib3index.a psi4/libfock/libfock.a psi4/libfunctional/libfunctional.a psi4/libdisp/libdisp.a psi4/libplugin/libplugin.a -ldl psi4/libmints/libmints.a psi4/libtrans/libtrans.a psi4/libqt/libqt.a psi4/libefp_solver/libefp_solver.a psi4/libmints/libmints.a psi4/libtrans/libtrans.a psi4/libqt/libqt.a psi4/libefp_solver/libefp_solver.a psi4/libiwl/libiwl.a psi4/libpsi4util/libpsi4util.a /export/home/nape/programs/psi4_build/psi4-master_161130b/objdir/stage/export/home/nape/programs/psi4_build/psi4_bin_161130b/external/lib/libderiv.a /export/home/nape/programs/psi4_build/psi4-master_161130b/objdir/stage/export/home/nape/programs/psi4_build/psi4_bin_161130b/external/lib/libint.a psi4/libpsio/libpsio.a psi4/libciomr/libciomr.a psi4/libparallel/libparallel.a psi4/liboptions/liboptions.a psi4/libfilesystem/libfilesystem.a /export/home/nape/programs/psi4_build/psi4-master_161130b/objdir/stage/export/home/nape/programs/psi4_build/psi4_bin_161130b/external/lib/libefp.a -llapack -lblas -Wl,–start-group -Wl,-Bstatic -lmkl_lapack95_lp64 -Wl,-Bdynamic -lmkl_gf_lp64 -lmkl_gnu_thread -lmkl_core -lpthread -lm -Wl,–end-group -Wl,-rpath,/export/home/nape/programs/psi4_build/psi4_bin_161130b/lib:/export/home/nape/programs/psi4_build/psi4-master_161130b/objdir/stage//export/home/nape/programs/psi4_build/psi4_bin_161130b/lib:/usr/lib/x86_64-linux-gnu

What is your /usr/bin/c++? Is that clang on Linux? or gcc/g++?

I suspect that the culprit is external/lib/libefp.a -llapack -lblas in the link.txt file. Unless you’re curious, I won’t go into the reasoning, but just try editing the link.txt and removing the -llapack -lblas. Save and reissue make in your objdir. Then test parallelism on the resulting build. If I’m right about this, it’s the second case this week, so I’ll definitely fix.

Oh, and if you happen to have Intel >=2016 compilers, specifying those may help.

It is g++, is it ok?

I removed the -llapack -lblas options, reissued make and make install in the objdir.
make did not do much, just linked core.so and installed it and metadata.py wihout any error message.

If I try to run this code without the -llapack -lblas, I got this error in runtime:
Intel MKL FATAL ERROR: Cannot load libmkl_mc3.so or libmkl_def.so

Can you suggest anything else to proceed with? I did not find any post here in the forum about the first similar case, you mentioned. Did I miss something?

I also tried to compile with Intel 2016 (restarting from the beginning with cmake…), but I got this error multiple times during compilation using make:
/usr/include/c++/5/bits/hashtable.h(1565): error: no instance of overloaded function “std::forward” matches the argument list
argument types are: (void *)
this->_M_allocate_node(std::forward<_Args>(__args)…);

According to internet wisdom, this might be related to a problem in intel 16.0.2, which we have here. I’ll try to ask the admin to update the compiler, but this seems to be unrelated to the first (two) problems.

Yes, g++ is ok. I don’t usually mix gcc compilers and mkl math, but I know it can be done successfully. Now big question, what’s c++ --version? psi4 needs full c++11 compliance, so that means >= 4.9 for gcc or intel compilers.

The quick make after editing link.txt is expected. You can try LD_PRELOAD=/path/to/libmkl_mc3.so:/path/to/libmkl_def.so psi4 input.in That might help the FATAL ERROR.

The other instance of the lapack/blas/libefp was via email and involved acml math libs.

In case you are dealing with old gcc, this may be helpful.

There’ve been substantial changes to psi4 since I last built with intel 2016.2, but I can confirm 2016.3 works nicely and ppl are using 2017 also.

5.4.0, so this should be fine.

If I try this, I get another error in runtime:

/usr/bin/python: symbol lookup error: /export/intel/parallel_studio_xe_2013/composer_xe_2013_sp1.3.174/mkl/lib/intel64/libmkl_def.so: undefined symbol: mkl_dft_fft_fix_twiddle_table_32f

I do not understand this new error message, but I think this is not the one that needs to be solved.

Thank you again. Please, let me know, if I can try anything else.

apparently, twiddle table errors are a thing. Does your mkl installation have a libmkl_rt.so? You can try LD_PRELOADing or linking to that instead. It’s supposed to replace the separate thread/interface/math layers of the start-group intellib intellib2 intellib3 end-group sequence. Other than that, I don’t know enough about c.2013 mkl to help much.

Thank you very much for the lots of help, Lori, LD_PRELOADing libmkl_rt.so worked and the code seems to run properly now.
I would be interested if a more permanent fix appears to this problem other than deleting -llapack -lblas.

Whew, glad something worked. I’ll let you know when the libefp fix goes in, after which you won’t have to delete the -llapack -lblas. We’re thinking of generally linking to libmkl_rt.so for common (mostly shared, as opposed to the conda) builds, so as to not be encouraging bad habits (which LD_PRELOAD and LD_LIBRARY_PATH rather are).

Now that psi4/psi4#591 is merged into Psi4, I think your lapack/blas/ld_preload problems should be solved. If you try it, let me know if it works or fails. (And you’ll want a full recompile – even your CMAKE_INSTALL_PREFIX filesystem location should be empty.)