SAPT2+ Calculation: Segmentation fault during Exch12 computation

@jgonthier, int overflow was what Daniel was hypothesizing the other day. If you want to try patching it, it’d be great, but there are probably multiple places. I remember Ed H. fixing this once before for adenine-thymine, but I guess that was only a partial fix.

@loriab I’ll have a look. Reading the code to find this is impossible but maybe there is a way to replace all int declarations in the code by a custom class that detects possible overflows, and then correct the original code…

@polyx15 Do you know which compiler was used to build the code ? I think part of the error may be compiler-dependent.

@jgonthier You can always replace with size_t, we’re moving that way in most of the code anyway.

Submitted a pull request with a patch that could hopefully solve the problem. Unfortunately, I cannot actually test the code on a large enough system.

@polyx15, are you running from a compiled Psi4 so that you can apply the patch and test it? Or are you running a conda binary, in which case, I can compile one with the patch for you to test. I’m afraid we don’t have ready access to hardware with enough memory to run your particular job.

I am running a compiled Psi4, compiled with GCC 5.2.0 (since @jgonthier asked).

GCC 5.2 should be fine (though I’m not familiar with any of its long int peculiarities). Can we expect that you are trying out the patch and will report back?

I will try out the patch and will report back, it will take a few days though.

I was using GCC 5.4 for testing and I am certain the code would have failed with a system your size, some multiplications would have overflowed.
I just hope I could catch all of them, so maybe run with the debug 1 option so that we can know where it fails if I missed something.

Hi loriab,

I am responsible for installing software on the Euler cluster, where polyx15 runs his calculations. I am not too familiar with github yet.

In order to get the code that includes the patch mentioned in this discussion, should I just “git clone” the master branch of psi4 or another particular repository ? Once I have the code I can most likely get the installation ready the same day, then polyx15 can start testing as soon as possible.

Best regards

Sam

Hi @samfux84,

So you can either

  • git clone https://github.com/jgonthier/psi4.git followed by git checkout SAPT_patch to get on the patch branch.

or, if you have a fairly recent psi4 source lying around (from github.com/psi4/psi4), you can

  • just edit the source to apply these changes since the changeset isn’t too big.

Thanks for helping with this testing,
Lori

Hi Lori,

thank you for the github instructions to get the correct code with the patch. I have built a new version of psi4, but I noticed some quite substantial change in library size (when comparing) to my 1.1a1 build.

The core.so library grew from ca. 48 MB to about 596 MB.

Older 1.1a1 build:

[sfux@develop01 psi4]$ ls -ltr
total 46940
-rw-r--r-- 1 apps ID-HPC-APPS     1133 Jan 25 09:29 metadata.py
drwxr-xr-x 7 apps ID-HPC-APPS     4096 Jan 25 09:29 driver
-rw-r--r-- 1 apps ID-HPC-APPS     2800 Jan 25 09:29 header.py
-rw-r--r-- 1 apps ID-HPC-APPS     1727 Jan 25 09:29 extras.py
-rw-r--r-- 1 apps ID-HPC-APPS     2706 Jan 25 09:29 __init__.py
-rw-r--r-- 1 apps ID-HPC-APPS     1505 Jan 25 09:29 config.py
-rwxr-xr-x 1 apps ID-HPC-APPS 47849225 Jan 25 09:29 core.so

Build of newest version + patch:

[apps@develop01 psi4]$ ls -ltr
total 583832
-rwxr-xr-x 1 apps ID-HPC-APPS 595471556 Feb 17 15:47 core.so
-rw-r--r-- 1 apps ID-HPC-APPS      2721 Feb 17 15:47 __init__.py
-rw-r--r-- 1 apps ID-HPC-APPS      1727 Feb 17 15:47 extras.py
-rw-r--r-- 1 apps ID-HPC-APPS      2800 Feb 17 15:47 header.py
drwxr-xr-x 7 apps ID-HPC-APPS      4096 Feb 17 15:47 driver
-rw-r--r-- 1 apps ID-HPC-APPS      1147 Feb 17 15:47 metadata.py

Does this make sense ? Have there been substantial changes to the code that can explain the growth of core.so ?

Best regards

Sam

Hi,

in my release build of the SAPT_patch branch, with GCC 5.4 and openBLAS, my core.so is about 66 MB. It does seem weird that your core.so is so large. Which options did you use for building ?

Best regards

Jerome

Hi Jerome,

Thank you for your feedback. It seems that I did something wrong when setting CMake variables. I set quite a lot of them, therefore I am not adding all of it here in my comment.

It is already helpful to know what the size of the library should be.

I think I found the culprit: BUILD_SHARED_LIBS was set to OFF. I will retry compiling the code on Monday.

Thank you for your help.

Best regards

Sam

Yes, I think BUILD_SHARED_LIBS OFF will build static libraries, which are larger… Maybe @loriab can confirm ?

Yes, BUILD_SHARED_LIBS would explain it. Both are equally valid and will give the same results. With ON, you’ll get several other libs libefp.so and libint.so and libderiv.so along a smaller core.so. With OFF, all the contents of the former (including the integrals, which can be hefty at large AM) will be in core.so.

Hi,

I use the following CMAKE parameters, but the core.so library is still more than 400 MB:

ccmake -DBUILD_SHARED_LIBS=“ON” -DCMAKE_BUILD_TYPE=“RelWithDebInfo” -DCMAKE_CXX_FLAGS_RELWITHDEBINFO=‘-O2 -g -DNDEBUG -ftree-vectorize -march=corei7-avx -mavx’ -DCMAKE_C_FLAGS_RELWITHDEBINFO=‘-O2 -g -DNDEBUG -ftree-vectorize -march=corei7-avx -mavx’ -DCMAKE_INSTALL_PREFIX=“/cluster/apps/psi4/1.1a1_p1/x86_64” -DCMAKE_VERBOSE_MAKEFILE=“ON” -DMAX_AM_ERI=“7” -DCMAKE_INSTALL_OLDINCLUDEDIR=“” -DPYTHON_LIBRARY=“/cluster/apps/python/2.7.12/x86_64/lib64/libpython2.7.so” -DPYTHON_LIBRARY_DEBUG=“/cluster/apps/python/2.7.12/x86_64/lib64/libpython2.7.so” …

I am sure that there is a stupid error on my side, but I could not yet identify it. I will continue building psi4 tomorrow.

Best regards

Sam

Seems that I can not post more than 3 replies to this topic. Therefore I am adding some text here:

Hi,

switching to -DCMAKE_BUILD_TYPE=Release solved the problem. Now the libraries are below 100 MB.

Thank you and best regards

Sam

That -DCMAKE_BUILD_TYPE="RelWithDebInfo" looks very likely. I’ve never tried it myself, but CMake sets up flags to include full debugging symbols in libraries built with that build type. You probably want -DCMAKE_BUILD_TYPE=Release (the default if you don’t include that option at all).

You can also save yourself some build time by compiling libint (https://github.com/psi4/libint) on its own, installing it, and then adding -DCMAKE_PREFIX_PATH=/path/to/libint/install to the psi4 configuration. Then you won’t be rebuilding libint every time you experiment with a psi4 build.

@loriab @jgonthier: Good news, I am currently re-running the calculation on the patched version and it now already finished Exch12_k11u_5 and is still running. So the patched version seems to fix the problem. Thank you for your help.

@polyx15: Thank you for letting us know! We would appreciate if you could
let us know if the computation finishes without problems. These integer
overflows are difficult to locate.