I am trying to run a SAPT2+/aDZ (Silver Standard according to J. Chem. Phys. 2014, 140, 094106) calculation on a fairly large system (162 electrons, 920 basis functions, I would rather not go into detail here about what system it is exactly). The calculation is running fine until the Exch11 contribution, afterwards the job crashes and I get the following error:
Exited with exit code 139.
6243 Segmentation fault (core dumped)
I did this type of calculations on several smaller (similar) systems with no problems. The next contributing term would be Exch12 so I am assuming it is trying to compute that when it is crashing. The estimated memory usage is “127060.6 MB”, I am giving the job 140000 MB of memory. I also tried to give the job significantly more memory (more than double of that), but I had the same problem. Does anyone have an idea what the problem might be?
If you need more information, please let me know. I could also send the corresponding input on request per mail for somebody to test it.
Thank you already in advance for your help.
How much memory does your computer have? One doesn’t usually give the
memory keyword more than 95–98% of the computer system’s memory.
The computer I am running the job on has 256 GB of memory.
Can you set the option
debug 1 and post the SAPT module print out to a gist?
I set the option “debug 1” and re-ran the calculation. Here is the output of the SAPT module: https://gist.github.com/anonymous/c54ab98d64fb028833e7e7565df3c0b2
@dgasmith, @loriab: looking at the code for the term that seems to fail (Exch12_k11u_4) I see a potential integer overflow in lines 1397, 1398, 1401 and 1402 of exch12.cc. The multiplication aoccA_ * nvirA_ * aoccA_ * nvirA_ exceeds a 32-bit signed integer with the dimensions of the system, unless I am mistaken… Line 1402 in particular has no conversions to (long int).
If you agree that this could be the problem, I’ll patch it.
@jgonthier, int overflow was what Daniel was hypothesizing the other day. If you want to try patching it, it’d be great, but there are probably multiple places. I remember Ed H. fixing this once before for adenine-thymine, but I guess that was only a partial fix.
@loriab I’ll have a look. Reading the code to find this is impossible but maybe there is a way to replace all int declarations in the code by a custom class that detects possible overflows, and then correct the original code…
@polyx15 Do you know which compiler was used to build the code ? I think part of the error may be compiler-dependent.
@jgonthier You can always replace with size_t, we’re moving that way in most of the code anyway.
Submitted a pull request with a patch that could hopefully solve the problem. Unfortunately, I cannot actually test the code on a large enough system.
@polyx15, are you running from a compiled Psi4 so that you can apply the patch and test it? Or are you running a conda binary, in which case, I can compile one with the patch for you to test. I’m afraid we don’t have ready access to hardware with enough memory to run your particular job.
I am running a compiled Psi4, compiled with GCC 5.2.0 (since @jgonthier asked).
GCC 5.2 should be fine (though I’m not familiar with any of its long int peculiarities). Can we expect that you are trying out the patch and will report back?
I will try out the patch and will report back, it will take a few days though.
I was using GCC 5.4 for testing and I am certain the code would have failed with a system your size, some multiplications would have overflowed.
I just hope I could catch all of them, so maybe run with the
debug 1 option so that we can know where it fails if I missed something.
I am responsible for installing software on the Euler cluster, where polyx15 runs his calculations. I am not too familiar with github yet.
In order to get the code that includes the patch mentioned in this discussion, should I just “git clone” the master branch of psi4 or another particular repository ? Once I have the code I can most likely get the installation ready the same day, then polyx15 can start testing as soon as possible.
So you can either
git clone https://github.com/jgonthier/psi4.git followed by
git checkout SAPT_patch to get on the patch branch.
or, if you have a fairly recent psi4 source lying around (from github.com/psi4/psi4), you can
- just edit the source to apply these changes since the changeset isn’t too big.
Thanks for helping with this testing,
thank you for the github instructions to get the correct code with the patch. I have built a new version of psi4, but I noticed some quite substantial change in library size (when comparing) to my 1.1a1 build.
The core.so library grew from ca. 48 MB to about 596 MB.
Older 1.1a1 build:
[sfux@develop01 psi4]$ ls -ltr
-rw-r--r-- 1 apps ID-HPC-APPS 1133 Jan 25 09:29 metadata.py
drwxr-xr-x 7 apps ID-HPC-APPS 4096 Jan 25 09:29 driver
-rw-r--r-- 1 apps ID-HPC-APPS 2800 Jan 25 09:29 header.py
-rw-r--r-- 1 apps ID-HPC-APPS 1727 Jan 25 09:29 extras.py
-rw-r--r-- 1 apps ID-HPC-APPS 2706 Jan 25 09:29 __init__.py
-rw-r--r-- 1 apps ID-HPC-APPS 1505 Jan 25 09:29 config.py
-rwxr-xr-x 1 apps ID-HPC-APPS 47849225 Jan 25 09:29 core.so
Build of newest version + patch:
[apps@develop01 psi4]$ ls -ltr
-rwxr-xr-x 1 apps ID-HPC-APPS 595471556 Feb 17 15:47 core.so
-rw-r--r-- 1 apps ID-HPC-APPS 2721 Feb 17 15:47 __init__.py
-rw-r--r-- 1 apps ID-HPC-APPS 1727 Feb 17 15:47 extras.py
-rw-r--r-- 1 apps ID-HPC-APPS 2800 Feb 17 15:47 header.py
drwxr-xr-x 7 apps ID-HPC-APPS 4096 Feb 17 15:47 driver
-rw-r--r-- 1 apps ID-HPC-APPS 1147 Feb 17 15:47 metadata.py
Does this make sense ? Have there been substantial changes to the code that can explain the growth of core.so ?
in my release build of the SAPT_patch branch, with GCC 5.4 and openBLAS, my core.so is about 66 MB. It does seem weird that your core.so is so large. Which options did you use for building ?
Thank you for your feedback. It seems that I did something wrong when setting CMake variables. I set quite a lot of them, therefore I am not adding all of it here in my comment.
It is already helpful to know what the size of the library should be.
I think I found the culprit: BUILD_SHARED_LIBS was set to OFF. I will retry compiling the code on Monday.
Thank you for your help.