I am trying to run a SAPT2+/aDZ (Silver Standard according to J. Chem. Phys. 2014, 140, 094106) calculation on a fairly large system (162 electrons, 920 basis functions, I would rather not go into detail here about what system it is exactly). The calculation is running fine until the Exch11 contribution, afterwards the job crashes and I get the following error:
Exited with exit code 139.
6243 Segmentation fault (core dumped)
I did this type of calculations on several smaller (similar) systems with no problems. The next contributing term would be Exch12 so I am assuming it is trying to compute that when it is crashing. The estimated memory usage is “127060.6 MB”, I am giving the job 140000 MB of memory. I also tried to give the job significantly more memory (more than double of that), but I had the same problem. Does anyone have an idea what the problem might be?
If you need more information, please let me know. I could also send the corresponding input on request per mail for somebody to test it.
@dgasmith, @loriab: looking at the code for the term that seems to fail (Exch12_k11u_4) I see a potential integer overflow in lines 1397, 1398, 1401 and 1402 of exch12.cc. The multiplication aoccA_ * nvirA_ * aoccA_ * nvirA_ exceeds a 32-bit signed integer with the dimensions of the system, unless I am mistaken… Line 1402 in particular has no conversions to (long int).
If you agree that this could be the problem, I’ll patch it.
@jgonthier, int overflow was what Daniel was hypothesizing the other day. If you want to try patching it, it’d be great, but there are probably multiple places. I remember Ed H. fixing this once before for adenine-thymine, but I guess that was only a partial fix.
@loriab I’ll have a look. Reading the code to find this is impossible but maybe there is a way to replace all int declarations in the code by a custom class that detects possible overflows, and then correct the original code…
@polyx15 Do you know which compiler was used to build the code ? I think part of the error may be compiler-dependent.
@polyx15, are you running from a compiled Psi4 so that you can apply the patch and test it? Or are you running a conda binary, in which case, I can compile one with the patch for you to test. I’m afraid we don’t have ready access to hardware with enough memory to run your particular job.
GCC 5.2 should be fine (though I’m not familiar with any of its long int peculiarities). Can we expect that you are trying out the patch and will report back?
I was using GCC 5.4 for testing and I am certain the code would have failed with a system your size, some multiplications would have overflowed.
I just hope I could catch all of them, so maybe run with the debug 1 option so that we can know where it fails if I missed something.
I am responsible for installing software on the Euler cluster, where polyx15 runs his calculations. I am not too familiar with github yet.
In order to get the code that includes the patch mentioned in this discussion, should I just “git clone” the master branch of psi4 or another particular repository ? Once I have the code I can most likely get the installation ready the same day, then polyx15 can start testing as soon as possible.
thank you for the github instructions to get the correct code with the patch. I have built a new version of psi4, but I noticed some quite substantial change in library size (when comparing) to my 1.1a1 build.
The core.so library grew from ca. 48 MB to about 596 MB.
Older 1.1a1 build:
[sfux@develop01 psi4]$ ls -ltr
total 46940
-rw-r--r-- 1 apps ID-HPC-APPS 1133 Jan 25 09:29 metadata.py
drwxr-xr-x 7 apps ID-HPC-APPS 4096 Jan 25 09:29 driver
-rw-r--r-- 1 apps ID-HPC-APPS 2800 Jan 25 09:29 header.py
-rw-r--r-- 1 apps ID-HPC-APPS 1727 Jan 25 09:29 extras.py
-rw-r--r-- 1 apps ID-HPC-APPS 2706 Jan 25 09:29 __init__.py
-rw-r--r-- 1 apps ID-HPC-APPS 1505 Jan 25 09:29 config.py
-rwxr-xr-x 1 apps ID-HPC-APPS 47849225 Jan 25 09:29 core.so
Build of newest version + patch:
[apps@develop01 psi4]$ ls -ltr
total 583832
-rwxr-xr-x 1 apps ID-HPC-APPS 595471556 Feb 17 15:47 core.so
-rw-r--r-- 1 apps ID-HPC-APPS 2721 Feb 17 15:47 __init__.py
-rw-r--r-- 1 apps ID-HPC-APPS 1727 Feb 17 15:47 extras.py
-rw-r--r-- 1 apps ID-HPC-APPS 2800 Feb 17 15:47 header.py
drwxr-xr-x 7 apps ID-HPC-APPS 4096 Feb 17 15:47 driver
-rw-r--r-- 1 apps ID-HPC-APPS 1147 Feb 17 15:47 metadata.py
Does this make sense ? Have there been substantial changes to the code that can explain the growth of core.so ?
in my release build of the SAPT_patch branch, with GCC 5.4 and openBLAS, my core.so is about 66 MB. It does seem weird that your core.so is so large. Which options did you use for building ?
Thank you for your feedback. It seems that I did something wrong when setting CMake variables. I set quite a lot of them, therefore I am not adding all of it here in my comment.
It is already helpful to know what the size of the library should be.
I think I found the culprit: BUILD_SHARED_LIBS was set to OFF. I will retry compiling the code on Monday.