SAPT2+ Calculation: Segmentation fault during Exch12 computation

polyx15 · February 8, 2017, 11:49am

Dear all,

I am trying to run a SAPT2+/aDZ (Silver Standard according to J. Chem. Phys. 2014, 140, 094106) calculation on a fairly large system (162 electrons, 920 basis functions, I would rather not go into detail here about what system it is exactly). The calculation is running fine until the Exch11 contribution, afterwards the job crashes and I get the following error:

Exited with exit code 139.
6243 Segmentation fault (core dumped)

I did this type of calculations on several smaller (similar) systems with no problems. The next contributing term would be Exch12 so I am assuming it is trying to compute that when it is crashing. The estimated memory usage is “127060.6 MB”, I am giving the job 140000 MB of memory. I also tried to give the job significantly more memory (more than double of that), but I had the same problem. Does anyone have an idea what the problem might be?

If you need more information, please let me know. I could also send the corresponding input on request per mail for somebody to test it.

Thank you already in advance for your help.

Best,
Robert

loriab · February 8, 2017, 12:49pm

How much memory does your computer have? One doesn’t usually give the memory keyword more than 95–98% of the computer system’s memory.

polyx15 · February 8, 2017, 7:28pm

The computer I am running the job on has 256 GB of memory.

dgasmith · February 8, 2017, 7:39pm

Can you set the option debug 1 and post the SAPT module print out to a gist?

polyx15 · February 12, 2017, 5:05pm

I set the option “debug 1” and re-ran the calculation. Here is the output of the SAPT module: https://gist.github.com/anonymous/c54ab98d64fb028833e7e7565df3c0b2

jgonthier · February 12, 2017, 7:55pm

Hi,

@dgasmith, @loriab: looking at the code for the term that seems to fail (Exch12_k11u_4) I see a potential integer overflow in lines 1397, 1398, 1401 and 1402 of exch12.cc. The multiplication aoccA_ * nvirA_ * aoccA_ * nvirA_ exceeds a 32-bit signed integer with the dimensions of the system, unless I am mistaken… Line 1402 in particular has no conversions to (long int).

If you agree that this could be the problem, I’ll patch it.

loriab · February 12, 2017, 8:22pm

@jgonthier, int overflow was what Daniel was hypothesizing the other day. If you want to try patching it, it’d be great, but there are probably multiple places. I remember Ed H. fixing this once before for adenine-thymine, but I guess that was only a partial fix.

jgonthier · February 14, 2017, 1:26am

@loriab I’ll have a look. Reading the code to find this is impossible but maybe there is a way to replace all int declarations in the code by a custom class that detects possible overflows, and then correct the original code…

@polyx15 Do you know which compiler was used to build the code ? I think part of the error may be compiler-dependent.

dgasmith · February 14, 2017, 1:44am

@jgonthier You can always replace with size_t, we’re moving that way in most of the code anyway.

jgonthier · February 14, 2017, 8:00am

Submitted a pull request with a patch that could hopefully solve the problem. Unfortunately, I cannot actually test the code on a large enough system.

loriab · February 14, 2017, 8:52pm

@polyx15, are you running from a compiled Psi4 so that you can apply the patch and test it? Or are you running a conda binary, in which case, I can compile one with the patch for you to test. I’m afraid we don’t have ready access to hardware with enough memory to run your particular job.

polyx15 · February 15, 2017, 6:56am

I am running a compiled Psi4, compiled with GCC 5.2.0 (since @jgonthier asked).

loriab · February 15, 2017, 5:54pm

GCC 5.2 should be fine (though I’m not familiar with any of its long int peculiarities). Can we expect that you are trying out the patch and will report back?

polyx15 · February 15, 2017, 8:45pm

I will try out the patch and will report back, it will take a few days though.

jgonthier · February 15, 2017, 9:08pm

I was using GCC 5.4 for testing and I am certain the code would have failed with a system your size, some multiplications would have overflowed.
I just hope I could catch all of them, so maybe run with the debug 1 option so that we can know where it fails if I missed something.

samfux84 · February 16, 2017, 9:44am

Hi loriab,

I am responsible for installing software on the Euler cluster, where polyx15 runs his calculations. I am not too familiar with github yet.

In order to get the code that includes the patch mentioned in this discussion, should I just “git clone” the master branch of psi4 or another particular repository ? Once I have the code I can most likely get the installation ready the same day, then polyx15 can start testing as soon as possible.

Best regards

Sam

loriab · February 16, 2017, 5:36pm

Hi @samfux84,

So you can either

git clone https://github.com/jgonthier/psi4.git followed by git checkout SAPT_patch to get on the patch branch.

or, if you have a fairly recent psi4 source lying around (from github.com/psi4/psi4), you can

just edit the source to apply these changes since the changeset isn’t too big.

Thanks for helping with this testing,
Lori

samfux84 · February 17, 2017, 2:56pm

Hi Lori,

thank you for the github instructions to get the correct code with the patch. I have built a new version of psi4, but I noticed some quite substantial change in library size (when comparing) to my 1.1a1 build.

The core.so library grew from ca. 48 MB to about 596 MB.

Older 1.1a1 build:

[sfux@develop01 psi4]$ ls -ltr
total 46940
-rw-r--r-- 1 apps ID-HPC-APPS     1133 Jan 25 09:29 metadata.py
drwxr-xr-x 7 apps ID-HPC-APPS     4096 Jan 25 09:29 driver
-rw-r--r-- 1 apps ID-HPC-APPS     2800 Jan 25 09:29 header.py
-rw-r--r-- 1 apps ID-HPC-APPS     1727 Jan 25 09:29 extras.py
-rw-r--r-- 1 apps ID-HPC-APPS     2706 Jan 25 09:29 __init__.py
-rw-r--r-- 1 apps ID-HPC-APPS     1505 Jan 25 09:29 config.py
-rwxr-xr-x 1 apps ID-HPC-APPS 47849225 Jan 25 09:29 core.so

Build of newest version + patch:

[apps@develop01 psi4]$ ls -ltr
total 583832
-rwxr-xr-x 1 apps ID-HPC-APPS 595471556 Feb 17 15:47 core.so
-rw-r--r-- 1 apps ID-HPC-APPS      2721 Feb 17 15:47 __init__.py
-rw-r--r-- 1 apps ID-HPC-APPS      1727 Feb 17 15:47 extras.py
-rw-r--r-- 1 apps ID-HPC-APPS      2800 Feb 17 15:47 header.py
drwxr-xr-x 7 apps ID-HPC-APPS      4096 Feb 17 15:47 driver
-rw-r--r-- 1 apps ID-HPC-APPS      1147 Feb 17 15:47 metadata.py

Does this make sense ? Have there been substantial changes to the code that can explain the growth of core.so ?

Best regards

Sam

jgonthier · February 17, 2017, 3:34pm

Hi,

in my release build of the SAPT_patch branch, with GCC 5.4 and openBLAS, my core.so is about 66 MB. It does seem weird that your core.so is so large. Which options did you use for building ?

Best regards

Jerome

samfux84 · February 17, 2017, 3:54pm

Hi Jerome,

Thank you for your feedback. It seems that I did something wrong when setting CMake variables. I set quite a lot of them, therefore I am not adding all of it here in my comment.

It is already helpful to know what the size of the library should be.

I think I found the culprit: BUILD_SHARED_LIBS was set to OFF. I will retry compiling the code on Monday.

Thank you for your help.

Best regards

Sam