SAPT2+ Calculation: Segmentation fault during Exch12 computation


#1

Dear all,

I am trying to run a SAPT2+/aDZ (Silver Standard according to J. Chem. Phys. 2014, 140, 094106) calculation on a fairly large system (162 electrons, 920 basis functions, I would rather not go into detail here about what system it is exactly). The calculation is running fine until the Exch11 contribution, afterwards the job crashes and I get the following error:

Exited with exit code 139.
6243 Segmentation fault (core dumped)

I did this type of calculations on several smaller (similar) systems with no problems. The next contributing term would be Exch12 so I am assuming it is trying to compute that when it is crashing. The estimated memory usage is “127060.6 MB”, I am giving the job 140000 MB of memory. I also tried to give the job significantly more memory (more than double of that), but I had the same problem. Does anyone have an idea what the problem might be?

If you need more information, please let me know. I could also send the corresponding input on request per mail for somebody to test it.

Thank you already in advance for your help.

Best,
Robert


SAPT2 calculation stops with a Segmentation fault error
#2

How much memory does your computer have? One doesn’t usually give the memory keyword more than 95–98% of the computer system’s memory.


#3

The computer I am running the job on has 256 GB of memory.


#4

Can you set the option debug 1 and post the SAPT module print out to a gist?


#5

I set the option “debug 1” and re-ran the calculation. Here is the output of the SAPT module: https://gist.github.com/anonymous/c54ab98d64fb028833e7e7565df3c0b2


#6

Hi,

@dgasmith, @loriab: looking at the code for the term that seems to fail (Exch12_k11u_4) I see a potential integer overflow in lines 1397, 1398, 1401 and 1402 of exch12.cc. The multiplication aoccA_ * nvirA_ * aoccA_ * nvirA_ exceeds a 32-bit signed integer with the dimensions of the system, unless I am mistaken… Line 1402 in particular has no conversions to (long int).

If you agree that this could be the problem, I’ll patch it.


#7

@jgonthier, int overflow was what Daniel was hypothesizing the other day. If you want to try patching it, it’d be great, but there are probably multiple places. I remember Ed H. fixing this once before for adenine-thymine, but I guess that was only a partial fix.


#8

@loriab I’ll have a look. Reading the code to find this is impossible but maybe there is a way to replace all int declarations in the code by a custom class that detects possible overflows, and then correct the original code…

@polyx15 Do you know which compiler was used to build the code ? I think part of the error may be compiler-dependent.


#9

@jgonthier You can always replace with size_t, we’re moving that way in most of the code anyway.


#10

Submitted a pull request with a patch that could hopefully solve the problem. Unfortunately, I cannot actually test the code on a large enough system.


#11

@polyx15, are you running from a compiled Psi4 so that you can apply the patch and test it? Or are you running a conda binary, in which case, I can compile one with the patch for you to test. I’m afraid we don’t have ready access to hardware with enough memory to run your particular job.


#12

I am running a compiled Psi4, compiled with GCC 5.2.0 (since @jgonthier asked).


#13

GCC 5.2 should be fine (though I’m not familiar with any of its long int peculiarities). Can we expect that you are trying out the patch and will report back?


#14

I will try out the patch and will report back, it will take a few days though.


#15

I was using GCC 5.4 for testing and I am certain the code would have failed with a system your size, some multiplications would have overflowed.
I just hope I could catch all of them, so maybe run with the debug 1 option so that we can know where it fails if I missed something.


#16

Hi loriab,

I am responsible for installing software on the Euler cluster, where polyx15 runs his calculations. I am not too familiar with github yet.

In order to get the code that includes the patch mentioned in this discussion, should I just “git clone” the master branch of psi4 or another particular repository ? Once I have the code I can most likely get the installation ready the same day, then polyx15 can start testing as soon as possible.

Best regards

Sam


#17

Hi @samfux84,

So you can either

  • git clone https://github.com/jgonthier/psi4.git followed by git checkout SAPT_patch to get on the patch branch.

or, if you have a fairly recent psi4 source lying around (from github.com/psi4/psi4), you can

  • just edit the source to apply these changes since the changeset isn’t too big.

Thanks for helping with this testing,
Lori


#18

Hi Lori,

thank you for the github instructions to get the correct code with the patch. I have built a new version of psi4, but I noticed some quite substantial change in library size (when comparing) to my 1.1a1 build.

The core.so library grew from ca. 48 MB to about 596 MB.

Older 1.1a1 build:

[sfux@develop01 psi4]$ ls -ltr
total 46940
-rw-r--r-- 1 apps ID-HPC-APPS     1133 Jan 25 09:29 metadata.py
drwxr-xr-x 7 apps ID-HPC-APPS     4096 Jan 25 09:29 driver
-rw-r--r-- 1 apps ID-HPC-APPS     2800 Jan 25 09:29 header.py
-rw-r--r-- 1 apps ID-HPC-APPS     1727 Jan 25 09:29 extras.py
-rw-r--r-- 1 apps ID-HPC-APPS     2706 Jan 25 09:29 __init__.py
-rw-r--r-- 1 apps ID-HPC-APPS     1505 Jan 25 09:29 config.py
-rwxr-xr-x 1 apps ID-HPC-APPS 47849225 Jan 25 09:29 core.so

Build of newest version + patch:

[apps@develop01 psi4]$ ls -ltr
total 583832
-rwxr-xr-x 1 apps ID-HPC-APPS 595471556 Feb 17 15:47 core.so
-rw-r--r-- 1 apps ID-HPC-APPS      2721 Feb 17 15:47 __init__.py
-rw-r--r-- 1 apps ID-HPC-APPS      1727 Feb 17 15:47 extras.py
-rw-r--r-- 1 apps ID-HPC-APPS      2800 Feb 17 15:47 header.py
drwxr-xr-x 7 apps ID-HPC-APPS      4096 Feb 17 15:47 driver
-rw-r--r-- 1 apps ID-HPC-APPS      1147 Feb 17 15:47 metadata.py

Does this make sense ? Have there been substantial changes to the code that can explain the growth of core.so ?

Best regards

Sam


#19

Hi,

in my release build of the SAPT_patch branch, with GCC 5.4 and openBLAS, my core.so is about 66 MB. It does seem weird that your core.so is so large. Which options did you use for building ?

Best regards

Jerome


#20

Hi Jerome,

Thank you for your feedback. It seems that I did something wrong when setting CMake variables. I set quite a lot of them, therefore I am not adding all of it here in my comment.

It is already helpful to know what the size of the library should be.

I think I found the culprit: BUILD_SHARED_LIBS was set to OFF. I will retry compiling the code on Monday.

Thank you for your help.

Best regards

Sam