Different DFT grid behaviour on different machines -> very different DFT enrgies

Sorry for the length but I’ve tried to include as much info as I can!

I’ve been testing various builds of psi4 on several different computers, and I’m currently trying to track down why all tests run fine on all computers (given a high enough MAX_AM_ERI for the pywrap tests to run!) apart from one where a single test fails: dftd3-version fails because energies do not match those expected.

Initially, I thought I had an older version of dftd3 on the failing one but that’s not the issue here: the dispersion parts all match perfectly! The part(s) that fails is where dftd3 is called from the C-side, i.e. functionals such as “b3lyp-d3” are tested. The dispersion energies match the reference values but it fails when the total energy is compared because the energy from the DFT part is different.

Digging into this, I’ve taken the ethene dimer structure and run a set of simple:

“ethene dimer structure = eeee”
set basis 6-31G(d)
energy(‘b3lyp’, molecule=eeee)

test calcs on 3 different machines, A, B, and C (where dftd3-version fails on C). The total energies match on machines A and B to 9 dp, which seems pretty good to me, but only to 4 dp on machine C!

dftd3-version uses a fairly cut-down integration grid. If I use this cut down number of spherical and radial points in my test calcs, the total energies still match on A and B to 9 dp, but now C only matches to 3 dp!

One thing that I’ve spotted is that A and B use the same number of grid points: Total Points = 265160 (64564 for the smaller grid); Max Points = 4197 (3539 for the smaller grid) -> the same energy.

On machine C (the “odd one”), Total Points = 265208 (64592 for the smaller grid); Max Points = 4784 (3709 for the smaller grid).

For some reason, Machine C is using a different grid, and I suspect this could be why my energies don’t match!

The same is true in the dftd3-version output:
Total Points = 64396; Max Points = 3627 (Total Energy matches and test passed);
Total Points = 64400; Max Points = 3758 (Total Energy doesn’t match: test fails);

The supplied output.ref has:
Total Points = 64396; Max Points = 3525

Hence, for some reason, the failing machine uses too many integration grid points and gets the wrong energy.

I had a brief look at where the grid pruning happens and there didn’t seem to be anything too complicated going on, so I’m at a bit of a loss as to why this happens, seemingly on one machine! Unless this is some odd difference in rounding behaviour between the different machines that I’ve managed to discover.

All versions are using python 3.5.3 and are linked against MKL (although I doubt that’s involved here). On machines A and C, gcc 6.3.0 was used; on machine B, gcc 4.9.4 was used.

HF energies match to 9 dp on all three machines.

Interestingly, the dft-alone tests pass OK but I will look back at their outputs to check them a bit more closely.

Does anyone have any clues on where to look next?

I’ve been digging deeper into this…

Before I delve into trying to work out how the grid generation functions work, am I correct in thinking that the same set of coordinates, number of radial and spherical points, etc., should always give the same set of integration grid points?

Currently, I get different grid-point coordinates on two different machines, leading to different numbers of distant points being removed, giving different energies.

@laz I have seen this for that particular test case before, but not in production. Agreed, that this is not correct behavior.

I strongly suspect this is occurring in the extents portion of the grid pruning, likely due to the bisection involved. Somewhere around libfock/cubature.cc:3700.

There’s definitely something odd going on with grid generation. I’ve added a Printf at line 3576 of cubature.cc to output the coordinates of the grid points as they’re generated.

For several simple test systems like a He atom, H2O with something like cc-pvdz, I get identical grid points for 3 machines (default grid parameters).

For the ethene dimer structure from the dftd3-version input.dat and an sto-3g basis set, the first few grid points on the two machines (that pass the dftd3-version test) are:

LAZ: grid gen: (-1.26121, -22.0888, -3.51321)
LAZ: grid gen: (-1.26121, 22.0888, -3.51321)
LAZ: grid gen: (20.8275, 1.31393e-15, -3.51321)
LAZ: grid gen: (-23.35, -1.39116e-15, -3.51321)
LAZ: grid gen: (-1.26121, -3.86134e-17, 18.5755)
LAZ: grid gen: (-1.26121, -3.86134e-17, -25.602)
LAZ: grid gen: (11.4917, -12.7529, 9.23974)
LAZ: grid gen: (11.4917, 12.7529, 9.23974)
LAZ: grid gen: (-14.0142, -12.7529, 9.23974)
LAZ: grid gen: (11.4917, -12.7529, -16.2662)

With the same input file on the machine that fails the dftd3-version test, I get:

LAZ: grid gen: (17.9752, 10.857, -3.51321)
LAZ: grid gen: (-20.4976, -10.857, -3.51321)
LAZ: grid gen: (-12.1182, 19.2364, -3.51321)
LAZ: grid gen: (9.59577, -19.2364, -3.51321)
LAZ: grid gen: (-1.26121, -3.86134e-17, 18.5755)
LAZ: grid gen: (-1.26121, -3.86134e-17, -25.602)
LAZ: grid gen: (3.57666, 17.3744, 9.23974)
LAZ: grid gen: (-18.6356, 4.83786, 9.23974)
LAZ: grid gen: (16.1132, -4.83786, 9.23974)
LAZ: grid gen: (3.57666, 17.3744, -16.2662)

The Z coordinates are the same for both but the X and Y components are completely different! Although, bizarrely, the 5th and 6th lines are the same on each.

I also converted the coordinates from each output file into massive XYZ files and compared them in vmd. They both have the same sort of general features (a higher density of points at the atoms) but the two grids were visually different.

I’ll keep on digging. Having been using DFT grids for years, and not given them any thought, it’s quite nice to be digging into the mechanics of how they’re formed.

Great, that code is aging a bit and the new DFT overhaul doesn’t actually touch that section. Feel free to make any changes that are needed.

If you need help I can answer what I can, but I did not write that section of the code nor am I intimately familiar with the specifics.

-Daniel Smith

After a significant amount of digging, I think I’ve found the source of my problem (odd test failures due to different DFT grids.

The DFT grid points are generated for each atom and then transformed into standard orientation for that atom (using an OrientationMgr). The transformation uses a matrix diagonalized using DSYEV of LAPACK (through several wrappers: OrientationMgr::diagonalize(), Matrix::Diagonalize(), and sq_rsp()).

The matrix is constructed from atomic coordinates and nuclear charges (ca. line 3400 of libfock/cubature.cc): all seems to work happily until a significant number of atoms are on axes. In that case (and apparently only on certain machines), numerical errors can creep in.

For test coordinates of the ethene dimer (from dftd3-version: the test that failed on one of my machines), the matrix for diagonalization is diagonal due to the input coordinates:

> 4.0783088445e+01 0.0000000000e+00 0.0000000000e+00
> 0.0000000000e+00 4.0783088445e+01 0.0000000000e+00
> 0.0000000000e+00 0.0000000000e+00 4.1939865126e+02

and the resulting eigenvectors from DSYEV (through the wrappers) are a unit matrix.

On the failing machine, the off-diagonal elements are not quite zero:

>  4.0783088445e+01 -6.1629758220e-33 1.7763568394e-15
> -6.1629758220e-33  4.0783088445e+01 1.7763568394e-15
>  1.7763568394e-15  1.7763568394e-15 4.1939865126e+02

and DSYEV gives the following “off-axis” eigenvectors:

>  0.5042242   -0.8635728    0.0000000
>  0.8635728    0.5042242    0.0000000
> -0.0000000    0.0000000    1.0000000

It would seem that DSYEV doesn’t behave well with tiny values! I haven’t quite tracked down why I’m seeing subtly different summations on different machines, or whether it’s down to the processor model, libc, compiler options (the same compiler version, gcc 6.3.0, was used on both). I also saw the same failure whether I built against MKL or openblas.

Since these matrices are used to transform the grid points, different grids result in each case, resulting in different total energies. I suspect this only happens when coordinates are in standard orientation and values cancel out (or not!) when constructing the transformation matrix.

Looking at libciomr/sq_rsp.cc, the sq_rsp() function was rewritten from its own fortran routine to being a wrapper for LAPACK. The function has a tolerance parameter but this is currently ignored (although I think that’s more for convergence of the diagonalization than for the input matrix).

A quick fix appear to be zeroing matrix elements that are smaller (e.g.) +/-1e-14 in OrientationMgr::diagonalize (ca. line 3081 of libfock/cubature.cc) before diagonalization. With this change, the dftd3-version test now passes and energies match the reference values to about 9 dp (whereas it was about 4 dp before this change!).

Obviously, matrix diagonalization is used a lot during calculations: I don’t think this change breaks anything else because the change is within OrientationMgr rather than sq_rsp() which I think is called by other functions.

I’ve just run a full “make test” (with default build options) and all tests pass except for pywrap-checkrun-* (due to not increasing MAX_AM_ERI from the default of 5 and psimrcc-ccsd_t-2 which did run before but now fails with:

  @SCF  100      -38.856259245280       0.000000000000       0.000000000179
    ci      = [-0.87294010, 0.48782741]
    ci_grad = [ 0.00000000, 0.00000000]
  Setting level shift to 0.000 S/E
  @SCF  101      -38.856259245280       0.000000000000       0.000000000093
    ci      = [-0.87294010, 0.48782741]
    ci_grad = [ 0.00000000, 0.00000000]
  The calculation did not converge in 100 cycles
  Quitting MCSCF.

(It converges in 66 steps on another machine and 65 in output.ref.)

I was going to submit a pull request but perhaps it isn’t as simple as I thought!

FYI, I’ve linked this to https://github.com/psi4/psi4/issues/677 on GitHub.

@laz Wow, great find! If you are only editing the OrientationMgr class this will not be propagated to other areas of the code. Something else is likely going on with psimrcc-ccsd_t-2.

In that past I have seen this as an issue where noise around machine epsilon can cause the diagonalization to find somewhat surprising solutions. The closed form for 3x3 matrix diagonalization is not exactly complex (and easily copy/pastable), I wonder if it would be worth implementing here. See here. Although, do note that the eigenvectors can be obtained more easily via Cayley rotations using known eigenvalues.

I wonder if thats a bit of a hammer when the simpler solution should work just fine. @andysim

@loriab Can you poke Dom/Matt, this sounds like a problem for our new crack team.

Further digging (thought I might as well continue here rather than on the respective github issue )

Having done a few simple tests with lapack (both netlib and MKL versions, the weird eigenvectors seems to be “standard” (across multiple processor types): feed DSYEV a matrix that is a tiny amount away from being diagonal, and weird eigenvectors result. I see the same behaviour from GNU octave (eigens function) which uses LAPACK. Interestingly, Perl PDL (which I’ve used a fair bit in the past for diagonalising inertia tensors) uses its own Jacobi-based diagonalisation function and that behaves as expected i.e. returns a a diagonal matrix of eigenvectors. I assume this unexpected behaviour of DSYEV is known and it’s left to the user to be careful of what is fed into it (although I haven’t found any obvious mentions of this anywhere!).

The odd processor-dependent behaviour (hardware or compiler optimisations?) that leads to such an ever-so-slightly-not diagonal matrix seems to occur in OrientationMgr::OrientationMgr() where input coordinates “cancel out” during the addition (or not!).

E.g., during the summations Iq.xz += x * z * q;
(-1.2612083903 x -3.5132106406 x 6) + (1.2612083903 x -3.5132106406 x 6) = 0.0000000000 on an i3-3110M
(-1.2612083903 x -3.5132106406 x 6) + (1.2612083903 x -3.5132106406 x 6) = 1.7763568394e-15 on an i7-5960X

Hence, when coordinates that should lead to a diagonal matrix are in use, the off-diagonal elements can be either zero, as expected, and diagonalisation works fine, or slightly off-zero, giving unexpected diagonalisation results.

Clearly, these are extremely small numerical inaccuracies that creep in but they seem to have a big effect in DSYEV!

As mentioned above, zeroing any elements of Iq that are <1e-14 leads to the expected behaviour on all CPUs I’ve tested, although for some reason, that build failed on the psimrcc-ccsd_t-2 test that did work before; it passed if I increased maxiter for mcscf (108 steps to converge when the default maximum is 100). No idea why though behaved differently, though.

I doubt using the analytical solution for the 3x3 matrix diagonalisation would help here: it looks horrendous and I expect that slightly non-zero off-diagonal elements would end up propagating through it and the input would still need clamping at 1e-14 or whatever…

Odd, that almost looks like an FMA rounding issue due to the high precision intermediate.

Note that the analytic solution should be much more numerically stable as DSYEV can be implemented in a large variety of ways (often iteratively) whose numerical stability can vary. I wouldnt implement the analytical solution by hand, but you can either get Mathematica to spit these things out as C++ code or do a find and replace on the text.

Please feel free to submit either solution as a pull request to the main Psi4 repository.

The analytical solution for a 3x3 matrix did seam tempting at first but it would have involved having to deal with the presence or not of complex solutions! I did find the following link discussing the accuracy of various diagonalisation algorithms for matrices with values spanning massive numerical ranges, and Jacobi’s method is meant to be more stable under those conditions. I don’t think it’s possible to perform diagonalisation directly with LAPACK using a Jacobi algorithm, although it’s probably possible to do so through an SVD function.

I suspect the simplest option is just to zero any matrix elements that are smaller than 1e-14 before diagonalisation in OrientationMgr which shouldn’t cause any other issues. I’ve tested this and all tests pass (apart from psimrcc_ccsd_t-2 test that took 8 more iterations than the standard limit but then passed! This can’t be connected.)

I’ll submit a pull request.