Terrible parallelization of dft? no checkpointing for optimization?

Nevensky · June 22, 2017, 11:25am

Dear Psi4 developers,

I’m having some trouble obtaining good performance for DF-DFT optimizations (using wB97XD functional and Dunnings cc basis sets). It appears that only 16% of the calculation time is running in parallel (100% cpu), while the remaining time is spent on a single processor. I’m using the precompiled binaries on a Ubuntu 16.04 LTS system on Intel Xeon ( skylake gen) and running psi with “psi4 -n 24 -i input.dat” command. Perhaps the precompiled binaries are not suited for good parallel computation and I should compile it myself? Or is this the intended functionality of Psi4 and I should instead move to a machine with less cores and more memory for optimal performance ? The manual is awfully sparse in info with respect to this.

The input card is as follows (with the molecule geometry data removed)

memory 32 Gb
molecule aloha_dimer {
0 1
…
units angstrom

}
set globals {
basis cc-pvtz
guess sad
scf_type df
df_scf_guess true
dft_functional wB97X-D
}
optimize(‘wB97X-D’)

Thanks!

I also found a part in the manual stating that only energy procedures are checkpointed or in other words if I am reading this correctly, optimizations can not be restarted? Are there any plans on extending checkpoint files to cover other than energy procedures?

dgasmith · June 23, 2017, 1:10pm

Psi 1.0 and 1.1 did not thread the XC kernels. As this was originally developed on 6-core i7’s and the XC part is relatively small this was not a particularly large issue, Amdahl’s law has obviously caught up to us on last gen hardware. Significant work has been made to ensure the HF and DFT modules threads to 30+ NUMA cores for the next release. A current copy of Psi4 from the master branch includes some of these changes and should thread significantly better.

loriab · June 23, 2017, 1:26pm

On the restarting part, you can always copy the last good geometry into a new input file and start a new job. Only disadvantage is that you lose the history of the optimization. That history is stored in binary file “1”. So you’d have to arrange to save it:

core.IOManager.shared_object().set_specific_retention(1, True)
core.IOManager.shared_object().set_specific_path(1, './')
# normal job input

Then you have to know where that file 1 from the previous job is, say restartfile and copy it into place in the restarted job

shutil.copy(restartfile, p4util.get_psifile(1))
# normal job input

We do this internally in the driver, but it hasn’t been set up to be easy for users in input files. I’ll make some adjustments and cook up a test case so I’m confident it’ll work user-side.

asdf · June 24, 2017, 8:01pm

I see similar behavior. My v1.1 install from the conda channel writes the following to the log:
Python driver attempted to set threads to 8.
Psi4 was compiled without OpenMP, setting threads to 1.

Could that contribute to this behavior? Do you think recompiling would improve parallelization?

dgasmith · June 24, 2017, 8:18pm

If Psi4 was compiled without OMP some threading can be picked up through MKL; however, this is fairly minimal. The DFT XC kernels will not thread without the development version of Psi4 from (GitHub.com/psi4/psi4).