User time longer than wall time on one core

javierelpianista · November 1, 2018, 8:57pm

Hello!

I’m running a simple SAPT calculation with almost everything in default settings. The run is submitted to a cluster via slurm, and I’ve specified that the number of cores to be used is one (double checked this). Still, when I look at the timings in timer.dat, I see that user times are longer than wall times. How is this possible?
Also, the node I performed the calculation on is configured to run one thread per core, so no hyperthreading here. Here’s the relevant part of the timer.dat file:

Wall Time:     3990.65 seconds

JK: (A|Q)^-1                :     20.783u      6.917s     12.704w      1 calls
JK: (A|mn)                  :     77.533u      2.317s     48.031w      2 calls
JK: (Q|mn)                  :    117.083u      3.833s     72.725w      2 calls
JK: (Q|mn) Write            :      0.100u      6.717s      4.092w      2 calls
HF: Form H                  :      0.050u      0.083s      0.388w      3 calls
HF: Form S/X                :      1.067u      0.067s      0.701w      3 calls
HF: Guess                   :      0.983u      0.033s      0.598w      3 calls
SAD Guess                   :      0.917u      0.017s      0.567w      3 calls
HF: Form G                  :   3188.617u    489.933s   2214.782w     48 calls
JK: D                       :      0.200u      0.017s      0.152w     48 calls
JK: USO2AO                  :      0.167u      0.017s      0.084w     48 calls
JK: JK                      :   3187.983u    489.717s   2214.219w     48 calls
JK: (Q|mn) Read             :      1.633u    166.800s    103.652w     96 calls
JK: J                       :     91.767u      1.150s     55.863w     96 calls
JK: J1                      :     43.967u      0.567s     26.807w     96 calls
JK: J2                      :     46.950u      0.567s     28.589w     96 calls
JK: K                       :   3035.233u     46.900s   1853.677w     96 calls
JK: K1                      :   2686.367u     41.767s   1640.748w     96 calls
JK: K2                      :    348.750u      5.067s    212.810w     96 calls
JK: AO2USO                  :      0.000u      0.000s      0.000w     48 calls
HF: Form F                  :      0.250u      0.133s      0.171w     48 calls
HF: DIIS                    :      6.467u      2.100s      5.135w     48 calls
DIISManager::add_entry      :      0.217u      0.317s      0.313w     45 calls
DIISManager::extrapolate    :      0.550u      1.567s      1.274w     42 calls
bMatrix setup               :      0.200u      0.967s      0.690w     42 calls
bMatrix pseudoinverse       :      0.000u      0.000s      0.005w     42 calls
New vector                  :      0.350u      0.600s      0.578w     42 calls
HF: Form C                  :     18.450u      0.133s     11.200w     48 calls
HF: Form D                  :      0.233u      0.000s      0.132w     48 calls
DF Integrals                :    483.633u     59.067s    326.559w      1 calls
W Integrals                 :      0.617u      2.350s      1.788w      1 calls
Elst10                      :      0.000u      0.000s      0.000w      1 calls
Exch10                      :      0.583u      0.733s      0.800w      1 calls
Exch10 S^2                  :      0.333u      0.700s      0.621w      1 calls
Ind20                       :    430.267u    334.433s    459.926w      1 calls
Exch-Ind20                  :      6.417u     10.017s      9.882w      1 calls
Exch-Disp20 N^5             :    599.167u     64.983s    399.453w      1 calls
Exch-Disp20 N^4             :    573.467u     71.383s    387.825w      1 calls

-----------------------------------------------------------
JK: (A|Q)^-1                :     20.783u      6.917s     12.704w      1 calls
JK: (A|mn)                  :     77.533u      2.317s     48.031w      2 calls
JK: (Q|mn)                  :    117.083u      3.833s     72.725w      2 calls
JK: (Q|mn) Write            :      0.100u      6.717s      4.092w      2 calls
HF: Form H                  :      0.050u      0.083s      0.388w      3 calls
HF: Form S/X                :      1.067u      0.067s      0.701w      3 calls
HF: Guess                   :      0.983u      0.033s      0.598w      3 calls
| SAD Guess                 :      0.917u      0.017s      0.567w      3 calls
HF: Form G                  :   3188.617u    489.933s   2214.782w     48 calls
| JK: D                     :      0.200u      0.017s      0.152w     48 calls
| JK: USO2AO                :      0.167u      0.017s      0.084w     48 calls
| JK: JK                    :   3187.983u    489.717s   2214.219w     48 calls
| | JK: (Q|mn) Read         :      1.633u    166.800s    103.652w     96 calls
| | JK: J                   :     91.767u      1.150s     55.863w     96 calls
| | | JK: J1                :     43.967u      0.567s     26.807w     96 calls
| | | JK: J2                :     46.950u      0.567s     28.589w     96 calls
| | JK: K                   :   3035.233u     46.900s   1853.677w     96 calls
| | | JK: K1                :   2686.367u     41.767s   1640.748w     96 calls
| | | JK: K2                :    348.750u      5.067s    212.810w     96 calls
| JK: AO2USO                :      0.000u      0.000s      0.000w     48 calls
HF: Form F                  :      0.250u      0.133s      0.171w     48 calls
HF: DIIS                    :      6.467u      2.100s      5.135w     48 calls
| DIISManager::add_entry    :      0.217u      0.317s      0.313w     45 calls
| DIISManager::extrapolate  :      0.550u      1.567s      1.274w     42 calls
| | bMatrix setup           :      0.200u      0.967s      0.690w     42 calls
| | bMatrix pseudoinverse   :      0.000u      0.000s      0.005w     42 calls
| | New vector              :      0.350u      0.600s      0.578w     42 calls
HF: Form C                  :     18.450u      0.133s     11.200w     48 calls
HF: Form D                  :      0.233u      0.000s      0.132w     48 calls
DF Integrals                :    483.633u     59.067s    326.559w      1 calls
W Integrals                 :      0.617u      2.350s      1.788w      1 calls
Elst10                      :      0.000u      0.000s      0.000w      1 calls
Exch10                      :      0.583u      0.733s      0.800w      1 calls
Exch10 S^2                  :      0.333u      0.700s      0.621w      1 calls
Ind20                       :    430.267u    334.433s    459.926w      1 calls
Exch-Ind20                  :      6.417u     10.017s      9.882w      1 calls
Exch-Disp20 N^5             :    599.167u     64.983s    399.453w      1 calls
Exch-Disp20 N^4             :    573.467u     71.383s    387.825w      1 calls

***********************************************************

Thanks for your help.

javierelpianista · November 2, 2018, 3:58pm

Also, if I add up the timings of the main subroutines (those that don’t start with | in the second part of the table), I get these results:

User: 5867.72
Sys: 836.92
Wall: 4029.65

On the other hand, the .dat file generated by Psi4 states the following:

Total time:
        user time   =    3550.08 seconds =      59.17 minutes
        system time =     505.44 seconds =       8.42 minutes
        total time  =       4065 seconds =      67.75 minutes

zachglick · November 11, 2018, 9:58pm

Hello Javier,

Those are some interesting results. CPU time > Wall time suggests that either multithreading or hyperthreading occured during the calculation.

It probably wouldn’t hurt to make sure that Psi4 only ran on 1 core (no multithreading) even though you specified this. Do you still have the output file generated by Psi4? You can run grep -i "Threads" to make sure this is the case.

My best guess is that the CPU you run on supports hyperthreading. You said that the node is configured to run one thread per core. Does this refer to physical or virtual cores? It’s possible that the job runs on 1 physical core which is treated by the operating system as 2 virtual cores. This would be in line with your observation that user times are approximately double the length of wall times.

You could check if your CPU has enabled hyperthreading. On my computer (linux / Intel CPU) I can do this with "grep -i 'ht' /proc/cpuinfo". If you’re submitting jobs to a cluster you might have to ask the sysadmin about it.

Your second observation about the difference between the timer-file and the output-file is a little puzzling. It makes sense that the wall time is slightly shorter when subroutines are added, since not every operation within Psi4 is explicitly timed and displayed in the timer-file. However, the difference between the two user/system times suggests that he timer-file accounts for hyperthreading while the output-file doesn’t. My suspicion is that the timer-file is the more accurate of the two, but hopefully a developer with a little more knowledge about how timing works in Psi4 can chime in.

Best,
Zach