Parallelization improvement

Dear PSI4 Developers,

I would appreciate if you could improve the parallelization of the PSI4 modules.
I perform geometry optimization for the first excited state using the EOM-CCSD method. Is a bit large calculation, I would like to use it for reference data (22 atoms 330 basis functions). The parallel running over 32 nodes runs fine for computing the energy and the CCLAMBDA-CCDENSITY parts, but when the final gradient is computed the system goes down to a single processor running and takes more than 15 hour till this part is finished. The last output text after the parallelization stops is:
Virial Theorem Data:

Kinetic energy (ref) =
Kinetic energy (corr) =
Kinetic energy (total) =
-V/T (ref) =
-V/T (corr) =
-V/T (total) =

Is it possible to parallelize also this PSI4 module?

Many thanks

It should be.

I’ve added this to our issues list. I wanted to refactor the gradient code anyways in the next year, so I can look into this then.

Thank you very much for your kind answer.
I will waiting for it.