Threading and Parallelism

I tried to search this forum but was still very confused. My question is, is this the expected behavior? With 4 threads, I get speed up only by 2.2 times. With 8 threads, only 2.6.
I am using conda-installed psi4.
I used psi4/share/psi4/scripts/test_threading.py to test the speed and here is what I got.

I found a very old post Multithreading in downloaded binary distribution, but it was for version 1.3. Not sure if things changed in 1.9.

Time for threads 1, size 200: Psi4: 0.000388 NumPy: 0.000405
Time for threads 1, size 500: Psi4: 0.005445 NumPy: 0.005349
Time for threads 1, size 2000: Psi4: 0.318029 NumPy: 0.308266
Time for threads 1, size 4000: Psi4: 2.529851 NumPy: 2.515574
Time for threads 4, size 200: Psi4: 0.000124 NumPy: 0.000144
Time for threads 4, size 500: Psi4: 0.001536 NumPy: 0.001613
Time for threads 4, size 2000: Psi4: 0.093142 NumPy: 0.090224
Time for threads 4, size 4000: Psi4: 0.670489 NumPy: 0.667352
NumPy@n4 : Psi4@n4 ratio (want ~1): 1.00
Psi4@n1 : Psi4@n4 ratio (want ~4): 3.77
Running psi4 -i _thread_test_input_psi4_yo.in -o _thread_test_input_psi4_yo_n1.out -n1 …
Time for threads 1: Psi4: 85.772461
Running psi4 -i _thread_test_input_psi4_yo.in -o _thread_test_input_psi4_yo_n4.out -n4 …
Time for threads 4: Psi4: 38.372297
Psi4@n1 : Psi4@n4 ratio (want ~4): 2.24

Time for threads 1, size 200: Psi4: 0.000425 NumPy: 0.000443
Time for threads 1, size 500: Psi4: 0.005931 NumPy: 0.005959
Time for threads 1, size 2000: Psi4: 0.359246 NumPy: 0.348622
Time for threads 1, size 4000: Psi4: 2.764390 NumPy: 2.720615
Time for threads 8, size 200: Psi4: 0.000081 NumPy: 0.000106
Time for threads 8, size 500: Psi4: 0.000886 NumPy: 0.000946
Time for threads 8, size 2000: Psi4: 0.062190 NumPy: 0.052501
Time for threads 8, size 4000: Psi4: 0.378442 NumPy: 0.377958
NumPy@n8 : Psi4@n8 ratio (want ~1): 1.00
Psi4@n1 : Psi4@n8 ratio (want ~8): 7.30
Running psi4 -i _thread_test_input_psi4_yo.in -o _thread_test_input_psi4_yo_n1.out -n1 …
Time for threads 1: Psi4: 89.553097
Running psi4 -i _thread_test_input_psi4_yo.in -o _thread_test_input_psi4_yo_n8.out -n8 …
Time for threads 8: Psi4: 34.369661
Psi4@n1 : Psi4@n8 ratio (want ~8): 2.61

Yeah, this is basically expected efficiency for this test case, which is running SAPT0 on a fairly small system. I would expect different methods and system sizes to have different parallel scaling efficiencies. Usually Psi4 can get quite good efficiency on 8 cores for larger systems.

In other words, that test doesn’t try to answer how parallelizable is my computation? It answers is Psi4’s parallelization significantly worse than NumPy’s for SAPT0? And as expected, the answer is “no.”