I have done some additional testing, hopefully they end up being useful.
The 1.1 Conda package does not seem to make use of threaded BLAS, tested on Debian 8, Debian 9 and Fedora 26.
I have also grabbed a tarball from git, and compiled using OpenBLAS (on Debian 9). This was very useful, because OpenBLAS can be influenced by the OPENBLAS_NUM_THREADS enviroment variable.
Testing using this has revealed, that when it comes to the CCSD iterations of a DF-CCSD single point energy calculation, psi4 does not set MKL threading properly (or at least the conda package doesnt). Not only did running the conda package with -n2 gave the same timings as running with -n1, but also the git build with OpenBLAS produced practically identical timings to the conda package, and -n2 yet again did nothing for DF-CCSD iteration times. However, with the OpenBLAS build I saw very significant speedups, when I increased the BLAS thread count via the env. variable.
So while I cannot say what or where the problem is, I can say for certain that it is there somewhere.