I was recently chasing a p2p-centric correctness issue in QUDA. In order to get reproducible workload invocations, I had decided to turn off tuning. To my surprise, this disables the p2p codepath inside of the dslash policy. There are p2p components that are invoked before dslash is applied even with tuning off.
My surprise, however, may be someone else's expectation, so I wanted to open up this ticket to at least start a discussion around it before calling it a bug.
I'll note that turning off p2p for dslash did help me narrow down where the problem ended up being, but it also had the side-effect of making the failure more flaky.