25th September 2013

#

**61**Quote:

It is the general consensus that unless a test implements statistical significance utilizing known probabilities and null hypothesis with a minimum number of trials (10 - 25), and a minimum confidence level of 95%, the test would not be deemed as scientific. Of course, significance testing goes far beyond the field of audio.

A very brief description of ABX testing can be found here, which touches on what I'm referring to. There are very large charts available that list confidence levels achieved based on the number of trials performed and answers given.

IMO, if a test will produce an outcome different than its scientific equivalent, then it is flawed. If you disagree with this, then you're effectively arguing against science.

Is a test that implements less than ten trials, for example, flawed? Based on scientific reasoning, yes. What if, for example, a subject performed ten trials, but chose 7/10 as being their preference. In other words, three times they chose preamp 'B' as their preference, and seven times they chose preamp 'A' as their preference? A person not familiar in the area of statistical significance would likely say that preamp 'A' is definitely their preference. But according to science, it is not! In fact, they do not have a preference according to science (without a minimum degree of certainty). If someone says otherwise, then they should back up your reasoning with science.

In significance testing, we want to be relatively certain that the subject is not guessing, factoring a minimum confidence level of 95%. In other words, if a subject chose preamp 'A' 9/10 times in a

**randomized scientific test**, then we can be highly confident that he/she is indeed not guessing (9/10 trials has a confidence level of over 95%).

Not following the above guidelines is just one of many ways a test can be flawed, but it's not that hard to come up with a test that is reasonably flaw-free.

As this thread goes on, you'll start to see how other flaws will be called out and/or questioned. But, how many tests even follow the above guidelines? 1, 2, 3 maybe on this entire board?