|
Home
/ FAQ
/ News Classifieds / Events |
Audio Asylum Thread Printer |
Get a view of an entire thread on one page |
75.54.119.222
| '); } else { document.writeln(''); } } else { document.writeln(''); } } else { document.writeln(''); } } // End --> |
In Reply to: RE: statistics question posted by mike1127 on June 25, 2009 at 11:17:03
There are a lot of potential problems with conducting amateur DBT's, see:
http://www.audioasylum.com/forums/prophead/messages/2190.html
http://www.audioasylum.com/forums/prophead/messages/2579.html
http://www.audioasylum.com/forums/prophead/messages/2580.html
for some discussion of the common problems and errors that get committed.
One of the biggest problems from a test methodology aspect, is that even experienced listeners get tired very easily and quickly. The suggestion (more like a demand) of the ABX folks that one use 16 trials has been one of the biggest problems in my opinion.
In the first cited URL I state that:
"The benchmark for the 16 trials was to get 12 or more correct, this would then establish that the listener had less than a 5% chance of just guessing that many correct. It is what is known as a confidence level of 95%. The criteria for what was considered 'good enough' so as to not be just due to chance, is supposed to be selected before the test, and then adhered to. Other confidence levels could be used, such as 99% (very strict, and usually extremely hard to do in these kinds of tests), or 90%. It should be noted, that for a 95% confidence level, that just conducting 20 runs would typically result in one that appeared to exceeded the 95% confidence level, even if everything was just random choices. So in order to take the test results as a valid positive, one would have to do better than this on the average."
What this means, is that you would have to perform the test more than once to satisfy most of the objectivist folks, otherwise they would be very likely to deny that a single test had any meaning.
If you ran say 10 such listening tests with 16 trials, and had more than half of them get more than 12 of the 16 trials correct, this would tend to be a strong indicator that tthe test results were showing something that was really there.
However, as I said above, doing 16 trials tends to become a self-fullfilling prophecy: many such tests end up with null results.
Why is this?
I cover that in the three cited URL's. A quick version would be that by the time you get past about 8 or 10 trials, listening fatique often sets in, and the rest of the results end up almost random.
If the last 7 or 8 trials are random, then even if you got 6 or 7 out of the first 8 correct, then the end result falls below the cutoff, and is declared "random results".
Yet if you look at doing just 9 trials, and get 7 correct, that is a p=0.09 (or 9%), which falls within a 90% criteria, rather than a 95% criteria.
I talk about the number of trials in the 3rd part of the cited URL's, and how many trials to run. All of these were selected to minimize the number of trials per test, to help minimize listening fatigue and then getting poor results.
I strongly suggest that you read the three URL's cited, and make sure that any listening tests you intend to conduct avoid the most common and worst of the mistakes I list.
Jon Risch
Something you don't mention is the problem of Type II error. (The chance that we fail to reject the null hypothesis when it is false.)
Let p be the probability that a listener picks the right answer on a single trial. For a perfect test subject, p=1.0. For someone who's guessing, p=0.5. In reality, p is somewhere in-between. Maybe it's 0.9, maybe it's 0.7, maybe it's 0.6. Something that rather shocked me was that for p=0.6, you need a LOT of trials to minimize Type II error. On the order of 50 trials if your significance level is 5%.
This may be one reason few amateur cable tests have shown a positive result. In a long-term listening style of test, who has time to do 50 trials? If we are dealing with a small p, the differences are real, but it takes many trials to reach a 5% significance level.
This is the conundrum. You want a lot of trials, so your confidence will be high, but a lot of trials causes listening fatigue, and the result go towards random, and no possibility of finding statistical positives.
The alternative is to do lots of runs (spread out over time to allow for listening fatigue to be reduced), with a small number of trials, as I suggest in the cited URL's.
Jon Risch
Thanks for that. Very interesting. I will read all of it.
Regarding how often we are told, "No one has ever passed a cable blind test," I had a thought.
If there were really a lot of cable blind tests being run, a certain percentage of them would succeed from chance alone. If you run 100 tests at a significance level of 5% you'll probably have 5 of them that reject the null hypothesis.
So if there were really a healthy number of tests being run, we should have a practically endless supply of stories of successful tests. And I don't mean rigorous tests. Just informal tests... by chance alone, you would expect a lot of 16-trial/5% significance tests to reject the null hypothesis no matter whether they are well-run or poorly-run.
This tells me that practically no one is trying to run blind tests! In this case, I believe it is true that: Absence of evidence is not evidence of absence. It's evidence that no one is even looking for evidence.
You wrote:
"Regarding how often we are told, "No one has ever passed a cable blind test," ...."
Actually, this is has been mis-worded or misquoted.
Many people have 'passed' a cable listening test. I have 'passed' such tests.
However, the objectivists often require and demand more than reasonable levels of proof.
Unless a listening test is fully written up, and submitted to a peer-reviewed scientific journal, most objectivist's won't accept it as any sort of proof. Even then, when faced with some sort of evidence, they often look for excuses to dismiss the data and to call for even further tests to "back up the one conducted", dismissing it wholly until more test data is submitted.
Several examples:
I have conducted many blind listening tests, and been able to identify cables using p~0.10, and have done this over a large number of runs.
However, I have not written them up formally and submitted them to a peer-reviewed journal, therefore the objectivist's dismiss them as lies and propaganda.
A fellow named Vandy at a chat board called Audio Review conducted a series of listening tests some years ago, using a methodology similar to what I have recommended, and found that he was able to generate a statistical positive, yet he was so lambasted and flamed upon, that he left the AR board never to return, due to the viscious and relentless badgering and hammering from the objectivist's.
Other's have attempted amateur listening tests for cable differences, and tried to 'submit' their statistically positive results on the news groups, only to be flamed and denigrated and hounded by dozens of hard-core objectionists.
I seriously doubt (and so do most subjectivist's) that any amateur work would ever be accepted by the objectivist crowd, it is too easy to be a critic, too easy to not do anything, to easy to say "no, it can't be".
I am not trying to discourage you, but this is the situation we find ourselves in, a highly polarized one where things are portrayed as black and white, right and wrong, and no room for less than perfection (which will never happen).
Jon Risch
Post a Followup: