|
Home
/ FAQ
/ News Classifieds / Events |
Audio Asylum Thread Printer |
Get a view of an entire thread on one page |
75.50.156.90
| '); } else { document.writeln(''); } } else { document.writeln(''); } } else { document.writeln(''); } } // End --> |
I'm doing some blind tests to try to prove I can tell interconnect cables apart. I have a statistics question.
My understanding is that we have a "null hypothesis" which is that the cables can't be told apart. If I do well enough then we can reject the null hypothesis with a certain level of significance.
For example, I'm not sure if the numbers here are right, but if I do 16 trials and get 12 right, then there is a 5% chance I was guessing, so the null hypothesis has been rejected with a 5% level of significance.
My question now is: let's say I'm able to get a correct identification 80% of the time. How many trials would be needed for a 5% level of significance? For a 1% level of significance? For a 0.1% level?
There are a lot of potential problems with conducting amateur DBT's, see:
http://www.audioasylum.com/forums/prophead/messages/2190.html
http://www.audioasylum.com/forums/prophead/messages/2579.html
http://www.audioasylum.com/forums/prophead/messages/2580.html
for some discussion of the common problems and errors that get committed.
One of the biggest problems from a test methodology aspect, is that even experienced listeners get tired very easily and quickly. The suggestion (more like a demand) of the ABX folks that one use 16 trials has been one of the biggest problems in my opinion.
In the first cited URL I state that:
"The benchmark for the 16 trials was to get 12 or more correct, this would then establish that the listener had less than a 5% chance of just guessing that many correct. It is what is known as a confidence level of 95%. The criteria for what was considered 'good enough' so as to not be just due to chance, is supposed to be selected before the test, and then adhered to. Other confidence levels could be used, such as 99% (very strict, and usually extremely hard to do in these kinds of tests), or 90%. It should be noted, that for a 95% confidence level, that just conducting 20 runs would typically result in one that appeared to exceeded the 95% confidence level, even if everything was just random choices. So in order to take the test results as a valid positive, one would have to do better than this on the average."
What this means, is that you would have to perform the test more than once to satisfy most of the objectivist folks, otherwise they would be very likely to deny that a single test had any meaning.
If you ran say 10 such listening tests with 16 trials, and had more than half of them get more than 12 of the 16 trials correct, this would tend to be a strong indicator that tthe test results were showing something that was really there.
However, as I said above, doing 16 trials tends to become a self-fullfilling prophecy: many such tests end up with null results.
Why is this?
I cover that in the three cited URL's. A quick version would be that by the time you get past about 8 or 10 trials, listening fatique often sets in, and the rest of the results end up almost random.
If the last 7 or 8 trials are random, then even if you got 6 or 7 out of the first 8 correct, then the end result falls below the cutoff, and is declared "random results".
Yet if you look at doing just 9 trials, and get 7 correct, that is a p=0.09 (or 9%), which falls within a 90% criteria, rather than a 95% criteria.
I talk about the number of trials in the 3rd part of the cited URL's, and how many trials to run. All of these were selected to minimize the number of trials per test, to help minimize listening fatigue and then getting poor results.
I strongly suggest that you read the three URL's cited, and make sure that any listening tests you intend to conduct avoid the most common and worst of the mistakes I list.
Jon Risch
Something you don't mention is the problem of Type II error. (The chance that we fail to reject the null hypothesis when it is false.)
Let p be the probability that a listener picks the right answer on a single trial. For a perfect test subject, p=1.0. For someone who's guessing, p=0.5. In reality, p is somewhere in-between. Maybe it's 0.9, maybe it's 0.7, maybe it's 0.6. Something that rather shocked me was that for p=0.6, you need a LOT of trials to minimize Type II error. On the order of 50 trials if your significance level is 5%.
This may be one reason few amateur cable tests have shown a positive result. In a long-term listening style of test, who has time to do 50 trials? If we are dealing with a small p, the differences are real, but it takes many trials to reach a 5% significance level.
This is the conundrum. You want a lot of trials, so your confidence will be high, but a lot of trials causes listening fatigue, and the result go towards random, and no possibility of finding statistical positives.
The alternative is to do lots of runs (spread out over time to allow for listening fatigue to be reduced), with a small number of trials, as I suggest in the cited URL's.
Jon Risch
Thanks for that. Very interesting. I will read all of it.
Regarding how often we are told, "No one has ever passed a cable blind test," I had a thought.
If there were really a lot of cable blind tests being run, a certain percentage of them would succeed from chance alone. If you run 100 tests at a significance level of 5% you'll probably have 5 of them that reject the null hypothesis.
So if there were really a healthy number of tests being run, we should have a practically endless supply of stories of successful tests. And I don't mean rigorous tests. Just informal tests... by chance alone, you would expect a lot of 16-trial/5% significance tests to reject the null hypothesis no matter whether they are well-run or poorly-run.
This tells me that practically no one is trying to run blind tests! In this case, I believe it is true that: Absence of evidence is not evidence of absence. It's evidence that no one is even looking for evidence.
You wrote:
"Regarding how often we are told, "No one has ever passed a cable blind test," ...."
Actually, this is has been mis-worded or misquoted.
Many people have 'passed' a cable listening test. I have 'passed' such tests.
However, the objectivists often require and demand more than reasonable levels of proof.
Unless a listening test is fully written up, and submitted to a peer-reviewed scientific journal, most objectivist's won't accept it as any sort of proof. Even then, when faced with some sort of evidence, they often look for excuses to dismiss the data and to call for even further tests to "back up the one conducted", dismissing it wholly until more test data is submitted.
Several examples:
I have conducted many blind listening tests, and been able to identify cables using p~0.10, and have done this over a large number of runs.
However, I have not written them up formally and submitted them to a peer-reviewed journal, therefore the objectivist's dismiss them as lies and propaganda.
A fellow named Vandy at a chat board called Audio Review conducted a series of listening tests some years ago, using a methodology similar to what I have recommended, and found that he was able to generate a statistical positive, yet he was so lambasted and flamed upon, that he left the AR board never to return, due to the viscious and relentless badgering and hammering from the objectivist's.
Other's have attempted amateur listening tests for cable differences, and tried to 'submit' their statistically positive results on the news groups, only to be flamed and denigrated and hounded by dozens of hard-core objectionists.
I seriously doubt (and so do most subjectivist's) that any amateur work would ever be accepted by the objectivist crowd, it is too easy to be a critic, too easy to not do anything, to easy to say "no, it can't be".
I am not trying to discourage you, but this is the situation we find ourselves in, a highly polarized one where things are portrayed as black and white, right and wrong, and no room for less than perfection (which will never happen).
Jon Risch
If one wire is louder than the other,
or one wire picks up hum from a nearby wire,
and the other doesn't,
or one wire has corroded terminations,
and the other wire doesn't,
then no expertise in statistics
will create a good test design.
Make sure the terminations are clean and tight.
Make sure SPL's at 1000Hz. are the same using a mike on a tripod
A perfect test design would require ONE trial.
Golden Ears think ONE sighted audition followed by endless pontification about the wires, is a perfect test design!
With imperfect test designs, 12 correct of 16 trials reduces the probability of lucky guessing to 3.8%.
A typical test has 100 A-B-X comparisons within the 16 trials, and lasts a few hours.
Make sure you think you can hear a difference under sighted conditions BEFORE you switch to blind conditions.
No winking or hand signals or smiling when the "good" wires are connected by the test "leader"
Each trial can last 10 minutes or 20 minutes or an hour or a whole day
(Please see link for more details on the statistics)
.
.
.
Richard BassNut Greene.....................................................................
The "Cliff Claven" of Audio
and the "Floyd R. Turbo of Bingham Farms Michigan"
It is not logical to "accept" a hypothesis, only to tentatively reject the null hypothesis. There is a good body of literature on the notion of statistical significance when you do not test with the entire population but rather a random sample of sufficient size. You take a chance with random samples that your sample is not representative of the population and thus falsely reject the true null hypothesis. When this is sufficiently improbable, typically only 5 samples in a hundred, most reject the null hypothesis.
You are really just talking about probabilities. If something is quite improbable, most would say that random guessing is the best explanation. The problem is that you merely are saying that you could be guessing. You cannot say that the cables sound alike.
You may find the link below interesting. It has a good discussion on the difference between subject-matter and statistical significance.
*
"Whoever undertakes to set himself up as a judge of truth and knowledge is shipwrecked by the laughter of the gods." - Albert Einstein
The author throws out the baby with the bath water here. While he is correct about the misuse of statistical significance, he infers that it is useless. In a properly designed experiment with a well-defined hypothesis, hypothesis testing is not only useful, but required.
Really, the misuse of significance testing comes into play when the investigator is more interested in the strength of an effect, which in itself has nothing to do with significance.
Bringing up Bayesian statistics is a tired old argument. This has even more problems than the hypothesis-testing approach.
Interesting also that the USGS distributes Blossom, a highly useful package for statistical testing.
In a well designed experiment you really need a random sample to use statistical significance to reject the null hypothesis. A good experiment will use a big enough random sample given the anticipated strength of the relationship to reject the null hypothesis.
Certainly statistical significance testing offers no insight into the strength of the relationship although it is not infrequently done. Statistical significance, of course, increases with the sample size as well as the strength of the relationship. I remember hearing a paper in international relations where rather than using 40 countries the author used their relations as diads. This, of course, greatly increase his "sample" size and won him many "significant" relationships. He was bombastic when I noted this. I said that it really didn't matter as although he had many "significant" relationships, he explain little of the variance and thus his research was trivial. He was livid, but it did not matter as I was doing the hiring for the position he applied for.
Statistical hypothesis testing is too often used as a substitute for the "goodness" of an effect. But this is dependent on sample size and other factors. It's just a lazy way to put a pseudo-scientific stamp of approval on whatever findings were developed. In an audio double-blind test, you might get a significant result without your ears registering enough of an effect to make it worthwhile. Now this is disregarding all the other problems that may make the test worthless or invalid.
a
When conducting any study based on acquiring data from the natural (or poluted) environment, you are conducting an undesigned experiment. The need to determine a meaningful effect size (subject-matter significance) to test for becomes a very important element of the sampling and analysis program.
*
"Whoever undertakes to set himself up as a judge of truth and knowledge is shipwrecked by the laughter of the gods." - Albert Einstein
> My understanding is that we have a "null hypothesis" which is that the
> cables can't be told apart. If I do well enough then we can reject the
> null hypothesis with a certain level of significance.
This is not correct. The procedure you describe is to test the hypothesis that you can tell the difference between your interconnects to a given level of confidence. Failing this test does not mean you cannot sometimes tell the interconnects apart. It means that you cannot reliably tell the interconnects apart.
In order to test that you cannot tell the interconnects apart at all you can use the same results but have to perform a different analysis to test a different hypothesis. The hypothesis you would want to test is: are my results simply a set of random guesses to a high level of confidence? This is a standard test but not one I have seen audiophiles perform which is perhaps not surprising. It would be used to test if, for example, a dice was fair or if a statistical measurement technique was unbiased. As a test it requires more samples for a given level of confidence because one needs to test the whole probability distribution rather than simply the mean. For example, if you took a 100 measurements and perceived a difference in the first 50 and perceived no difference in the last 50 then although the mean is the same as the average of a random one the distribution is clearly not random.
> My question now is: let's say I'm able to get a correct identification
> 80% of the time. How many trials would be needed for a 5% level of
> significance? For a 1% level of significance? For a 0.1% level?
If you can only tell the cables apart 80% of the time then it does not matter how many samples take since you are not drawing from a population that can tell the cables apart. You would need to change the hypothesis being tested to something else.
"""This is not correct. The procedure you describe is to test the hypothesis that you can tell the difference between your interconnects to a given level of confidence. Failing this test does not mean you cannot sometimes tell the interconnects apart. It means that you cannot reliably tell the interconnects apart."""
Well, that's not what my little primer on statistics says. There is a null hypothesis (H_0), alternative hypothesis, (H_1). If the sound of the cable has no influence on my perception, then the probability of guessing correctly is p=0.5. In other words, completely random. That's the null hypothesis. If we run 18 trials and I get 15 of them correct, then P=0.004. That means there was a 0.4% chance that p actually equals 0.5. There is a 99.6% chance that p is *something other than 0.5*. But the test cannot tell you what p is. Just that it's not 0.5.
Guessing right 80% of the time, over a large number of trials, is in fact strong evidence that my perception is influenced by the sound of the cables. It may also be influenced by what I had for lunch, or what random thoughts are going through my brain. But it is very likely influenced by the difference in sound---hence evidence that I can tell them apart (i.e. that their sound influences my perception).
"Well, that's not what my little primer on statistics says. There is a null hypothesis (H_0), alternative hypothesis, (H_1). If the sound of the cable has no influence on my perception, then the probability of guessing correctly is p=0.5. In other words, completely random. That's the null hypothesis. If we run 18 trials and I get 15 of them correct, then P=0.004. That means there was a 0.4% chance that p actually equals 0.5. There is a 99.6% chance that p is *something other than 0.5*. But the test cannot tell you what p is. Just that it's not 0.5."You mostly have it except for the meanings of the various probabilities.
p=0.5 ... that just means the probability of guessing correct on a single independent trial.
P=0.004 ... that just means the probability of getting 15 correct out of 18 trials due to chance alone.
And, as per my other (longer) post, the test does not involve rejecting small "p", that's just an attribute of the mathematical model associated with the null hypothesis, what we are really doing is comparing our result with what the model "says" about the likelyhood of such a value.
Everything matters, don't forget to tweak your placebos!
Edits: 06/25/09
Use the calculator located at the link below.Example (set p=0.5 and don't change):
12 Trials (n=12), 10 Correct Identifications (k=10) [83% correct] (P-value) = 0.019287 (1.93%) is within your 5% level of sigificance
In general as the number of trials increase then required percentage of correct identification for a given level of significance decreases.
Example: 75 Trials: 46 correct identifications (61.33%) is within a 5% level of sigificance requirement.
---
More importantly you should take the time to read the contributions by Les Leventhal to the The Highs & Lows of Double-Blind Testing article that appears on the Stereophile site (Leventhal's stuff start on Page 2). This is an excellent primer on the statistics involved in typical DBT/ABX test. In particular you will learn of the problems inherent in tests with a low number of trials, such as 16 trials, in particular the high probability of what is called a Type 2 error (which in the case of DBT/ABX test generally means "... mistakenly concluding that audible differences are inaudible" as Leventhal puts it).
...
Back to your question ... we found that a 10 correct out of 12 trials test (83.3% correct) meets the 5% level of significance (pretty close to the 80% correct case).
Of course the probability of Type 2 error for a 12 trial test is even worst than it is for a 16 trial test, and the probability of Type 2 error in a 16 trial test is itself unacceptably high, certainly too high to be considered "serious science" to say the very least.
That said if you *do* consistently get 10 out of 12 correct then that in itself is strong evidence of sonic difference, the Type 2 Error concern comes in when failing to "score" 10 out of 12... just to be clear on that point.
Everything matters, don't forget to tweak your placebos!
Edits: 06/25/09
Meaningful significance and statistical significance are quite different. One can get very great statistical significance what a large random sample while have no impact on meaningful significance. A random sample of 25,000 would be sufficient for belt size to have a statistical significant impact on how people vote with no meaningful significance.
"""Meaningful significance and statistical significance are quite different. One can get very great statistical significance what a large random sample while have no impact on meaningful significance. A random sample of 25,000 would be sufficient for belt size to have a statistical significant impact on how people vote with no meaningful significance."""
But belt size is a continuous variable. We are talking about an ABX test, in which the answers are binary; true/false. Aren't these different topics?
This would only matter were you trying to say that all people can hear a difference and were using a random sample. As I understand it, you are only seeking to satisfy yourself, not to generalize. You seek to only say that it is improbable not that it is statistically significant. Inferential statistics are quite different than descriptive statistics.
Could you explain one more thing? I am aware that one can find correlation without proving causation. Seems like this issue is irrelevant to ABX testing.
Let's say in an ABX test I have a slight tendency to pick A. Because X is totally random in each trial, there is no way this can influence the results. That's my understanding.
easier to hear and better.
I must say that personally I am okay in just putting cables in and hearing a difference that I like. When I am studying whether people vote their party loyalty or whether states with concealed handgun laws have less crime, I am engaged in science and must be concern with causation, explanation, and whether the data are valid for the questions I am asking. When I am deciding whether one set of cables are better than another, I am not engaged in science. I am assessing my tastes in sound. The magnitude of the improve become important. Often I try blind tests if the improvement is small, but often if it is small I just stick with what I have.
Depending on how you conduct your testing, how thorough and rigorous you are, - your hypothesis and conclusions may vary. And some may call your testing methodolgy "poor science," - but it's still science. I have take issue with the neither those that that want more rigor or less; but, - I always appreciate tolerance for both sides.
Both sides of the river, there is bacteria; there must be meaning behind the moaning, is this living?
minded. When engage in information gathering to assess regularities of some benefit to society, I insisted on valid measurement of concepts, random samples, careful methodology, and care to avoid spurious relationships, when you cannot do real experiments. How I would love to randomly pick 25 states to have concealed handgun laws and 25 none and wait 20 years to see what differences there are between the states. My null hypothesis would, of course, be no differences.
I think it helps a lot that I am the whole population and that even if I'm wrong, there is little downside.
It seems to hit the fan each time someone posts that they did thus and so with good results. To someone else that may seem the depths of impossibility so they ridicule the poster rather than either trying it or just deciding that it's so unlikely that they aren't going to waste the time to check it out. Ridiculing others rather than thoughtfully examining your own understanding is very tempting and I've fallen off the wagon a few times myself.
I enjoyed this thread but like you, I believe, it's hard for me to see how statistics have much value for an individual listener. If I can't hear it, it doesn't matter and if I can I'll try to choose the best compromise if it isn't clear-cut. And I may share the result. Even if it isn't reliably predictive it does provide insights into things to try. And that's where AA shines, getting ideas to play with.
If I want to learn more about the underlying processes then I'd turn to measurements and try to find ones that correlate to the listening and from there try to reproduce the results with known changes which would hopefully be enough information to understand and usefully model whatever the process is.
Ironically even if it can be proven beyond question that Joe reliably hears a difference by putting marbles under his clock radio, that chunk of data alone adds little more to predicting my results than just his assertion that he does. One of the nicest things about this hobby is that you can try this stuff at home without the neighbors knowing.
Rick
a
"Meaningful significance and statistical significance are quite different."
An essential distinction.
There are two steps involved in going from statistical significance to meaningful significance . The first step bridges the gap from correlation to causation . The second step bridges the gap between an effect and a meaningful effect . Both steps are frequently contentious, as can be seen in numerous threads in this forum. The first step can be accomplished with a causal model, the second step requires a set of values.
Those people who view the world in terms of crude (e.g. black or white) facts should stay well away from anything to do with statistics.
Tony Lauck
"Diversity is the law of nature; no two entities in this universe are uniform." - P.R. Sarkar
Causation and explanation are the next steps in providing an understanding. I am mainly concerned with people dropping "statistical" and assuming that what they have found is meaningfully significant. You find this quite prevalently in research literature in the social sciences.
pounding of ones' chest ... I suggest you do a little searching and add if you locate as the addition would be a nice finishing touch to your post.
Everything matters, don't forget to tweak your placebos!
Not a problem, I was merely keeping it simple, not even addressing the larger issue; nor do I consider myself up to that task for that matter.
Everything matters, don't forget to tweak your placebos!
*
I understand.However it brings to mind an experience I had in 1st year university. I had this Economics professor who had a way of creating in his student's minds, mine included, a feeling of now understanding how it all works .
But in a class near the end of the semester he shocks us (well he certainly shocked me in any case) by declaring that everything we had learned was essentially incorrect, that as we progressed we'd discover it all to be egregious over-simplification. Yet he added that he still felt that his style of teaching with conviction , as he put it, was the correct way to approach a topic, basically a variation on the theme that one must first crawl to walk and that when at the crawl stage one should concentrate on doing that (alone) to the best of ones' abilities.
I was real life lesson that stuck with me.
Everything matters, don't forget to tweak your placebos!
Edits: 06/26/09
I would outline why each was said to be better. Then I would use data to show they were irrelevant. This is essentially teaching against the textbook.
Now I have written my own text and develop everything from the data themselves. It is still confusing, but for many I get them to think critically. The data show that "merit selection" of judges, where voters merely say whether a judge deserved another term or not does nothing for the quality of justice but does get younger judges and those with degrees from more prestigious law schools.
> Example (set p=0.5 and don't change):
Thanks for the link. Does lower-case p represent the probability of the null hypothesis?
OK, a little on null hypothesis...A null hypothesis is some statement that has can be modeled mathematically, and hence something you can compare experimental results against, specifically against what the mathematical model "says" about the value obtained experimentally.
For example we can mathematically model the probabilities associates with coin tosses, the probability of getting a head (or a tail) on a single independent toss (trial) is 50%, and we can answer questions like:
. what is the probability of getting exactly 12 Heads when we toss the coin 125 times.
. what is the probability of getting at least 23 Tails when we toss the coin 50 times.As it turns out the coin toss experiment for the second question, probability of getting X Heads (or Tails) for n tosses (trials) is modeled by the Binomial Mass Function when p=.5 (the probability of getting a head (or a tail) on a single independent trial).
Hence jumping ahead we see that the traditional DBT/ABX test is similar to coin tosses when modeled mathematically.
But before we get there ...
---
OK, let's say we have a cable test. We want to test if the cables "sound different". So what is our null hypothesis? Is it...
The cables sound different.
So we do our test, say we get 34 out of 50 correct, what does it mean? What is the mathematical model for "The cables sound different"? There isn't one! So forget it, that's not the null hypothesis!
Instead we propose that when it comes to being able to distinguish between the two cables (their "sound") that such is determined by "chance" alone, and that on a single independent trial the chance of correctly identifying X (i.e. in a traditional ABX test) is exactly 50%. Now we are getting somewhere, in fact that is our null hypothesis but we can put it simply as...
Distinguishing between the cables is determined by chance alone with p=.5
Hence now when we run the test and get some result we have something to compare against, namely the Binomial Mass Function with p=.5 (simple BMF hereafter) Aren't we clever!
Then we get to level of significance. Well for any test where the null hypothesis is modeled by the BMF (or by some other function for that matter) we decide in advance what result we require to "reject" the null hypothesis, that is we set a "level of significance" (LOS). A 5% LOS means...
We will reject the null hypothesis for any result for which the probability of obtaining that result due to chance alone is less than 5%.
So we run a test getting X correct identifications and the BMF tell us that probability of getting at least X correct identifications is 7.5%. Well that's greater than our 5% LOS so we *don't* reject the null hypothesis, in other words we accept that X correct identifications could have been due to chance alone... which fails to demonstrate a difference between the cables.
Now if we get Y correct identifications and the BMF tell us that probability of getting at least Y correct identifications is 1.3% then that's less than our 5% LOS so we *do* reject the null hypothesis, in other words we agree that the result *could not have been due to chance alone* (given our LOS) ... which would then imply that there is a difference between the cables.
Hope that helps.
Everything matters, don't forget to tweak your placebos!
Edits: 06/25/09 06/26/09
Post a Followup: