Home Propeller Head Plaza

Technical and scientific discussion of amps, cables and other topics.

DBT, Part 1

DBTs, ABX and the Meaning of Life? Part 1

Talking about Double Blind Tests is worse than discussing politics or religion, and the infamous DBT thread death-spiral is all too familiar to most of us who have perused the various audio message boards or news groups on the Internet.

I personally have been accused of being anti-DBT, because I moderate a message board forum, the Cable Asylum, that has an anti-DBT posting rule. Nothing could be further from the truth.

I have often come out against the unwarranted conclusions that a few certain folks come to over the results of a few certain amateur listening tests, primarily because I am more familiar than most (but not more than jj) with the problems and limitations of such listening tests.

So I am going to discuss some of the various issues and aspects of DBT's, and in this, I am primarily referring to the amateur type of listening tests, not the professionally conducted types that jj and the codec folks do. I am not going to place this disclaimer at the end of every paragraph, so jj, please print out that sentence, and attach manually. If I specifically refer to a professionally conducted test, I will say so, quite clearly.

I also may refer to audio cable testing at times, but really, what I am saying applies to almost all audio component testing, and is relevant and applicable.

Now that that is out of the way, let's get down to brass tacks.
What is valid? That is, what is a valid listening test, or what constitutes a scientifically valid set of data?

The gold standard for many years has been serious studies or papers published in a peer-reviewed professional journal. There are many reasons for this, and I am not going to cover all of them. Suffice it to say, that this kind of presentation allows one to examine all the facts, the procedures, and the data. It provides for the review of the paper and it's contents by peers in the field, and is published where other professionals have access to it and can question it or raise points they feel have been overlooked. Does such publication guarantee that the conclusions reached by the author are pretty solid evidence? No, but it does provide a certain minimal level of information, screening and review that make the data and conclusions useful to a certain point.

DBT test results from certain amateur listening tests get thrown about sometimes as if they were some sort of hard, cold facts; after all, it was "scientifically determined" that such and such was the case, right?

However, when we look at these DBT listening tests more closely, we find that most have not been published in a peer-reviewed professional journal, in fact, not one DBT on audio cables has been published in such a manner. None. Very few listening tests on other audio components, with the exception of codecs, have been so published either. There have been a few landmark studies on speakers, ala Toole, and most people agree that audio loudspeaker systems do sound different, so this is not one of the more controversial components of study.

So why all the noise about DBT's? Where have they been published, and are they valid evidence? Well, for audio cables, only a handful have been published in popular press magazines. Note that this is not the same thing as being published in a professional journal, an editor may or may not have an agenda, no one else may be reviewing the article for accuracy or proper scientific procedures, etc. When I say just a handful, this is literally the case, as there are only about a half dozen (depends on your criteria) on speaker cables, and few on interconnects. For other audio components, there may be a half dozen articles or so. Not all of these came up with null results either, so it would be very hard to come to any sort of real conclusion based on the data from these articles.

What about web sites, message board posts, news group posts? These are what is known as anecdotal data, they usually have not provided all the details of the tests, nor all of the data, nor have they been reviewed by anyone for proper scientific procedures, etc.

The vast majority of listening test accounts are of an anecdotal nature, and not traditionally allowed to be considered as any sort of good scientifically based evidence.

So the very thing that is being argued about, the DBT listening tests, on audio components, are not of a nature that one can say are very useful in terms of truly valid scientific evidence.

So what about these amateur listening tests, these anecdotal web sites, the popular press magazine articles, are they any good to make any judgments from?

One of the great little catch phrases that get used by some folks extolling these amateur DBTs, is that "In 20/25/30 years of testing, no one has found XXX audio component to sound different, under controled conditions, when nothing is broken", etc.

This is meant to sound like DBT tests have put the matter to bed years ago. This sounds all very well and fine, until you realize: what was the SOTA for 25 or 30 years ago? What kind of cables would have been compared 25 or 30 years ago? I can tell you: zip cords against zip cords. Several of the articles commonly referred to by folks citing popular press DBT results, were this old. Some of the articles on CDPs are 17, 14 years old. How far have CDPs come in that length of time? I mean, we are talking about CDPs that probably did not even have 16 or 15 bits of resolution, no dither, multi-stage, multi-opamp analog output flters, etc.

So if you stop to think about how valid, how relevant some of these really old tests are to the current state of audio, including mid-fi, then it becomes clear that some of them are not really of any use for modern audio components.

What about the tests themselves, how were they conducted? Let's look at a typical scenario for one of the more popular testing paradigms of the day: an ABX style listening test. Note that this is not intended to represent ALL such tests, but merely to provide some idea of what went on in many of the amateur listening tests commonly cited.

First, an ABX switchbox was used to connect the two DUT's, or Device Under Test. In most cases, this required additional cables to be used to insert the switchbox into the signal chain, so it could control which unit was being heard at any given time. The extra cables were almost without exception, just zip cords and/or el cheapo IC's. Even when an audio cable was the subject of the test, the extra cable portion was almost always a zip cord or an el cheapo IC. The reasoning here was that both units were subjected to the same conditions, so it shouldn't matter. So much for the weakest link.

For cable tests, this would be a serious limiting factor, as whatever losses or problems the zip cords or cheap ICs had, were now superimposed on the test cables as well. Ironically, since the vast majority of testers did not believe that audio cables had any sonic impact, they created a situation that virtually guaranteed that it would be hard, at best, to hear what was going on.

Then the listener is asked to listen to the test units, and 'familiarize' themselves with the switchbox and listening protocol.
Typically, while the listener was listening, and switching back and forth, the music was allowed to play on. The first portion was the so-called sighted portion of the test, where they knew the identity of each unit (they know which one is A, and which one is B). The listener was often encouraged to switch back and forth during this portion, and to state whether or not they felt they were hearing the same kinds of sonic differences they did under sighted listening without the switchbox. More on this aspect later.

Then after what might have been hundreds of switches back and forth, under what I would call fairly casual conditions, they would enter the forced choice portion of the listening test, and be asked to identify an unknown DUT, presented as X. They still had access to hearing DUT A or B, and still knew what device the A or B unit was, but X was an unknown, and they were asked to make a choice as to whether it was unit A or unit B.

Classically, they were exposed to a total of 16 trials where they had to select what unit they thought X was, and since it was what is known as a forced choice type of situation, even if they were to readily admit, that they did not think they could identify the DUT, or that they had listening fatigue, they still were supposed to make a choice.
Note that each trial could consist of as many switches back and forth from A to B and back again, and to X and back again.

A single listener might only participate in a single run of 16 trials, and there might only be a handful of such listeners.

Once the listening tests were completed, then the test administrator would check the ABX hardware for the accuracy scores of the listener, and check this against a table of probability ratings, to see how much of a probability existed that the listener had actually been identifying the DUT beyond a certain level of sheer chance.

The benchmark for the 16 trials was to get 12 or more correct, this would then establish that the listener had less than a 5% chance of just guessing that many correct. It is what is known as a confidence level of 95%. The criteria for what was considered 'good enough' so as to not be just due to chance, is supposed to be selected before the test, and then adhered to. Other confidence levels could be used, such as 99% (very strict, and usually extremely hard to do in these kinds of tests), or 90%. It should be noted, that for a 95% confidence level, that just conducting 20 runs would typically result in one that appeared to exceeded the 95% confidence level, even if everything was just random choices. So in order to take the test results as a valid positive, one would have to do better than this on the average.

Much was made of these kinds of tests, mainly because of the fact that they were Double Blind, due to the use of the automated switchbox hardware. The test administrator did not know the identity of the X unit until after the test was completed, and therefore, was theoretically incapable of influencing the outcome of the tests.

What were the problems with these early amateur DBTs?

Unfortunately, they were legion.

It was often assumed that since these tests were double blind, that they represented the only 'true' kind of valid listening test available. However, it was often overlooked that the mere fact that any given listening test was DBT, did not guarantee ANYTHING else at all. It could have been the worlds worst listening test ever, and still could have been double blind.

The long open (sighted) initial portion was not really training, nor were they valid controls of the test sensitivity. In my opinion, they were more of a fatigue inducing situation than anything else. The listener seldom got any real training, they were not exposed to the forced choice scenario until it was time to 'perform', and they were not really trained in terms of what kinds of things to listen for, what kinds of things to hone in on, etc.

The music was typically left to play on, and this is a huge error in procedure. In essence, the listener was never comparing the same signal on both DUTs at any given time, in fact, the same signal was NEVER compared, ONLY a different signal was ever compared. This is such a big problem with the procedure, that such listening tests could be summarily dismissed as an invalid attempt based on this alone.

In terms of listening fatigue, the listener was encouraged to switch back and forth as often and as much as they desired, and this often lead inexperienced and untrained listeners to switch back and forth a huge number of times, all the while not really focusing in on the musical presentation that much. Again, with the music playing on, it would be very hard to try and draw any sort of valid choice, and just as hard to hear what the two units were doing even when you knew which one was which.

This combined with the typically open ended initial sighted portion, and the relatively large number of trials, each of which might include dozens or even hundreds of times that the listener switched back and forth betwen the various units, in my opinion was the cause f a lot of listener fatigue, and therefore was also a very significant factor in these kinds of tests coming up with null results.

Then there was the issue of the switchbox itself, and the extra cables, often of a very poor overall quality level. The relays inside the ABX boxes were of various types over the years, the early ones were mercury wetted reed-relays, the later ones were supposedly rheuthenium plated relay contacts. It has been argued that the switchbox was a source of significant degradation of the listening test resolving power, due to the extra cables and contacts involved. The signal was exposed to magnetic fields inside the relay, and had to travel through a lot of extra wiring and contacts compared to a normal direct real-world connection.

Defenders claimed that the ABX switchbox had passed two tests that assured it was transparent, aside from the usual objective measurement standards of THD, noise and the like:
One, it had been tested using yet another ABX switchbox, and the results had turned up as a null.
Two, J. Gordon Holt, the golden-ears of Stereophile fame, was said to have found it to be 'inaudible' during one of his listening sessions once long ago.

Well, I hope that I don't have to explain the fallacy involved with the first assertion, and the second one is ironic, as one of the very things that the ABX folks were against, was the acceptance of any pronouncements from golden-eared reviewers using sighted listening to review audio products. I think it incredible that they wanted to dismiss and discount all the other reviewers, and Mr. Holt as well when he was reviewing audio equipment, but it was OK to accept his pronouncement on THEIR unit as being transparent when using the same methods. Even so, it is a good idea to note that this occurred back in the 80's, so who knows what one would hear using modern high performance audio gear?

Finally, the confidence level chosen, as well as the particular number of trials, created a very high 'bar' to hurtle, the listener had to be really hearing definite things, and would not have been able to easily discern more subtle things to the requirements chosen.

Despite all of this, certain folks try to cite these old DBT tests as definitive evidence of no sonic differences for audio cables, CDPs, power amps, etc. Not only are the previous problems cited good reasons not to do this, even if none of the problems had existed, and all the items objected to been corrected, there would still be a fundamental problem with doing so.

This fundamental problem is the equating of a null result, that is, a listening test result that simply failed to reach the previously defined criteria for a statistically significant result, as a negative result.

If you have a controlled listening test, and it fails to reach the defined level of statistical confidence, then the result is often called a null result, or "accepting the null hypothesis". However, this kind of result really and truly has no other meaning. You can not legitimately equate a null result to a negative.

Some folks have tried to argue that the equating of a null with a negative is legitimate, and even cited a lone book as a reference. However, the vast majority of statistics books, professors, and accepted authorities still maintain that doing so is just not correct.

The primary reason for not doing so was touched on earlier, you can not know how sensitive the listening test set-up is, unless you have performed a control experiment to determine this. Without such a control, a test that has determined how sensitive both the listening test set-up, and the listening subjects are to very subtle sound issues, you can not have any chance of knowing that the listening test was even inherently capable of discerning what was being tested for!

In the ABX style listening tests, the comments by listeners in the sighted portion that they are indeed hearing what they expected to hear is often cited as a sufficient provision of this test sensitivity control information.
However, this is NOT a scientific way to achieve the determination of this control condition. It is another example of the answer begging the question. Just as you can not use the test to test the test, you can not use a sighted portion to verify the performance of the forced choice portion. This is yet another example of the incorrect reasoning used to justify these kinds of listening tests, and how valid they are supposed to be.

Part 2 will cover the inherent problems and flaws with DBT listening tests, even when done impeccably. Part 3 will cover alternate methods and include some comments on doing your own DBT's.


Jon Risch


This post is made possible by the generous support of people like you and our sponsors:
  Schiit Audio  


Follow Ups Full Thread
Follow Ups


You can not post to an archived thread.