![]() ![]() |
Audio Asylum Thread Printer Get a view of an entire thread on one page |
For Sale Ads |
In Reply to: Wanted : participants for long-term ABX test posted by Klaus on September 29, 2003 at 03:46:23:
This approach to a test will only work if certain requirements are met.First, the recording process doesn't eliminate significant differences in the results with each cable. Take volume/level for example. If one IC has significantly different resistance to another IC, one would expect to hear a difference in volume when one made the substitution. Using the ICs as part of the recording chain, the level going to the recorder should be higher with one than the other. If you adjust the recording level to match input levels for both ICs, then you have removed part of the difference, just as you would remove part of the difference if you adjusted the listening level in the first instance so that the listener heard matched levels.
How do you propose to make the recording in a way that ensures that no significant differences are lost? You can't match recording levels because the signal level coming through the IC is one of the things that could distinguish it from other ICs. You also need to ensure that there's nothing in your recorder that accentuates the characteristics of one IC while diminishing those of the other.
Next, on the test design itself. Why ABX unless you're making a different CD for each individual? My understanding of ABX is that the X means that on some changes, no substitution is made, so sometimes the listener gets two As in a row and sometimes two Bs. Different listeners actually get presented with a different play order and the administrator has no idea of what the play order presented to the listener is. With 3 tracks you're only going to get 2 identical tracks in a row on one occasion, a very different situation to a proper ABX test where there are many different presentations with several duplications along the way.
Are you going to create different CDs for each listener, with a different track order and duplication, so one person gets AAB, one ABA, another BAA, another ABB, another BBA, and yet another BAB in order to cover all 6 possibilities of play order with only 3 tracks? That's the sort of detail in the test process that is needed to guarantee that the presentations are random enough to ensure that the play order isn't influencing outcomes. Of course you'd probably also need to up the number of participants so a reasonable number of people got each presentation.
That then means that you need to keep a list of who gets which disc so that the results can be properly compiled. If you receive the results directly and pass them on to the umpire, there's no guarantee that things don't get altered in the process. You need a way of ensuring that you, as administrator, don't influence the result. You could do that by having people send results directly to the 'umpire' who compiles and analyses them, or ask people to send results to both you and the 'umpire' who both separately compile and analyse and either present your results individually or in a joint presentation. Given the nature of the net as a posting method, I'd prefer to see individual reporting by both you and the 'umpire'. Either way, the umpire needs to have their own copy of the list of who gets which CD.
If you aren't prepared to go to that sort of length, then why not a simple AB test with half the participants getting a disc with an AB play order and the other half getting a disc with a BA play order?
I'm interested, but only if I'm satisfied that the test methodology is reasonable. On the basis of your 3 track ABX suggestion as presented in your post, I have grave doubts that will be the case. I think you need to do a lot more work on your test design first, and be prepared to give a detailed account of what the test design is when you ask for volunteers. Since you're the one who regards blind testing as so much more reliable, you'll have to excuse me for demanding the utmost stringency in test design and procedure. This is only worth doing if it's worth doing extremely well.
Follow Ups:
1. I would not adjust the recording levels for the interconnect tracks, I however would adjust level when using CD and MD as source (pink noise from EBU SQAM)2. If my recorder accentuates one cable w.r.t. the other, this would be beyond my knowlegde and beyond my control. If this is technically possible, I would like to have a technical explanation.
3. I would record several sets of tracks, not only one.
4. All CD's will be identical. The participants will, however, not know who else is participating (unless, obviously, they ask all of the inmates who declared their interest in the test). This ensures that no participant can communicate with the others during the test.
5. I will ask the participants to communicate their results to myself and the umpire. The identity of the umpire, however, wiill have to be kept secret until everybody has finished. Also, the umpire will not know who is participating as to avoid possible interfering communication.
6. For possible statistical analysis of the results I would need help of an expert inmate.
Proper tests use many more than 3 presentations. The reason for that is that the listener is presented with 4 different changes at various stages:A followed by B
A followed by A
B followed by A
B followed by BWhat this does is allow for the fact that the order of presentation may make a difference, so it may be easier or harder to detect a change when B follows A than it is when A follows B. It also gives instances when there is no change with both of the 2 options available. Finally, each of those 4 presentations is presented several times because, if the differences are subtle and people really are discriminating something that is close to the audibility threshold for the difference, results can be variable.
Even using a variety of different material, presenting each sample only 3 times with one sample repeated twice simply does not stack up to professional test standards. Only 2 transitions are represented - track 1 to track 2, and track 2 to track 3, one of which is one of the two changes possible and the other of which is one of the two non-changes possible. By not presenting the full range of presentations possible, you degrade the validity of the test and by only presenting each of your 2 transitions once, you degrade the validity again in a different way. This is simply not a good test approach.
You are also going to have problems with selection bias on this test because you are asking for volunteers and a small number at that. You have said in the past that tweakers have a tendency to hear what they want to hear, but that tendency is not unique to tweakers People who don't believe there are differences are equally prone to hearing what they want to hear, ie no difference in their case. You have no way of ensuring that your sample is unbiased and a sample size of 15 is simply too small, especially given the fact that you are excessively reducing the transitions presented. Having a strong proportion of people who believe that there are no audible changes and who simply report 'no difference' to each change would distort the test result excessively.
Finally, what do you think you are going to prove by this. You are not going to prove anything about whether or not there is an audible difference. There are simply too few presentations and too few subjects to guarantee a result. In fact, it is probably impossible to show that a difference can be heard with this test design. The difference would need to be night and day to get results that would satisfy statistical requirements. The harder the difference is to show, the larger the test sample needs to be and the more presentations are required. You need really big tests to if really small differences are to be demonstrated.
Bear in mind that for something to be audible, all that is really required is for 1 person to be able to hear it reliably, ie to accurately say whether or not there is a difference every time they are presented with a transition. Why then do we need studies of the sort under discussion. Simply because no single individual is that reliable in relation to differences that are close to the limits of perception so it becomes more a question of 'can we hear it more often than not' rather than 'can we hear it every time' or 'does everybody hear it', and that is a very different sort of thing to telling the difference between red and green where you know there is a problem if the person doesn't get it right. There is a 'grey area' between the clear cut differences which everyone hears under normal circumstances unless they have a hearing impairment, and the opposite situation where the difference between the two things is genuinely so small that absolutely no-one can ever hear it. Not everyone has the same level of hearing acuity so as genuine differences become more subtle, fewer and fewer people hear them until eventually no one can hear anything. The closer you get to the point where no one can hear it, the bigger the test needs to be if you really want to stand a chance of demonstrating that some people can hear it. If cables really do make a difference, then that difference is in the 'grey area' because not everyone hears it and not all of those who do hear it hear it with every cable change so your small study simply can't measure up to the standards required. If it can't possibly show a difference exists because the sample size isn't large enough for validity, then the test is simply useless.
Of course, if you only want to prove that most people can't hear the difference, it's a lot easier but you can't use a test which does that as a basis for claiming that no one can hear it.
Your test as described is simply too limited and your sample size far too small to be capable of demonstrating that people can hear anything other than a 'night and day' sort of difference. Don't believe me on face value on this - go and talk to a statistician or someone involved in hearing research and tell them you want to test for something where there is genuine disagreement about whether or not there is an audible difference. They will tell you the same thing, but at least they will be your chosen expert so you should be more prepared to trust them rather than me on this.
1. I agree that for the A-B samples both two orders of presentation are necessary, A-B and B-A. The participants are perfectly able to repeat the sets of samples as often as they wish.2. I was not aware that the order of presentation A-B-A was flawed in itself. At least, that argument has never, to my knowledge, been presented when discussing that method.
3. No listening test can be sure to have a good mix of believers and non-believers, unless you select the participants using that parameter. In this particular test, I could ask the ones who are goig to participate whether they consider themselves believers or not.
4. I know that such a test will not prove anything, but it is able to add evidence to the discussion. If there are valid technical reasons that do show that this test is not reliable, and I'm still waiting for such reasons, then we'd better stop.
5. Of course I'm not in a position to verify if the participants are able to obtain consistent results, which would qualify them as test subjects. But the common audiophile is not verifying his consistency either when judging audio gear.
6. I perfectly know that the sample size might be too small to give reliable results. One more reason for you to participate :-)
1. I agree that for the A-B samples both two orders of presentation are necessary, A-B and B-A. The participants are perfectly able to repeat the sets of samples as often as they wish.
You miss the point. In a proper blind test, neither the listener nor the administrator knows the order of what is being presented. The use of the X in ABX - the random insertion of no change transitions, jumbles the order and neither the listener nor the administrator knows after the first presentation - the A presenetation - whether they are listening to A or B. You're suggesting introducing the ability for the listener to be able to know whether they're listening to track 1, 2 or 3 and to compare them at will - a totally different set of conditions to a blind test.
2. I was not aware that the order of presentation A-B-A was flawed in itself. At least, that argument has never, to my knowledge, been presented when discussing that method.
Once again you miss the point, First, are you saying that you are just going to present the 2 options in the order ABA? That is not what ABX does and you're calling this an ABX test, which it simply isn't. Secondly, if people know the order, it isn't blind. The idea is that the subject doens't know what the transition is from or two. That includes being able to identify which are the different tracks. Using an ABA order and stating it as you have just done turns this into a totally sighted test and also never presents the subjects with a no change transition, an integral part of the real testing process.
3. No listening test can be sure to have a good mix of believers and non-believers, unless you select the participants using that parameter. In this particular test, I could ask the ones who are goig to participate whether they consider themselves believers or not.
Wrong. University tests often use first year students and participation is compulsory. You get large numbers with effectively random choice - everyone taking a first year psych course - which means you may get some audiophiles of both persuasions and a whole lot of people with no interest in the audio as distinct from music and who have no idea what they are listening to. In fact, if the subjects aren't told what is being changed, even the subjects interested in audio have no idea whether the change is in a component, a cable, or even different recordings of the same signal with some sort of frequency altering going on. It's not always easy to avoid problems of subject bias, but there are ways and bigger samples always help.
4. I know that such a test will not prove anything, but it is able to add evidence to the discussion. If there are valid technical reasons that do show that this test is not reliable, and I'm still waiting for such reasons, then we'd better stop.
If it simply isn't capable of proving anything, it can't add evidence to the discussion. If the test isn't good enough to have the capacity to resolve a difference if one really does exist, it can't generate any evidence whatsoever. It's like giving a person a set of binoculars with the lenses painted black and asking them to describe a distant object by observing it through the binoculars. The fact that the person can't see anything doesn't prove a thing about their vision or the ability of binoculars to assist in viewing far objects. The test instrument has to be good enough to capture reliable data if the results are to provide any sort of evidence at all.
5. Of course I'm not in a position to verify if the participants are able to obtain consistent results, which would qualify them as test subjects. But the common audiophile is not verifying his consistency either when judging audio gear.
Consistency isn't a requirement for a test subject. At close to the threshold of audibility, no person - audiophile or otherwise - is consistent anyway. And consistency isn't necessary in most cases in judging gear. The audiophile simply has to reach a decision that satisfies him/her by whatever means they choose. After all, equipment choice and taste in sound are personal preferences, and are subject to change over time.
6. I perfectly know that the sample size might be too small to give reliable results. One more reason for you to participate :-)
If the sample size is too small to give reliable results, it doesn't matter who participates. The results will always be unreliable because the sample size simply isn't up to demonstrating what you want.When you say that you know the test won't prove anything, and that the sample size may be too small to give reliable results, you are admitting that you can't draw any conclusions from the results at all. How can you if you know it won't prove anything. It's only worth while doing if it is capable of proving something. A single positiive test is never sufficient for proof on it's own - it needs to be replicated, possibly a few times, before the findings are accepted, but no one is even interested in trying to replicate a test that is incapable of proving anything. The results of such a test are simply meaningless.
I would be quite happy to participate in a meaningful test, but I'm not happy to participate in a meaningless test. I also think that it is quite improper for you to seek volunteers for such a test when you know that the procedure is flawed, and to say that you will present the results becuase you are going to present something that is quite meaningless in a way that is misleading. That isn't genuine research and it isn't an honest approach. I would say it was a misguided approach if you weren't aware of the failings in the process, but based on some of the statements in your reply you do know enough to realise that there are significant problems with your approach so to continue with it really is dishonest in my view.
David,even when big-scale ABX tests are conducted with sufficient sample size and appropriate statistical evaluation you will find people like Robert Harley and others who will tell you that and why this kind of test if inherently flawed. Why on earth should I put tremendous effort in this test just to hear that it's useless anyway. On my scale the effort is limited. If you, Jon and others think that it's useless, fine. I don't share your view.
Btw, I'm still waiting for someone to present technical, not methodological, reasons that would speak against the test.
If a test isn't valid, then it's useless regardless of whether the reason is a technical reason or a methodological reason.Conducting a flawed test, regardless of the nature of the flaw, and publishing the results as if they mean something when they can't is simply dishonest as I have said. If you want to conduct a methodologically flawed test, go right ahead, but you can't then honestly claim that the results mean anything at all. You can't use them to support your current view on the topic, and you can't use them for a reason for giving up your current view on the topic. There really is no reason to do the test unless it is both methodlologically AND technically valid.
There really is nothing more to say. If that point means nothing to you, then you may as well believe whatever you like and forget about tests and any form of evidence whatsoever. Conducting a technically correct but methodologically flawed test isn't bad science - it just isn't science at all.
[ The use of the X in ABX - the random insertion of no change transitions, jumbles the order and neither the listener nor the administrator knows after the first presentation - the A presenetation - whether they are listening to A or B. You're suggesting introducing the ability for the listener to be able to know whether they're listening to track 1, 2 or 3 and to compare them at will - a totally different set of conditions to a blind test. ]
Forgive me if I am laboring under a misconception, but from this and your original post, I get the impression that your are confused about what an ABX type test is.In so far as audio is concerned, I take it to mean that the listener (and administrator), know which unit A is and which unit B is, but X is presented as an unknown, and the listener is asked to make a forced choice as to whether or not they feel it is A or B.
The listener can switch back and forth between A and B as many times as they like, and in some instances, listen to X as many times as they like (the case here with CD tracks), and then make their decision.
This is the classic ABX test as proposed by Clark et al, and what most folks are referring to when they speak of an ABX type test.
See:
http://www.provide.net/~djcarlst/abx_new.htm
particularly
http://www.provide.net/~djcarlst/abx_p9.htm
A general description of the ABX test procedure when using the ABX switchbox is at:
http://www.bostonaudiosociety.org/bas_speaker/abx_testing.htmMy comments on this type of testing, and other amatuer DBT tests are at:
http://www.audioasylum.com/forums/prophead/messages/2190.html
and at:
http://www.audioasylum.com/forums/prophead/messages/2579.html
and
http://www.audioasylum.com/forums/prophead/messages/2580.html
That does mean he could get away with 3 tracks as he originally suggested, but not his "ABA" giveaway in his reply to my initial criticisms.Most of my knowledge of testing is on health issues, especially epidemiological studies, which are a different kettle of fish though the research I did was not epidemiological and much simpler. I got enough experience to know I don't have a lot of experience, but at the same time being able to do some basic assessment of published data in my field and to appreciate the damage that sloppy tests which get wide publicity are capable of doing.
A good test is worth every bit of effort it takes, even more than it takes, and it takes a lot. A bad test just keeps on causing problems for years.
You said in your point 4: "If there are valid technical reasons that do show that this test is not reliable, and I'm still waiting for such reasons, then we'd better stop."I gave you valid reasons and so did Jon Risch. I also suggested that you discuss my reasons with a competent statistician or researcher and you obviously haven't. It appears that you regard reasons that you don't like as being not valid reasons. I can understand your interest in trying to do some sort of test in this area, but that eagerness doesn't justify conducting an invalid test and then publishing the results as you have promised to do.
Jon Risch is a qualified electrical engineer who has presented engineering papers and is familiar with the requirements for testing.
I have a postgraduate degree in health and safety and was required to conduct research as part of the requrements for that degree. I had that research published in a peer reviewed journal and presented it at an optometrical science conference. I do not regard myself as a qualified researcher - I don't have a sufficiently strong statistical background nor am I sufficiently expert in test design. I had to submit a design for my research which was critiqued and I was subject to supervision throughout the process. I have not conducted research independently and would not regard myself as qualified to do so. That doesn't mean that I know nothing about what constitutes reliable test data.
If you have 2 people with more experience than yourself telling you that your test design is fatally flawed, you need to seriously consider that advice. If you aren't experienced in test design and conduct, you should discuss the criticisms you have been given with someone who is.
This is not a matter of 2 people with different views to you saying that you are wrong in what you believe about what people hear. This is a matter of people telling you that you cannot gather reliable test support one way or another using your process. I think I can speak for Jon as well as myself when I say that both of us would welcome good, solid test results no matter what the outcome of the test turned out to be. Reliable results help audiophiles make better choices and help designers and manufacturers make gear better in the future, no matter what the results are. I think Jon would welcome good, reliable research and I know I would.
What we are saying is that your approach isn't good, reliable research. What I will personally add to that is that the outcomes of fatally flawed research do more harm than good. They hang around and are quoted without understanding, and are often accepted as fact when they aren't. They get in the way and muddy discussions and make it much harder for interested people to accept reliable results at a later date if those results are at variance with the flawed ones.
I don't believe you want to contribute to more confusion on this topic but that is exactly what you will do if you use a flawed process and publish the results here or elsehwere. If you are going to publish results, you simply have a major obligation to ensure that those results are genuinely meaningful. You can't fulfil that obligation if you don't understand the requirements of a genuinely meaningful test.
Once again, I can only repeat: Get advice from someone sufficiently qualified in statistics and research on how to construct this test before you start. Then conduct it in a way that will generate meaningful data, and that includes getting a large enough group of subjects for your sample. If you're not prepared to do that, simply don't try the research. The results will have no value whatsoever and may do far more harm than good.
.
Jon Risch
![]()
This post is made possible by the generous support of people like you and our sponsors: