Page 1 of 1

First Time Ever Music Test

Posted: Mon Nov 29, 2010 9:07 am
by rogerwimmer
Happy Thanksgiving! I'm such a workaholic I am actually typing this to you on T-day itself, from a hospital. Long story . . . I'm fine, but I'm spending the night with somebody I love who's sick. Anyway, enough about me. Let's talk about YOUR take on my first ever auditorium music test. OK so I guess it's still about me.

1. It was the volume-knob type of test. Listeners boosted it one way for what they liked, the other for what they didn't. You know the drill, I'm guessing. Scoring was on the 0 to 100 scale. We tested 350 songs to two groups — one with 35, the other with 55. The company promised no less than 90 total, so they'll likely come back to make good on the discrepancy.

2. First area of concern. Only two songs look to have scored in the 90% range, and maybe another half dozen in the 80% area. A majority of them fell in the 50 to 69% range. The researcher hasn't turned over the official results yet, so I'm relying on what I observed while viewing the test live from another room. So, the way I'm reading these numbers: A majority of my library consists of "D and F" songs (were we to apply school-style grades to them). Is that how you would take the information?

3. Let's say I learn from this AMT that maybe 70 songs out of my whole library are received favorably. I know a lot of stations that would say, "Just play those 70 then." I'm really doubting my listeners would agree, even if those were the 70 consensus titles they agree on. So I'm trying to determine how I can look at these scores, clean the crap out of my library, and still have a play list that won't send my AWTE (Average Weekly Time Exposed) into the toilet. My first thought was to convert my results to Z-scores, and consider any song that's average or better as a possible keeper, while being wary of titles below that average. My consultant will probably pull out his "nuclear" and "toxic" definitions, and tell me that anything below a certain percentage (like possibly anything below 60 on that scale to 100) is "toxic" and risky to play.

4. My consultant might also contend that with so many songs in that 50-69 zone, my entire format isn't viable. I don't want to rule that out, but we've been on the air for almost a decade with this niche format, have always been in the black financially, and our cume has doubled over the past year by sticking with this format. So I guess my question here is how to truly assess that "not viable" contention. Does an AMT provide enough of the right information to even have that discussion?

5. I saw a lot of discrepancies between the AMT and what we get back from our online panels. (I know you LOVE the online stuff!) I don't expect that the two would correlate down to the decimal point, but songs that rise way above average online were often mediocre in the AMT. How concerned should I be? I'm inclined to trust the AMT more since we saw the respondents face to face. But I'm wary to totally do away with online folks since all that leaves me with is my lovely gut. Short of that, I am a little inclined to totally destroy my current online panel and build a new one from scratch. Oddly enough, the online people had to answer more questions to get in than the AMT people. For the AMT, people qualified if they were in my target demo and a P1 to my station. The online people not only have to ID who they listen to, their age, etc., but they also rate various clips of hooks in the screening so we see how that correlates to their radio preferences. Would you lean towards trusting one over the other, given what I shared?

6. The AMT sessions took place at 9:00 a.m. and Noon. The 9:00 a.m. group was a lot more "nit picky" and overall scored songs noticeably lower than the Noon group. Have you seen that before? I certainly expect that each group is going to react to all the different songs different ways, but the fact that one group simply panned our music a lot more than the other was perplexing. One of the people on our staff suggested maybe people willing to show up at 9:00 a.m. on a Saturday had natural personality leanings that made them more picky than those who opted to join the noon session.

I am terribly sorry I took so long to type this out. You are of course welcome to edit however you wish to get your points/answers across. Thanks so much for your time and for sharing that powerful brain of yours! (Yes, that was pandering maybe, but I meant it.) - Anonymous


Anon: Sorry to hear about your loved one in the hospital. I hope things are OK now. At least it gave you time to type for a while. My "powerful brain?" Thanks and I will alert my wife. On to your questions . . .

1. Type of test. You said the research company promised no less than 90 total respondents and you had 35 in one group and 55 in the second group. I'm confused. That's 90 respondents. Are you saying that you were supposed to have two groups of 90 each to test the same 350 songs? That doesn't make sense to me because you don't need that many respondents (N = 180).

You also said that the respondents used a "volume-knob" to record their ratings for the songs. I assume the researcher explained how to use the dial and also included a few control questions to determine if the respondents were using the knob correctly. If not, some respondents may have been confused with how the "volume-knob" works.

2. First area of concern. If only two songs scored in the 90 range, and the majority fell in the 50-69 range, then something "don't be right." I would need more information, but since you tested the music on your radio station with respondents who are supposedly P1s (fans, listen most often), there seems to be something wrong with the sample. I find it hard to believe that you're P1s would rate so many songs so poorly. That doesn't make sense.

3. Number of Songs If you find that only about 70 songs in your music test are received favorably, then something is wrong with the sample and/or the measurement system (the knob, the songs you tested, the hooks, and other things.).

You said your first thought was to convert my results to z-scores. You should always convert music test scores to z-scores, and this is especially true for your situation since you are using a "sample pooling" approach where the results from two different samples are combined. You should not look at the raw data for your test. That isn't a valid scientific approach.

4. Viable Format? You said that your consultant might also contend that with so many songs in that 50-69 zone, your entire format isn't viable. I think it's a bit early to jump to that conclusion. You need to have more information about your sample and the entire testing to procedure to determine why so many songs tested so poorly. As I mentioned earlier, if the respondents in the two groups were P1s to your radio station, then you should not have had so many low testing songs. And "no," a music test, even if conducted perfectly, does not provide enough information to determine if a radio station's format is viable.

5. AMT and Online Comparison You are correct in saying that I don't have a lot of faith in online testing, but you can't compare the two procedures unless you convert all the data from both methods to z-scores.

Would I lean towards trusting one over the other, given what I shared? With the information I have, the only thing I can say is that you should "trust" the information that seems most reliable and valid. You know your product and your listeners. Without any other cross-validating information, which method makes the most sense to you? The research term for that is "logical consistency."

6. AMT Sessions Oh, I'm sure there is a possibility that the respondents who showed up for the 9:00 a.m. session may be a little different from that who showed up at Noon, but I think saying that the early group was a lot more "nit picky" may be stretching it a bit.

However, there is one way to test the theory that there are differences between the groups — run each group's scores independently and compare them. Ask your researcher to conduct something like a t-test to determine if the groups are similar or different. Then you'll know for sure (even though the differences will be controlled when the data are converted to z-scores).

I hope that helps. I didn't go into a lot of detail in my answers because I would be typing while celebrating New Year's Day.


(Want to comment on this question? Click on the POSTREPLY button under the question.)