Paul has done a lot of work in the field of playlist evaluation.
I think that if you survey the researchers in the field, the "WTF test" would be considered fairly reasonable - especially for a quick-and-dirty evaluation. Can you point to any specific songs that he said were WTF's and you think aren't, or vice/versa? If not then it would appear to meet the objectivity criteria.
Using his own music collection might be slightly more suspect. Changing that might have flipped the outcome of iTunes vs EchoNest, but wouldn't have changed the real news here: Google does really, really badly.
I think you can argue the individual songs but the overall findings are sound - Google is poor, EchoNest is good, iTunes is good bar one song (which is suspicious). The methodology is a bit finger in the air but then so are the findings.
I think that if you survey the researchers in the field, the "WTF test" would be considered fairly reasonable - especially for a quick-and-dirty evaluation. Can you point to any specific songs that he said were WTF's and you think aren't, or vice/versa? If not then it would appear to meet the objectivity criteria.
Using his own music collection might be slightly more suspect. Changing that might have flipped the outcome of iTunes vs EchoNest, but wouldn't have changed the real news here: Google does really, really badly.