Hi everyone,
I agree with Oscar, there is so much we could do if we had access to the artist
and track names, using Music Information Retrieval techniques: we could analyse
the audio (tempo, chords, melody, timbre, etc.), the scores, the lyrics, the
artists' connections, and much more. There is a growing community working on
these topics, and attempting to do music recommendation without any contextual
and/or content information other than the genres (which is a limited approach)
is simply ignoring this whole branch of research.
The release of this large music dataset is such a great step forward (thank you
for this) so why stopping there when all we need is artist/song names (since we
can get the audio files and/or audio features and contextual metadata from other
resources)?
It seems to me that the de-anonymisation problem could be easily avoided by
asking the contestants to register and carefully checking their identity and
affiliation (and if I am not wrong that's what you do for the R1, R2 and R3
datasets http://webscope.sandbox.yahoo.com/catalog.php?datatype=r). Researchers
are not interested in revealing users' musical guilty pleasures. The goal is to
advance the field of music recommendation.
You would get many more participants, original and varied solutions to this
recommendation problem and I bet a large support/publicity from the MIR
community judging by the buzz (on Twitter for instance) the non-release of these
details has created.
So would you consider changing your mind and releasing that data?
Best wishes,
Amelie Anglade
--- In kddcup2011@yahoogroups.com, Oscar Celma <oscar.celma@...> wrote:
>
> Hi Markus,
>
> thanks for your prompt reply!
>
> Still... I don't get it! :-)
> Let me explain a little bit more, then.
>
> It seems that these other Y! datasets already contain, at least, the
> artist names:
> http://webscope.sandbox.yahoo.com/catalog.php?datatype=r (see datasets
> R1, R2 or R3)
>
> Also, anyone can get them (Yahoo! sends a couple of nifty DVDs with all
> this data).
>
> As you can see, there is already some people merging Y! datasets with
> other ones:
> http://labrosa.ee.columbia.edu/millionsong/pages/tasks-demos#yahoodata
> and here's the code:
>
https://github.com/tb2332/MSongsDB/blob/master/Tasks_Demos/YahooRatings/README.t\
xt
>
>
> Now, regarding users' de-anonymization (that I clearly understand, and
> respect Yahoo!'s position on that), it's not clear at all where the
> "users" of this dataset came from. Is it from the old Yahoo! Launchcast?
> Is it from Yahoo! Music ( http://new.music.yahoo.com/ ). Or, does the
> users come from the old "Yahoo! Music Unlimited"? (you font have to
> answer this, BTW :-) ).
> Furthermore, can anyone scrape my Yahoo! profile page (if there's such),
> and get all the ratings I did back then? (when the dataset was created).
> And then using this information to de-anonymize me?
> To me, it doesn't seem straight forward to de-anonymize these two datasets.
> Also, nowadays I'm not sure if people really cares if they're
> de-anonymized or not (e.g. last.fm scrobbling). But that's another
> story, I guess...
>
>
> Of course, all the *user* information must be anonymized in the provided
> datasets. Thus not providing any genre, age, nor geolocation information.
> But, not releasing artist nor track names to avoid de-anonymization? I
> still think that's too much "paranoia", so to say.
>
> In terms of research, I clearly can see *lots* of benefits if only the
> artist name and song name were available.
> All in all, I just think these two datasets do not contain the necessary
> information to provide any meaningful (music) recommendations.
>
> Cheers,
>
> Oscar Celma
>
> On 02/17/2011 12:47 AM, Markus Weimer wrote:
> > Hi Oscar,
> >
> >>> the artist, track and album names are not part of the data set. Thus, it
is impossible to add information from the sources you mention.
> >> So, is there any special reason for that? I didn't hope the dataset contain
any extra or fancy metadata. But, come on, not containing neither artist nor
song names! I don't get it...
> > I agree that having this data is valuable to build recommender systems
> > this domain. Releasing it has been a strong consideration during the
> > preparation of the data set.
> >
> > However, releasing this data would also open the door to the kind of
> > de-anonymization techniques that have been applied to similar data sets
> > in the past. Yahoo! values the privacy promises it gives to its users
> > very highly. Therefore, we are unable to release data that is vulnerable
> > to known attacks.
> >
> > Take care,
> >
> > Markus
> >
>