Paul Ohm, who has highlighted the problems with supposedly anonymized data, has published a forceful commentary on Netflix’s recent announcement of their new contest. Ohm writes, in part:
Although I give Netflix a pass for its past privacy breach, I am astonished to learn from the New York Times that the company plans a second act:
The new contest is going to present the contestants with demographic and behavioral data, and they will be asked to model individuals’ “taste profiles,” the company said. The data set of more than 100 million entries will include information about renters’ ages, gender, ZIP codes, genre ratings and previously chosen movies. Unlike the first challenge, the contest will have no specific accuracy target. Instead, $500,000 will be awarded to the team in the lead after six months, and $500,000 to the leader after 18 months.
Netflix should cancel this new, irresponsible contest, which it has dubbed Netflix Prize 2. Researchers have known for more than a decade that gender plus ZIP code plus birthdate uniquely identifies a significant percentage of Americans (87% according to Latanya Sweeney’s famous study.) True, Netflix plans to release age not birthdate, but simple arithmetic shows that for many people in the country, gender plus ZIP code plus age will narrow their private movie preferences down to at most a few hundred people. Netflix needs to understand the concept of “information entropy”: even if it is not revealing information tied to a single person, it is revealing information tied to so few that we should consider this a privacy breach.
I have no doubt that researchers will be able to use the techniques of Narayanan and Shmatikov, together with databases revealing sex, zip code, and age, to tie many people directly to these supposedly anonymized new records.
Because of this, if it releases the data, Netflix might be breaking the law. The Video Privacy Protection Act (VPPA), 18 USC 2710 prohibits a “video tape service provider” (a broadly defined term) from revealing “personally identifiable information” about its customers. Aggrieved customers can sue providers under the VPPA and courts can order “not less than $2500” in damages for each violation. If somebody brings a class action lawsuit under this statute, Netflix might face millions of dollars in damages.
Additionally, the FTC might also decide to fine Netflix for violating its privacy policy as an unfair business practice.
Either a lawsuit under the VPPA or an FTC investigation would turn, in large part, on one sentence in Netflix’s privacy policy: “We may also disclose and otherwise use, on an anonymous basis, movie ratings, consumption habits, commentary, reviews and other non-personal information about customers.” If sued or investigated, Netflix will surely argue that its acts are immunized by the policy, because the data is disclosed “on an anonymous basis.” While this argument might have carried the day in 2006, before Narayanan and Shmatikov conducted their study, the argument is much weaker in 2009, now that Netflix has many reasons to know better, including in part, my paper and the publicity surrounding it. A weak argument is made even weaker if Netflix includes the kind of data–ZIP code, age, and gender–that we have known for over a decade fails to anonymize.