A Broad Definition of "Personal Information" is Needed in the Upcoming U.S. Federal Privacy Regulations

Following the rollout of the EU’s General Data Protection Regulation (GDPR) and a string of high-visibility privacy scandals at U.S. companies (like Equifax, Facebook, and Google), Federal regulators are finally getting serious about privacy.

The National Institute of Standards and Technology (NIST) is convening a series of workshops to develop a new Privacy Framework, which the body envisions will be a voluntary set of standards for assessing organizations’ privacy risks. This isn’t regulation and it won’t be mandatory, but it will create a benchmark against which companies’ data handling practices can be judged. And voluntary standards can become de facto mandates: if a company in the midst of a privacy PR disaster promises to do better by adhering to NIST standards, they can be held accountable by the FTC (under their “deceptive practices” enforcement authority) if they fail to do so.

Simultaneously, the National Telecommunications and Information Administration (NTIA) is developing an “approach” to consumer privacy on behalf of the Department of Commerce which has the potential to eventually inform privacy rules made by Federal agencies.

If you’re interested in all of this and want to weigh in during the policymaking process, now may be your best chance. Until November 9th, the NTIA is seeking public comment on a draft set of high-level goals and intended outcomes for the upcoming policy. You can write whatever you want, but effective public comments are brief, focused on a specific recommendation or point of clarification, and introduce relevant technical facts to make a cogent argument.

Below I’ve reproduced a comment I submitted to the NTIA’s request arguing that the way they use the term “personal information” is insufficiently vague and that it must be defined more broadly than it has been in other U.S. data protection laws.

The Definition of “Personal Information” in Federal Privacy Standards Must Include Metadata

My name is Ben Kaiser and I am a graduate researcher in the Center for Information Technology Policy at Princeton University. I am responding to the NTIA’s request for comments on developing the Administration’s approach to consumer privacy.

The document overviewing the intended privacy outcomes and desired high-level goals for the forthcoming policy uses the terms “personal data” and “personal information” synonymously and vaguely (I will use “information” going forward). How exactly this notion is defined will greatly impact the ability of the policy to achieve the desired outcomes; in particular, an excessively narrow definition will limit users’ control over their personal information (Outcome #2) and inhibit accurate risk modeling (Goal #4) by leaving out information that directly affects users’ privacy risks.

I am writing to recommend a broad definition of personal information that includes not only data objects that inform about a person but also metadata, which is information describing an informative data object. The distinction is clear when considering communication data: the contents of the communication are the data while the circumstances of the communication are the metadata. For a telephone call, the phone numbers of the participants, time, and duration are all metadata. For Internet traffic, packet-level metadata includes source and destination IP addresses while application-level metadata includes website titles and keywords indexed for search. Smartphones generate timestamped location metadata objects called Cell Site Location Information (CSLI) when they connect to cell towers.

Many existing U.S. legal definitions of personal information, particularly those in state data breach laws, do not include metadata^[1]. They explicitly scope the definition to what is typically called Personally Identifiable Information, or PII. This covers users’ real names or account usernames, Social Security numbers or other unique identifiers, credit card numbers, medical information, and a few other specific categories. This information is necessary to protect, but it is not sufficient for protecting user privacy. It is essential that comprehensive Federal policy include metadata in its definition of protected personal information for two reasons:

Through aggregation and correlation, metadata can reveal PII, even if it has been stripped of identifying information (“de-identified” or “anonymized”).
Metadata that describes users’ interactions with end services is visible not only to the end service provider, but also to facilitating intermediaries like telecom providers and Internet Service Providers (ISPs) and outside observers like advertising companies. This compounds the privacy risk that the unregulated collection of metadata poses to users.

Multiple independent research efforts have demonstrated that analysis of metadata can reveal sensitive personal information. Even if metadata has been anonymized by stripping PII, the user whom the metadata describes can still be identified through pattern analysis and correlation with public information sources. Personal information about those users can then be inferred.

For example, consider telephone call metadata. Using a de-identified data set of call records and public data sources like Google and Facebook, researchers were able to connect phone numbers to the names of their owners and their city of residence^[2]. They were also able to identify the nature of the relationship between call participants (e.g., siblings, parent/child, or boss/employee). All of this information is clearly PII worthy of privacy protections, but this research shows that if the metadata is not protected, neither is the personal information.

The same researchers were also able to determine when users were calling sensitive organizations such as mental health facilities, sexual and reproductive health facilities, financial services, and religious organizations. Patterns in this data can be further analyzed to reason about protected personal traits: calls to mental health providers followed by a call to an insurance company suggest a patient searching for a new therapist; long calls to family members followed by a short call to Planned Parenthood suggests pregnancy or other family planning concerns; a dearth of phone calls after sundown on Friday may indicate a Shabbat-observant Jew.

Privacy policies that fail to protect this information fail to truly grant users control over what personal information their telephone provider can collect. This is because users likely do not understand that when they conduct business with end service providers like organizations that they call or websites that they visit, they are providing intermediary service providers with such intimately revealing data. In the case of call records, this is just the phone service provider, but for Internet traffic, metadata is shared far more broadly, increasing its privacy risks.

The additional parties privy to Web browsing metadata are called third-party web trackers. These include advertising companies, analytics services, social media companies, content providers like news services, and content hosting platforms like WordPress^[3]. A 2016 study found 81,000 unique third-party trackers in existence (although most users will only share data with a small subset of these trackers in their normal browsing habits)^[4].

In some cases these third-party trackers collect information that clearly falls within even the most narrow definition of personal information, such as email addresses. But they sometimes argue that they collect only metadata such as de-identified browsing histories and that it is therefore not PII^[5]. This is a clear demonstration of the pitfalls of a too-narrow definition of personal information: the claim is technically true and could allow tracking companies to claim compliance with privacy regulations, but the metadata in question can be easily de-anonymized to reveal PII^[5].

It is therefore essential that Federal privacy regulations do not define “personal information” so narrowly that these trackers can claim that the deidentified metadata they collect is exempt while maintaining a store of data that does in fact reveal personal user information. If policy protects PII itself but not metadata from which PII can be inferred, the policy cannot be said to be outcome-based, which is a stated goal of the NTIA’s approach. Furthermore, when asked, users have regularly expressed the desire to not be tracked across the Web^[6][7], so privacy regulations that do not allow them to fulfill that desire do not achieve the goal of allowing users to exercise reasonable control over their personal information. Thus in order to achieve the goals and outcomes outlined in the request for comments, the definition of “personal information” used in Federal privacy policy must include metadata.

References

[1] National Conference of State Legislatures (2018). Security Breach Notification Laws. [Online]. Available: http://www.ncsl.org/research/telecommunications-and-information-technology/ security-breach-notification-laws.aspx

[2] Mayer, J., Mutchler, P., & Mitchell, J. C. (2016). Evaluating the privacy properties of telephone metadata. Proceedings of the National Academy of Sciences, 113(20), 5536–5541. https://doi.org/10.1073/pnas.1508081113

[3] Mayer, J. R., & Mitchell, J. C. (2012). Third-Party Web Tracking: Policy and Technology. In 2012 IEEE Symposium on Security and Privacy (pp. 413–427). San Francisco, CA, USA: IEEE. https://doi.org/10.1109/SP.2012.47

[4] Englehardt, S., & Narayanan, A. (2016). Online Tracking: A 1-million-site Measurement and Analysis. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security – CCS’16 (pp. 1388–1401). Vienna, Austria: ACM Press. https://doi.org/10.1145/2976749.2978313

[5] Narayanan. (2011) There is no such thing as anonymous online tracking. [Online]. Available: http://cyberlaw.stanford.edu/blog/2011/07/there-no-such-thing-anonymous-online-tracking

[6] Turow, J., & King, J., & Hoofnagle, C. J., & Bleakley, A., & Hennessy, M. (2009). Americans Reject Tailored Advertising and Three Activities that Enable It. Technical report. https://repository.upenn.edu/asc_papers/524/

[7] Purcell, K., & Brenner, J., & Rainie, L. (2012). Search Engine Use. Technical report. http://www.pewinternet.org/2012/03/09/search-engine-use-2012/

Topical computer security