Tuesday, May 24, 2011

Traitwise, PGP, data standards and the OSI model

I spent the day with Jason Bobe of the Personal Genomes Project. PGP is a public effort sequence a large number of genomes. Traitwise has formed a partnership with them to serve as part of the phenotyping solution.

Jason's role at PGP puts him in the center of a growing number of people who have competing desires to control / provide services around personal health (or "quantified self" as some are calling it). Jason therefore is in the unique position to promote and evangelize.

An interesting part of the conversation focused around the need for open standards of public health and trait related data. I used the OSI networking protocol framework as a model of how conceptual standards can be incredibly useful to increase innovation. One of the best things about the OSI model is that it didn't attempt to actually define any standards but rather it serves as a conceptual framework that defines how standards interface with each other. From that model has emerged a number of interoperable standards at all levels and without it we certainly wouldn't live in the networked world we live in.

One of the aspects of the human data problem is privacy which has "levels of data privacy" and this, it occurred to us, was somewhat analogous to the OSI layer framework.

Layer 1 - Raw identifiable data (post-privacy)
Layer 2 - Anonymized raw data (HIPPA compliant)
Layer 3 - Algorithmically open data (sand-boxed, machine readable)
Layer 4 - Aggregated data

Layer 1 data is non-private. It requires a consent certificate of some sort to go along with the data.

Layer 2 data is considered by many scientists to be adequate for protection in many research circumstances and indeed HIPPA seems satisfied with this. However, for open projects such as Traitwise I personally don't think that it suffices as de-anonymization has been shown in many circumstances such as the infamous AOL search records scandal.

Layer 3 is perhaps the most interesting and least discussed. A Layer 3 system would allow an algorithm written by a researcher to run against raw data but within a sandbox that only allows the aggregated results to emerge. I haven't put a huge amount of thought into this, but it seems plausible to write an API that could enforce such constraints. But, even without such an API, a human code reviewer could accomplish the same thing.

Layer 4 is what Traitwise and others are currently focusing on -- aggregated data that is not deanonymizable.

This is just one aspect of the data and communications problem, but it is an important one and it was fun talking to Jason about it today.

1 comment:

Sarah Gray said...

That is super cool and exciting! Hey, did you read the piece in last week's New Yorker called "The Secret Sharer? http://nyr.kr/mmWEjz

It talks about ThinThread, an application written (and then later deconstructed & abused, but that's politics and politics of software) to run algorithms on pools of data in order to identify suspicious patterns -- but the data was anonymized. Made me think of item #3 on your list.