Today’s post is about Mayo Clinic’s Clinical Data Analytics Platform specifically and the case for using identified data more generally. I promise this is not the usual argument that you hear against deidentified data!
At the beginning of the year the Mayo Clinic announced the “Clinical Data Analytics Platform,” which is an ambitious venture to use federated learning and Mayo’s vast amounts of high quality patient data to support drug development and the biopharma industry. Federated learning is only one part of a multi pronged data privacy model, and another core aspect is deidentifying patient data. If you want a deeper dive on what federated learning is and how it works you can read my previous blog post here.
This week Mayo announced they’ve deidentified “structured portions of 10 million Mayo Clinic patient records, along with 2.5 million unstructured portions.” What a data set that is! But while Mayo’s effort to protect the privacy of their patients is laudable I don’t think deidentifying patient data is the right approach and I thought I’d take this opportunity to express why.
As a very brief recap health data in the US is afforded special legal protections under the Health Insurance Portability and Accountability Act, or HIPAA. However, HIPAA specifically creates an exemption for health data that has been stripped of information that could be used to identify a person, or “deidentified”. If an organization deidentifies health data they have custody of then they can use or share it without seeking consent from the data subjects themselves. The thinking behind this exemption is that deidentifying patient data theoretically minimizes privacy risks to individuals while still allowing their data to be used for good.
It is no longer necessary to deidentify data to preserve privacy
That thinking made sense when data needed to be shared to learn from it. But with federated learning there is no need to share data to learn from it! The compromise of deidentification is not needed in federated learning networks because data never leaves the servers of the data custodian, and third parties only see the insights derived from that data. In this way you can use identified data in a federated learning network while still keeping that data private, thus eliminating the need for deidentification.
There are a number of additional safeguards that could make this safe. Mayo could set rules on the kind of analysis that could be run and what data fields could be used. Obviously part of those rules should be restrictions on anything that would return patients’ identities or identifiable information. Furthermore, it is possible that insights generated by a federated learning network could reveal sensitive information about patients. To combat this Mayo should not allow for any analysis that would return insights about a specific person or a small group of people.
Mayo would also need to change how their platform works operationally. In stark contrast to today patients would need to know what their data is being used for, and be able to opt in or opt out of the use of their data. There should be also be strict oversight in place for who has access to patient data as well. Furthermore, where possible differential privacy and other privacy preserving technologies should be used. While these changes may be more of a burden for Mayo they are certainly possible.
The case for using identified data
Thus far I have made a case for why using deidentified data in a federated learning network isn’t necessary. I think there is also a positive case for using identified data: by doing so we are able to share insights with the very people that helped find them.
Relying on deidentified data is inherently limiting in this regard. If deidentified data is used to create an algorithm that could improve patients’ lives there is, by definition, no direct way of sharing that algorithm with the people whose data were used. Since the patient data used to train that algorithm is deidentified there would be no way of engaging those patients. If the algorithm did reach them eventually, it would have to do so incidentally, and it is likely that patients would need to pay for access as well. This structural limitation is both ironic and unfortunate, as the patients who provided valuable data for this algorithm are likely those that would benefit the most from it.
The most efficient and scalable way to distribute insights is to use identified data. That applies to all contexts where health data is analyzed. By using identified data the impact of new findings can be easily amplified by sharing them directly with the people whose data helped find those insights. Furthermore, by using identified data organizations can follow up with patients to ask for additional data if needed, which would improve the quality of insights.
Lastly, by using identified data organizations have an opportunity to forge a new kind of relationship with patients. Many people would happily contribute to research, and similarly many would contribute to more commercially oriented projects in the right contexts - perhaps only if they were paid. Asking patients to contribute and letting them engage on the terms that they feel most comfortable with is how you treat patients as true stakeholders. Moreover, if patients know their data is being used for specific and good purposes they could take pride in that and might provide more and higher quality data, which would produce even better insights and algorithms.
The use of deidentified data has undoubtedly furthered our understanding of human health. But deidentified data was given a special place in our healthcare system because of an assumption that is longer true: that we must deidentify data to preserve privacy while making data useful. New technologies like federated learning disprove this assumption by enabling both privacy and data utility. In turn this should cause us to rethink how we treat deidentifying health data and whose interests current practices are serving.
I think there is a compelling argument that deidentifying patient data in contexts where patient privacy can be protected by other means, such as federated learning, is the wrong thing to do. By deidentifying data the potential benefits that data subjects could get from the use of their data are limited. Using identified data in a federated learning network would provide more benefits to patients with the same level of risk. Doing so comes with higher operational and technical burdens for healthcare organizations, but I believe that is a tradeoff we should make.
For these reasons I think that more healthcare organizations, including but not limited to Mayo and their Clinical Data Analytics Platform, should use identified data.
Thanks for reading
If you found this newsletter valuable then you can click the button below to sign up for free.
If you’re an existing reader I would deeply appreciate it if you share this with people who would find it a valuable resource. You can also “like” this newsletter by clicking the heart just below this, which helps me get visibility on SubStack.
My content is free, but if you would like to support me, you can do so on Patreon.