[Long take] Data self-sovereignty can't be achieved alone

Your data isn't just your data, and that undermines an individualistic notion of data self-sovereignty

Last week I discussed data self-sovereignty and the challenges in realizing it. To summarize my point of view, the reasons why we don’t have data self-sovereignty aren’t just technical, they are economic and social as well. Realizing data self-sovereignty will require new economic and social systems, in addition to technical systems. It is very likely that the system that does usher in data self-sovereignty will look very different from the status quo.

Self-sovereignty over your data is a nebulous thing, but the rallying cry for this notion is that “your data is your property,” and I am considering the goals of data self-sovereignty to be to “control, own, and profit from my data.” Long time readers will know that I am sympathetic to this cause, however this highly individualistic notion does not capture the richness and the social nature of data. As such, it will not achieve the goals that its adherents think it will, and would in fact undermine them.

Your data isn’t just your data

An individual’s data is never about just them. Data is social and there are always overlapping claims to “personal” data. Consider this newsletter, and you reading it. Substack, which I use as a newsletter service, will generate data showing that my newsletter was opened by you. That single data point has two claims to it: it is both data on who opened my newsletter and simultaneously it is an individual’s data on what newsletters they opened.

On one hand, that data could reveal something about your tastes (rest assured, all of my readers have the best taste), and you probably feel you have a claim to it. On the other hand I worked hard to create this newsletter and I feel I have at least some stake in the data about who opens it. Again, this is a single data point with at least two claims of “ownership” to it. So whose data is it?

The narrow idea of data self-sovereignty as an individual owning and controlling their data offers no solution to this conundrum. Under this paradigm you and I would both own, control, and profit from the same overlapping data! And dear reader, while I would never do anything with your data that you wouldn’t approve of, I don’t have the same confidence in others. Overlapping claims to data mean that one person’s right to control their data could infringe on another person’s rights. To see how this could be problematic consider some of the most personal data about a person: their genome.

Genomic data self-sovereignty

Not only is your genome central to who you are, it is also central to who your family is too. Armed with someone’s genome, you can also make inferences about their family members’ genomes. As such, whenever you, as an individual, share your genomic data you are also, in part, sharing your family’s genomic data. Many people have discovered this after they’ve had their personal health information entered by well meaning relatives into online genetics services without their consent.

The fact that you share genomic data with your family and relatives means that your choices about your genetic data impacts the privacy and well-being of your family and relatives. Consider an extreme example: empowered to own and control your health data, you choose to sell your genome on a data marketplace of the future. Somehow it ends up in the hands of the prospective employer of your relative, who knows you two are related. Noticing that you had a gene highly correlated with a terribly expensive hereditary disease they chose not to hire your relative, as they are also likely to carry this gene and the prospective employer would have to offer your relative insurance.

Your genomic data, and all of your other data, is more than just your data. Your data contains information about others, and thus actions you take with your data have consequences for others. The principle works both ways! Other people’s data contains information about you, and their actions have consequences for you. Again, the usual narrow ideas of data self-sovereignty offers no answer to this.

Data about you that isn’t your data

Data about one person can be used - especially when combined with other data - to make inferences or predictions about other people, especially friends, family, and colleagues. That means that other people’s data can be used to learn information about me that my data does not contain.

For example, consider a healthy person whose individual medical records do not suggest they are at high risk of developing cancer, but whose family members have all had cancer. Taking the healthy individual’s data alone you would conclude they are not at high risk of developing cancer. However, taking the data of your family into account you would realize they are at in fact high risk.

That is sensitive information about you, but information that you would do not control or own. There are some benefits to this too, as when combined other people’s data can help “fill in the gaps” of your data, thus making the collective dataset much richer and more useful. Again, a narrow idea of self-sovereignty as data ownership does not account for the positive or negative aspects of this phenomena.

The social nature of data and markets

One of the common goals of the data self-sovereignty movement is to allow individuals to monetize their data, and oftentimes this desire manifests in marketplaces for data that pair individual sellers with buyers. There are two problems with this model: an individual’s data is not worth much alone and the social nature of data undermines the financial value of an individual’s data.

A single person’s information is practically useless for, as examples, training an algorithm to target ads or identify lung cancer. It is only when huge quantities of data are pooled together that algorithms can unearth the most valuable insights from data. As a result an individual’s data doesn’t have much financial value alone, and only acquires large financial value when combined with the data of others.

There are also vast power asymmetries between data “buyers” and prospective “sellers.” As an example Google and Facebook are the only two buyers for personal data used for online advertising. Just as a monopoly, an entity that controls the supply of a good, can artificially raise prices, highly concentrated buyers of goods can artificially lower prices. This dynamic is exacerbated as there are many data sellers that will be competing against each other, which means prices will be lowered even more. An individual alone won’t get a good deal on their data from Google or Facebook.

Furthermore, the social nature of data means that one person’s actions drastically affects the value of other people’s data. To return to the example of genomic data, if I sell my genomic data to a third party then that third party also gains information on my families’ genomic data. In turn the value of my families’ genomic data would be drastically reduced, as they already have some of their data. Moreover, if my parents, relatives, and I all sold our genomic data to a third party then they would have little need for my brother’s data at all. If there is not coordination between data sellers then this dynamic could create a race to the bottom, where individuals hope to sell their data first, thus cashing out before a competing seller with overlapping data does.

Consider also that data becomes much more rich when combined with other data. As in the example of an individual’s cancer risk above, other people’s data can be used to fill in gaps in my data, and when combined together this makes our collective data much more rich, useful for learning, and financially valuable.

The social nature of data undermines the market oriented goals of data self-sovereignty. To summarize: an individual’s data alone isn’t worth much, has little bargaining power, and can undermine the value of other’s data. On the other hand, if individuals with overlapping data collectively make decisions about their combined data then their overlapping interests and claims to data are aligned, they have greater bargaining power, and their data becomes far more rich, useful, and thus valuable. How do we achieve that?

Sovereignty through cooperation

The RadicalXChange community proposes a system of data cooperatives, sometimes also called data unions. These would be democratic organizations that have legal fiduciary duties to their members and act as intermediaries between individuals and those that want their data. Individuals could opt to join a data cooperative and pool their data together. That drastically improves the bargaining position of their members for a few reasons:

  • First, by providing the scale necessary to make data useful the collective financial value of data would increase.

  • Second, pooling data together would make the resulting datasets much more rich, further improving its utility for algorithms and financial value.

  • Third, data cooperatives provide a way for individuals with overlapping data claims to coordinate and ensure they don’t undercut each other.

Taking all of this into account, data cooperatives would be an effective counterweight to the power of data purchasers and help individuals get a fair deal. The similarity to labor unions is deliberate; data cooperatives are trying to be the labor union equivalent of the data economy.

Looking beyond the financial goals of individuals alone, data cooperatives would provide ways for individuals with overlapping claims to data to coordinate. Instead of many people with overlapping data acting alone, democratic processes in a data cooperative could be used to decide how the cooperative’s data is used. There will be some cases where the cooperative makes decisions that members don’t agree with, but on balance this is worth it. An uncoordinated group of people with overlapping data will almost certainly lead to someone people’s data being used in a way they don’t agree with anyway. Further, data cooperatives give members a concrete way to exercise some control over information about them that was contained in other member’s data.

Beyond coordination and collective bargaining there are a range of other valuable services that a data cooperative could provide. For example:

  • After negotiating terms of data use auditing and enforcing those terms

  • Performing legal, accounting, and tax services related to the monetization of data

  • Providing quality certifications so data consumers know that the data they are using is high quality

  • Educating their members

Theoretically these are things that an individuals could do, but having an organization with staff that has expertise dedicated to these things would take a huge amount of burden off of the shoulders of individuals.

To expand on one, the current model for how we agreeing to terms of use for data is woefully broken. Barely anyone bothers to read the terms and conditions, and reading those terms for all the services you interact with it would take 76 working days! Moreover, it is very difficult to make informed decisions as many threats to privacy result from not single pieces of data, but instead the aggregation of many different pieces of data. Without knowing how your data will be combined with other sources beforehand it is impossible to properly weigh the costs and benefits. Lastly, it takes savvy vigilance to catch companies from misleading you in how your data is being used. For example, bank apps ask for location data to help them determine what transactions might be fraudulent. But how can you be sure that is what they are using your data for? Do you know how often your location data is being recorded? Or where that data is stored and who has access to it?

There is more to say on this topic still yet, but I think you already get the point: having “control” of your data is not meaningful without reforms to how we agree to the terms of use for our data. In the absence of reforms, the problems I’ve highlighted above could be solved at scale by a data cooperative that has dedicated personnel with expertise.

Concluding thoughts

The goals of data self-sovereignty — controlling, owning, and profiting from your data — are noble ones. Unfortunately data is not like property, it would be much more simple if that were the case! Data is social and full of overlapping claims, and this reality demands that we move past the popular individualistic notion of data ownership. What is required are a new set of institutions, like data cooperatives, that can coordinate individuals with overlapping claims, counterbalance concentrated market power, and provide essential services for the data economy.

Like any institution data cooperatives, or something similar, would subsume individuals into a broader structure and restrict their autonomy to some degree. However, I hope that I have demonstrated that this is a necessary trade-off, as these institutions would enable people to have more meaningful control, ownership, and profit from their data than they would have had otherwise.

Thanks for reading, and if you found this valuable you can sign up below. If you’re an existing reader I would deeply appreciate it if you share this with people who would find it a valuable resource. You can also “like” this newsletter by clicking the heart just below this, which helps me get visibility on SubStack.

My content is free, but if you would like to support me, you can do so on Patron here.

Feel free to reply to this email with comments, questions, or feedback. I host a blockchain and healthcare Telegram group that you can join. You can also find me on Twitter here.

Union Made' exhibit showcases labor, fashion history | Cornell ...
Teacher unions striking in Chicago in the early 20th century.