When it comes to AI, data is not always ‘the new oil’. Often, datafication in the service of AI development has dubious benefits and concrete risks.
Assumption: A principal enabler of AI development and deployment is data. Therefore, states wishing to increase their AI capacity should endeavour to collect, consolidate and distribute the greatest volume of relevant data.
Counterpoint: Not all applications of AI will necessarily benefit from the collection, centralization and distribution of data. Furthermore, any data collection and distribution activity carries serious risks. In some cases, those risks may outweigh the anticipated gains the data might yield for AI development.
One of the most embedded AI assumptions today is that it is impossible to become successful in the pursuit of AI without possessing a lot of data. National strategies often include specific measures to enhance the collection of data and to build centralized data infrastructures that serve to make datasets for AI development accessible to both public and private stakeholders. In one much-cited AI index, state readiness for using AI in public services is graded, in part, on national ‘data availability’ – a compound metric based on factors such as the amount of open data each government publishes and the level of mobile phone and internet use in the population (which is a proxy for how much digital data each citizen generates).
However, this assumption belies a much more complex reality. The value of data for AI varies by application and depends upon the user’s capacity for leveraging those data. Meanwhile, amassing and disseminating data can create risks and vulnerabilities that cannot necessarily be addressed through the privacy controls and security measures that states often promise as part of their AI campaigns.
The new oil?
It is often said that ‘data is the new oil’. But the relationship between the availability of data and the performance of AI is far more fraught than the relationship between an entity’s access to energy sources and its energy security. In reality, collecting data is much cheaper and easier than turning those data into an AI advantage.
In order to be useful for training most machine learning-based AI systems, data need to be well-curated and free from errors. Expunging errors from data is a non-trivial challenge.
In order to be useful for training most machine learning-based AI systems, data need to be well-curated and free from errors. Expunging errors from data is a non-trivial challenge. Just as crucially, data must be closely aligned with the purpose for which the AI system is being developed. For example, if a machine learning system is to be used for medical triage or diagnostics, it must be trained on vetted historical patient data that have the same statistical properties as those of the patient population that it will be used on. It could not be trained on data from hospitals in another country. Even using data from a hospital in a different area of the same country might degrade the performance of the system.
Those data that are sufficiently representative at the time of collection never remain so indefinitely. Many contexts in which AI is deployed will evolve gradually (or sometimes not so gradually), a phenomenon known as ‘distribution shift’ that can significantly degrade the AI system’s performance over time. Often, this happens in a way that is difficult to pre-empt.
Nor can datasets ever be, as one strategy claims optimistically, a ‘single platform of truth’. Datasets only reflect the numerical approximation of a reality that may be far more multi-faceted. A standardized test score, for example, is an incomplete indicator of a student’s total academic potential. Data also invariably harbour inconsistencies and biases. As the tasks that the AI is intended for grow in complexity, the challenge of producing and maintaining clean, representative truthful data expands exponentially. A common refrain in the machine-learning community when discussing data is ‘garbage in, garbage out’. But the evidence suggests that for a sufficiently complex AI task, all data are ‘garbage’; they are naturally more limited, invariable, inflexible and biased than the reality that they purport to represent.
Even when an embarrassment of what might appear to be well-matched data are available for an AI system, that does not guarantee success. For example, though large language models are trained on unthinkably large volumes of data, they have shown themselves to be adept at generating false or divisive written media. Nor does a representative dataset always necessarily lead to AI that generates positive outcomes; a dataset of language from unmoderated internet forums, for example, may be representative of what people say in those spaces, but a chatbot trained on such speech would be undesirable.
The perils of datafication
Because an imagined application of AI can only be tested if data relating to that application are available, digitized and consolidated, the pursuit of AI is seen to require the mass ‘datafication’ of society. Or, as the scholars Ulises A. Mejias and Nick Couldry put it, ‘the transformation of human life into data through processes of quantification, and the generation of different kinds of value from data’. Though the potential benefits of such datafication are widely discussed in national AI strategies, its inherent risks receive less attention.
Every time information relating to people is turned into machine-readable data, it creates new privacy risks. Indeed, many of the characteristics that will make a dataset suitable for building AI will be particularly bad for the privacy of the people represented within those data. Machine learning-based AI often thrives when trained on massive, granular, multi-modal, labelled data that can reveal sensitive personal information even when individual datapoints are anonymized.
The risks of such AI are multiplied whenever data are consolidated in the types of national strategic data repositories that many states are seeking to build as a foundation for their AI campaigns. Especially in countries lacking rigorous data protection regimes, such repositories could afford authorities and nefarious actors easy access to personal information that would have previously been siloed in separate streams. This creates novel possibilities for abuse.
Of course, these dataset and data infrastructures could be built and managed according to strict privacy principles, as many strategies note. However, privacy controls can always be revoked. A malevolent new regime that takes power could abuse data that were previously collected and used in tight compliance with privacy protections. A shifting landscape of criminal law can also have a direct effect on the risk that these datasets pose to the people they contain. Data that reveal individuals to have engaged in activities that were once legal (such as seeking an abortion) may become problematic if those activities are suddenly made illegal.
Abuse by private actors is also a concern. Datasets that are made available to a diversity of stakeholders can become ‘runaway datasets’ that are so widely held, stored, distributed and reproduced that they cannot be recalled if they are discovered to be problematic. Such runaway datasets are already common, and their risks expand the longer they are accessible and as their scale and diversity increase. The longer a dataset is available, the greater the risk that it will also end up being used for problematic or unproven applications. This has already been observed in the research sector. In one case, the US technology firm Microsoft took down a publicly available dataset of millions of images of more than 100,000 ‘celebrities’ (many of whom were journalists) after it was revealed that it had been used by a number of companies that build surveillance technologies, including IBM, Panasonic, Hitachi, Sensetime and Megvii.
Finally, datafication can increase the risk of cybercrime. Even closed data repositories that are only made available on a restricted basis can be hacked. Such attacks could make potentially sensitive data available for crimes such as identity theft or stalking. Large breaches could also potentially enable malign actors to ‘poison’ these datasets, so that any AI systems trained upon them could exhibit suboptimal or dangerous performance. The more that a dataset is disseminated, the higher the chance that it could be attacked.
The datafication mindset
An emphasis on data availability for AI might stand in the way of rigorous privacy protections for preventing the kinds of harms described above. Many observers have noted that authoritarian regimes will have greater access to data for AI development than societies where the mass collection of personal information is subject to controls. While this has been helpful for illuminating the perils of mass datafication, there is a risk that it could serve arguments that undermine the push for better digital privacy. In a 2021 report comparing the AI capabilities of the US, the EU and China, one think-tank went so far as to argue that, ‘[US] Congress should ensure any change to federal data privacy legislation does not limit data collection and use of AI’.
Datafication in the service of AI could also have profound secondary effects. Just as misinformed visions of the actual ‘intelligence’ of AI could cause governments to prioritize it over other areas of government investment (see previous chapter), an over-emphasis on data could privilege research efforts focused on large models (i.e. highly complex machine learning-based AI systems with many parameters). Unlike simpler tools, these large models have increasingly proven to be problematic in terms of both their societal and environmental harms.
By extension, the massive provision of data to industry could encourage a misplaced focus on data-driven approaches to problems that are rooted in societal issues. It is tempting to assume that any societal challenge can be solved with scientific exactitude by training a machine-learning model on that challenge. But datafying a particular problem might only succeed in giving rise to an inflexible machine-learning model that provides no true insight or predictive capacity in operational use – at worst, such efforts become a costly distraction from solutions that could have a much higher long-term probability of success.
Even if an AI were to exhibit some positive performance in tackling a difficult challenge, it might not be enough to justify the risks associated with the necessary datafication. Consider, for example, AI efforts for suicide prevention, or for detecting welfare fraud; even if they were to create some efficiencies in tackling these noble causes, such initiatives require the collection and dissemination of highly sensitive data that could be breached or abused to harmful effect. This is not to say that reducing suicides or fraud are not goals that states should pursue. Rather, if similar gains can be achieved by measures that do not involve the collection of data, those measures would be preferable to ones that do.
Nor can one necessarily expect the benefits of datafication to be distributed evenly among stakeholders. Mass datafication naturally privileges communities and sectors with the capacity and computational resources to process data – such as the tech sector and high-revenue industries such as finance. Meanwhile, mass datafication will yield few gains for those who do not have access to the necessary resources to protect themselves from the collection of information that can be used against their interests.
State AI strategies have yet to grapple fully with these concerns. It might therefore be helpful for states to assume that any datafication in the service of AI carries tangible risks that not only must be weighed against the potential benefits of the resulting AI capabilities but also against the very real possibility that those capabilities will fail. In this analysis, some states may find that some of their proposed datafication measures simply do not justify the risk. As always, the buy-in of those groups who will be most affected by this datafication is a key criterion in this analysis.