Until recently, AI has been developed in the open. Now, risk aversion and commercial imperatives are reversing this trend. But use of systems built solely behind closed corporate doors would bring unwelcome centralization of control, and could result in unequal distribution of AI’s benefits. Open-source AI must be allowed to thrive.
The meteoric rise of artificial intelligence (AI) in political consciousness has come alongside a major shift in how the technology is being built. Once a technology developed along open-source principles, AI is increasingly hidden away on grounds of safety, intellectual property rights or defence of trade secrets. This shift must be counteracted. From local councils to libraries, schools to universities, the power of AI-enabled technologies to transform our lives and the services we depend on is enormous.
Limiting AI’s development to only the most powerful corporations would be a major setback in ensuring its benefits are felt as equitably as possible. Proportional regulation, protections for open-source data, and public sector skills and investment to secure a place for public AI alongside private AI are all necessary.
Open beginnings
Modern AI development is founded on the principle of openness. For decades, AI was primarily a research discipline, existing both in academia and industry. AI research teams in companies openly shared innovations. Even today, TensorFlow and PyTorch, the two critical machine-learning frameworks, built by Google and Facebook (currently Meta) respectively, remain shared as open-source code. Similarly, the Transformer architecture, a novel and widely used approach to deep learning, is an open-source innovation shared by Google Brain engineers. Such methods, combined with an open publishing culture embracing the use of preprint archives such as arXiv, have been crucial in allowing researchers to share ideas.
As recently as 2017, open-source approaches still seemed to be in the ascendant. Nick Bostrom, one of the leading thinkers in providing an ideological underpinning for today’s AI development, observed that ‘leading AI developers operate with a high degree of openness’. Analysing the strategic implications of openness – which he understood to mean the sharing of public-domain source code, scientific discoveries and AI platforms – Bostrom concluded that the short- and mid-term effects of openness would most probably be net positive. That same year, OpenAI launched as a non-profit initiative. Openness was explicit in its brand identity.
Closing up
Seven years later, things look different. Today, OpenAI is one of several commercial giants offering closed and non-transparent AI systems in an increasingly concentrated market. A company manifesto from February 2023 states that ‘we were wrong in our original thinking about openness’ and frames the new approach as being to ‘safely share access’. In early 2024, the French AI startup Mistral followed the same trajectory. Although it was launched in mid-2023 as an open-source AI lab, the company decided not to release its latest model, Mistral Large, openly. Critics have questioned whether such shifts are as much about safety as they are about technology companies protecting their market value, but safety is the touchstone for many arguing against openness in AI research. A recent op-ed in the Financial Times compared machine learning with pathogen research, a field premised on mitigating risk at any cost; the article was one of a number of voices highlighting the risks of working with AI in the open. A widely shared white paper written by DeepMind researchers offers a taxonomy of risks related to the operation of language models; these risks include discrimination, the spread of hate speech, misinformation, bias and exclusion. The underlying argument here is that open systems lack control mechanisms to mitigate risk.
GPT-2, the first generative language model that found the limelight, was not immediately open-sourced. OpenAI argued that its decision to develop GPT-2 using a more closed approach was based on ethical considerations, and on the potential risk that the model would be used to create ‘deceptive, biased, or abusive language at scale’. With the deployment of the next generations of its model, GPT-3 and GPT-4, OpenAI has moved further still from openness and does not even provide basic documentation of these systems. Google is similarly not sharing its innovations, and gated API access is becoming the standard for AI services made available to the public. In 2023, Meta launched its Llama model (followed by Llama 2, later that year) using a hybrid approach. The code was openly released, but there were legal limitations on its reuse.
The driving forces behind limiting access to AI for reasons of safety – regardless of the motivation – are often the big AI industry players. Their requests to regulate and license AI development may be presented as solutions to AI risks, but critics argue that this limits market competition. The most common narrative equates closed AI with responsible AI, and open models with AI risk. This is a line of reasoning repeatedly presented by industry, and picked up by the US government in its negotiation of voluntary commitments from AI companies. Europe has taken a different path with its new AI Act. This regulation focuses on mitigating high-risk AI systems, and includes general-purpose AI models – deemed by many to be riskier because of the wide range of applications – in its scope. Yet carve-outs to obligations placed on AI developers have been included for open-source AI, with policymakers recognizing the benefits of open development and deployment of AI.
Requests to regulate and license AI development may be presented as solutions to AI risks, but critics argue that this limits market competition.
Nonetheless, while the regulatory trend is still not definite, the dominant narratives supporting closed approaches are a cause for concern. Above all, they fail to account for a systemic risk that openness can mitigate: the risk of centralized control of powerful technologies and the monopolization of beneficial outcomes of AI systems. This is a risk that Bostrom noted in his paper. A report from the UK Competition and Markets Authority, published in April 2024, outlines risks to competition on AI foundation models. It is telling that the document comes from a market regulator, rather than from an AI policy institute. Such issues are often largely ignored in policy debates. For example, the DeepMind risk taxonomy mentions the risk of concentration of power only indirectly. Yet the concentration of power is, in fact, a fundamental AI risk that can only be mitigated by measures to decentralize and democratize access to, and use of, these technologies.
The resilience of open-source systems
The move towards closed models is far from a fait accompli. In July 2022, the Big Science consortium released BLOOM, a fully open, large language model comparable to GPT-3. BLOOM has open-source code, transparent training datasets, and a collaborative production model that involved over 1,000 researchers. In the same month, Stability.AI released Stable Diffusion, a fully open text-to-image model that could produce images similar to those generated by proprietary models such as Midjourney or Dall-E. While BLOOM and Stable Diffusion have attracted mainstream attention, many other open-source solutions have been developed in recent years. These include Pythia, a model built by the Eleuther.ai non-profit that allows researchers to better understand how AI models work, and StarCoder, a family of language models for computer code.
Together, these examples of open-source models signal the possibility of democratizing and decentralizing AI development. They demonstrate that a different trajectory is possible than that of centralization through proprietary solutions. Just as with browsers and operating systems in the past, open-source solutions have become viable alternatives and challenges to a potential AI oligopoly. Today, a robust field of open-source AI science is leading in areas such as training dataset creation, security research and model fine-tuning, and the models being built can be freely applied to non-commercial applications, large or small.
Decentralizing AI power should be a policy goal in itself, comparable to anti-trust efforts. Open-sourcing AI, as a decentralization method, would increase market competition. And while the creation of new AI models is prohibitively expensive, their further development and fine-tuning can often be conducted at much lower cost, offering a business model for market entrants and smaller, less resourced companies.
Open-source AI also has all the benefits associated with open research, by giving researchers broad, equal access to the technologies involved. Despite other mainstream narratives, the open-sourcing of models allows for greater scrutiny, and therefore helps solve issues such as bias, security or environmental concerns.
Open-source approaches can also help efforts to diversify AI technologies and make them available to people around the globe. Large AI firms have a history of treating languages from around the world as raw resources that can be extracted, exploited and enclosed in proprietary systems (see also Chapter 4, ‘Community-based AI’, and Chapter 6, ‘Resisting colonialism – why AI systems must embed the values of the historically oppressed’). The reverse trend is being championed by open-source developers. The collaboratively built BLOOM model works in 46 languages, and is based on justly sourced data. The open-source Polyglot-Ko is currently the best language model that works in Korean. Another example is Te Hiku Media (see also Chapter 4), a Māori organization that uses open-source technology to build sovereign AI solutions that both preserve and protect Māori language and tribal knowledge.
Today, debates about AI focus on the development of technologies. We are still in the early phases of their deployment, for example through chatbots like ChatGPT or Claude. Yet as AI solutions become more ubiquitous, we will either have the choice of a variety of solutions or a single corporate offer. This will be a choice faced by every small business, every non-profit organization and every school system. The market is already skewed. Today, any client of AI services most probably pays one of several providers, and indirectly pays for the services of an even smaller set of cloud companies that provide necessary computing power. In the field of AI services, market competition will also mean democratization.
What now?
Policymakers face a choice. As Frank Pasquale, a law professor and AI expert, has observed, one strategy could be to accept – or even promote – ‘digital gigantism’ and focus on regulating it. This is expressed in calls for licensing AI developers or focusing on AI safety in close cooperation with commercial AI giants.
This strategy may be building momentum, but it is not without challenge. Around the world, governments and similar administrations are recognizing that it is necessary to prioritize both building on the strengths of open AI science and ensuring the freedom of non-commercial players to create and deploy AI systems for non-commercial needs. Governments from the US to Sweden, and bodies such as the EU, are receptive to the importance of open AI science. This is a major reason for optimism.
The introduction of exceptions for open-source AI systems within the scope of the AI Act is a symbolic first. The key elements of these amendments include provisions supporting open-source AI development and rules for increased transparency and governance of training data. Although these measures have been lacking from the voluntary commitments secured by the US government from the seven largest AI companies, the bipartisan CREATE AI Act (‘Creating Resources for Every American To Experiment with Artificial Intelligence Act of 2023’) pushes for ‘a shared national research infrastructure that provides AI researchers and students from diverse backgrounds with greater access to the complex resources, data, and tools needed to develop safe and trustworthy artificial intelligence’.
Striking a balance between commercial and non-commercial AI, open and closed AI, and safety and opportunity should be at the heart of AI policy. This means proportional regulation, strong data rights and public involvement. Three specific principles can be advocated:
Firstly, policies should support open-source AI development by making sure that regulation is proportional and does not unduly burden developers. In particular, proposals for licensing AI developers run the risk of concentrating development in the hands of major players. Self-governance practices developed in open-source projects – especially practices ensuring documentation and transparency – can serve as blueprints for regulation.
Secondly, policies must acknowledge that AI development depends on a robust corpus of data. While the spotlight for most of today’s policy debates is on the governance of AI models (e.g. their alignment with human values, accountability and responsible use), governance of training data is also a fundamental aspect of AI policy. The legal status of training AI to perform certain tasks (such as generating text, for instance) using content and data taken from the open web is currently unclear and depends on the jurisdiction. European text- and data-mining exceptions allow such ‘scraping’ of internet sources; in the US, fair-use status is currently being tested in courts. On the other hand, the practice has understandably met with opposition from creators and rightsholders in the creative sector. Various platforms, including user-generated content sites like DeviantArt and Reddit, or publications like the New York Times, have opposed such practices. There is a shared sense of an exploitative dynamic at play: a commons of publicly available knowledge and culture being used as a raw material for commercial services that may capture its value without giving back. Without proper regulation protecting the digital commons, digital content will be exploited as AI systems grow, and as data are siphoned away into closed models by companies unwilling to support original content creation. The approach adopted by the EU balances the freedom to ‘mine’ content for the purpose of AI training with an opt-out mechanism available for those who want to reserve their rights.
Text- and data-mining rules should strike a balance between allowing content to be reused freely and protecting intellectual property. Measures enabling content owners to opt out of allowing their data to train AI systems are important for ensuring this balance. But copyright rules themselves are not enough. A new social contract is needed to ensure that the profits generated by AI systems are recycled into funding production of the very content on which such systems rely; a financial levy might be one way of achieving this.
Thirdly, greater involvement of the public sector and increased public investment are needed to secure public interest in responsible AI. Public research institutions and supercomputing centres are among the few actors that can compete with the commercial AI giants when it comes to research funding and computing infrastructure. In July 2023, the French government announced the outlines of a national AI initiative based on open-source principles, aimed at creating new models developed by national AI champions and complemented by publicly accessible training datasets. In the UK, the Labour Party has proposed to increase tenfold the budget of the Foundation Models Taskforce in order to build BritGPT, to ensure that there is publicly owned capacity to develop and run foundation models. In the US, leading AI researchers have argued for ‘public option AI’ and the need to provide substantial funding to the National Artificial Intelligence Research Resource – a pilot scheme launched by the U.S. National Science Foundation – so that it would offer public alternatives for computational power and data sources. Finally, the EU announced in January 2024 the creation of ALT-EDIC, a consortium tasked with creating publicly available language resources for training language models.
We are facing a challenging public debate, in which opinion leaders wield tremendous influence by virtue of also often being the owners of companies that are quickly concentrating power around the new AI technologies. These voices suggest that a safe AI future depends on societies trusting Big Tech to be gatekeepers of technologies that are complex, powerful, supposedly even sentient in a predictable future. And the debate itself may even be prone to centralization, as some parts of governments are willing to treat the new AI giants as the only voices they need to consult. This is a vision of technological development that is not in line with democratic values. And the fear of AI risks – most of these fears uncertain, extrapolated and exaggerated – are used to cement this concentration of power.
Instead, we need an approach that is democratic, with technologies serving citizens, and being available for use by citizens in ways that are affordable and just. The open-source approach, while not without its challenges, offers one of the clearest paths to attaining these goals, especially when coupled with a strong public commitment to developing AI as public infrastructure.