A lawyer and a PhD researcher at the Max Planck Institute for Innovation and Competition in Munich, Carlos is focused on open innovation dynamics in the ICT field. He is a co-founder of HIGH Technology Law Forum, and also the co-lead of the Legal & Ethical WG at the BigScience workshop, an open collaborative AI research initiative. We talked to him about the legal aspects of training large language models, how Big Tech uses open source as a strategy to compete for the ecosystem, and how recent European regulation will impact applications of AI. For clarity: Carlos' views are his own and do not represent any of the organizations he is affiliated with.
Carlos, can you please share some words about your background and what got you interested in, in the legal aspects of technology and AI? It's quite funny for a lawyer to be here today in this kind of interview, because traditionally, lawyers weren't supposed to deal with technical stuff, and even less to know about it. The traditional conception of a lawyer dealing with technical stuff was basically restricted to the patent lawyer. Nowadays, we are suddenly starting to have these kinds of tech lawyers (and I include myself there), mainly due to the ubiquity of digitalization.
My research deals with the interactions between open source and standards with a focus on the telecom industry: open source related business models are progressively having an impact within standardization bodies and standardization dynamics. Nowadays, we don't talk anymore about technical specifications alone, but also about software reference implementations. I started to be more and more interested in AI, e.g. the strategic use of open source licences by big tech firms, and how open source is used to compete for the leadership of the market. Open source is understood as an attraction mechanism to generate market dependency, for instance with machine learning frameworks, such as Tensorflow or Pytorch. Last summer, I started to collaborate with the BigScience workshop, where I help with coordinating the legal and ethical efforts. There we are dealing with many interesting legal challenges, e.g. related to intellectual property and personal data.
The BigScience workshop is about training very large language models, on very large sets of data collected from the public internet. Can you sketch out what the legal challenges are to collecting these datasets and using them for training AI models?
There are basically two kinds of data sources here. First, data sets from companies or institutions which BigScience is interested in obtaining for training. These kinds of data sets sometimes might be closed data sets, and we bilaterally negotiate with a company or an institution to obtain access to this data set for research purposes. Second, there is data obtained by web crawling. In this case, there are two main challenges: personal data and copyrighted material, and if we focus on EU jurisdictions, even database sui generis rights. Here, intellectual property exceptions play a major role. And as you might be aware, the debate is ongoing about text and data mining (TDM) exceptions applying in the EU – e.g. for research purposes. Exceptions can apply to TDM carried out by research organizations, or text and data mining carried out for whatever purpose, but based on lawfully obtained collections, and even then copyright holders can expressly reserve their rights. It’s not an easy legal framework and will need further interpretation and guidance from policy makers, courts, and the market.
So if I train a model on lots of data, there are copyright regulations and database rights in play on the data itself, but if I train my model on it, and then I throw away the data, in what sense am I still impacted by this? Is a trained model a derivative work of any kind? It’s now been transformed to a bunch of weights and parameters.
Yes, it could be argued it is a “derivative” work. Put it simply, a derivative work is a work based on some pre-existing copyrighted work. “Derivative work” is a US-based legal concept, by the way. Within the EU, where copyright laws are not harmonized, it depends on the country. For example, in Spain, France or Germany we do not explicitly use the term “derivative work”. The general rule is the same though, if you use a pre-existing work, you have to obtain the permission of the author. However, the exceptions frameworks are different between the US and the EU.
In the case of training a neural network model, one should carefully understand how copyright laws and their exceptions work in order to comply with the law and respect the rights of others. For instance, if we go to the US, we have a safe harbor called “fair use”. Fair use is based on four essential criteria: the purpose and character of the use; the nature of the work; the amount of copyrighted material used; and, the impact of the unauthorized use of the copyrighted material in the copyright holder’s respective market.
For example, under US copyright law, let’s imagine someone crawls a number of copyrighted materials found on the web to train a model. Two of the questions that could be asked are:
Is there a commercial or research purpose? Does this model have a transformative character compared to the set of - e.g. - papers in electronic format used to train the model? Yes. Because at the end of the day, these papers were made for scientific purposes, as intellectual expressions for consumption by individuals. Conversely, the model doesn't focus on the intellectual expressions or the main aim of the author, it just looks at the text for mechanical parameter tuning purposes.
What is the impact of the use of the copyrighted material in the copyright holders’ markets for their intellectual property rights? For instance, if I write a book and sell it in an electronic format, does the use of the book for creating a model in the field of NLP impact my market for the selling of the book? I hardly think so.
So from a US based perspective, one could hold that the use of these copyrighted material, such as books, for developing models amounts to fair use. However, this should not be seen as a general rule, but rather assessed on a case-by-case basis. For instance, it would be more difficult to prove fair use for the unauthorized use of datasets potentially designed for model training purposes.
Has this opinion been tested in court so far?
Not in the case of AI, to the extent of my knowledge. But in the US, there are, as far as I know, a few interesting cases. The main one, dating from 2015, is Authors Guild vs Google, relating to books scanned by Google to create an application to show snippets of the book, and the court ruled that this amounted to fair use. Again, this should not be regarded as a general rule for this specific case, fair use applied on a case by case scenario taking into account the specificities of each case. Let’s talk a bit more about Big Tech, and the data and infrastructure advantages that they have. How do you see the current dynamic in AI, and the power balance between Big Tech and open access of data for society for research and general benefit?
Nowadays “openness” is possibly one of the terms most subject to industry interpretation. From a policy and strategy angle companies invest millions in capturing the interpretation of such concepts. We hear companies referring to open standards, but what do you mean by an open standard? Is it open from the perspective of access and exploitation of the tech? For instance OpenAI’s GPT-2. Or are we also dealing with openness of the tech development or standardization process? So you are not just welcome to use the technology, but also to participate in the development of this technology. Sometimes being present at the development stage makes you get all the necessary access to know-how, to further develop the technology, and this makes the difference.
And the other point is this: what’s the interrelation between openness and ethics? Openness, of course, is essential to fostering innovation in fields such as AI, but do open software development phenomena, such as open source, inherently mean responsible use of AI? I don't think so. The AI model might be open source, but may not have use case restrictions, so I could use the model to develop a digital weapon or an addictive chatbot for children, etc. We really have to strike a balance between fostering open access by means of open-style software licences, and taking care of the use of AI systems, such as large language models.
These very large models, that are quite hard to train for smaller resourced individual researchers or smaller companies, have been called foundational models, implying that they are the foundation for further development. Do you think there is an argument to be made that society should strive for these foundational models to be open source, so that equal chances are preserved for various market players?
It's always nice to reason by analogy. Let's think about the case of the Linux kernel, developed and adopted in the late 90s, early 2000s. Big companies, such as IBM, saw that open source was the future in terms of software development and new ICT business models, and created a defensive patent pool to protect the Linux kernel, via the Open Invention Network (OIN), because Linux could be defined as a core technology at the time. By doing so companies such as IBM or Red Hat were able to integrate open source as a business model while assuring to customers and investors the mitigation of patent-related disputes concerning Linux, thanks to the OIN. In other words, business models can be designed around open source, thus benefiting different economic interests at stake. It would indeed be interesting and beneficial to move towards opening foundational technologies, a clear example is BigScience. I’m not in a position to judge whether the LLM BigScience release will be qualified as a “foundational” one, but anyways it will be a massive milestone for the global AI research community.
It would also be interesting in the mid term to see consortia or standards-initiatives focused on the development of foundational models, following both open source-style development processes and IPR policies mitigating potential disputes. Maybe I’m just a dreamer, but dreaming is free, right?
There's another element of competition. The AI research community is global, but there is a sense of competition around these very large foundational models between the US and China and Europe. What role should countries or international organizations play here?
The geopolitical perspective on AI development is very interesting. Not surprisingly, since three or four years already, the European Commission has been promoting open source. Within the European Commission's long term policy strategy, open source now is one of the main goals, in my personal opinion because they think we have lost this so-called AI technological race. And open source is a specific software market phenomena by which we can take back leverage and even try to compete in the long run with the US and China. If you look at China's strategy, there is a paper from 2018 (see references below), dealing with AI and open source, and how to foster it. In this paper, they analyze the US Open Source commercial model, which at the end of the day, has been the winner within the global open source community. Google, IBM, Red Hat, these guys invented open source commercial models, and they know how to play this game at scale. Just take a look at Android, it’s “open”, but at the same time it’s closed.
The teams training very large AI models realize the potential of open source from a market strategy perspective, and even broadly, from an ecosystem control perspective, because nowadays in ICT markets, companies do not target a single market anymore, some of them compete for the entire ecosystem, such as Google, Microsoft, or Meta. And for this kind of platform competition dynamics, open source is the perfect tool. You open source the core technology upon which several other technologies rely, and if successful, everyone will adopt this technology. You attract users, they become dependent on your tool, you extract benefits elsewhere in the market closely connected to the tool. So at the end of the day, you achieve a kind of control of the ecosystem and the market. It’s the battle of “openness”: you open, attract, control, and when needed, close.
Europe is on a national level very much trying to compete in the AI race. Germany, UK, France, etc. are investing a lot in building these national foundational models for their own languages. However, as a collective, the EU has excelled mainly in the legal domain, with the European AI act as a first significant attempt to regulate AI. What is your view on that?
The proposal for an AI Act, I would say, should be seen as having a transversal impact. In my personal opinion, the European Commission foresees that in the near future, AI will be applied in every single sector: AI will be like any other software in the near future – a commodity. This is why we need a Regulation capable of regulating the use of AI in every sector. One act to rule them all, as Gandalf would say. Some voices have been heard on the risk of over-regulating a nascent industry. Will regulation hinder or foster innovation? I think the AI Act should not be seen just as a burden of compliance like the GDPR, but also from the “opportunity” perspective. How do we get a strategic leverage from the interpretation of the provisions within the AI act? For instance, the act will impose some kind of certification schemes for the quality of AI products. Now, who is going to develop these kinds of certifications? Are these certifications going to be carried out by standardization bodies, or are we going to create private consortia to issue these certifications? Standardization is a business in itself. I also like the idea of the AI sandbox, it is a specific set of provisions within the AI Act. Regulatory sandboxes have been making a lot of noice in the FinTech industry, with the UK pioneering this instrument back in 2015. A regulatory sandbox basically is a testbed whereby the regulator and companies come together to test disruptive technologies, before these technologies arrive to the market. For regulatory purposes, either data protection, or some financial certification, in the case of FinTech. In case of the AI act, as a vendor you will have the opportunity to apply to the AI sandbox, and once you are in you will test with the specific public institution whether your AI system should be modified in order to be marketed. The sandbox will also serve the regulators to empirically assess, based on the technologies and business models entering the sandbox, if there are any common patterns which could motivate amendments of existing regulations. The legal framework will thus be updated as a result of new technical and business dynamics in the AI industry. At the end of the day, regulatory sandboxes are doubled-edged instruments.
Typically, in regulation and legal frameworks, we fix the problems of the past, and it’s very hard to foresee the problems of the future. One of the reasons is because the regulators and legislators are not usually very tech savvy. Don’t you think that even within the context of the sandboxes, and like auditing and enforcing compliance with something like the AI act, that the techies create a real reality that is so far ahead of what the regulators understand that that is not really effective?
With regulatory initiatives such as the AI Act, the first challenged party is not the industry, but also the policymakers and the regulators. Imagine, as a public agency dealing with the AI sandbox in two or three years, how are you going to deal with a huge number of applications? Do you have the human capital, the qualified professionals to deal with the sandbox? And more precisely, are these professionals experts within the AI field? In other words, the AI Act will entail a serious investment in order to set the appropriate institutional infrastructure ensuring a smooth functioning of the regulation and the actors involved. To be honest, I don’t know, maybe we have to automate the functioning of sandboxes using AI! Crazy (but not impossible in the long run).
Thanks for the interview Carlos, and for sharing the some tips for further reading with us: Muñoz Ferrandis, Carlos and Duque Lizarralde, Marta, Open sourcing AI: intellectual property at the service of platform leadership (January 26, 2022). Available at SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4018413
Ministry of Industry and Information Technology - Informatization and Software Services Division, White Paper on the Development of China’s Artificial Intelligence Open Source Software (AOSS), Jeffrey Ding (transl.) (2018).
OpenAI, Comments regarding the request for comments on IP protection for AI innovation, before the USPTO (2019).
Knut Blind et.al. The impact of Open Source Software and Hardware on technological independence, competitiveness and innovation in the EU economy (European Commission, 2021) see pages 306ff for AI.
Alexandra Theben, Laura Gunderson, Laura López Forés, Gianluca Misuraca, Francisco Lupiáñez Villanueva, Challenges and limits of an open source approach to Artificial Intelligence, (European Parliament 2021) Study for the Special Committee on Artificial Intelligence in a Digital Age (AIDA), Policy Department for Economic, Scientific and Quality of Life Policies.
European Commission, Communication from the Commission ,“Open Source Software Strategy 2020 – 2023, Think Open” (2020).
Ibrahim Haddad, Open Source AI Projects, Insights, and Trends (The Linux Foundation, 2018).