BREIN takes down a large language model for the second time in as many weeks
The BREIN Foundation has taken another Dutch AI language model offline a mere week after GEITje-7B. It once again concerns a model based on Mistral-7B. According to its creator, the model was trained on, among other things, many billions of tokens of Dutch-language literature, news and textbooks. In its documentation, the creator didn’t elaborate on what kind of materials those would be specifically, but with so much data, it was highly unlikely to be exclusively copyright-free material. The AI was primarily offered as a chatbot and could be downloaded and run by anyone.
BREIN reached out to the creator of the model and asked what those training data were, where the data came from, and whether the creator had a license to collect and process the data in that way. If these rights were lacking, then obviously the model would have to be taken offline. The alternative was a lawsuit.
Data sets for training AI have been known to be filled with materials from illegal sources. The names of certain so-called shadow libraries come up regularly in this context. On such unauthorized websites, protected works can be downloaded for free; these illegal sources have already been blocked by the Dutch access providers at BREIN’s request. If AI datasets and language models are based on such illegal copies, this is obviously undesirable for the authors and producers of the original works, and BREIN will take action.
The person behind the LLM undoubtedly understood the situation and decided to take his model offline without further discussion. The BREIN Foundation is satisfied with that result and continues to search for datasets and language models that violate copyright on a large scale.
In earlier reporting, BREIN already mentioned that we are not against AI and its training, but if books, news articles, music, etc. are used for for AI training purposes, then permission from their rights holders must be obtained and, logically,a license fee should be paid. The AI providers themselves also charge money for use of their models. Works protected by intellectual property rights are necessary for training and that training results in software that can compete with the original training materials. Therefore, it can only be to the detriment of the income of authors and book and news publishers if their productions are used without compensation to create a language model.
In the United States, dozens of lawsuits are already pending against providers of AI models. In Europe, the first cases are now also brought before the courts. Gradually the realization is dawning that copyright must be respected and we are seeing the first licensing agreements being signed. For example between OpenAI and the Financial Times and recently also the preliminary agreement between the major music companies and Anthropic.
“Ultimately, it’s about the tech industry also abiding by the law and respecting copyrights. Creators and producers should be able to earn an honest living and (big) tech should pay for the use of other people’s copyrighted material just like everyone else,” said BREIN managing director Bastiaan van Ramshorst.