BLOOM: Inside the radical new project to democratize AI
But Meta’s model is available on request only and it has a license that limits its use for research purposes. Face hugs went one step further. The the meeting Details of its work over the past year are recorded and uploaded online, and anyone can download the model for free and use it for research or building commercial applications.
BigScience’s big focus was on bringing ethical considerations into the model from the outset, rather than treating them as an afterthought. LLMs are trained on tons of data gathered by searching the internet. This can be problematic, as these data sets include a lot of personal information and often reflect dangerous biases. Develop team data governance structure specific to LLMs need to make it clearer what data is being used and to whom it belongs and where it comes from datasets from around the world that are not available online.
The group is also releasing a Responsible AI License, something like a terms of service agreement. It is designed to prevent the use of BLOOM in high-risk areas such as law enforcement or healthcare, or to harm, deceive, take advantage of or impersonate people. The license was a test in the former self-regulatory LLM rule caught up, said the Danish Contractor, an AI researcher who volunteered for the project and co-creator of the license. But in the end, there’s nothing stopping anyone from abusing BLOOM.
Giada Pistilli, Hugging Face ethologist who drafted it, said: BLOOM .’s Code of Ethics. For example, it made the point of recruiting volunteers from a variety of backgrounds and locations, ensuring that outsiders could easily reproduce the project’s findings and publish their results. it openly.
All on board
This philosophy translates into a key difference between BLOOM and other LLMs currently available: the large number of human languages the model can understand. It can handle 46 of them, including French, Vietnamese, Mandarin, Indonesian, Catalan, 13 Indian languages (such as Hindi) and 20 African languages. Just over 30% of its training data is in English. This model also understands 13 programming languages.
This is very unusual in the world of large language models, where English predominates. It’s another consequence of the fact that LLM is built by collecting data from the internet: English is the most commonly spoken language on the net.
The reason BLOOM was able to improve this situation is that the team brought together volunteers from all over the world to build relevant datasets in other languages even if those languages were not presented. display online. For example, Hugging Face has held workshops with African AI researchers to try to find data sets such as records from local governments or universities that could be used to train models on African languages, said Chris Emezue, an intern at Hugging Face and a researcher at Masakhane, an organization that works on natural language processing for African languages.