Data generation and labeling for language models

We supported the BSC in creating a catalog of various open-source corpora for companies to develop their artificial intelligence models.


The Barcelona Supercomputing Center (BSC), under the auspices of the AINA project, recognized a crucial need for specialized expertise and technology to advance the Catalan language's presence in the digital world. To address this, BSC sought a multidisciplinary team comprising linguists, engineers, and an advanced Natural Language Processing (NLP) platform. The objective was to generate an extensive corpus of Catalan language data, along with sophisticated language models. These resources were not just for academic or research purposes but were also intended to be shared with the broader industry. This initiative aimed to enrich the technological landscape with Catalan language capabilities, facilitating more inclusive and diverse digital communication and information processing in the region.


In response to the Barcelona Supercomputing Center's (BSC) requirements under the AINA project, we assembled a dynamic, multidisciplinary team of 30 experts, including linguists and engineers. Our primary goal was to customize our advanced Natural Language Processing (NLP) Platform to cater to a diverse range of linguistic projects. This customization enabled the platform to efficiently handle five distinct project areas: Named-Entity Recognition (NER), Intent Creation, Question Answering, Sentiment Analysis, and Speech-to-Text. Each of these projects was meticulously designed to enhance the technological capabilities for processing and understanding the Catalan language, thereby contributing significantly to the linguistic and digital diversity in the industry.


The collaborative efforts with the Barcelona Supercomputing Center (BSC) under the AINA project culminated in remarkable achievements within a span of just four months. BSC successfully created extensive corpora to train AI language models, a pivotal step in advancing Catalan language processing. These corpora encompassed over 1,000 hours of audio data, 60,000 documents annotated with various entities, more than 90,000 question-answering pairs, over 40,000 texts categorized for sentiment analysis, and 12,000 sentences designed to represent 250 different intents. This rich, multifaceted dataset not only signifies a leap in AI language model training but also marks a significant contribution to the linguistic diversity and digital inclusivity in AI technologies.