Public Sector

Supported the Barcelona Supercomputing Center in creating a catalog of various open-source corpora for companies to develop their artificial intelligence models.

Barcelona Supercomputing Center
#AIEngineering
#NLPPlatform
PROBLEM

The Barcelona Supercomputing Center (BSC), within the AINA project, needed a team of linguists, engineers and an NLP platform to generate a vast amount of catalan language corpora and models to share with the industry.

SOLUTION

We built a multidiciplinary team of 30 linguists and engineers and customized our NLP Platform to be able to work in 5 different projects including NER, Intent creation, Question Answering, Sentiment Analysis and Speech to Text.

RESULT

Barcelona Supercomputer Center (BSC) was able to create large corporas to train AI language models in 4 months. Those corporas included +1,000 hours of audio, 60,000 documents with entities, +90,000 pairs of question answering, +40,000 texts with sentiment classification and 12,000 sentences for 250 different intents.

Creating Catalan Language Corpora with NLP Platform: How BSC Leveraged NLP to Enhance AI Language Models

The Barcelona Supercomputing Center (BSC), as part of their AINA project, had a need to generate a vast amount of Catalan language corpora and models to share with the industry. To achieve this goal, they required a team of linguists, engineers, and an NLP platform.

One of the challenges that the BSC likely faced was the availability of data in Catalan. As a regional language spoken in Catalonia and other parts of Spain, there may be limited resources available for training language models. The BSC's decision to create its own corpora was a strategic move that allowed them to generate high-quality data that is specific to the Catalan language, which can be used to develop AI applications that are tailored to the needs of this particular region.

Our solution involved building a team of 30 linguists and engineers who were tasked with customizing our NLP platform to meet the specific needs of the BSC. Our platform was designed to work on five different projects, including Named Entity Recognition (NER), Intent creation, Question Answering, Sentiment Analysis, and Speech to Text.

After four months of intensive work, the BSC was able to create a large corpora that could be used to train AI language models. These corpora included over 1,000 hours of audio, 60,000 documents with entities, more than 90,000 pairs of question answering, over 40,000 texts with sentiment classification, and 12,000 sentences for 250 different intents.

With these results, the BSC was able to share valuable language resources with the industry and improve the development of AI language models in Catalan.

The use of our NLP platform and the expertise of our team of linguists and engineers allowed for a rapid and efficient creation of these corpora, demonstrating the potential of NLP technologies in advancing language research and development.