Multilingual-Sentiment-Analysis-with-Fine-tuned-Bidirectional-Encoder-Representations-from-Transformer
The following project used fine-tuned Bidirectional Encoder Representations from Transformer (BERT) model for Sentiment Analysis evaluated on an open-source, multilingual dataset of over 30 languages annotated with Plutchik's eight-core emotions.
Part 1: INTRODUCTION
Sentiment Analysis involves understanding and scoring the underlying emotions behind a text. While significant work has been done on understanding emotions in a singular or a smaller set of languages, performing accurate analysis across multiple, ten or more, languages with a single model is still an unsolved research problem. With the advancement in multilingual transformer-based models, machines have gotten better are understanding and interpreting multiple languages. While human, expert-level translation and understanding of languages is still unrivaled, the scale of data generated with the internet makes this problem challenging. The proposed approach, if successful, may provide a higher overview of national or regional sentiments and change of sentiments over time.
Part 2: ALIGNMENT AND DEGREE OF INNOVATION
Sentiment Analysis works incredibly well when data is in a single language, in particular English. However, a single approach that works and scales well across multiple, disparate languages is still a significant research problem in Natural Language Processing.
The base approach is derived from the seminal work on the Bidirectional Encoder Representations from Transformer (BERT) model. BERT model has achieved state-of-the-art performance in most Natural Language Processing Tasks and has spawned multiple variations. A recent variation, proposed in 2020, aims to learn the encodings of multiple languages in a singular model. While improvements to the multilingual BERT model are still being made, the model has not yet been tested for the sentiment analysis task. This is the first work utilizing multilingual BERT for sentiment analysis across ten or more languages to the best of our knowledge. The dataset used for this task, XDE, was also released in 2020 and included over 30 languages in writing. The dataset is expected to expand to include 40 more languages.
The proposed approach, if successful, can be extended to include more languages and data sources such as audio or videos. Specifically, online and social media sources can retrieve national or regional sentiment analysis results irrespective of language. A dashboard and data on "change of perceptions" may also be created.
Part 3: TEAM
Understanding emotion from text is a fundamental step towards Artificial General Intelligence. With the advancement towards multilingual models, language models are getting better every day. This proposed project represents an intellectual challenge and growth opportunity for the team.
The team lead is an expert in machine learning, deep-learning approaches to computer vision, and natural language tasks. In addition to the team lead, the team comprises three goal-driven students experienced in Python programming. Under the direction of the project lead, the students will study, program, train, and evaluate the requisite model and novel variants using the available dataset for sentiment analysis. We see this project as an opportunity to learn and add to existing knowledge in the field of Natural Language Processing.
Part 4: TECHNOLOGY AND CONCEPT VIABILITY
The base model being used has provided state-of-the-art performance in Natural Language tasks compared to existing models for language embedding. While the model itself is larger, fine-tuning for sentiment classification requires finite and lower computational resources than comparable approaches.