Code Monkey home page Code Monkey logo

Comments (2)

mar-muel avatar mar-muel commented on September 26, 2024

responding to the tone of 'voice' in the comments, producing strong "false" or "misleading" signals if the input text is aggressive in nature?

Sometimes a thorough error analysis is very insightful! Problematic errors are systematic errors and it's important to reveal them/know how they impact the summary statistics.

In general, your definition of "fake" might also overlap partially with "non-rational"/agitated/ALL CAPS comments. So it's worth to conduct a similar analysis on your annotation set. Larger models usually require fewer samples to get to a decent accuracy level, so you might be able to clean your annotation data a bit as well (as long as you're not introducing another bias). This usually has a positive impact on scores because your objective is clearer.

Just some thoughts - good luck with the analysis.

from covid-twitter-bert.

peregilk avatar peregilk commented on September 26, 2024

Following up on Martins comments here.

Firstly, the COVID-TWITTER-BERT is starting to get a bit old. It was trained in the beginning of the pandemic. It still does "think" that Malone is a basketball player and that alpha, delta and omikron are letters in the greek alphabet. In some cases the stance/sentiment in a sentence requires you to know the meaning of these words. To fix this, one would have to do some additional pretraining on additional (unannotated) data. Not sure if it would have real impact in your case, just something you should think about.

Another comment is that is the possibility that the model is picking up the "tone of voice" as you describe it during finetuning. Take a minute to think about the process of finetuning a classification task. Lets say you have the task of pro/anti vaccine. You do some annotation, and put the "pro" in pile A and "anti" in pile B. In real life, a lot of these categorisations are really hard. Inter-rater reliability on tasks like this is typically below 0.8. Then you are finetuning your model on this. However, you are no longer finetuning on pro vs anti vaccine. You are finetuning on recreating pile A and pile B. There are a lot of other ways of recreating these piles, for instance the use of specific words, or their anger, or their use of CAPS LOCK.

There are ways of getting around this problem. One approach is to do the classification target specific (where you hint to the label of the piles to give the classificator a hint about what you are looking for). Another approach is not to train on the classification task, but instead view this as a logical task. We have made an mnli-version of the model that can be used for that.

Best of luck with the competition!

from covid-twitter-bert.

Related Issues (18)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.