Code Monkey home page Code Monkey logo

Comments (8)

avalanchesiqi avatar avalanchesiqi commented on August 16, 2024 1

I had some success in investigating the code related to this issue.

Quick workaround to bypass this issue

Comment out these two lines. It won't perform the check and no errors will be thrown.

More investigations

The real question here is whether these assertion tests work as they are designed. I will do a bit line by line analysis

note_status_history.py

233  if len(mergedStatuses) > c.minNumNotesForProdData:
234    _check_flips(mergedStatuses)

mergedStatuses is a dataframe that combines oldNoteStatusHistory and newNoteStatusHistory. If you use the published Community Notes data before Nov 25 2023, you will have a dataframe with 401490 entries. This is way over the preset threshold minNumNotesForProdData = 200. So the _check_flips(mergedStatuses) line will be executed.

The problem is in those lines. And I suspect they are not working as expected, or at least there is a difference between the production environment and local setup which causes the exception.

170  # Prune to unlocked notes.
171  mergedStatuses = mergedStatuses[mergedStatuses[c.timestampMillisOfStatusLockKey].isna()]
172  # Identify new and old CRH notes.
173  oldCrhNotes = frozenset(
174    mergedStatuses[mergedStatuses[c.currentLabelKey] == c.currentlyRatedHelpful][c.noteIdKey]
175  )
176  newCrhNotes = frozenset(
177    mergedStatuses[mergedStatuses[c.finalRatingStatusKey] == c.currentlyRatedHelpful][c.noteIdKey]
178  )

Line 171 will filter out mergedStatuses that has timestampMillisOfStatusLock equals to na. This is to find out new notes and can be found in the training log. I have 8 new notes that will be updated to the noteStatusHistory file.

total notes added to noteStatusHistory: 8

Line 174 is the line causing the error. It selects currentStatus="CURRENTLY_RATED_HELPFUL" notes in these 8 new notes. By definition, it will be zero because these notes are new and do not have a currentStatus yet. They are all na.

My local log below

>>> mergedStatuses[mergedStatuses['timestampMillisOfStatusLock'].isna()][["currentStatus", "finalRatingStatus"]]
       currentStatus   finalRatingStatus
401481           NaN  NEEDS_MORE_RATINGS
401482           NaN  NEEDS_MORE_RATINGS
401483           NaN  NEEDS_MORE_RATINGS
401484           NaN  NEEDS_MORE_RATINGS
401485           NaN  NEEDS_MORE_RATINGS
401486           NaN  NEEDS_MORE_RATINGS
401487           NaN  NEEDS_MORE_RATINGS
401488           NaN  NEEDS_MORE_RATINGS

In fact both oldCrhNotes and newCrhNotes will be empty.

@jbaxter thoughts?

from communitynotes.

jbaxter avatar jbaxter commented on August 16, 2024 1

Hmm, seems like we released another version of the file with -1s. Should be fixed upon the next release (in a few hours)

from communitynotes.

jbaxter avatar jbaxter commented on August 16, 2024

We are not seeing this issue in production so I am wondering what data files you had downloaded and what command you used to invoke the run

from communitynotes.

christosporios avatar christosporios commented on August 16, 2024

I ran main.py with the right dependencies and python version, as specified in the README. The data files are below, and they come from https://twitter.com/i/communitynotes/download-data.

noteStatusHistory-00000.tsv  ratings-00000.tsv  ratings-00002.tsv  userEnrollment-00000.tsv
notes-00000.tsv              ratings-00001.tsv  ratings-00003.tsv

from communitynotes.

christosporios avatar christosporios commented on August 16, 2024

Should I be joining ratings-0000{0-3}.tsv together before running this?

from communitynotes.

avalanchesiqi avatar avalanchesiqi commented on August 16, 2024

I also saw the same issue in my reproduction. I have tried only one rating file and multiple rating files, both yielded the same exception. In my case, I commented out the assert statement and then everything worked. But that may be my hack.

from communitynotes.

haelyons avatar haelyons commented on August 16, 2024

Also saw this issue when running a build with the files provided at on the Community Notes Twitter with the following traceback:

Traceback (most recent call last):
  File "main.py", line 17, in <module>
    main()
  File "/home/hal/communitynotes/sourcecode/scoring/runner.py", line 105, in main
    dataLoader=dataLoader if args.parallel == True else None,
  File "/home/hal/communitynotes/sourcecode/scoring/run_scoring.py", line 702, in run_scoring
    noteStatusHistory, scoredNotes
  File "/home/hal/communitynotes/sourcecode/scoring/note_status_history.py", line 234, in update_note_status_history
    _check_flips(mergedStatuses)
  File "/home/hal/communitynotes/sourcecode/scoring/note_status_history.py", line 181, in _check_flips
    (len(newCrhNotes - oldCrhNotes) / len(oldCrhNotes)) < maxCrhChurn
ZeroDivisionError: division by zero

from communitynotes.

jbaxter avatar jbaxter commented on August 16, 2024

Thank you all for tracking this down! The root cause was quite subtle: it looks like the script we use to process our internal TSV files for public release has replacing NaNs with -1 instead of empty string as they were in the internal code. I've fixed that export script so the code should run properly if you re-download the data files (or commenting out this assert works too).

from communitynotes.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.