Comments (8)
I had some success in investigating the code related to this issue.
Quick workaround to bypass this issue
Comment out these two lines. It won't perform the check and no errors will be thrown.
More investigations
The real question here is whether these assertion tests work as they are designed. I will do a bit line by line analysis
233 if len(mergedStatuses) > c.minNumNotesForProdData:
234 _check_flips(mergedStatuses)
mergedStatuses
is a dataframe that combines oldNoteStatusHistory
and newNoteStatusHistory
. If you use the published Community Notes data before Nov 25 2023, you will have a dataframe with 401490 entries. This is way over the preset threshold minNumNotesForProdData = 200
. So the _check_flips(mergedStatuses)
line will be executed.
The problem is in those lines. And I suspect they are not working as expected, or at least there is a difference between the production environment and local setup which causes the exception.
170 # Prune to unlocked notes.
171 mergedStatuses = mergedStatuses[mergedStatuses[c.timestampMillisOfStatusLockKey].isna()]
172 # Identify new and old CRH notes.
173 oldCrhNotes = frozenset(
174 mergedStatuses[mergedStatuses[c.currentLabelKey] == c.currentlyRatedHelpful][c.noteIdKey]
175 )
176 newCrhNotes = frozenset(
177 mergedStatuses[mergedStatuses[c.finalRatingStatusKey] == c.currentlyRatedHelpful][c.noteIdKey]
178 )
Line 171 will filter out mergedStatuses
that has timestampMillisOfStatusLock
equals to na
. This is to find out new notes and can be found in the training log. I have 8 new notes that will be updated to the noteStatusHistory
file.
total notes added to noteStatusHistory: 8
Line 174 is the line causing the error. It selects currentStatus="CURRENTLY_RATED_HELPFUL"
notes in these 8 new notes. By definition, it will be zero because these notes are new and do not have a currentStatus
yet. They are all na
.
My local log below
>>> mergedStatuses[mergedStatuses['timestampMillisOfStatusLock'].isna()][["currentStatus", "finalRatingStatus"]]
currentStatus finalRatingStatus
401481 NaN NEEDS_MORE_RATINGS
401482 NaN NEEDS_MORE_RATINGS
401483 NaN NEEDS_MORE_RATINGS
401484 NaN NEEDS_MORE_RATINGS
401485 NaN NEEDS_MORE_RATINGS
401486 NaN NEEDS_MORE_RATINGS
401487 NaN NEEDS_MORE_RATINGS
401488 NaN NEEDS_MORE_RATINGS
In fact both oldCrhNotes
and newCrhNotes
will be empty.
@jbaxter thoughts?
from communitynotes.
Hmm, seems like we released another version of the file with -1s. Should be fixed upon the next release (in a few hours)
from communitynotes.
We are not seeing this issue in production so I am wondering what data files you had downloaded and what command you used to invoke the run
from communitynotes.
I ran main.py
with the right dependencies and python version, as specified in the README. The data files are below, and they come from https://twitter.com/i/communitynotes/download-data.
noteStatusHistory-00000.tsv ratings-00000.tsv ratings-00002.tsv userEnrollment-00000.tsv
notes-00000.tsv ratings-00001.tsv ratings-00003.tsv
from communitynotes.
Should I be joining ratings-0000{0-3}.tsv
together before running this?
from communitynotes.
I also saw the same issue in my reproduction. I have tried only one rating file and multiple rating files, both yielded the same exception. In my case, I commented out the assert
statement and then everything worked. But that may be my hack.
from communitynotes.
Also saw this issue when running a build with the files provided at on the Community Notes Twitter with the following traceback:
Traceback (most recent call last):
File "main.py", line 17, in <module>
main()
File "/home/hal/communitynotes/sourcecode/scoring/runner.py", line 105, in main
dataLoader=dataLoader if args.parallel == True else None,
File "/home/hal/communitynotes/sourcecode/scoring/run_scoring.py", line 702, in run_scoring
noteStatusHistory, scoredNotes
File "/home/hal/communitynotes/sourcecode/scoring/note_status_history.py", line 234, in update_note_status_history
_check_flips(mergedStatuses)
File "/home/hal/communitynotes/sourcecode/scoring/note_status_history.py", line 181, in _check_flips
(len(newCrhNotes - oldCrhNotes) / len(oldCrhNotes)) < maxCrhChurn
ZeroDivisionError: division by zero
from communitynotes.
Thank you all for tracking this down! The root cause was quite subtle: it looks like the script we use to process our internal TSV files for public release has replacing NaNs with -1 instead of empty string as they were in the internal code. I've fixed that export script so the code should run properly if you re-download the data files (or commenting out this assert works too).
from communitynotes.
Related Issues (20)
- Missing unit tests HOT 1
- Hydrating community notes? HOT 2
- Missing scipy in requirements.txt HOT 1
- On Increasing bridging/diversity property inside CN Contributors for non-US countries
- Link leads to not found page HOT 2
- Installing via pip requirements hits a Cython error HOT 1
- Mobile app (iPad) reports "Your account may not be allowed to perform this action" on Submit HOT 4
- "Rate it" shouldn't appear on a note which reader has already rated HOT 1
- Reusing notes HOT 1
- [WHITE-PAPER] Birdwatch paper and limited explanatory model HOT 1
- Options on Notification when a post you've interacted with gets note'd HOT 1
- Cannot write notes while Impact Score is >5 HOT 3
- Notes
- Security
- The page asks to join again to an already existing user HOT 2
- README recommends Python 3.7.9, but requirements.txt is not compatible HOT 1
- Some days, some data files are compressed
- "Enable Trending for Community Notes and Offer Insights on Account Notations" HOT 1
- X Premium to send messages issue HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from communitynotes.