In our fresh and newly polished generate_dataset.py
we now output elements grouped together by page.
However, the output sample_data.csv
(for nflx variant as well) only contains annotated webelements.
This is a legacy artifact from the non-grouped data.
We want to add all the unannotated neighbors back in on a per-page basis. However, we still probably don't want annotated table elements in the neighbor groups (or do we? This could be an explicit commented 'toggle' in the code.)
We will need downstream filtering that 'deletes/ignores' pages that only contain (annotated tables or unannotated elements).
The current status of the data is really clean and impressive!