The code to train the model. The model is a Token Classification model with roberta-base (LB: 0.631) as the backbone. I also tried the longformer-base (LB: 0.631) and used gradient accumulation to fit a single batch of size 4 into RTX 3070ti's memory.
I had an idea of creating folds based on topics extracted using LDA. The logic was that documents of a similar topic would be structurally similar. Unfortunately, this had a negative impact on the model's overall performance. The parameters for the LDA model were not optimized and were chosen randomly such that the model returned topic clusters similar to that of what other users had shared on the forums.