yte9pc / internet-archive-pdf-capstone Goto Github PK
View Code? Open in Web Editor NEWUVA Data Science Capstone project for Internet Archive. This project aimed to classify PDFs as research or non-research documents using an image and text-based approach. For the image-based models, we leveraged CNN transfer learning and used XGBoost for text-based approach.