This project aims to find machine learning models to detect AI-generated text from human-generated text.
With the rise of AI-generated content, there is a growing need to distinguish between text generated by artificial intelligence systems and text authored by humans. This project addresses this challenge by leveraging machine learning techniques and feature engineering to classify text samples as either AI-generated or human-generated.
Three machine learning classifiers were trained and evaluated for essay classification:
• Naive Bayes
• Logistic Regression
• Random Forest
• DistilBERT
Feature engineering techniques were employed to augment the dataset and enhance classification performance. Three key features were engineered:
• Text Length
• Lexical Diversity
• Flesch Reading Ease
Evaluation metrics such as precision, recall, F1-score, and accuracy were used to assess the performance of each classifier.
https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset (Report #2)
https://www.kaggle.com/competitions/llm-detect-ai-generated-text (Report #1)
Sample size of 10000 with k-fold validation is implemented in ML models with 10000 sample
Sample size of 3000 with k-fold validation(Naive Bayes, Logistic Regression, Random Forest) is implemented in ML models with 3000 sample
Sample size of 3000 without k-fold validation(Naive Bayes, Logistic Regression, Random Forest and DistilBERT) is implemented in ML models with 3000 sample(without K-fold)
Note: Codes take several hours to run and give output. Therefore, outputs are added to the notebooks for initial review.