Code Monkey home page Code Monkey logo

essay-quality-prediction's Introduction

Essay quality prediction

The objective is to create a model predicting the grade of an essay.

INSTALLATION

Our project was built in an environment using the packages mentioned in the text file ‘requirement.txt’. This can be easily installed using the command line pip install -r REQUIREMENTfile.txt

DATA DESCRIPTION

essay_id: A unique identifier for each individual student essay

essay_set: 1-8, an id for each set of essays

essay: The ascii text of a student's response

rater1_domain1: Rater 1's domain 1 score; all essays have this

rater2_domain1: Rater 2's domain 1 score; all essays have this

rater3_domain1: Rater 3's domain 1 score; only some essays in set 8 have this.

domain1_score: Resolved score between the raters; all essays have this

rater1_domain2: Rater 1's domain 2 score; only essays in set 2 have this

rater2_domain2: Rater 2's domain 2 score; only essays in set 2 have this

domain2_score: Resolved score between the raters; only essays in set 2 have this

rater1_trait1 score - rater3_trait6 score: trait scores for sets 7-8

FUNCTIONALITY :

We approached this project by correcting the several imbalances by using random under sampling and SMOTE.

We also encode our dataset with tf-idf encoder, target encoder , cat.cod encoder and count.vectorizer encoder.

We then create our model with XGBOOSTRegressor.

As result we got : 65% of accuracy | mean squared error (MSE): 0,02and | R² : 0,65.

Before feature engineering, with our XGBOOSTRegressor we had an accuracy score of 55%.

After making feature engineering we progress from 55% for accuracy to 65% with a smaller mean squared error meaning that even when it predicts a false grade it is closer to the correct one than before. R² progress from 0,53 to 0,65 meaning that adding our features as variables helped the model predict the target.

ENHANCEMENT :

Future step to improve our performance score could reside in:

  • Improving our features

Our R² not being optimal shows that the explanation of the target is only partially explained by our features. For a better result, one should create features exploring other aspects of the text such as readability, sentence structure, content analysis, theme similarity, etc. 

  • Changing our model hyper parameters

Modify hyperparameter like Number of Trees, Scale Pos Weight or Max Depth to help the model learn more accurately from the training data. 

Increasing the number of trees and the Max depth parameters could help the model capture existing and non-obvious patterns in the data.  

By changing the scale Pos Weight parameter, which is a parameter to handle imbalanced dataset, we could have given more weight to the essay_set- grades that were not as present as others. 

  • Totally changing model

We use the model that seems the most appropriate and easily understandable for everyone but here might exist better and more complex models to handle our prediction. 

REFERENCES :

Below a list of resources used during this project especially for dealing with imbalance and choosing a model

https://ichi.pro/fr/standardiser-ou-normaliser-exemples-en-python-250626184732156

https://www.geeksforgeeks.org/stratified-sampling-in-pandas/

https://www.youtube.com/watch?v=irHhDMbw3xo&ab_channel=DataSchool

https://www.kdnuggets.com/2017/06/7-techniques-handle-imbalanced-data.html

https://machinelearningmastery.com/xgboost-for-imbalanced-classification/

https://towardsdatascience.com/explaining-feature-importance-by-example-of-a-random-forest-d9166011959e

https://www.youtube.com/watch?v=irHhDMbw3xo&ab_channel=DataSchool (Title: How do I encode categorical features using scikit-learn?)

essay-quality-prediction's People

Contributors

karl0706 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.