Code Monkey home page Code Monkey logo

classifying-fake-and-real-job's Introduction

Introduction:

This project is done under Machine Learning Course 2021 at Ariel University.

Today it is quite difficult to find a job, and a student or just a person looking for a job will look for jobs online in various job ads or posts, and most people will not notice if the job is fake or real. If the job is real then it is excellent, but if the job is fake, it can either lead to a phishing site or it will cause sensitive information to be displayed.

With the help of advanced natural language processing, it is possible to build a classifier that will identify between fake and real jobs.

Prerequisite

numpy v1.20.1
pandas v1.2.1
scikit-learn v0.24.1
mlxtend v0.18.0
matplotlib v3.3.4
spacy v3.0.1
seaborn 0.11.1

Data:

Using the data from the site: https://www.kaggle.com/shivamb/real-or-fake-fake-jobposting-prediction
The data set contains about 18,000 job descriptions, of which about 800 are fake jobs. Using this data set, it is possible to create models that will learn and classify the fake jobs and the real ones.

All the fake jobs will be tagged as 1 and all the real jobs will be tagged as 0, under the 'fraudulent' column, all information under the other columns will be the features.

The following is a list of columns from the data set:

['job_id', 'title', 'location', 'department', 'salary_range'
'company_profile', 'description', 'requirements', 'benefits',
'telecommuting', 'has_company_logo', 'has_questions', 'employment_type',
'required_experience', 'required_education', 'industry', 'function','fraudulent']

Data cleanning:

The cleaning process included:

  • Replacing places where NaN's is written, to a number.
  • Delete duplicate rows.
  • Deleting columns where the data is relatively identical between the data of the fake jobs and the real jobs, so they will not help us so much.

Some conclusions about the data set:

  • The 'function' and 'department' columns are identical, so we will remove one of them.
  • Fake job postings were mainly aimed at full-time positions, whose requirements were very minimal.
  • The telecommuting column has the same percentages as the number of real and fake jobs, So this column will not help us either.
  • After creating a word cloud, it was found that the work publications had similar content, but the originals were more job specific.

Frequency of Words for Geniune / Fake applications:

In the following pictures, you can see the number of occurrences of the different words in relation to their occurrences in real posts and fake posts:

  • In 'company_profile' you can see that in fake jobs there are fewer words than in real jobs.
    picture
  • In the 'description', 'requirements' and 'benefits' you can see that the distribution of the words in the three graphs is relatively the same, but you can see that the distribution of the words in real posts is more focused.
    picture picture picture

Distribution over countries

  • It can be seen that the real jobs come from all sorts of countries, but fake jobs are more targeted in the United States.
    picture

WordCloud

Using WordCloud we can see a visual representation for the distribution of words.

Genuine Cloud

picture

Fraud Cloud

picture

Imbalanced Data

We have noticed our data is WAY off-balance, as we can see in the following figure:
picture

  • We have used Random Over Sampler to solve this issue.
    from imblearn.over_sampling import RandomOverSampler

Results

Overall results are quite impressive. Un-balanced data has caused alot of issues and overfit problems.
Using 'TF-idf' Vectorizer to analyze our text and feed it to our models.
from sklearn.feature_extraction.text import TfidfVectorizer

Comparisons

  • Overall scores

picture

  • Individual Training Scores

picture picture picture picture picture

Authors

classifying-fake-and-real-job's People

Contributors

idoselmo avatar davidhct avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.