Introduction:

This project is done under Machine Learning Course 2021 at Ariel University.

Today it is quite difficult to find a job, and a student or just a person looking for a job will look for jobs online in various job ads or posts, and most people will not notice if the job is fake or real. If the job is real then it is excellent, but if the job is fake, it can either lead to a phishing site or it will cause sensitive information to be displayed.

With the help of advanced natural language processing, it is possible to build a classifier that will identify between fake and real jobs.

Prerequisite

numpy v1.20.1
pandas v1.2.1
scikit-learn v0.24.1
mlxtend v0.18.0
matplotlib v3.3.4
spacy v3.0.1
seaborn 0.11.1

Data:

Using the data from the site: https://www.kaggle.com/shivamb/real-or-fake-fake-jobposting-prediction
The data set contains about 18,000 job descriptions, of which about 800 are fake jobs. Using this data set, it is possible to create models that will learn and classify the fake jobs and the real ones.

All the fake jobs will be tagged as 1 and all the real jobs will be tagged as 0, under the 'fraudulent' column, all information under the other columns will be the features.

The following is a list of columns from the data set:

['job_id', 'title', 'location', 'department', 'salary_range'
'company_profile', 'description', 'requirements', 'benefits',
'telecommuting', 'has_company_logo', 'has_questions', 'employment_type',
'required_experience', 'required_education', 'industry', 'function','fraudulent']

Data cleanning:

The cleaning process included:

Replacing places where NaN's is written, to a number.
Delete duplicate rows.
Deleting columns where the data is relatively identical between the data of the fake jobs and the real jobs, so they will not help us so much.

Some conclusions about the data set:

The 'function' and 'department' columns are identical, so we will remove one of them.
Fake job postings were mainly aimed at full-time positions, whose requirements were very minimal.
The telecommuting column has the same percentages as the number of real and fake jobs, So this column will not help us either.
After creating a word cloud, it was found that the work publications had similar content, but the originals were more job specific.

Frequency of Words for Geniune / Fake applications:

In the following pictures, you can see the number of occurrences of the different words in relation to their occurrences in real posts and fake posts:

In 'company_profile' you can see that in fake jobs there are fewer words than in real jobs.
In the 'description', 'requirements' and 'benefits' you can see that the distribution of the words in the three graphs is relatively the same, but you can see that the distribution of the words in real posts is more focused.

Distribution over countries

It can be seen that the real jobs come from all sorts of countries, but fake jobs are more targeted in the United States.

WordCloud

Using WordCloud we can see a visual representation for the distribution of words.

Genuine Cloud

Fraud Cloud

Imbalanced Data

We have noticed our data is WAY off-balance, as we can see in the following figure:

We have used Random Over Sampler to solve this issue.
from imblearn.over_sampling import RandomOverSampler

Results

Overall results are quite impressive. Un-balanced data has caused alot of issues and overfit problems.
Using 'TF-idf' Vectorizer to analyze our text and feed it to our models.
from sklearn.feature_extraction.text import TfidfVectorizer

Comparisons

Overall scores

Individual Training Scores

davidhct / classifying-fake-and-real-job Goto Github PK

classifying-fake-and-real-job's Introduction

Introduction:

This project is done under Machine Learning Course 2021 at Ariel University.

Prerequisite

Data:

The following is a list of columns from the data set:

Data cleanning:

The cleaning process included:

Some conclusions about the data set:

Frequency of Words for Geniune / Fake applications:

Distribution over countries

WordCloud

Genuine Cloud

Fraud Cloud

Imbalanced Data

Results

Comparisons

Authors

classifying-fake-and-real-job's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org