Code Monkey home page Code Monkey logo

casebasedreasoning's Introduction

Missing Values

Case-based reasoning (CBR) is the process of solving new problems based on the solutions of similar past problems [1]. In data science CBR is a technique which allows us to find missing values from a given set of data, and each variable has its own set of characteristics. If we have a variable that does not contain one or more characteristic, we can find a similar variable using Euclidean distance and predict the missing characteristics.

In this repository there are 2 Python scripts, one for finding missing values and the other for calculating which characteristics (columns) influence/define the missing value the most. There is also already a generated file containing data (including missing values) and statistics collected from users of a telecommunication company. Using that data we can train a model and check the correctness of the results.

Getting started

Requirements

To run both scripts you should have Python version 3.x and the following modules installed:

  • pandas
  • numpy

All modules are available trough PIP:

pip install pandas

This command also installs numpy module.

Running

This git repository is consisted of the following files:

  • FindMissingValues.py
  • CalculateIV.py
  • telecom.csv
  • telecomStats.txt

File telecom.csv contains all data from the telecommunication company in csv format with columns:

  • customerId : int
  • customerAge : int
  • customerPlansChanged : int
  • smsCountPerMonth : int
  • callMinutePerMonth : int
  • dataMBPerMonth : int
  • netflixStream : boolean
  • pickboxStream : boolean
  • youtubeStream : boolean
  • hboGoStream : boolean
  • viberFree : boolean
  • whatsappFree : boolean

There is a total of 4 000 rows and 12 columns. Out of the 4 000 rows, 6 of them have missing values: 3 rows with customerAge missing and 3 rows with customerPlansChanged missing.

If we run the script FindMissingValues.py with the following command:

python3 FindMissingValues.py

we get following results:

customerId predictedValue realValue
3998 30 33
3999 70 48
4000 25 28
3995 0 0
3996 3 3
3997 4 5

As we can see our predicting model is giving pretty good values, except for people older than 40. Those people usually do not use any kind of data/sms/call plans, and for that reason they are not our target population for presenting new tariffs. It should also be mentioned that data used in prediction is quite random so edge cases are not deeply covered (ex: people older than 40 years old).

If we run CalculateIV.py we will get information about which columns influence missing value (in this case customer age is the column that contains missing values):

(14, 18):
#Positive influence:
customerPlansChanged = (2.0, 3.0)
#Opposite influence:
customerPlansChanged = (3.0, 4.0)
netflixStream = True
pickboxStream = True
youtubeStream = True
hboGoStream = True

(18, 28):
#Positive influence:
youtubeStream = True
#There is no opposite influence
...

From this partial output you can see which variables influence the missing value, so if a user changes his/her tariff 2 or 3 times (positive influence), and does not have Netflix, Pickbox, Youtube or HBO GO stream (negative influence), than he/she is probably between 14 and 18 years old.

Notice

This is a simple algorithm for finding missing values and it is not tested on real world data/applications. Do not use it in production before you double check if everything is working as assumed.

License

This project is licensed under the MIT License - see the LICENSE file for details.

References

[1] https://en.wikipedia.org/wiki/Case-based_reasoning

casebasedreasoning's People

Contributors

sanjinkurelic avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.