shakedzy / dython Goto Github PK

View Code? Open in Web Editor NEW

492.0 15.0 102.0 4.8 MB

A set of data tools in Python

Home Page: http://shakedzy.xyz/dython/

License: MIT License

Python 100.00%

data analysis plot correlation python roc modeling

dython's Introduction

Dython

A set of Data analysis tools in pYTHON 3.x.

Dython was designed with analysis usage in mind - meaning ease-of-use, functionality and readability are the core values of this library.

Installation

Dython can be installed directly using pip:

pip install dython

or, via the conda package manager:

conda install -c conda-forge dython

Documentation

Modules documentation can be found on shakedzy.xyz/dython. You can also learn more and see examples of the main methods of this library on these blogposts.

Contributing

Contributions are always welcomed - if you found something you can fix, or have an idea for a new feature, feel free to write it and open a pull request. Please make sure to go over the contributions guidelines.

Citing

Use this reference to cite if you use Dython in a paper:

@software{Zychlinski_dython_2018,
  author = {Zychlinski, Shaked},
  title = {{dython}},
  year = {2018},
  url = {https://github.com/shakedzy/dython},
  doi = {10.5281/zenodo.12698421}
}

dython's People

Contributors

Stargazers

Watchers

Forkers

jparajuli mejihero hungnk25 anzarkamdar it176131 puczilka arita37 ppriceu samrouge dnilanjan victoraranda chucknoelke ganjingcatherine varsha-shewale zehcan kpflugshaupt benbusath thetiaramisu binti03 im8055 sakssethi tomrod anxhela21 renoust brunocvs7 krishnakumartiwari oldgittroy thotegt benman1 imredred ayajnik rishi-lulla heraclitus007 francois-27 karajkhosla manikant92 himanshu-create victor-luu191 mehran1414 gracecarrillo dushyantkhosla guardianrg choufler shiva24yv carlonicolini kurt-hectic thanif porvakanti alejandropenaloza ben621 jyosmitha sksvineeth dani-montilla-gaia chung-kai-eng lnlz001 tanay-n riali-mouad rakeshkm monaim1 rose-royce thomasbury sarthakpati lionlongzy yangjiada blade134 i-spark nativevex mainguyenanhvu mattialobascio soulcoder91 ahashemiche mahmoud-taya 93kryptonian aviacoder lahdjirayhan archu641 overfittingstudyroom cargonriv riverdarda hudakas manu87ds aly202012 gianfa joelle-minjoo lisawanghsu govindjsk shivanand050 matbb mahieyin-rahmun jeffamaxey wiwern cc5824 dibyendumandal enrir jbarsotti chimtingshing fsaforo1 usmankhan46 christmori arpitjain799

dython's Issues

can you share example why boltzmann_sampling is so important to use

great code thanks
can you share example why boltzmann_sampling is so important to use

google search gives noting useful about boltzmann_sampling for ML

Please confirm the conditional entropy equation

I think your conditional entropy equation in line 86 of nominal.py is inverted:

        entropy += p_xy * math.log(p_y / p_xy, log_base)

I believe that the equation should be:

        entropy += -1 * p_xy * math.log(p_xy / p_y, log_base)

This Wikipedia page has the equation as p(x,y) / p(x) but note: their equation is S(Y|X) while yours is S(X|Y).

In addition, with your equation, the numbers are positive (because p_y / p_xy > 1, since p_y > p_xy) but should be negative (hence the -1).

Can you please take another look and confirm the equation in case I misunderstood?

conditional_entropy fonction

Hello Shaked Zychlinski,

Thanks your for work . Excellent article !
Can you please complet your fonction code of theils_u excatly on conditional_entropy and
correlation_ratio .

Thanks.

Specifying the range of correlation displayed on the legend and in the heatmap

Describe the new feature:

For some datasets, all of the observed correlations are positive. It would make things easier to visualize if we were able to specify the correlation range directly through dython. For example, I would like to specify a range of 0 to 1 instead of -1 to 1.

What is the current outcome?

For now, unless I am mistaken, there is a vmin and vmax option in Seaborn's heatmap, but this option does not work directly in dython.

Is it backward-compatible?

I believe it would probably be, but since I have no experience as a Python library developer, I can't confirm.

cramers_v_weighted

The weighted correlation feature for categorical variables is missing in Cramer's V, and this feature helps to give weightage to certain rows.

Adding weightage to certain rows depending on the target variable is a regular process in the retail/sales industry when it comes to correlation and incorporating this feature for categorical correlation is a huge plus.

Doesn't account for weighted correlation (can be incorporated in cramers'V for categorical variables)

This change needs to be added as a separate function in nominal.py where the weight variable needs to be added at crosstab

correlation_ratio produces key error in flatten to cat_measures

Hi there,

Thanks for this handy set of tools and the excellent article you posted on towrdsdatacience.com.

I have just been playing with the nominal tools and found that that correlation_ratio is throwing me an error when I pass it my array.

KeyError: '[ 0 ... 76] not in index

Digging into the function it seems that this is being generated at:
cat_measures = measurements[np.argwhere(fcat == i).flatten()]

I have double checked the array and tested the function with a range of data structures and the same error is returned. (Array consists of 8 continuous variables and 1 categorical). All indexes appear to be the same and carry through the prior steps of the function as you would expect.

I don't suppose you have any insights into why this might be occurring?

Full example of test:

df_dython = df_prep.drop(['Bedrock value'], axis = 1)
cols = df_dython.columns[0:9]

def correlation_ratio(categories, measurements):
    fcat, _ = pd.factorize(categories)
    cat_num = np.max(fcat)+1
    y_avg_array = np.zeros(cat_num)
    n_array = np.zeros(cat_num)
    for i in range(0,cat_num):
        cat_measures = measurements[np.argwhere(fcat == i).flatten()]
        n_array[i] = len(cat_measures)
        y_avg_array[i] = np.average(cat_measures)
    y_total_avg = np.sum(np.multiply(y_avg_array,n_array))/np.sum(n_array)
    numerator = np.sum(np.multiply(n_array,np.power(np.subtract(y_avg_array,y_total_avg),2)))
    denominator = np.sum(np.power(np.subtract(measurements,y_total_avg),2))
    if numerator == 0:
        eta = 0.0
    else:
        eta = numerator/denominator
    return eta

correlation_ratio(df_dython['Bedrock'], df_dython[cols]

Customize heatmap size

Thanks for this amazing categorical correlation matrix using Cramer's V and Theil's U! I'm working in Jupyter Notebook and the heat map is very small and I can't read all the categories. I'm learning on the mushroom dataset and want to plot all features, like you did in your article.

My code:

column_names = ['class','cap-shape','cap-surface','cap-color','bruises?','odor','gill-attachment','gill-spacing','gill-size','gill-color','stalk-shape','stalk-root','stalk-surface-above-ring','stalk-surface-below-ring','stalk-color-above-ring','stalk-color-below-ring','veil-type','veil-color','ring-number','ring-type','spore-print-color','population','habitat']

import pandas as pd
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data'
df = pd.read_csv(url, header=None, names=column_names)
df.drop(columns=['veil-type'], inplace=True)

from dython import nominal
nominal.associations(df, theil_u=True, nominal_columns='all')

Is there a way to display the heat map larger in Jupyter Notebook? Thanks for creating this!

nan_strategy as 'drop_samples'

Hi,

Thanks for this amazing package. I am trying it over a dataset of 1220*500. I have categorized the columns in 3 types: Continuous, Ordinal and Nominal. There are missing values in the dataset, which I don't wish to impute with anything and makes sense to use 'drop_samples' for the nan_strategy parameter. However, when I run the associations() method on the dataset, I get below error.

ValueError: zero-size array to reduction operation maximum which has no identity

Can you please suggest a way to solve this? Also, does it make sense to use this method when I have ordinal variables as well because in that case we use 'spearman rank correlation'.

Thanks in advance!

identify numeric columns

Describe the new feature:

This is useful to gather all numeric columns into a list so that later we can apply any transformation on them in one-liner style. For example, fill NAs in the numeric columns by their means.

What is the current outcome?

Module nominal already has method identify_nominal_columns to find nominal columns. I follow its style and added to my fork a method identify_numeric_columns, but it seems I cannot make a pull request yet because there is another request pending. Can @shakedzy accept my previous pull request?

Is it backward-compatible?

Should be no problem, as it is independent from others.

Can I get a tutorial on how the library works?

I loved your article on correlation: https://towardsdatascience.com/the-search-for-categorical-correlation-a1cf7f1888c9

I know you published the library, but since I am still new, I would love to see the code you used in the article, so I have some working examples of how the library is supposed to work.

Binary values

Describe the new feature:

Add check for binary type and convert them to categorical so its run much smother

What is the current outcome?

Value error and you have to change it yourself, there is no need for that

Is it backward-compatible?

ValueError: If using all scalar values, you must pass an index in Cramers v

So, why I actually came here was this. When using the associations with theils_u=False and Cramers V is used, it tries to create a pd.crosstab, into which two lists are passed. These lists are the result from

    if nan_strategy == REPLACE:
        x, y = replace_nan_with_value(x, y, nan_replace_value)
    elif nan_strategy == DROP:
        x, y = remove_incomplete_samples(x, y)

Which is then followed by

    confusion_matrix = pd.crosstab(x,y)

This throws the error: ValueError: If using all scalar values, you must pass an index

This seems to be because crosstab requires data with an index or that a manual index is passed (i.e. the input should be a Series or the index should be set manually). The easiest solution is likely to just pass the lists to a pd.Series before doing the crosstab.

Empty values and usability for feature selection

Hi Shaked,

I replied on your Medium article with a question about NaN values in a column, and I partially figured out how your functions deal with NaN values. Using 'cramers_v', calling pd.crosstab(x,y) drops all rows in the dataframe which contain a NaN value. Replacing NaN with for example the string 'empty' results in different values. Haven't figured out if theils_u does the same, maybe you know about it?

Also I want to use both Cramers V and Theils U for feature selection in my machine learning project, do you have some scientific papers about feature selection referring to both Cramers V and Theils U? I have a lot of features in my dataset and I'm thinking about removing features wich have a high correlation with an other feature, what do you think about this? I'm also using Chi Square to find out whether features are correlated or not.

Hope you can help me forward.

Rick

How to plot heatmap just for categorical and numeric features?

Hi Shakedzy!

Thanks a lot for sharing your nice code.

Could you please provide me with an example which shows how to plot heatmap just for categorical and numeric features? Really, I mean how it is possible to have 3 heatmap plots:

numeric vs numeric
numeric vs categorical
categorical vs categorical.

Besides, should categorical features with more than 2 category be converted into 0 or 1 using get_dummies? If there is 3 categories, is in't allowed to use 1, 2, and 3 to represent each category?

Thank you in advance.

The heatmap doesn't scale well

I attached an example with 56 columns

Distinguish plot and show for associations parameters

This is more of a suggestion, since I find myself working around it.

Describe the new feature:

Have the option to show the plot (currently done with the plot parameter), but also the option to plot (currently no option). The function associations is very neat, but you don't always want to plot the results, but just get the dataframe. I suggest adding a show parameter.

What is the current outcome?

Currently, there is no way not to plot (i.e. create matplotlib object). This means, when working in a notebook, it will always show the plot at the end of a cell, disregarding the value of the plot parameter.

Is it backward-compatible?

No. show would replace plot in functionality, and plot would get a different meaning.

My current workaround is to create a figure/ax before calling associations, and passing ax to associations. Then I call plt.close(ax) to remove the plot.

nan_strategy as "drop_samples"

Version check:

Run and copy the output:

import sys, dython
print(sys.version_info)
print(dython.__version__)

0.5.1

Describe the bug:

Code to reproduce:

import dython
# your code goes here

data = pd.read_excel("\data.xlsx")
dython_test.xlsx

nominal_var = ['a4a','k17','cnd2','k16','bmj5']
dython_corr = associations(data,nominal_columns=nominal_var,plot=False,nan_strategy ='drop_samples')

Error message:

# your error message

return ufunc.reduce(obj, axis, dtype, out, **passkwargs)

ValueError: zero-size array to reduction operation maximum which has no identity

Input data:

dython_test.xlsx

correlation ratio

Shall we take the square root of eta as the correlation ratio according to its definition? Thanks.

Mixed Data Type

Hi,

Can I generate the heatmap for categorical-categorical and categorical-numeric at the same time?

conditional_entropy produces ZeroDivisionError

File "corr.py", line 278, in <module>
    associations(df,nominal_columns="all", theil_u=True)
  File "corr.py", line 196, in associations
    corr[columns[j]][columns[i]] = theils_u(dataset[columns[i]],dataset[columns[j]])
  File "corr.py", line 107, in theils_u
    s_xy = conditional_entropy(x,y)
  File "corr.py", line 64, in conditional_entropy
    entropy += p_xy * math.log(p_y/p_xy)
ZeroDivisionError: integer division or modulo by zero

I tried

df=pd.read_csv(sys.argv[1], sep=',', low_memory=False, nrows=1000)
associations(df,nominal_columns="all", theil_u=True)

A,B,C
a,b,c
a,k,c
a,d,c
b,r,l
b,d,l
b,z,l

Feature request: autosize

It's easy to adjust figsize yourself, and you can pass it to associations(), but it would be nice if this method could estimate an appropriate figsize based on the number of columns.

Update documentation to return results in associations.

In your docstring for cluster_correlations you have the following example:

 Example:
    --------
    >> correlations = associations(
        customers,
        return_results=True,
        plot=False
    )
    >> correlations, _ = cluster_corre

Key element here is return_results=True,.
That is the only place I found this parameter. In fact, so far, I could only get the correlations when this flag is set to true. If I don't set it, somehow the function does not return anything. But rather NoneType. Not quite sure if this is a bug. But I think the return_results parameter should be in the documentation.

Or have I misunderstood something?

kind regards,

Nominal functions do not handle missing values

The nominal module was not designed to handle incomplete data. I would like to figure out a solution to this:

Should all methods accept incomplete data?
How should methods handle missing data?

Installation issue

Hi,
I installed the dython module as per your instructions. But then I could not import nominal or the other function. I get an error saying that they are not found in dython. Could you please help?
Thank you.

Searching for Categorical Correlation

Hi, thank you for the good job you are doing by imparting knowledge to others. I read your blog on searching for categorical correlation, and I found this code:
def cramers_v(x, y):
confusion_matrix = pd.crosstab(x,y)
chi2 = ss.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum().sum()
phi2 = chi2/n
r,k = confusion_matrix.shape
phi2corr = max(0, phi2-((k-1)*(r-1))/(n-1))
rcorr = r-((r-1)**2)/(n-1)
kcorr = k-((k-1)**2)/(n-1)
return np.sqrt(phi2corr/min((kcorr-1),(rcorr-1)))

However, I was wondering how you applied it to your Mushrooms datasets since we have so many variables there.

Please, I would appreciate it if you could help me with this. I have a related problem to solve with my dataset.

Thank you

Datetime is not supported in associations method

When using nominal.associations over data that contains datetime data, an error occurs:

import pandas as pd
from datetime import datetime, timedelta
from dython.nominal import associations

dt = datetime(2020, 12, 1)
end = datetime(2020, 12, 2)
step = timedelta(seconds=5)
result = []
while dt < end:
    result.append(dt.strftime('%Y-%m-%d %H:%M:%S'))
    dt += step

nums = list(range(len(result)))
df = pd.DataFrame({'dates':result, 'up': nums, 'down': sorted(nums, reverse=True)})
df['dates'] = pd.to_datetime(df['dates'], format="%Y-%m-%d %H:%M:%S")  # without this, this column is considered as object rather than dates

associations(df)

Error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/shaked.z/self/dython/dython/nominal.py", line 559, in associations
    nan_strategy, nan_replace_value)
  File "/Users/shaked.z/self/dython/dython/nominal.py", line 373, in _comp_assoc
    dataset[columns[j]])
  File "/Users/shaked.z/myenv/lib/python3.6/site-packages/scipy/stats/stats.py", line 3851, in pearsonr
    dtype = type(1.0 + x[0] + y[0])
numpy.core._exceptions.UFuncTypeError: ufunc 'add' cannot use operands with types dtype('float64') and dtype('<M8[ns]')

Dates and times should be considered as continuous numerical values.

Using natural logarithm instead of base 2

I wonder of the reason to use natural log in calculating the Conditional Entropy (CE) instead of base 2.

May be it is more logical "not" to scale the CE to bits (log2) when calculating the correlation between categorical variables, but would appreciate more clarification.

Nominal cramers_v

Cramer's V recently began throwing an error.

[To Replicate]

import pandas as pd
df = pd.read_fwf('http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data', header=None)
df.columns=['mpg','cylinders','displacement',
            'horsepower','weight','acceleration',
            'model_year','origin','car_name']
df_clean = df.dropna()
nominal.cramers_v(df_clean[['model_year']],df_clean[['car_name']])

[Expected Results]
Calculation of Cramer's V

[Error]

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-24-c52c2b81dd0a> in <module>
----> 1 nominal.cramers_v(df_clean[['model_year']],df_clean[['car_name']])

~/apps/anaconda3/lib/python3.7/site-packages/dython/nominal.py in cramers_v(x, y, nan_strategy, nan_replace_value)
     78     elif nan_strategy == DROP:
     79         x, y = remove_incomplete_samples(x, y)
---> 80     confusion_matrix = pd.crosstab(x,y)
     81     chi2 = ss.chi2_contingency(confusion_matrix)[0]
     82     n = confusion_matrix.sum().sum()

~/.local/lib/python3.7/site-packages/pandas/core/reshape/pivot.py in crosstab(index, columns, values, rownames, colnames, aggfunc, margins, margins_name, dropna, normalize)
    560     from pandas import DataFrame
    561 
--> 562     df = DataFrame(data, index=common_idx)
    563     if values is None:
    564         df["__dummy__"] = 0

~/.local/lib/python3.7/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    408             )
    409         elif isinstance(data, dict):
--> 410             mgr = init_dict(data, index, columns, dtype=dtype)
    411         elif isinstance(data, ma.MaskedArray):
    412             import numpy.ma.mrecords as mrecords

~/.local/lib/python3.7/site-packages/pandas/core/internals/construction.py in init_dict(data, index, columns, dtype)
    255             arr if not is_datetime64tz_dtype(arr) else arr.copy() for arr in arrays
    256         ]
--> 257     return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
    258 
    259 

~/.local/lib/python3.7/site-packages/pandas/core/internals/construction.py in arrays_to_mgr(arrays, arr_names, index, columns, dtype)
     75     # figure out the index, if necessary
     76     if index is None:
---> 77         index = extract_index(arrays)
     78     else:
     79         index = ensure_index(index)

~/.local/lib/python3.7/site-packages/pandas/core/internals/construction.py in extract_index(data)
    356 
    357         if not indexes and not raw_lengths:
--> 358             raise ValueError("If using all scalar values, you must pass an index")
    359 
    360         if have_series:

ValueError: If using all scalar values, you must pass an index

[Package versions]
Dython=='0.2.0'
Pandas=='0.25.0'

Optional parameter: cbar for associations

Describe the new feature:

It would be nice to have the option to disable the cbar for the associations plot. I know associations already has a ton of parameters, so I'm thinking of how to fix this neatly, but so far I do not have a solution.

What is the current outcome?

Currently, the cbar is enabled and cannot be disabled anymore. This was possible by passing it in the kwargs, but with the explicit parameters, this is not longer possible.

Is it backward-compatible?

Yes, if default is True, nothing should change for older versions.

Nice template btw :)

Error in conditional_entropy function

Version check:

Run and copy the output:

import sys, dython
print(sys.version_info)
print(dython.__version__)

Describe the bug:

I believe your joint distribution for x and y is not properly calculated. Specifically you will find that sum(p_xy) > 1 because you are normalising by n_y instead of n_xy.

I have not run the code so none of the other fields in this form are applicable.

Code to reproduce:

import dython
# your code goes here

Error message:

# your error message

Input data:

Associations not returned for categorical variables

Hi, first of all thank you for this awesome package and the Medium article ;)

I am testing the associations function found in nominal.py with a mix of numerical and categorical variables. I provide below a sample (sample.csv) of the dataset that is returning an empty result.

2020-02-13 00:00:00.017,/131.161.10.118,GET,404,1830,569,1930
2020-02-13 00:00:00.183,/58.14.127.52,GET,406,1110,607,1210
2020-02-13 00:00:00.35,/93.40.70.85,GET,200,926,544,1026
2020-02-13 00:00:00.521,/93.40.70.85,GET,404,2229,502,2329
2020-02-13 00:00:01.02,/87.65.64.76,GET,404,2046,556,2146

My code is:

data = pd.read_csv(sample.csv', header = None)
associations(data)

The numerical columns are providing results that are fine but I am not getting anything for the categorical ones, my result is :

How is that nothing is returned ?

When testing this with other datasets that have a mix of variables I have had the case were everything was calculated just fine, cases where it doesn't, like the above, and cases where it does not and it throws this warning:
RuntimeWarning: divide by zero encountered in double_scalars return np.sqrt(phi2corr / min((kcorr - 1), (rcorr - 1)))

Any help will be greatly appreciated, I can't wait to use this package more!

Running dython.nominal associations twice in same jupyter notebook environment

Hi,

dython version: 0.6.2

Thank you for your work! I'm experiencing some issues when running the dython.nominal associations function twice in the same jupyter notebook environment.

For the sake of reproducability, I used the Adult dataset (https://www.kaggle.com/wenruliu/adult-income-dataset).

df = pd.read_csv('adult.csv')

associations(df.copy());

associations(df.copy());

When running this code twice, the second association matrix contains a nan value, where the first doesn't (see image below). I am not sure if this is a dython issue, or a scipy issue in general. Do you have any thoughts on this, or how to solve this issue?

How to Contribute?

Hi guys,

First of all, thanks a lot for the amazing work here. As a stats and 'metrics-trained data scientist who is working in Python now, this is a godsend.

I was using your theil's U and cramer's V functions, but made my own function to plot results for a variable number of inputs and for both functions.

How can I contribute this to the code base?

plt.subplot eopt markers are plotted in wrong axis

Version check:

Run and copy the output:
Python 3.8.6 (default, Sep 25 2020, 00:00:00)
[GCC 10.2.1 20200723 (Red Hat 10.2.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.

import sys, dython
print(sys.version_info)
sys.version_info(major=3, minor=8, micro=6, releaselevel='final', serial=0)
print(dython.version)
0.6.1

Describe the bug:

Code to reproduce:

import dython
fig, axs = plt.subplots(ncols=2, nrows=2,figsize=(fig_height,sub_fig_width))
results = roc_graph(y_test, y_pred, class_names=class_names**,ax=axs[0,1],eopt=True**,plot=True,colors=p)

Error message:

[ WARNING ] - No handles with labels found to put in legend.

Error message:

your error message

eopt is plotted in last subplot (axs[1.1])
No legend is plotted

Input data:

Correlation ratio in nominal.py class

Hello and thank you for this useful package. I have been calculating the correlation with nominal.py class. I have a non-linear numeric data and categorical data. I was wondering which statistical method are you using when you calculate correlation ratio with non-linear numeric and categorical data?

calculation of cramer v

Hi, would like to know why you included the calculation of the following for cramer v?

phi2corr = max(0, phi2-((k-1)*(r-1))/(n-1))
    rcorr = r-((r-1)**2)/(n-1)
    kcorr = k-((k-1)**2)/(n-1)

When wiki stated this formula:
phi2/min(k-1,r-1)

associations should handle single-value features in a better way

Currently, nominal.associations single-value feature handling isn't perfect. If set to use Cramer's V, it will print a warning:

RuntimeWarning: invalid value encountered in double_scalars
  return np.sqrt(phi2corr / min((kcorr - 1), (rcorr - 1)))

And output empty values for this feature:

If set to use Theil's U, it will plot a 1.0 and 0.0 columns:

While correct, this is a bit misleading if user is unaware of the fact that there's a single value in this column.

Suggested solution: Plot a single-color column and row, with an explicit label Single Value.

Add tests

At the moment, Dython has no tests. As the code gets mote and more complex, these could be useful. This also means that the dependencies versions must be decided.

TypeError in correlation_ratio()

Hi! Thanks for sharing you project, the idea of it is really useful. However, I bumped into a problem when I tried to use the nominal.associations() function.

File "marton.py", line 79, in <module> assoc = nominal.associations(df, nominal_columns=['sexdet','Location', 'shade']) File "C:\Program Files\Python\lib\site-packages\dython\nominal.py", line 188, in associations cell = correlation_ratio(dataset[columns[i]], dataset[columns[j]]) File "C:\Program Files\Python\lib\site-packages\dython\nominal.py", line 123, in correlation_ratio y_avg_array[i] = np.average(cat_measures) File "C:\Program Files\Python\lib\site-packages\numpy\lib\function_base.py", line 1110, in average avg = a.mean(axis) File "C:\Program Files\Python\lib\site-packages\numpy\core\_methods.py", line 82, in _mean ret = ret / rcount TypeError: unsupported operand type(s) for /: 'str' and 'int'

Snippets of the dataset:

Any help or advice where to look at is appreciated!
Thanks.

associations un-needed steps when plot=False

Hi, this library is amazing! I've noticed that when doing certain actions, un-necessary matplotlib steps are executed. One example would be when using the associations() function. I patched my installed version to test, but made these changes so that if plot=False we don't worry about doing any of the steps needed to plot.

    corr, columns, nominal_columns, inf_nan, single_value_columns = _comp_assoc(dataset, nominal_columns, mark_columns,
                                                                                theil_u, clustering, bias_correction,
                                                                                nan_strategy, nan_replace_value)
    if not plot:
        ax = None
    if ax is None and plot:
        plt.figure(figsize=figsize)
    if inf_nan.any(axis=None):
        inf_nan_mask = np.vectorize(lambda x: not bool(x))(inf_nan.values)
        if plot:
            ax = sns.heatmap(inf_nan_mask,
                            cmap=['white'],
                            annot=inf_nan if annot else None,
                            fmt='',
                            center=0,
                            square=True,
                            ax=ax,
                            mask=inf_nan_mask,
                            cbar=False)
    else:
        inf_nan_mask = np.ones_like(corr)
    if len(single_value_columns) > 0:
        sv = pd.DataFrame(data=np.zeros_like(corr),
                          columns=columns,
                          index=columns)
        for c in single_value_columns:
            sv.loc[:, c] = ' '
            sv.loc[c, :] = ' '
            sv.loc[c, c] = 'SV'
        sv_mask = np.vectorize(lambda x: not bool(x))(sv.values)
        if plot:
            ax = sns.heatmap(sv_mask,
                            cmap=[sv_color],
                            annot=sv if annot else None,
                            fmt='',
                            center=0,
                            square=True,
                            ax=ax,
                            mask=sv_mask,
                            cbar=False)
    else:
        sv_mask = np.ones_like(corr)
    mask = np.vectorize(lambda x: not bool(x))(inf_nan_mask) + np.vectorize(lambda x: not bool(x))(sv_mask)
    if plot:
        ax = sns.heatmap(corr,
                        cmap=cmap,
                        annot=annot,
                        fmt=fmt,
                        center=0,
                        vmax=1.0,
                        vmin=-1.0 if len(columns) - len(nominal_columns) >= 2 else 0.0,
                        square=True,
                        mask=mask,
                        ax=ax,
                        cbar=cbar)
    if plot:
        plt.show()
    
    return {'corr': corr,
            'ax': ax}

Module import not possible?

Version check:

sys.version_info(major=3, minor=7, micro=6, releaselevel='final', serial=0)
0.4.6

Describe the bug:

Code to reproduce:

import dython as dy

dy.nominal.associations(...)

Error message:

AttributeError: module 'dython' has no attribute 'nominal'

ValueError: math domain error, when setting theil_u=True

Hi Shaked,
Thank you for providing a great library and informative write-up!

I faced an issue with associations method, check below for details:

Usage:

All selected features are categorical of type (Object)

from dython.nominal import associations
accidents_data_catigorical = ['Speed_limit', 'Junction_Detail', 'Light_Conditions', 
                              'Weather_Conditions', 'Road_Surface_Conditions', 'Special_Conditions_at_Site',
                             'Carriageway_Hazards', 'Urban_or_Rural_Area', 'Accident_Severity', '1st_Road_Class']

corr = associations(accidents_data[accidents_data_catigorical], theil_u=True, 
                    return_results=True, plot=True, nominal_columns=accidents_data_catigorical)

Thrown error trace:

ValueError                                Traceback (most recent call last)
<ipython-input-105-99541e13a9e9> in <module>()
      5 
      6 corr = associations(accidents_data[accidents_data_catigorical], theil_u=True, 
----> 7                     return_results=True, plot=True, nominal_columns=accidents_data_catigorical)

/Applications/anaconda3/lib/python3.6/site-packages/dython/nominal.py in associations(dataset, nominal_columns, mark_columns, theil_u, plot, return_results, **kwargs)
    169                         if theil_u:
    170                             corr[columns[j]][columns[i]] = theils_u(dataset[columns[i]],dataset[columns[j]])
--> 171                             corr[columns[i]][columns[j]] = theils_u(dataset[columns[j]],dataset[columns[i]])
    172                         else:
    173                             cell = cramers_v(dataset[columns[i]],dataset[columns[j]])

/Applications/anaconda3/lib/python3.6/site-packages/dython/nominal.py in theils_u(x, y)
     76         in the range of [0,1]
     77     """
---> 78     s_xy = conditional_entropy(x,y)
     79     x_counter = Counter(x)
     80     total_occurrences = sum(x_counter.values())

/Applications/anaconda3/lib/python3.6/site-packages/dython/nominal.py in conditional_entropy(x, y)
     29         p_xy = xy_counter[xy] / total_occurrences
     30         p_y = y_counter[xy[1]] / total_occurrences
---> 31         entropy += p_xy * math.log(p_y/p_xy)
     32     return entropy
     33 

ValueError: math domain error

Returning to the default value for theil_u=false, no error is thrown, however, since correlation is between categorical features it's beneficial to use Uncertainty Coefficient (Theil’s U)

Possible bug in nominal.py

Shouldn't line 25 in nominal.py be
xy_counter = Counter(list(product(x,y))) instead of xy_counter = Counter(list(zip(x,y)))?
The reasoning here being that when computing the joint distribution of (x, y), we would iterate over all possible pairs?

In the same vein, line 26 calculates total_occurences which is then used in line 29.
However for computing the joint probability p_xy, shouldn't total_occurences be changed to sum(xy_counter.values())?

I guess I am not sure about the validity of the calculations here. Can I work on this issue?

associations method should return fig and ax

Thanks for this amazing work. I have used it several times now to explore my data (with citation). Question - how can I save the heat map visualization?

nominal.associations(df_cats, theil_u=True, nominal_columns='all', figsize=(12,12))

I tried variations like the following:

f2,ax = nominal.associations(df_cats, theil_u=True, nominal_columns='all', figsize=(12,12))
f2.savefig('Categorical Corr Heat Map.png')

Thanks!

Not ready to Production

HI,
First, I would like to say, thank you that you have implemented this library and also the blog you have written The Search for Categorical Correlation
Second:

the standard of python package is coverage of tests.
you can read python.doc
that means that current code I not production-ready.
you should use tox, pytest, mypy
Therefore I will show you why it's important:

Version check:

Run and copy the output:

import sys

print(sys.version_info)

sys.version_info(major=3, minor=8, micro=1, releaselevel='final', serial=0)

print(dython.__version__)

0.6.1

Describe the bug:

pandas.DataFrame.as_matrix has been deprecades see pandas doc
Deprecated since version 0.23.0: Use DataFrame.values() instead.

Code to reproduce:

import dython

import pandas as pd

pd.DataFrame(data={'col1': [1, 2]})

   col1

0     1

1     2

df = pd.DataFrame(data={'col1': [1, 2]})

dython._private.convert(df, 'array')

Traceback (most recent call last):

  File "<input>", line 1, in <module>

  File "C:\Users\AS6SE\PycharmProjects\omri_code\venv\lib\site-packages\dython\_private.py", line 15, in convert

    converted = data.as_matrix()

  File "C:\Users\AS6SE\PycharmProjects\omri_code\venv\lib\site-packages\pandas\core\generic.py", line 5139, in __getattr__

    return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'as_matrix'

Error message:

Traceback (most recent call last):

  File "<input>", line 1, in <module>

  File "C:\Users\AS6SE\PycharmProjects\omri_code\venv\lib\site-packages\dython\_private.py", line 15, in convert

    converted = data.as_matrix()

  File "C:\Users\AS6SE\PycharmProjects\omri_code\venv\lib\site-packages\pandas\core\generic.py", line 5139, in __getattr__

    return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'as_matrix'

Input data:

nan_strategy=_SKIP producing negative values in some cases

Hello shakedzy,
First of all, thank you for the great contribution to Python with this fantastic package.
Also, congratulations to the paper written in TowardsDataScience.com .

I just had one question/concern after using your package.
Doing the correlation of category-category with Theil's U, I was getting negative results for some cases.
After analyzing your code, I have seen that what is causing these negative results is the parameter "nan_strategy=_SKIP" in the function "_comp_assoc"(lines 409 and 415 in dython/dython/nominal.py).
I do not know if this should be like this or if it is an error since you can specify the nan_strategy in parameters,

Thank you very much,
Regards,
Xavi

Theil's Uncertainty coefficient matrix with negative values

Version check:

In [213]: import sys, dython                                                                             

In [214]: print(sys.version_info)                                                                        
sys.version_info(major=3, minor=7, micro=3, releaselevel='final', serial=0)

In [215]: print(dython.__version__)                                                                      
0.5.1

Describe the bug:

Hi,

I'm using Dython for computing Theil's uncertainty coefficient matrix (great job, btw, thanks :)), and I get negative values. Theil's U coeff should be [0,1], shouldn't it? Am I making some error while using dython?

Thanks!

In [211]: associations(datacorr,nan_strategy='drop_samples', theil_u=True)                               
Out[211]: 
{'corr':           score1    score2      col0      col1  ...     col11     col12     col13     col14
 score1  1.000000  0.786042  0.025973  0.006398  ... -0.097800 -0.097800  0.215670  0.073910
 score2  0.786042  1.000000 -0.006524 -0.159151  ... -0.044973 -0.044973  0.343482  0.051602
 col0    0.025973 -0.006524  1.000000  0.512063  ... -0.033839 -0.033839  0.184816 -0.137152
 col1    0.006398 -0.159151  0.512063  1.000000  ... -0.058191 -0.058191  0.745826  0.011129
 col2   -0.047007 -0.157616 -0.030940  0.007882  ... -0.005629 -0.005629  0.078914  0.080133
 col3   -0.068517  0.073003  0.070584 -0.236361  ...  0.400682  0.400682  0.349443 -0.154374
 col4   -0.012441 -0.031822  0.002387  0.128089  ... -0.057034 -0.057034  0.131739  0.028288
 col5    0.033059  0.405524 -0.028680 -0.338379  ...  0.090785  0.090785  0.450301 -0.191195
 col6   -0.105568 -0.382011  0.036999  0.222112  ... -0.019065 -0.019065  0.248824  0.127296
 col7   -0.319328 -0.471742  0.056818  0.232227  ...  0.028656  0.028656  0.262430  0.176267
 col8   -0.203441 -0.269696  0.187738  0.680963  ... -0.028105 -0.028105  0.603658  0.097654
 col9   -0.063676  0.141670 -0.066039 -0.381560  ...  0.286273  0.286273  0.541820 -0.031728
 col10   0.215472  0.344284  0.310335  0.786511  ...  0.230519  0.230519  0.959587  0.463759
 col11  -0.097800 -0.044973 -0.033839 -0.058191  ...  1.000000  1.000000  0.227054 -0.154119
 col12  -0.097800 -0.044973 -0.033839 -0.058191  ...  1.000000  1.000000  0.227054 -0.154119
 col13   0.215670  0.343482  0.184816  0.745826  ...  0.227054  0.227054  1.000000  0.462736
 col14   0.073910  0.051602 -0.137152  0.011129  ... -0.154119 -0.154119  0.462736  1.000000
 
 [17 rows x 17 columns],
 'ax': <matplotlib.axes._subplots.AxesSubplot at 0x7fb451635940>}

In [212]: datacorr.to_csv('/tmp/datacorr.csv',index=True)

Input data:

datacorr.csv.txt

Consider integrating another heatmap visualization

Consider implemeting the heatmap code from drazenz/heatmap to nominal.associations.

Associations nan_replace_value affects original input data

Thanks for all the work on this package - it's been really helpful.

I noticed a minor bug. When running associations or compute_associations the nan_replace_strategy affects the original input data. Hence, if you have a dataframe with NaN values in certain columns, these get replaced in the original input. It would be better to work on a copy of the original data, instead of processing the data inplace.

Pandas gives the following warning:

SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

How can I change the size of the heat map?

I have a heat map with many attributes generated with:

associations(dfVasp,nominal_columns=['user', 'partition_name', 'qos', 'account'], theil_u=True)

But the values are cropped since there is no more space for them.

shakedzy / dython Goto Github PK

dython's Introduction

Dython

Installation

Documentation

Contributing

Citing

dython's People

Contributors

Stargazers

Watchers

Forkers

dython's Issues

Describe the new feature:

What is the current outcome?

Is it backward-compatible?

Describe the new feature:

What is the current outcome?

Is it backward-compatible?

Describe the new feature:

What is the current outcome?

Is it backward-compatible?

Describe the new feature:

What is the current outcome?

Is it backward-compatible?

Version check:

Describe the bug:

Error message:

Input data:

Describe the new feature:

What is the current outcome?

Is it backward-compatible?

Version check:

Describe the bug:

Error message:

Input data:

Version check:

Describe the bug:

Error message:

your error message

Input data:

Version check:

Describe the bug:

Error message:

Thrown error trace:

Version check:

Describe the bug:

Error message:

Input data:

Version check:

Describe the bug:

Recommend Projects

Recommend Topics

Recommend Org