Comments (16)
-
Ah ok, many thanks!
-
Yes sorry- that is due to it being the first 10 rows only as example data. All 20,000 rows have 7 classes across them.
Will have a go in light of 1) and report back! Many thanks.
from dython.
Hi dere,
please help on above doubts.
from dython.
Hi, thanks for the feedback! Could you please post the array you're sending? I don't know what df_dython = df_prep.drop(['Bedrock value'], axis = 1)
is..
from dython.
Apologies, I should have attached that from the start. That drop is just me removing the pre-encoded data from the df as your function does that its self. Example data from the array as input attached.
from dython.
Hi, two things:
- You're passing several columns to
correlation_ratio
as the second argument, which is not how you use it. Check the function's documentation - you should be passingA sequence of continuous measurements
. I see that there could be a confusion, as I accidentally write that you can pass a DataFrame - I fixed that in the documentation, so thanks. - The
Bedrock
column only has a single value in it (at least the in the data you uploaded), so there's no real meaning to a Correlation Ratio, as there is only one class
Perhaps you could elaborate on what exactly is it you're trying to achieve, and I can try and guide you through
from dython.
Hi, forstly, thank you so much for the brilliant article! I cant tell you how helpful your article has been with regards to clarifying my doubts on correlations. Appreciate all your work.
I am facing an issue while implementing the Correlation Ratio.
~ Dataset: Kaggle's Titanic Train dataset (https://www.kaggle.com/startupsci/titanic-data-science-solutions/data)
~ Aim: Calculate Correlation Ratio between 2 Categorical features ('Survived' & 'Gender') and 2 continous features ('Fare' & 'Age')
~ Code I tried:
numcols = ['Fare','Age']
catcols = ['Survived','Sex']
def correlation_ratio(categories, measurements):
fcat, _ = pd.factorize(categories)
cat_num = np.max(fcat)+1
y_avg_array = np.zeros(cat_num)
n_array = np.zeros(cat_num)
for i in range(0,cat_num):
cat_measures = measurements[np.argwhere(fcat == i).flatten()]
n_array[i] = len(cat_measures)
y_avg_array[i] = np.average(cat_measures)
y_total_avg = np.sum(np.multiply(y_avg_array,n_array))/np.sum(n_array)
numerator = np.sum(np.multiply(n_array,np.power(np.subtract(y_avg_array,y_total_avg),2)))
denominator = np.sum(np.power(np.subtract(measurements,y_total_avg),2))
if numerator == 0:
eta = 0.0
else:
eta = numerator/denominator
return eta
correlation_ratio(catcols, numcols)
~ Error in line: cat_measures = measurements[np.argwhere(fcat == i).flatten()]
~ Error: TypeError: only integer scalar arrays can be converted to a scalar index
Can you please help me understand where am I going wrong? I would really appreciate any help in this regard. Thanks in advance.
from dython.
If the code you pasted is exactly what you run, then numcols
and catcols
are simply lists of strings, not the columns of the data.. you didn't extract the actual columns..
Also, you can't pass two columns categories
and measurements
, each hold only one column. Please read the function's documentation.
from dython.
Thanks for the great article ..
I getting below error while using this code..
REPLACE = 'replace'
DROP = 'drop'
DROP_SAMPLES = 'drop_samples'
DROP_FEATURES = 'drop_features'
SKIP = 'skip'
DEFAULT_REPLACE_VALUE = 0.0
def correlation_ratio(categories, measurements):
fcat, _ = pd.factorize(categories)
cat_num = np.max(fcat)+1
y_avg_array = np.zeros(cat_num)
n_array = np.zeros(cat_num)
for i in range(0,cat_num):
cat_measures = measurements[np.argwhere(fcat == i).flatten()]
n_array[i] = len(cat_measures)
y_avg_array[i] = np.average(cat_measures)
y_total_avg = np.sum(np.multiply(y_avg_array,n_array))/np.sum(n_array)
numerator = np.sum(np.multiply(n_array,np.power(np.subtract(y_avg_array,y_total_avg),2)))
denominator = np.sum(np.power(np.subtract(measurements,y_total_avg),2))
if numerator == 0:
eta = 0.0
else:
eta = np.sqrt(numerator/denominator)
return eta
correlation_ratio(car_sales_cat.columns,car_sales_num.columns)
error:
TypeError :
TypeError: unsupported operand type(s) for /: 'str' and 'int'
from dython.
please help
from dython.
where's the data you're using?? you just pasted my function
from dython.
this function of urs m using:
def correlation_ratio(categories, measurements):
fcat, _ = pd.factorize(categories)
cat_num = np.max(fcat)+1
y_avg_array = np.zeros(cat_num)
n_array = np.zeros(cat_num)
for i in range(0,cat_num):
cat_measures = measurements[np.argwhere(fcat == i).flatten()]
n_array[i] = len(cat_measures)
y_avg_array[i] = np.average(cat_measures)
y_total_avg = np.sum(np.multiply(y_avg_array,n_array))/np.sum(n_array)
numerator = np.sum(np.multiply(n_array,np.power(np.subtract(y_avg_array,y_total_avg),2)))
denominator = np.sum(np.power(np.subtract(measurements,y_total_avg),2))
if numerator == 0:
eta = 0.0
else:
eta = np.sqrt(numerator/denominator)
return eta
data
categories: car_sales_cat.columns(categorical column)
measurements: car_sales_num.columns(numerical column)
from dython.
what is car_sales_cat
?? How do you expect me to debug this without the data?
from dython.
Hi I m really sorry ,I should attach the file in beginning only, from the attached file I seperated the categorical and numerical data and m trying to pass in function.
like this:
categories: car_sales_cat.columns (categorical column)
measurements: car_sales_num.columns (numerical column)
from dython.
Dude, you're making this super hard to help you, as I still don't know how you split the data. You might be doing it wrong.
Anyway, if I assume that car_sales_cat
and car_sales_num
are DataFrames of pandas
. That means you're passing the columns names, not the actual data. I answered this exact same thing in the comment right above your question.
from dython.
Hi Thanks,yes thats a data frame of pandas with car_sales_cat is having only categorical data,car_sales_num is having only numerical data.
Please suggest a way to pass in the function..
I tried car_sales[car_sales_cat] this also not working...please help as I m new to python...
from dython.
I answered your question on my last comment:
you're passing the columns names, not the actual data
Refer to the Pandas documentation and DataFrame API.
from dython.
Related Issues (20)
- FAILED tests/test_nominal/test_associations.py::test_datetime_data - AssertionError: datetime associations are incorrect. HOT 6
- TypeError: associations() got an unexpected keyword argument 'theil_u' HOT 1
- No heatmap shown HOT 2
- Add option to drop nan values in each pair of columns independently
- Use Black for code formatting
- (docs) documentation for `nominal` module not updated on website HOT 2
- Allow re-plotting of associations heat-map HOT 1
- Run tests per each major Python version HOT 2
- Pandas must be limited to <1.5.0 HOT 4
- dython.nominal.associations handling fillna with dtype="category" HOT 3
- Issue with plotting heatmap using Dython associations HOT 2
- Cramer vs. Theil HOT 2
- ks_abc when run with plot=False still plots the graph HOT 13
- TypeError Traceback (most recent call last) HOT 1
- assotications function from pip or conda does not have multiprocessing or max_cpu_core ?? HOT 2
- associations function's nan_strategy not working?? HOT 2
- ks_abc when run with plot=False still plots the graph HOT 1
- Add type hints to functions
- speed
- Add official support for Python 3.12 HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dython.