maxhalford / prince Goto Github PK

View Code? Open in Web Editor NEW

1.2K 26.0 177.0 8.31 MB

:crown: Multivariate exploratory data analysis in Python — PCA, CA, MCA, MFA, FAMD, GPA

Home Page: https://maxhalford.github.io/prince

License: MIT License

Python 99.65% Makefile 0.35%

pandas pca ca mca python svd factor-analysis correspondence-analysis principal-component-analysis scikit-learn

prince's Introduction

Prince is a Python library for multivariate exploratory data analysis in Python. It includes a variety of methods for summarizing tabular data, including principal component analysis (PCA) and correspondence analysis (CA). Prince provides efficient implementations, using a scikit-learn API.

Example usage

>>> import prince

>>> dataset = prince.datasets.load_decathlon()
>>> decastar = dataset.query('competition == "Decastar"')

>>> pca = prince.PCA(n_components=5)
>>> pca = pca.fit(decastar, supplementary_columns=['rank', 'points'])
>>> pca.eigenvalues_summary
          eigenvalue % of variance % of variance (cumulative)
component
0              3.114        31.14%                     31.14%
1              2.027        20.27%                     51.41%
2              1.390        13.90%                     65.31%
3              1.321        13.21%                     78.52%
4              0.861         8.61%                     87.13%

>>> pca.transform(dataset).tail()
component                       0         1         2         3         4
competition athlete
OlympicG    Lorenzo      2.070933  1.545461 -1.272104 -0.215067 -0.515746
            Karlivans    1.321239  1.318348  0.138303 -0.175566 -1.484658
            Korkizoglou -0.756226 -1.975769  0.701975 -0.642077 -2.621566
            Uldal        1.905276 -0.062984 -0.370408 -0.007944 -2.040579
            Casarsa      2.282575 -2.150282  2.601953  1.196523 -3.571794

>>> chart = pca.plot(dataset)

This chart is interactive, which doesn't show on GitHub. The green points are the column loadings.

>>> chart = pca.plot(
...     dataset,
...     show_row_labels=True,
...     show_row_markers=False,
...     row_labels_column='athlete',
...     color_rows_by='competition'
... )

Installation

pip install prince

🎨 Prince uses Altair for making charts.

Methods

flowchart TD
    cat?(Categorical data?) --> |"✅"| num_too?(Numerical data too?)
    num_too? --> |"✅"| FAMD
    num_too? --> |"❌"| multiple_cat?(More than two columns?)
    multiple_cat? --> |"✅"| MCA
    multiple_cat? --> |"❌"| CA
    cat? --> |"❌"| groups?(Groups of columns?)
    groups? --> |"✅"| MFA
    groups? --> |"❌"| shapes?(Analysing shapes?)
    shapes? --> |"✅"| GPA
    shapes? --> |"❌"| PCA

Principal component analysis (PCA)

Correspondence analysis (CA)

Multiple correspondence analysis (MCA)

Multiple factor analysis (MFA)

Factor analysis of mixed data (FAMD)

Generalized procrustes analysis (GPA)

Correctness

Prince is tested against scikit-learn and FactoMineR. For the latter, rpy2 is used to run code in R, and convert the results to Python, which allows running automated tests. See more in the tests directory.

Citation

Please use this citation if you use this software as part of a scientific publication.

@software{Halford_Prince,
    author = {Halford, Max},
    license = {MIT},
    title = {{Prince}},
    url = {https://github.com/MaxHalford/prince}
}

Support

I made Prince when I was at university, back in 2016. I've had very little time over the years to maintain this package. I spent a significant amount of time in 2022 to revamp the entire package. Prince has now been downloaded over 1 million times. I would be grateful to anyone willing to sponsor me. Sponsorships allow me to spend more time working on open source software, including Prince.

License

The MIT License (MIT). Please see the license file for more information.

prince's People

Contributors

Stargazers

Watchers

Forkers

smartlixx jphcoi yushu-liu spark-lin hehuanshu96 fw1121 benjamesbabala timbearden saadmahboob kormilitzin python3pkg ilay32 bison31205 user624086 savourylie pnugues laurieskelly kazimosiurahman sissiyan shellcat-zero hamidkhaoua calledit liam-f wangj347 mattadendorff ciridijkstra flashlxy kbssr wangclover deltahedge1 lxj0276 rohanvardhan phillette demirtonchev mrchancegsy winfredeu mcalcote zysky1314 boredstats rspadim yuanmengzhixing vishalbelsare psyche-mia janes scottishfold007 gjogonzalezc wenliangz fpom hope0654 irjl1879 manhnguyen48 yichuanyanyu26 sureta1921 sishui198 anirban6393 carvidz ejolly maximekan al-yakubovich 459548764 vinrok stepp1 shaonannan schko bohblue2 ronger4242 ulinares littlelizikeen krcatbagan yatishnaik27 hannz88 whywecode farheenjn captainhtm aalokjha xinyuanhu hsarabu t-triobox mgvalverde tzhalin anfangermi manikant92 grandpurpleocelot chrinide haowei772 digi-metal mysky528 kavithacd macfernandez ilonem anorak4 nayanemaia hesslery nkm-ml luisfalva sandy4321 raekawu brontomerus susd1234 cristitosa

prince's Issues

FAMD example does not work

The example provided to illustrate the Factor Analysis of Mixed Data (FAMD) does not work. The code is the following (same as in the documentation):

import pandas as pd
X = pd.DataFrame(
...     data=[
...         ['A', 'A', 'A', 2, 5, 7, 6, 3, 6, 7],
...         ['A', 'A', 'A', 4, 4, 4, 2, 4, 4, 3],
...         ['B', 'A', 'B', 5, 2, 1, 1, 7, 1, 1],
...         ['B', 'A', 'B', 7, 2, 1, 2, 2, 2, 2],
...         ['B', 'B', 'B', 3, 5, 6, 5, 2, 6, 6],
...         ['B', 'B', 'A', 3, 5, 4, 5, 1, 7, 5]
...     ],
...     columns=['E1 fruity', 'E1 woody', 'E1 coffee',
...              'E2 red fruit', 'E2 roasted', 'E2 vanillin', 'E2 woody',
...              'E3 fruity', 'E3 butter', 'E3 woody'],
...     index=['Wine {}'.format(i+1) for i in range(6)]
... )
X['Oak type'] = [1, 2, 2, 2, 1, 1]

import prince
famd = prince.FAMD(
...     n_components=2,
...     n_iter=3,
...     copy=True,
...     check_input=True,
...     engine='auto',
...     random_state=42
... )
famd = famd.fit(X.drop('Oak type', axis='columns'))

This throws the following error ValueError: could not convert string to float: 'A'.

If converting the columns E1 fruity, E1 woody and E1 coffee to categorical, the resulting error is instead ValueError: Not all columns in "Categorical" group are of the same type.

I am aware that the issues https://github.com/MaxHalford/prince/issues/26 and https://github.com/MaxHalford/prince/issues/27 refer to the same problem, and this should be fixed since version 0.4.3. However, I downloaded version 0.5.0 and it does not appear to be so.

Any help would be highly appreciated. Thanks

Chi square distances

Is there a way to get the chi square distances in CA between rows, like in formula (17.18) from Izenman book?

How to reconstruct/data recover the dataset after FADM transform? (inverse_transform like)

Hi! I've started using Prince package in order to apply a FADM in my binary and categorical dataset (btw, thanks @MaxHalford for this amazing library!!!).

Now I'm trying to find a way to reconstruct my original dataset (something like the inverse_transform function of sklearn PCA), so I could check which points are in each cluster. But I could not find a way to do this, neither a function that gives me the original indexes.

Any ideas?

Thankss!!

FAMD implementation

Hi,

Any updates on the FAMD? Trying to get some staticstical analysis work done using python, but unfortunately cant find many tools. Appreciate the effort you have put into this package though!

Problem with typing

Whenever I use MCA in this library, I get the following error:

TypeError                                 Traceback (most recent call last)
<ipython-input-8-44ba06c0d9cb> in <module>()
----> 1 newMCA = prince.mca.MCA(simulationFrame,n_components = 2)

/Users/michaelrosenberg/Library/Python/2.7/lib/python/site-packages/prince/mca.pyc in __init__(self, dataframe, n_components, use_benzecri_rates, plotter)
     40         )
     41 
---> 42         super(MCA, self).__init__(
     43             dataframe=pd.get_dummies(dataframe),
     44             n_components=n_components,

TypeError: must be type, not classobj

What is going on here?

Total Inertia = 0.0 using indicator matrix as input

I'm analyzing product category purchases by customers and attempting to use MCA to identify clusters of product categories that go together. My data set looks like:

	cat1	cat2	cat3	cat4	cat5
user1	1	0	1	1	0
user2	0	0	1	0	1
user3	0	1	1	0	0
user4	1	1	0	1	0
...	...	...	...	...	...

The shape of this df is 20,000 x 100

Because the instantiation of an MCA object converts the incoming df to dummies (and my data is effectively already dummified), I convert my 1 values to string and 0 values to np.nan. This enables MCA to build mca.X as an indicator matrix (ie the same matrix as what I am inputting).

The issue however is that mca.total_inertia = 0.0. This leads to divide by zero errors in many of the plotting functions. The error, as I understand it, comes from:

After executing mca = prince.MCA(df, n_components=-1)
I find:

mca.n_components = 100
mca.q = 100

which leads to

@property
    def total_inertia(self):
        """The total inertia."""
        return (self.n_columns - self.q) / self.q

resulting in mca.total_inertia = 0.0

Any suggestions or methods to manipulate my df and get this working? I think if q = 1, then this would work quite well, but I was unable to change the value of q as their is no setter method.

Results have minor errors compared to step-by-step computation

Hi, I tried to implement CA by myself and our results could not match. The principal coordinates and standard coordinates for both row and col always have a small numerical difference (around 7%, not like a round-off issue).

I have been checking it a day and compared the computations of every steps. Except coordinates, all the rests are all matched (such as row_masses, standardized_residuals), and I couldn't see why the results not match. I cannot even reproduce the output by extract row_masses, standardized_residuals and do the multiplication myself.

Please check if the code can reproduce the hair-eye example at p.649 of following book (Alan J Izenman, Modern Multivariate Statistical Techniques) where more accurate (1e-4) results are provided.
http://ce.aut.ac.ir/~shiry/lecture/Advanced%20Machine%20Learning/Manifold_Modern_Multivariate%20Statistical%20Techniques%20-%20Regres.pdf

Great work! Looking forward to the development of FAMD part!

FAMD doesn't work on Wine Data

I'm attempting to use FAMD on the Wine Data, which is all numeric. (As I don't see an example for FAMD -- this would be very helpful!)

Trying to fit the FAMD results in a Value Error

`import prince
import pandas as pd

X = pd.DataFrame(
data=[
[1, 6, 7, 2, 5, 7, 6, 3, 6, 7],
[5, 3, 2, 4, 4, 4, 2, 4, 4, 3],
[6, 1, 1, 5, 2, 1, 1, 7, 1, 1],
[7, 1, 2, 7, 2, 1, 2, 2, 2, 2],
[2, 5, 4, 3, 5, 6, 5, 2, 6, 6],
[3, 4, 4, 3, 5, 4, 5, 1, 7, 5]
],
columns=['E1 fruity', 'E1 woody', 'E1 coffee',
'E2 red fruit', 'E2 roasted', 'E2 vanillin', 'E2 woody',
'E3 fruity', 'E3 butter', 'E3 woody'],
index=['Wine {}'.format(i+1) for i in range(6)]
)

famd = prince.FAMD()
famd.fit(X)`

`---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
in ()
18
19 famd = prince.FAMD()
---> 20 famd.fit(X)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\prince\famd.py in fit(self, X, y)
35
36 # One-hot encode the categorical columns
---> 37 self.one_hot_ = one_hot.OneHotEncoder().fit(cat)
38
39 # Apply PCA to the indicator matrix

~\AppData\Local\Continuum\anaconda3\lib\site-packages\prince\one_hot.py in fit(self, X, y)
152 raise ValueError('X must be a pandas.DataFrame')
153
--> 154 self = super().fit(X)
155 self.column_names_ = list(itertools.chain(*[
156 [

~\AppData\Local\Continuum\anaconda3\lib\site-packages\prince\one_hot.py in fit(self, X, y)
51 "supported")
52
---> 53 X_temp = utils.check_array(X, dtype=None)
54 if not hasattr(X, 'dtype') and np.issubdtype(X_temp.dtype, np.str_):
55 X = utils.check_array(X, dtype=np.object)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
468 " a minimum of %d is required%s."
469 % (n_features, shape_repr, ensure_min_features,
--> 470 context))
471
472 if warn_on_dtype and dtype_orig is not None and array.dtype != dtype_orig:

ValueError: Found array with 0 feature(s) (shape=(6, 0)) while a minimum of 1 is required.`

back to py2

Hi,

I would need to have Prince working with python2. too bad ...
I saw that the @ operator is not supported in py2. like in ca.py
data=X @ sparse.diags(self.row_masses_ ** -0.5) @ self.U_

I am not familiar with that @.
Is it possible to rewrite it using np.matmul?
Like
data=np.matmul(np.matmul(X, sparse.diags(self.row_masses_ ** -0.5)), self.U_)

I looked into going back to an old version, but I am using it also in py3, and like the way it works now.

Mistake in examples documentation

There is a mistake in the examples documentation for FAMD. The following line

famd = mfa.fit(X.drop('Oak type', axis='columns')) # No need for 'Oak type'

should be

famd = famd.fit(X.drop('Oak type', axis='columns')) # No need for 'Oak type'

mfa is used on the famd example

Contributing to prince

Is there any way I can contribute to the project? I use this package a lot and I have added methods which I would like to be part of the original package.

FAMD does not work

X = pd.DataFrame(
data=[
["A", "A", "A", 2, 5, 7, 6, 3, 6, 7],
["A", "A", "A", 4, 4, 4, 2, 4, 4, 3],
["B", "A", "B", 5, 2, 1, 1, 7, 1, 1],
["B", "A", "B", 7, 2, 1, 2, 2, 2, 2],
["B", "B", "B", 3, 5, 6, 5, 2, 6, 6],
["B", "B", "A", 3, 5, 4, 5, 1, 7, 5]
])
famd = FAMD(
n_components=3,
n_iter=3,
copy=True,
engine='auto',
random_state=4
)
famd = famd.fit(X)
famd_result = famd.transform(X)

raises

ValueError: could not convert string to float: 'A'

Error with FAMD

So I am using the example given by you for FAMD, but I am getting and error

`C:\ProgramData\Anaconda3\lib\site-packages\scipy_lib_util.py in _asarray_validated(a, check_finite, sparse_ok, objects_ok, mask_ok, as_inexact)
239 if not objects_ok:
240 if a.dtype is np.dtype('O'):
--> 241 raise ValueError('object arrays are not supported')
242 if as_inexact:
243 if not np.issubdtype(a.dtype, np.inexact):

ValueError: object arrays are not supported`

My FAMD is 0.5.2, also I have tried converting objects in categories but still getting same error

How to reconstruct the original data after MCA transformation

There is mca.eigenvalues_ which gives 2 values.
But this page mentions Eigen vectors are required to do that.But the Readme doesn't mention it.

Calling fit on FAMD object with numerical data only gives ValueError exception

Observed behavior: I used the example Iris dataset that is given in the documentation. Instead of creating a PCA object, I create a FAMD object. I call fit on it just like in the documentation and I get a Value error:

ValueError: Found array with 0 feature(s) (shape=(150, 0)) while a minimum of 1 is required.

Not sure exactly why this is happening.

Expected behavior: For purely numerical data (such as in the IRIS dataset) I would expect that the FAMD should produce identical results to the PCA. I see the documentation says to use PCA in this case. So I guess it's a matter of interpretation as to what the expected behavior should be. But at the least another exception saying use 'PCA for purely numerical data'. Anyway, please let me know your thoughts.

Error transforming test dataset

FAMD: TypeError: cannot concatenate object

Hi!
While trying to fit FAMD:

famd = prince.FAMD(n_components=2, n_iter=3, copy=True, check_input=True, engine='auto')
famd = famd.fit(df)

Where df contains the following dtypes:

ip                     object
app                    object
channel                object
DAY(click_time)         int64
YEAR(click_time)        int64
MONTH(click_time)       int64
WEEKDAY(click_time)     int64
users.device           object
users.os               object

Tje object df seems to be of the right type: <class 'pandas.core.frame.DataFrame'> and its columns also, but I get the following error:
TypeError: cannot concatenate object of type "<class 'scipy.sparse.csr.csr_matrix'>"; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid

Does it look like something coming from my side? Or is it a known bug of the libray?

Best,

Docs depend on subdirectory ./source which does not exist

In ./docs subdirectory, Makefile references source directory to contain the Sphinx sources. EG:

ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source
source subdir does not exist.

Permissions to push changes to my branch

Hi @MaxHalford , I have another feature that I want to add. However, I am unable to push the changes to my branch. Please advise. Thanks!

MCA: "ValueError: All values in X should be positive"

I have a dataframe like this:
restDf.head()

  | a | b | c | d
-- | -- | -- | -- | --
109 | 4 | 4 | 0
2 | 4 | 4 | 0
243 | 4 | 6 | 1
130 | 3 | 4 | 0
181 | 4 | 6 | 1

trying to run:

import prince
mca = prince.MCA()
test = mca.fit(restDf)

gives me:


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-75-d72ed9bacef2> in <module>()
      2 
      3 mca = prince.MCA(n_components=2)
----> 4 test = mca.fit(restDf)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\prince\mca.py in fit(self, X, y)
     26 
     27         # Apply correspondence analysis to the indicator matrix
---> 28         super().fit(X)
     29 
     30         # Compute the total inertia

~\AppData\Local\Continuum\anaconda3\lib\site-packages\prince\ca.py in fit(self, X, y)
     24         # Check all values are positive
     25         if np.any(X < 0):
---> 26             raise ValueError("All values in X should be positive")
     27 
     28         if isinstance(X, pd.DataFrame):

ValueError: All values in X should be positive

This also happens when I run the program with the same dataframe where everything <=0 is replaced with a positive number:
restDf2.head()

  | a | b | c | d
-- | -- | -- | -- | --
109 | 4 | 4 | 42069
2 | 4 | 4 | 42069
243 | 4 | 6 | 1
130 | 3 | 4 | 42069
181 | 4 | 6 | 1

Any help?

Batch implementation of row_coordinates

The dataset I am working with is quite large (roughly 5M entries and about 50 features, most of which are categorical) so I can't fit all my data at once into the famd.fit function, therefore I use a subset of my data for the fitting, hoping it is representative enough of my entire dataset.

However, when I want to compute row_coordinates or call plot_row_coordinates, I would like to do it over the entire data, but I most often run into a MemoryError in mfa.py at l.80.

I guess sklearn tries to allocate a huge array somewhere in check_array, it would be convenient to have an internal minibatch processing inrow_coordinates so the user doesn't need to worry about it...

Large sparse matrices and MemoryError

Hi, I am trying to use Prince MCA for Dorothea dataset. I ran to MemoryError at calculating the standardised residuals with np.diag(). Is it possible to substitute np.diag() with scipy.sparse.npdiags which works with sparse matrices? Like this 'S = diags(r ** -0.5) @ (X - np.outer(r, c)) @ diags(c ** -0.5)' ? With this modification the code runs and results are somewhat same with minor differences. Any insight on problems of using scipy's diags?

IndexingError: Too many indexers

Using example code:

import pandas as pd
import prince

df = pd.read_csv('ogm.csv')
mca = prince.MCA(df, n_components=-1)
mca.plot_rows(show_points=True, show_labels=False, color_by='Position Al A', ellipse_fill=True)

Error:

Traceback (most recent call last):
  File "test.py", line 6, in <module>
    mca.plot_rows(show_points=True, show_labels=False, color_by='Position Al A', ellipse_fill=True)
  File "/Users/hewgreen/anaconda/lib/python3.6/site-packages/prince/mca.py", line 154, in plot_rows
    ellipse_fill=ellipse_fill
  File "/Users/hewgreen/anaconda/lib/python3.6/site-packages/prince/plot/mpl/pca.py", line 30, in row_principal_coordinates
    data = principal_coordinates.iloc[:, axes].copy() # Active rows
  File "/Users/hewgreen/anaconda/lib/python3.6/site-packages/pandas/core/indexing.py", line 1325, in __getitem__
    return self._getitem_tuple(key)
  File "/Users/hewgreen/anaconda/lib/python3.6/site-packages/pandas/core/indexing.py", line 1662, in _getitem_tuple
    self._has_valid_tuple(tup)
  File "/Users/hewgreen/anaconda/lib/python3.6/site-packages/pandas/core/indexing.py", line 189, in _has_valid_tuple
    if not self._has_valid_type(k, i):
  File "/Users/hewgreen/anaconda/lib/python3.6/site-packages/pandas/core/indexing.py", line 1599, in _has_valid_type
    return self._is_valid_list_like(key, axis)
  File "/Users/hewgreen/anaconda/lib/python3.6/site-packages/pandas/core/indexing.py", line 1648, in _is_valid_list_like
    raise IndexingError('Too many indexers')
pandas.core.indexing.IndexingError: Too many indexers

I tried to find some details about how to fix myself to no avail. Many thanks in advance for your help.

Regards

hewgreen

MCA crashes with high dimensions

What was the largest amount of dimensions you have tested this with?

Thanks, and kudos

standardization/Normalization in FAMD

Ehy dear,
I m looking in FAMD with Python. I will ask, in PCA is very important the standardization before compute the algorithm( https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html ), but how to work in FAMD? Why we can't computer standardization before apply this one? The algorithm already done? If yes, how is done?

Thanks.

mca.plot_rows_columns labeling and colorbar ticks are inconsistent

This script

import numpy as np
import prince
import pandas as pd
import matplotlib.pyplot as plt


s = pd.DataFrame({
    'A': pd.Categorical(np.random.choice(list('abcde'), size=(1000,)),
                        categories=['e', 'd', 'c', 'a', 'b']),
    'C': pd.Categorical(np.random.choice(list('abcde'), size=(1000,)),
                        categories=['e', 'd', 'c', 'a', 'b']),
    'B': pd.Categorical(np.random.choice(list('abcde'), size=(1000,)),
                        categories=['e', 'd', 'c', 'a', 'b']),
})
mca = prince.MCA(s)
fig, ax = mca.plot_rows_columns(show_column_labels=True)

plt.savefig('cat_label_bug.png')

Produces this figure

It looks like the annotations and the tick labels are inconsistent with each other (I'm guessing one goes in column order, while the other is alphabetical). I haven't dug into whether it's the annotations or the tick labels that are wrong.

Unreliable results while FAMD model is returned from a function.

Hi,

I did notice that when you return the FAMD transform object from a function, the next results are sometimes unreliable. The following code snippet elucidates the issue :

import prince
import pandas as pd
from sklearn import linear_model
cv_train = pd.read_csv('cv_training3.csv')
famd = prince.FAMD(
    n_components=25,
    n_iter=10,
    copy=True,
    check_input=True, 
    engine='auto',
    random_state=42)
x_cols = [col for col in cv_train.columns if col != 'LogSalePrice']
print(x_cols)

X = cv_train[x_cols]
famd_model = famd.fit(X)
transformed_X = famd_model.transform(X)

my_model = linear_model.LinearRegression()
my_model.fit(transformed_X, cv_train['LogSalePrice'].values.ravel())

cv_validation = pd.read_csv('cv_validation3.csv')

old_predictions = my_model.predict(transformed_X)
print(old_predictions.mean())

new_X = cv_validation[x_cols]

new_transformed_X = famd_model.transform(new_X)

new_predictions = my_model.predict(new_transformed_X)
print(new_predictions.mean())

We have the following output :

['MSSubClass', 'LogGrLivArea', 'MSZoning']
12.030180212569563
-2158687260190.6848

Now consider the following code snippet :

import pandas as pd
import prince
from sklearn import linear_model

def get_trained_model_and_transform(X, Y, num_components=25, num_iter=20):
    famd = prince.FAMD(
        n_components=num_components,
        n_iter=num_iter,
        copy=True,
        check_input=True, 
        engine='auto',
        random_state=42)

    famd_model = famd.fit(X)
    transformed_X = famd_model.transform(X)
    
    my_model = linear_model.LinearRegression()
    
    my_model.fit(transformed_X,Y)

    return (my_model, famd_model)

cv_train = pd.read_csv('cv_training3.csv')

x_cols = [col for col in cv_train.columns if col != 'LogSalePrice']
print(x_cols)

X = cv_train[x_cols]

(my_model, famd_model) = get_trained_model_and_transform(X, 
                                                         cv_train['LogSalePrice'].values.ravel())

cv_validation = pd.read_csv('cv_validation3.csv')
transformed_X = famd_model.transform(X)
old_predictions = my_model.predict(transformed_X)
print(old_predictions.mean())

new_X = cv_validation[x_cols]

new_transformed_X = famd_model.transform(new_X)

new_predictions = my_model.predict(new_transformed_X)
print(new_predictions.mean()
)

The output is the following :

['MSSubClass', 'LogGrLivArea', 'MSZoning']
12.0300579767497
11.946205619238583

The only difference between the 2 snippets, is that we have wrapped the transform and model building to a function in the second whereas we have not done that in the first (and the results are drastically different).

I am attaching the input data files here as well. Feel free to let me know if you need more information.

NOTE : For the sake of ease of testing, I have included both these snippets as two scripts (script1.py and script2.py ) in the attached folder. I have added a screenshot of the output on my mac terminal as well.

However, I also tested these two scripts on an online python interface (https://repl.it/languages/python3) and they look to be giving identical outputs. I am using the production version of prince module for these scripts and had to change my matplotlib backend to Agg on my mac to get this working on my mac. Would that be the reason for the different result ?

data.zip

Thanks

How does FAMD identify the principal components?

Hi,
How do I specify what are the principal components as shown in the MFAD? (e.g. principal component 1 is Oak type 1 and principal component 2 is Oak Type 2).
For example, in the df below, I wish to set principal component 1 as "Donor is Teacher" == Yes and principal component 2 as "Donor is Teacher" == No. How do I specify that?

Thank you.

Donor State	Donor Is Teacher	Project Grade Level Category	School Metro Type	School State	# Donations Received	Teacher Project Posted Sequence	Project Subject Category Tree	Project Subject Subcategory Tree	Project Resource Category	Resource Vendor Name	Amount Needed
Oklahoma	No	Grades 6-8	suburban	Oklahoma	8	5	Math & Science	Applied Sciences, Health & Life Science	Supplies	Carolina Biological Supply Company,Carolina Bi...	738.15
Oklahoma	No	Grades 9-12	suburban	Oklahoma	6	2	Music & The Arts	Visual Arts	Supplies	Amazon Business,Amazon Business	18.39
Maryland	Yes	Grades PreK-2	urban	Maryland	5	20	Special Needs, Music & The Arts	Special Needs, Visual Arts	Supplies	Amazon Business,Amazon Business,Amazon Busines...	10.59
Maryland	Yes	Grades PreK-2	urban	Maryland	2	11	Applied Learning, Special Needs	Early Development, Special Needs	Books	AKJ Education,AKJ Education,AKJ Education,AKJ ...	283.19
Massachusetts	No	Grades 3-5	urban	Massachusetts	1	3	Health & Sports	Health & Wellness, Team Sports	Sports & Exercise Equipment	School Specialty,Staples Advantage,School Spec...	322.88

`KeyError 0` exception triggered at `CA.column_coordinates()`

The exception happens when the argument X is a pandas.DataFrame whose index is not integer (or it is integer but does not start at 0).

A potential simple fix is to update line 125 in ca.py so it reads as follows:

    data=X @ sparse.diags(self.row_masses_.values ** -0.5) @ self.U_,

While I did not encounter it, it is possible the same bug happens in CA.row_coordinates()

Calling column_correlations on FAMD gives ValueError

Observed behavior: Using the wine dataset example that is given in the documentation, I call column_correlations method of the FAMD object and get the following ValueError exception:

All the input array dimensions except for the concatenation axis must match exactly

Not sure exactly why this is happening.

Expected behavior: The column correlations to be returned since the shape of the X vector parameter passed into the method is (6, 10).

MCA is broken, can't call init

using the provided example code:

In [1]: import prince

In [2]: mca = prince.MCA(
   ...:      n_components=2,
   ...:      n_iter=3,
   ...:      copy=True,
   ...:      engine='auto',
   ...:      random_state=42)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-1a9495ebebd9> in <module>()
      4      copy=True,
      5      engine='auto',
----> 6      random_state=42)

TypeError: __init__() got an unexpected keyword argument 'engine'

In [3]:

ca.total_inertia_ > 1.0

Hi,

I was surprised to get a ca.total_inertia_ equal to 1.45.
If I np.sum(ca.explained_inertia_), I then get something very close to 1.

I though that 1 was supposed to be the correct answer, if we ask for a lot of components.
Am I misunderstanding something?

Using FAMD for numerical and binary data

Hey,

I am working with a dataset with a combination of continuous (~35%) and binary variables (1/0 representing Y/N) (~65%).

The binary variables were generated in a data preparation phase where I extracted information from categorical variables. They aren't strictly groups of dummy variables as there is some overlap.

My understanding is that PCA isn't fully appropriate here and using FAMD is the way to go, but since the data is already encoded as numerical variables the FAMD doesn't work.

Looking at the underlying code it seems to split numerical and categorical columns into two separate "groups" then runs MFA on both. In order to get around the error thrown back can I just manually run MFA supplying the numerical and binary columns as separate groups?

TypeError: must be type, not classobj (in Python 2.7)

Here is my code
import matplotlib.pyplot as plt
import pandas as pd
from prince import MCA

df = pd.read_csv('data/ogm.csv')
mca = MCA(df, n_components=-1)

Exception

TypeError Traceback (most recent call last)
in ()
4
5 df = pd.read_csv('data/ogm.csv')
----> 6 mca = MCA(df, n_components=-1)

/home/sambhu/env-attrition/local/lib/python2.7/site-packages/prince-0.2.6-py2.7.egg/prince/mca.pyc in init(self, dataframe, n_components, use_benzecri_rates, plotter)
40 )
41
---> 42 super(MCA, self).init(
43 dataframe=pd.get_dummies(dataframe),
44 n_components=n_components,

TypeError: must be type, not classobj

FAMD modifies original DataFrame

Hi,
As it says in the title already, FAMD modifies original DataFrame.

Example:

df = pd.DataFrame(
    data=[
        [1, 1, 1, 2, 5, 7],
        [1, 1, 1, 4, 4, 4],
        [2, 1, 2, 5, 2, 1],
        [2, 1, 2, 7, 2, 1],
        [2, 2, 2, 3, 5, 6],
        [2, 2, 1, 3, 5, 4]
    ]
)
for c in range(3):
    df[c] = df[c].astype("object")
famd = FAMD(
    n_components=3,
    n_iter=3,
    copy=True,
    engine='auto',
    random_state=4
)
famd = famd.fit(df)
famd_result = famd.transform(df)

famd_result contains the result (as expected).
df however was modified. My current workaround copies the DataFrame, before passing it to fit() and transform().

Is this modification of df intended?

got multiple values for argument 'n_components'

I was trying to run the CA.
I started by the 'presidentielles' example and got this error:

Traceback (most recent call last):

File "", line 5, in
engine='auto'

TypeError: init() got multiple values for argument 'n_components'

MCA vs OneHot+SVD

Hi, not reporting a bug, but was just wondering if you could explain or point me to a reference on how MCA is different from just doing a one-hot encoding followed by SVD? From the code I can see that the difference is a normalization you apply to the onehot matrix first, but I'm not sure what the effect of this normalization is.

Also, would it makes sense to apply MCA or onehot+SVD to a single categorical column? Asking because in that case all the rows would be "orthogonal" and I'm not sure if any useful correlation/structure can be found in that case?

DOC: MCA argument. Contingency table or raw observations?

In https://github.com/MaxHalford/Prince/blob/a854849c9e1ee2b331bb0fd4f0f939b1c01b0f01/prince/mca.py#L13 it says that MCA takes a dataframe that's a contingency table. From the examples (like mca-ogm.py) it looks like MCA takes the actual N obs x K variables DataFrame, not the summarized contingency table. Or I could be confused.

Overall, this looks great! One code comment: in places like here where you're filtering by dtype, you might be able to use DataFrame.select_dtypes.

Two potential issues with MFA.py

(I'm a novice at Python so apologies if these are not correct observations)

In MFA.py, in line 55, 'all_num' is static since it is set in a prior loop in line 50. Should you either add 'all_num = all_nums[name]' or merge the two loops (line 45 and line 54)?
On line 88, you have "X[cols] / self.partial_factor_analysis_[name].s_[0]", but it does not seem to work with categorical pandas dataframes as X[cols] is not numeric and cannot be divided by a scalar

Kernel dead when applied FAMD to large dataset

I have a dataset X with mixed data. The shape of X is (110000, 24).
I applied the following command to this dataset and the kernel is dead. Any suggestions for this? Thanks!

famd = prince.FAMD(
n_components=24,
n_iter=10,
copy=True,
engine='auto',
random_state=42
)
famd = famd.fit(X)

Is there a way to transform new data after fitting with FAMD?

Hello,

I just discovered this package and it seems very interesting. I was wondering is there a way to apply the transform function to new unseen data after calling FAMD fit? Analogous to how PCA works in sklearn.

When I try to do this I get an error:

X)
102 X = self.scaler_.transform(X)
103
--> 104 return pd.DataFrame(data=X.dot(self.V_.T), index=index)
105
106 def row_standard_coordinates(self, X):

ValueError: shapes (2,20) and (49,2) not aligned: 20 (dim 1) != 49 (dim 0)

Basically it looks like it doesn't understand there are a different number of "training examples" as opposed to when the fit occurred.

Cheers,

Kuhan

pandas.core.indexing.IndexingError: Too many indexers

Occurs when running the example pca-iris.py:

(py3) ➜  examples git:(master) python pca-iris.py 

Traceback (most recent call last):
  File "pca-iris.py", line 11, in <module>
    fig2, ax2 = pca.plot_rows(color_by='class', ellipse_fill=True)
  File "/storage/anaconda2/envs/py3/lib/python3.6/site-packages/prince/pca.py", line 287, in plot_rows
    ellipse_fill=ellipse_fill
  File "/storage/anaconda2/envs/py3/lib/python3.6/site-packages/prince/plot/mpl/pca.py", line 30, in row_principal_coordinates
    data = principal_coordinates.iloc[:, axes].copy() # Active rows
  File "/storage/anaconda2/envs/py3/lib/python3.6/site-packages/pandas/core/indexing.py", line 1325, in __getitem__
    return self._getitem_tuple(key)
  File "/storage/anaconda2/envs/py3/lib/python3.6/site-packages/pandas/core/indexing.py", line 1662, in _getitem_tuple
    self._has_valid_tuple(tup)
  File "/storage/anaconda2/envs/py3/lib/python3.6/site-packages/pandas/core/indexing.py", line 189, in _has_valid_tuple
    if not self._has_valid_type(k, i):
  File "/storage/anaconda2/envs/py3/lib/python3.6/site-packages/pandas/core/indexing.py", line 1599, in _has_valid_type
    return self._is_valid_list_like(key, axis)
  File "/storage/anaconda2/envs/py3/lib/python3.6/site-packages/pandas/core/indexing.py", line 1648, in _is_valid_list_like
    raise IndexingError('Too many indexers')
pandas.core.indexing.IndexingError: Too many indexers

ValueError: array must not contain infs or NaNs

When i Try CA the algorithm stops in compute SVD and i got a value error :/
I verified that i have no NaN or inf values.
My dataframe is a resulat of TFvectorizer with a dictionnary so values in the sparse matrix (used as a dataframe) are greater or equal to Zero.

print(pd.version) 0.20.3
print(np.version) 1.13.3
print(scipy.version) 1.0.0
Python 3.6.4
Windows 64bits

Wen i ran the SVD decomposition with scipy it works but not with princee which is surprising.

Anybody could help me please ?

Global PCA in MFA for categorical fields

Thanks for fixing the previous issue so quickly. There seems to still be an issue with line 76 in MFA.py when "super().fit(self._build_X_global(X))" is running the .fit() for PCA:

For categorical groups, _build_X_global retains the columns as they are (line 90 in MFA.py)
However, line 40 of PCA.py (within the .fit() method) and below assumes the dataframe is numeric

You can recreate the issue by using the sample data (and ignoring the scaling):
X = pd.DataFrame(
data=[
["A", "A", "A", 2, 5, 7, 6, 3, 6, 7],
["A", "A", "A", 4, 4, 4, 2, 4, 4, 3],
["B", "A", "B", 5, 2, 1, 1, 7, 1, 1],
["B", "A", "B", 7, 2, 1, 2, 2, 2, 2],
["B", "B", "B", 3, 5, 6, 5, 2, 6, 6],
["B", "B", "A", 3, 5, 4, 5, 1, 7, 5]
],
columns=['E1 fruity', 'E1 woody', 'E1 coffee',
'E2 red fruit', 'E2 roasted', 'E2 vanillin', 'E2 woody',
'E3 fruity', 'E3 butter', 'E3 woody'],
index=['Wine {}'.format(i+1) for i in range(6)]
)

Thanks!

MemoryError issue

The memory of my machine has 120 GB, and there are 40 GB left for me to conduct MCA computation.

The DataFrame has a shape of (1244210, 37), and I have processed the DataFrame with get_dummy() function in Pandas.

And I want to get 10 components, however, I got MemoryError here

>>> mca_result = prince.MCA(X_MCA, n_components=10)
MemoryError                               Traceback (most recent call last)
<ipython-input-20-ee2308cc121f> in <module>()
----> 1 mca_result = prince.MCA(X_MCA, n_components=10)

/home/libertatis/anaconda3/lib/python3.6/site-packages/prince/mca.py in __init__(self, dataframe, n_components, use_benzecri_rates, plotter)
     43             dataframe=pd.get_dummies(dataframe),
     44             n_components=n_components,
---> 45             plotter=plotter
     46         )
     47 

/home/libertatis/anaconda3/lib/python3.6/site-packages/prince/ca.py in __init__(self, dataframe, n_components, plotter)
     26         self._set_plotter(plotter_name=plotter)
     27 
---> 28         self._compute_svd()
     29 
     30     def _compute_svd(self):

/home/libertatis/anaconda3/lib/python3.6/site-packages/prince/ca.py in _compute_svd(self)
     29 
     30     def _compute_svd(self):
---> 31         self.svd = SVD(X=self.standardized_residuals, k=self.n_components)
     32 
     33     def _set_plotter(self, plotter_name):

/home/libertatis/anaconda3/lib/python3.6/site-packages/prince/ca.py in standardized_residuals(self)
    123         """
    124         residuals = (self.P - self.expected_frequencies).values
--> 125         return self.row_masses.dot(residuals).dot(self.column_masses)
    126 
    127     @property

/home/libertatis/anaconda3/lib/python3.6/site-packages/prince/ca.py in row_masses(self)
     99             represents the weight of the matching row; the non-diagonal cells are equal to 0.
    100         """
--> 101         return np.diag(1 / np.sqrt(self.row_sums))
    102 
    103     @property

/home/libertatis/anaconda3/lib/python3.6/site-packages/numpy/lib/twodim_base.py in diag(v, k)
    247     if len(s) == 1:
    248         n = s[0]+abs(k)
--> 249         res = zeros((n, n), v.dtype)
    250         if k >= 0:
    251             i = k

MemoryError:

And there are 40GB memories left for me and I can apply PCA to the DataFrame. How can I solve it?

I found a similar issue on this problem: esafak/mca#15

IndexingError: Too many indexers

when I run this Code I am getting Error in vertualenv

import matplotlib.pyplot as plt
import pandas as pd
import prince

df = pd.read_csv('data/ogm.csv')
mca = prince.MCA(df, n_components=-1)
fig1, ax1 = mca.plot_cumulative_inertia()
fig2, ax2 = mca.plot_rows(show_points=True, show_labels=False, color_by='Position Al A', ellipse_fill=True)
fig3, ax3 = mca.plot_rows_columns()
fig4, ax4 = mca.plot_relationship_square()
plt.show()

===========Exception===============

   IndexingError                             Traceback (most recent call last)

in ()
10
11 fig1, ax1 = mca.plot_cumulative_inertia()
---> 12 fig2, ax2 = mca.plot_rows(show_points=True, show_labels=False, color_by='Position Al A')
13 fig3, ax3 = mca.plot_rows_columns()
14 fig4, ax4 = mca.plot_relationship_square()

~/envname/lib/python3.4/site-packages/pandas/core/indexing.py in _is_valid_list_like(self, key, axis)
1646 # so don't treat a tuple as a valid indexer
1647 if isinstance(key, tuple):
-> 1648 raise IndexingError('Too many indexers')
1649
1650 # coerce the key to not exceed the maximum size of the index

 IndexingError: Too many indexers

Unable to transform test data after MCA fitting the training data

data = pd.read_csv("data/training set.csv")
X = data.loc[:, 'OS.1':'DSA.1']

df = pd.DataFrame(X)

mca = prince.MCA(
               n_components=2,
               n_iter=3,
               copy=True,
               check_input=True,
               engine='auto',
               random_state=42
                )

mca = mca.fit(df)

df_new = df.loc[0:5, :]
I = mca.transform(df_new)
print(I)

Output:
File "C:/../clustering/k means.py", line 62, in
I = mca.transform(df_new)
File "C:..\clustering\interpreter2\lib\site-packages\prince\mca.py", line 47, in transform
return self.row_coordinates(X)
File "C:..\clustering\interpreter2\lib\site-packages\prince\mca.py", line 37, in row_coordinates
return super().row_coordinates(self.one_hot_.transform(X))
File "C:..\clustering\interpreter2\lib\site-packages\prince\ca.py", line 111, in row_coordinates
X = X / X.sum(axis=1)
File "C:\python36\lib\site-packages\scipy\sparse\base.py", line 1015, in sum
np.ones((n, 1), dtype=res_dtype))
File "C:\python36\lib\site-packages\scipy\sparse\base.py", line 499, in mul
result = self._mul_vector(np.ravel(other))
File "C:\python36\lib\site-packages\scipy\sparse\coo.py", line 571, in _mul_vector
other.dtype.char))
File "C:\python36\lib\site-packages\scipy\sparse\sputils.py", line 60, in upcast_char
t = upcast(*map(np.dtype, args))
File "C:\python36\lib\site-packages\scipy\sparse\sputils.py", line 52, in upcast
raise TypeError('no supported conversion for types: %r' % (args,))
TypeError: no supported conversion for types: (dtype('O'), dtype('O'))

This is how data looks like

print(df_new)

output:
    0  1   2   3
0  9  8   9   9
1  8  7   8   6
2  8  7   9   9
3  8  7   9   9
4  8  7   8   7
5  9  8  10  10

python 3.6.4
scikit 0.20.2
numpy 1.16.1
pandas 0.24.1

problem with FAMD

Hi ,
when I try to use FAMD to my mix dataset (contains continuous and categorical variables) an error message appears:

ValueError: FAMD works with categorical and numerical data but you only have categorical data; you should consider using MCA

I'm sure I have mix data, but I don't know what is the problem !
any help please?

MCA not working(TypeError: init() got an unexpected keyword argument 'categories')

MCA() throws an error when using the fit() method on categorical dataframe.

MCA with categorical vs ordinal data

Hi, thanks for a terrific package.

I have a question regarding the use of categorical vs ordinal data with MCA. What are your suggestions for using prince in such cases?

I found an R package called homals (github), where they restrict the matrices differently for each data type (nominal/categorical, ordinal, numerical).

The package implements a generalized version of MCA called homogeneity analysis (paper). The calculations regarding restricting the data types (or levels as they call it) start on page 6 (Level constraints: Optimal scaling).

Would it be possible to take it into account in your package as well?

Cheers.