acclab / dabest-python Goto Github PK

View Code? Open in Web Editor NEW

338.0 338.0 46.0 59.43 MB

Data Analysis with Bootstrapped ESTimation

Home Page: https://acclab.github.io/DABEST-python/

License: Apache License 2.0

Python 6.47% Jupyter Notebook 93.26% CSS 0.27%

data-analysis data-visualization estimation python statistics

dabest-python's People

Contributors

Stargazers

Watchers

Forkers

etiennecampione yflyzhang dunovank lijielife kuanghu masonm pythseq kukasyachka vallurumk richardyando mekman liuchen92 anminyang kebarr lrq3000 mje-nz anekimken josesho mmyros anirban6908 kjappelbaum dfporter adam2392 marinmanuel opclavicle katemartian carlos-alberto-silva julian-tejada li-yixuan bubbleli llobetv gsarfo-boateng mikailweston jacobluke- maiyishan zhangrou-99 sangyu matjazjogan janns98 thewrik sunroofgod junyangliao lucas1213wzy awadhchaurasia mlotinga

dabest-python's Issues

compare all condition-condition pairs?

If I understand correctly, the package only supports comparing all conditions to one reference condition.

I thought it would be really neat if dabest can plot more than one reference conditions. For example, I think it would be useful to do all pair-wise comparison (compare condition i vs. condition j for all i, j s.t. i != j). For example, one possible layout is to use the 0-th row to plot the raw data, and plot a grid of mean difference distributions underneath, where the i-th row uses the i-th condition as the reference condition.

Thanks!

Adjusting font size on legends

I'm loving this plotting and analysis pipeline, and am trying to incorporate it into my own work. I'm working on building a script in Python to streamline and make plots that fit my general presentation style for presentations and publications. Currently, as I resize the plot the axis text becomes too large and begins overlapping (example attached). I've included the code I'm using to generate this, and any tips on where/how to revise this to change axis text would be greatly appreciated! I'd like to resize the text, but could also foresee wanting to change the orientation of the text and removing the "n=".

Load data into DABEST as paired dataset

data = dabest.load(dataset, idx=(cntl_dataset, expt_dataset), id_col='Image', paired=True)

Plot results

fig1 = data.mean_diff.plot(
fig_size=(3.5,3)
,dpi=150
,reflines_kwargs= {'linestyle':'dashed'
,'linewidth':.8
,'color' : 'black'}
,swarm_label=(yaxis_label)
,show_pairs=False
,swarm_desat=1
,group_summaries='mean_sd'
,group_summaries_offset=0.15
,swarmplot_kwargs={'size':5}
,group_summary_kwargs={'lw':3, 'alpha':0.8}
,float_contrast=True
,contrast_label='paired mean difference'
,es_marker_size=5
,halfviolin_desat=1
,halfviolin_alpha=0.8
,violinplot_kwargs={'widths':0.5}
)`

pairing mean_diff with Mann-Whitney test

It seems to me that there's a conceptual issue with the default behavior shown in the tutorial: the documentation there states "By default, DABEST will report the two-sided p-value of the most conservative test that is appropriate for the effect size.", which in the example means that the mean_diff effect size reports a Mann-Whitney test p-value.
However, the Mann-Whitney test is fundamentally unrelated to the mean difference in general, corresponding to the median difference instead. It's possible that a positive mean difference and negative median difference could both be valid conclusions for the right data!
Maybe the defaults could restrict to pairing mean- and median-based effect sizes and tests?

Remove NaNs from _plot_data.

np.random.seed(88888)

df = pd.DataFrame({"Group": np.repeat("A", 12).tolist() +
                            np.repeat("B", 12).tolist(),
                   "Value": np.random.randint(1, 10, 12).tolist() +
                            np.random.randint(20, 30, 8).tolist() +
                            np.repeat(np.nan, 4).tolist()})

db = dabest.load(df, x='Group', y='Value', 
                 idx=("A", "B"))

db.hedges_g.plot();

db2 = dabest.load(df.dropna(), x='Group', y='Value', 
                 idx=("A", "B"))
db2.hedges_g.plot();

Therefore, we need to drop NaNs automatically during the internal munging...

Plots for Shared Control Groups with Different Number of Units

Dear DABEST team,

Thank you for providing this beautiful way to show estimation statistics!

When I tried to plot the shared control groups with different numbers of units, it only showed the mean and the std for the group that has the most number of units. But it worked fine when I uploaded my csv and analyzed online on the estimationstats.com. Do you have any idea why this is happening? Thank you so much!

Best,
Ethan

Bug Plotting: Nan Dropping Drops Rows Prematurely

I am creating a Dataframe with say 10 columns, and 100 rows. When I use Dabest to create an object, and then do subsequent plotting of column 1 vs 2, if there are NANs in other columns (e.g. 8, or 9) in certain rows, then those rows are not used in subsequent analysis: effect size estimation, plotting, etc.

_dabest = dabest.load(data=feature_df, 
                          x='nonancol', 
                          y="stat", 
                         idx=(("S", "F"),))
_dabest.cohens_d.plot()

does not work, unless I run this first:

feature_df.drop(['nancol1', 'nancol2'], axis=1, inplace=True)

Dropping the columns with any nans inside. I presume that the NAN dropping is happening too early and that is why. So even though I don't have NANS in the columns I want, I still don't use those data points.

DABEST-python/dabest/_classes.py

Line 149 in e6cd4b8

plot_data.dropna(axis=0, how='any', subset=[self.__yvar], inplace=True)

Looks like some related issues in the past were fixed with that line:

I would presume an easy fix is to add a drop column in front of that line that removes any columns we are not looking at. (e.g. not passed into the "x" and "y" variables). Lmk if this is the right fix, and I can PR

AttributeError: module 'pandas' has no attribute 'CategoricalDtype'

Hi,

I get an AttributeError when running the following codeL

import numpy as np
import pandas as pd
import dabest
from pandas.api.types import CategoricalDtype

from scipy.stats import norm # Used in generation of populations.

np.random.seed(9999) # Fix the seed so the results are replicable.
Ns = 20 # The number of samples taken from each population

# Create samples
c1 = norm.rvs(loc=3, scale=0.4, size=Ns)
c2 = norm.rvs(loc=3.5, scale=0.75, size=Ns)
c3 = norm.rvs(loc=3.25, scale=0.4, size=Ns)

t1 = norm.rvs(loc=3.5, scale=0.5, size=Ns)
t2 = norm.rvs(loc=2.5, scale=0.6, size=Ns)
t3 = norm.rvs(loc=3, scale=0.75, size=Ns)
t4 = norm.rvs(loc=3.5, scale=0.75, size=Ns)
t5 = norm.rvs(loc=3.25, scale=0.4, size=Ns)
t6 = norm.rvs(loc=3.25, scale=0.4, size=Ns)


# Add a gender column for coloring the data.
females = np.repeat('Female', Ns/2).tolist()
males = np.repeat('Male', Ns/2).tolist()
gender = females + males

# Add an `id` column for paired data plotting.
id_col = pd.Series(range(1, Ns+1))

# Combine samples and gender into a DataFrame.
df = pd.DataFrame({'Control 1' : c1,  'Test 1' : t1,
                   'Control 2'  : c2,  'Test 2' : t2,
                   'Control 3'  : c3,  'Test 3' : t3,
                   'Test 4'     : t4,  'Test 5' : t5,
                   'Test 6'     : t6,
                   'Gender'    : gender, 'ID'  : id_col
                   })

two_groups_unpaired = dabest.load(df, idx=("Control 1", "Test 1"),
                                  resamples=5000)
two_groups_unpaired 


two_groups_paired = dabest.load(df, idx=("Control 1", "Test 1"),
                                paired=True, id_col="ID")
two_groups_paired

two_groups_unpaired.mean_diff

two_groups_unpaired.mean_diff.results
two_groups_unpaired.mean_diff.statistical_tests

two_groups_unpaired.mean_diff.plot(float_contrast=False)

multi_2group = dabest.load(df, idx=(("Control 1", "Test 1",),
                                    ("Control 2", "Test 2")
                                    ))
multi_2group.mean_diff.plot()

shared_control = dabest.load(df, idx=("Control 1", "Test 1",
                                      "Test 2", "Test 3",
                                      "Test 4", "Test 5", "Test 6"))
shared_control.mean_diff.plot()

This is the raised error:

  File "<ipython-input-118-1ba5bbb2090e>", line 55, in <module>
    two_groups_unpaired.mean_diff.plot(float_contrast=False)

  File "/home/dbau/.local/lib/python3.6/site-packages/dabest/_classes.py", line 1217, in plot
    out = EffectSizeDataFramePlotter(self, **all_kwargs)

  File "/home/dbau/.local/lib/python3.6/site-packages/dabest/plotter.py", line 376, in EffectSizeDataFramePlotter
    **group_summary_kwargs)

  File "/home/dbau/.local/lib/python3.6/site-packages/dabest/plot_tools.py", line 147, in gapped_lines
    if isinstance(data[x].dtype, pd.CategoricalDtype):

AttributeError: module 'pandas' has no attribute 'CategoricalDtype'

Libraries versions:

pandas
0.23.4

numpy
1.14.5

dabest
0.2.1

The resulting plot will not include the bootstrap effect sizes below the raw data.

Any idea of why this is happening?

Thanks,
Davide

scaling issues

Hello,

Thanks for making this package available, it seems very useful

! I am experiencing scaling issues as seen in the attached screenshot. This happens on Mac and CentOS, conda with python 3.6. I am able to save semi-sane pdfs after downscaling the font and stretching the figure a lot, but it's not great. Do you have any insight?

On a related note, do you have any suggestions for integrating these figures in subplots?

Thank you,
Max

plot absolute mean differences as percentage of control

Hi, @josesho
I'm wondering is it possible to change the mean_diff subplot to a percentage subplot ?
To clarify, instead of giving the differences of mean, can DABEST show the mean diff/reference mean ?
Just in order to see whether it's a big difference or a small difference as compared with the control condition.

Median rather than mean - Multi two groups

Hello,
Regarding Multi two-group estimation plot, is it possible to plot the median ± standard deviation of the measurements for each group, instead of the mean? Thanks in advance

need to run in multithread mode

How can I use dabest.plot() in multithread mode? I got a really big matrix.

axis labels for y axes

First off, many thanks for DABEST and everyone's work on estimation statistics.

I was looking at the initial example in README.md and see that there is no label on the y axis of the swarmplot. Checking the docstring of iris_dabest.mean_diff.plot, I found the kwarg swarm_label which allows me to set this. Indeed, using this works. However, the description of the kwarg suggests that it axis will be automatically labeled:

If `swarm_label` is not specified, it defaults to
"value", unless a column name was passed to `y`. If
`contrast_label` is not specified, it defaults to the effect size
being plotted.

However, the behavior I see (also shown in the README.md plot) is neither "value" nor the column name ("petal_width" in this case), but empty. So, I think there is either a bug in the docstring or the implementation.

Furthermore, I think all plots should have the y axis of the swarmplots labeled by default. I see in the tutorial, for example, that this is not the case.

Package the license file with the source distribution

To fulfill the requirements of the BSD-3-Clause license, perhaps add a MANIFEST.in to include the license file with the source distribution.

Presence of NaN in unrelated columns breaks DABEST

When trying to work on a large dataframe, containing several columns, some of which could be analyzed using dabest, I realized that other columns that are unrelated to the comparison I'm trying to do (i.e. columns that are not included in the x/y parameters) are interfering with the results.

Demonstration:

dabest.__version__
'0.2.4'

create example dataframe

df = pd.DataFrame(
    {'groups': np.random.choice(['Group 1', 'Group 2', 'Group 3'], size=(100,)),
     'value': np.random.random(size=(100,))})
df['unrelated'] = np.nan
df.head()

	groups	value
Group 1	0.592223	NaN
Group 1	0.432398	NaN
Group 3	0.714241	NaN
Group 1	0.889762	NaN
Group 1	0.388109	NaN

compare Group 1 vs Group 2:

test = dabest.load(data=df, x='groups', y='value', idx=['Group 1', 'Group 2'])
test.mean_diff

This generates a bunch of warnings:

.../numpy/core/fromnumeric.py:3118: RuntimeWarning: Mean of empty slice.
out=out, **kwargs)
.../numpy/core/_methods.py:85: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
.../dabest/_stats_tools/confint_2group_diff.py:157: RuntimeWarning: invalid value encountered in less
prop_less_than_es = sum(B < effsize) / len(B)
.../dabest/_classes.py:545: UserWarning: The lower limit of the BCa interval cannot be computed. It is set to the effect size itself. All bootstrap values were likely all the same.
stacklevel=0)
.../dabest/_classes.py:550: UserWarning: The upper limit of the BCa interval cannot be computed. It is set to the effect size itself. All bootstrap values were likely all the same.
stacklevel=0)
.../scipy/stats/stats.py:5001: RuntimeWarning: divide by zero encountered in double_scalars
z = (bigu - meanrank) / sd
.../numpy/core/fromnumeric.py:3367: RuntimeWarning: Degrees of freedom <= 0 for slice
**kwargs)
.../numpy/core/_methods.py:110: RuntimeWarning: invalid value encountered in true_divide
arrmean, rcount, out=arrmean, casting='unsafe', subok=False)
.../numpy/core/_methods.py:132: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)

and then the result is incorrect:

(...)
The unpaired mean difference between Group 1 and Group 2 is nan [95%CI nan, nan].
The two-sided p-value of the Mann-Whitney test is 0.0.
(...)

running the same analysis but keeping only the columns that are relevant generates the correct result

test = dabest.load(data=df[['groups','value']], x='groups', y='value', idx=['Group 1', 'Group 2'])
test.mean_diff

(...)
The unpaired mean difference between Group 1 and Group 2 is -0.0708 [95%CI -0.202, 0.0631].
The two-sided p-value of the Mann-Whitney test is 0.268.
(...)

Alternatively, if the unrelated column(s) do not contain NaNs, everything works as expected:

df.unrelated = 0

test = dabest.load(data=df, x='groups', y='value', idx=['Group 1', 'Group 2'])
test.mean_diff

(...)
The unpaired mean difference between Group 1 and Group 2 is -0.0708 [95%CI -0.202, 0.0631].
The two-sided p-value of the Mann-Whitney test is 0.268.
(...)

Implement permutation tests

This addresses #92, by providing a robust non-parametric bootstrap-based permutation p-value.

Infs in bootstrap differences jams up plotting

c = [1.45246, 1.19208, 1.61360, 1.12898, 1.27610]
t = [1.88, 2.33249, 0.80159, 1.44444]

df = pd.DataFrame({"group": np.repeat("control", len(c)).tolist() + np.repeat("test", len(t)).tolist(),
                   "value": c + t})

db = dabest.load(df, x="group", y="value", idx=["control", "test"])

db.hedges_g

produces

/Users/whho/anaconda3/envs/dabest-dev-py3.7/lib/python3.7/site-packages/dabest/_stats_tools/effsize.py:224: RuntimeWarning: divide by zero encountered in double_scalars
  return M / pooled_sd

DABEST v0.2.5
=============
             
Good afternoon!
The current time is Tue Oct  1 16:16:10 2019.

The unpaired Hedges' g between control and test is 0.553 [95%CI -1.58, 2.41].
The two-sided p-value of the Mann-Whitney test is 0.54.

5000 bootstrap samples were taken; the confidence interval is bias-corrected and accelerated.
The p-value(s) reported are the likelihood(s) of observing the effect size(s),
if the null hypothesis of zero difference is true.

To get the results of all valid statistical tests, use `.hedges_g.statistical_tests`

and an attempt to create the Hedges' g estimation plot results in

IndexError: list index out of range

Inspection of db.hedges_g.results.bootstraps reveals

0    [-inf, -7.413427045394007, -5.693852539205496,...
Name: bootstraps, dtype: object

We will have to discard the ±Infs before saving the bootstrap values....

First noted on the estimationstats webapp.

Not enough unqiue values to generate halfviolin plot?

Hi, @josesho,

I got the following error when there are only two unique values in a group of comparison

"""
Traceback (most recent call last):
  File "/home/tongli/miniconda3/envs/maars/lib/python3.7/site-packages/dabest/_classes.py", line 1295, in plot
    out = EffectSizeDataFramePlotter(self, **all_kwargs)
  File "/home/tongli/miniconda3/envs/maars/lib/python3.7/site-packages/dabest/plotter.py", line 488, in EffectSizeDataFramePlotter
    halfviolin(v, fill_color=fc, alpha=halfviolin_alpha)
  File "/home/tongli/miniconda3/envs/maars/lib/python3.7/site-packages/dabest/plot_tools.py", line 17, in halfviolin
    V = b.get_paths()[0].vertices
IndexError: list index out of range
"""

It seems that DABEST needs at least three points to generate a half violin plot, is that so?

Concretely, I have several groups to be compared. Most of them have more than 3 unique values.
However, I have one group having a very flatten data point distribution (a lot of 0s and 1s). Nothing else.

Frankly, I'm not sure what is happening...
Do you think the error is coming from here?

Best wishes
Tong

Statistical Question

Hi,
Thanks for this wonderful tool you provided!

I was wondering if it is statistically possible to include nuisance parameters in the group comparison?
For example, if I want to compare a biomarker (DV) of two groups (IV1) and control for gender (IV2) and age (IV3).
In a multiple regression setting, I would include these variables in the model or use them as random effects. (Biomarker ~ Group + Gender + Age, respectively Biomarker ~ Group + (1 | Gender) + (1 | Age))
Is this also possible in bootstrapping?

Thank you and best regards,

Marco

Problem with plotting mean_diff

The following code should produce two plots, but only the swarm plot is produced, and an error is thrown.

new_df = pd.read_csv('%s/all_ringrmsd_data_only.txt' % rootdir, sep='\t')
s_control = db.load(new_df,idx=('OMEGA','MOE','Macromodel','Desmond','RDKit','Prime'))

s_control.cohens_d

s_control.mean_diff.plot(raw_marker_size=3)

IndexError: index 12 is out of bounds for size 12

all_ringrmsd_data_only.txt

reflines_kwargs are not honnored when float_contrast=False

When using float_contrast=False, the zero reference line does not honor the attributes passed in reflines_kwargs.
Compare these two plots for example:

two_groups_unpaired.mean_diff.plot(reflines_kwargs=dict(linestyle='dotted'));
two_groups_unpaired.mean_diff.plot(float_contrast=False,reflines_kwargs=dict(linestyle='dotted'))

multi-group plot from long dataframe

Hi! I really like DABEST as it's easy to use for python-novices such as I am, and I would love to implement it in my analysis pipeline but I've come across following 'problem' when using long data frames (I am sorry if this is related to the closed issues #55 and #43 )

from scipy.stats import norm
t1 = norm.rvs(loc=3.5, scale=0.5, size=4)
df = pd.DataFrame({'parameter 1' : t1, 'parameter 2' : t1, 'parameter 3' : t1, 'group':('group1','group1','group2','group2')})

It would be very convenient to be able to do a multi group plot for each parameter comparing group1 with group 2 but I am afraid it is not possible to use nested touples for the y values (= parameters in this example) while keeping the x (= 'group') and idx values (='group 1, group2') constant to get something like this as a result:

I am not sure whether this is at all possible as stated in #43 repeated use of the same group in idx raises an error to avoid problems. I know I could restructure my data to something like this

to use the multi-group feature in the normal way and fix the labeling in post-processing
dabest.load(df, idx=(("group1_parameter 1", "group2_parameter 1"), ("group1_parameter 2", "group2_parameter 2"), ("group1_parameter 3", "group2_parameter 3") ))

Bottom line: is there or will there be a way to do multi group analysis/plots as shown above from a long/melted data frame ?

Best
Lukas

Hypothesis testing interpretation?

I'm not entirely sure how to properly interpret the graphs. Let's take the figure 1 from the paper:

From what I understand, Figure 1d represent p(data|H0), and here it seems to be just at the 95% threshold (ie, p<0.05, although there is no confidence interval plotted here). Figure 1e, the essence of this toolbox, looks to me to represent p(H0|data), or more precisely here p(C|T), which here seems just at the 95% threshold again.

Is this interpretation correct? If so, is it possible with DABEST to plot H0 distribution like in Figure 1d?

Indeed, I agree that p(H0|data) is more informative than p(data|H0), but still most of NHST is relying on the latter, and it would bring the best of both worlds to be able to plot both inferences. What I would imagine would be something like this:

What do you think, would this be pertinent?

Specify axes for plots

In some figures, the estimation plot will only be one panel of the figure. It would be useful to have the ability to specify which axes to use to draw the plot, similar to the ax option in seaborn. In this case, two axes will be required for the two elements of the plot.

In principle, one could save the figure as an image and put together a figure in Illustrator or similar, but it would be nice to generate the whole figure in code.

Example error

The example (https://github.com/ACCLAB/DABEST-python) doesn't work.

python3 0
Traceback (most recent call last):
File "0", line 12, in
iris_dabest.mean_diff.plot();
File "/usr/local/lib/python3.7/site-packages/dabest/_classes.py", line 1258, in plot
out = EffectSizeDataFramePlotter(self, **all_kwargs)
File "/usr/local/lib/python3.7/site-packages/dabest/plotter.py", line 377, in EffectSizeDataFramePlotter
**group_summary_kwargs)
File "/usr/local/lib/python3.7/site-packages/dabest/plot_tools.py", line 162, in gapped_lines
quantiles = data.groupby(x)[y].quantile([0.25, 0.75])
File "/usr/local/lib64/python3.7/site-packages/pandas/core/groupby/groupby.py", line 1953, in quantile
return result.take(indices)
File "/usr/local/lib64/python3.7/site-packages/pandas/core/series.py", line 4432, in take
new_index = self.index.take(indices)
File "/usr/local/lib64/python3.7/site-packages/pandas/core/indexes/multi.py", line 2032, in take
na_value=-1,
File "/usr/local/lib64/python3.7/site-packages/pandas/core/indexes/multi.py", line 2060, in _assert_take_fillable
taken = [lab.take(indices) for lab in self.codes]
File "/usr/local/lib64/python3.7/site-packages/pandas/core/indexes/multi.py", line 2060, in
taken = [lab.take(indices) for lab in self.codes]
IndexError: index 6 is out of bounds for size 6

Requirement already satisfied, skipping upgrade: scipy>=0.14.0 in /usr/local/lib64/python3.7/site-packages (from seaborn) (1.3.1)
Requirement already satisfied, skipping upgrade: matplotlib>=1.4.3 in /usr/local/lib64/python3.7/site-packages (from seaborn) (3.1.1)
Requirement already satisfied, skipping upgrade: numpy>=1.9.3 in /usr/local/lib64/python3.7/site-packages (from seaborn) (1.17.1)
Requirement already satisfied, skipping upgrade: python-dateutil>=2.6.1 in /usr/lib/python3.7/site-packages (from pandas>=0.15.2->seaborn) (2.8.0)
Requirement already satisfied, skipping upgrade: kiwisolver>=1.0.1 in /usr/local/lib64/python3.7/site-packages (from matplotlib>=1.4.3->seaborn) (1.1.0)
Requirement already satisfied, skipping upgrade: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/site-packages (from matplotlib>=1.4.3->seaborn) (2.4.2)
Requirement already satisfied, skipping upgrade: cycler>=0.10 in /usr/local/lib/python3.7/site-packages (from matplotlib>=1.4.3->seaborn) (0.10.0)
Requirement already satisfied, skipping upgrade: six>=1.5 in /usr/lib/python3.7/site-packages (from python-dateutil>=2.6.1->pandas>=0.15.2->seaborn) (1.12.0)
Requirement already satisfied, skipping upgrade: setuptools in /usr/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib>=1.4.3->seaborn) (41.0.0)
Requirement already up-to-date: pandas in /usr/local/lib64/python3.7/site-packages (0.25.1)
Requirement already satisfied, skipping upgrade: numpy>=1.13.3 in /usr/local/lib64/python3.7/site-packages (from pandas) (1.17.1)
Requirement already satisfied, skipping upgrade: python-dateutil>=2.6.1 in /usr/lib/python3.7/site-packages (from pandas) (2.8.0)
Requirement already satisfied, skipping upgrade: pytz>=2017.2 in /usr/local/lib/python3.7/site-packages (from pandas) (2019.2)
Requirement already satisfied, skipping upgrade: six>=1.5 in /usr/lib/python3.7/site-packages (from python-dateutil>=2.6.1->pandas) (1.12.0)
Requirement already up-to-date: dabest in /usr/local/lib/python3.7/site-packages (0.2.5)

Show bootstrap 95% confidence interval for Control?

As the title suggests, I'd like to show the bootstrap 95% confidence interval for the control group too, is there a way of doing that?

Thanks

Adding a Likelihood Q-Ratio Test

Hi, I was wondering what your thoughts on adding a robust statistic, such as the LqrT either to replace the t-test, or to add an additional column in the statistical testing? A quick summary: Compared to Wilcoxon Rank-Sum tests, it is more robust when the model is misspecified under a gross error model. See figure 9 of the paper for a most compelling result.

Proposed solution:
Adding the https://github.com/alyakin314/lqrt package into the requirements.txt and incorporating that into the statistical results dataframe result.

Reference:

http://www3.stat.sinica.edu.tw/statistica/oldpdf/A27n422.pdf

small issues

Thank you for the great resource!

I just have a little comment regard the website version. The vertical bars indicating the SD are not defined in the auto-generated legend and do not appear at all in the two group plots.

Also the given email address [email protected] does not work.

Allow ANOVA-style replotting of data in multi-group plots

import seaborn as sns
import dabest

# Requires internet access.
iris = sns.load_dataset("iris")

iris_db = dabest.load(iris, x="species", y="sepal_length",
                     idx=(("setosa", "versicolor"),
                          ("setosa", "virginica"),
                          ("versicolor", "virginica"))
                      )

iris_db.mean_diff

DABEST v0.2.4
=============
             
Good evening!
The current time is Tue May 28 18:30:11 2019.

The unpaired mean difference between setosa and versicolor is 0.93 [95%CI 0.76, 1.1].
The two-sided p-value of the Mann-Whitney test is 8.35e-14.

The unpaired mean difference between setosa and virginica is 1.58 [95%CI 1.38, 1.78].
The two-sided p-value of the Mann-Whitney test is 6.4e-17.

The unpaired mean difference between versicolor and virginica is 0.652 [95%CI 0.428, 0.878].
The two-sided p-value of the Mann-Whitney test is 5.87e-07.

5000 bootstrap samples were taken; the confidence interval is bias-corrected and accelerated.
The p-value(s) reported are the likelihood(s) of observing the effect size(s),
if the null hypothesis of zero difference is true.

To get the results of all valid statistical tests, use `.mean_diff.statistical_tests`

iris_db.mean_diff.plot();

The plot does not match the textual output above.

Tests do not pass with pytest

When I run either pytest or python -m pytest tests at repo root, I got the following error.

_________________________ ERROR collecting tests/test_02_plotting.py _________________________
ImportError while importing test module '/home/tongli/Documents/DABEST-python/dabest/tests/test_02_plotting.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
tests/test_02_plotting.py:13: in <module>
    import seaborn as sns
E   ModuleNotFoundError: No module named 'seaborn'

I'm sure that I have seaborn installed.
People said that removing the __init__.py in tests/will do the job.
But this is what I got:

______________________ ERROR collecting tests/test_01_effsizes_pvals.py ______________________
ImportError while importing test module '/home/tongli/Documents/DABEST-python/dabest/tests/test_01_effsizes_pvals.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
tests/test_01_effsizes_pvals.py:12: in <module>
    from .._stats_tools import effsize
E   ImportError: attempted relative import with no known parent package
_________________________ ERROR collecting tests/test_02_plotting.py _________________________
ImportError while importing test module '/home/tongli/Documents/DABEST-python/dabest/tests/test_02_plotting.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
tests/test_02_plotting.py:13: in <module>
    import seaborn as sns
E   ModuleNotFoundError: No module named 'seaborn'
_________________________ ERROR collecting tests/test_03_confint.py __________________________
ImportError while importing test module '/home/tongli/Documents/DABEST-python/dabest/tests/test_03_confint.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
tests/test_03_confint.py:8: in <module>
    from .._api import load
E   ImportError: attempted relative import with no known parent package

I believe it comes from the structure of dabest.
I'm not familiar with python packaging. Can you show me the direction ?
Thanks !

Plot controls

I'm sorry, I'm sure this is easy and I apologize if I've missed a duplicate issue. I'm looking to create and export plots out of DABEST for inclusion with a manuscript. Is there a recommended way of saving high-res plots with axis titles that are readable.

Thanks

setup.py doesn't specify dependencies correctly

Thanks for this package, it's exactly what I've been looking for!

On a fresh Ubuntu install, pip3 install dabest does not install any of the dependencies:

$ sudo apt install python3 python3-pip
# ...
done.
$ pip3 install dabest
Collecting dabest
  Downloading https://files.pythonhosted.org/packages/4b/05/baaf3990f30347f9780977b6b102036b6cbe22a8027e13f331c13e585873/dabest-0.2.5-py2.py3-none-any.whl (61kB)
    100% |################################| 71kB 1.5MB/s 
Installing collected packages: dabest
Successfully installed dabest-0.2.5

If you (unconditionally) specify the dependencies in setup.py then pip will install them automatically. It'd also stop everyone re-reporting the same issues because they have incompatible versions of dependencies (#52, #60, #67).

I'll send you a PR.

Paired t-test plot 95% CI shifted

I used you the data you created on https://acclab.github.io/DABEST-python-docs/tutorial.html
I make a t-test plot. If I choose median difference, the histogram corresponding to 95% CI (grey histogram) is not aligned with the interval (black line)

That's the code:
from scipy.stats import norm # Used in generation of populations.

np.random.seed(9999) # Fix the seed so the results are replicable.
Ns = 20 # The number of samples taken from each population

Create samples

c1 = norm.rvs(loc=3, scale=0.4, size=Ns)
c2 = norm.rvs(loc=3.5, scale=0.75, size=Ns)
c3 = norm.rvs(loc=3.25, scale=0.4, size=Ns)

t1 = norm.rvs(loc=3.5, scale=0.5, size=Ns)
t2 = norm.rvs(loc=2.5, scale=0.6, size=Ns)
t3 = norm.rvs(loc=3, scale=0.75, size=Ns)
t4 = norm.rvs(loc=3.5, scale=0.75, size=Ns)
t5 = norm.rvs(loc=3.25, scale=0.4, size=Ns)
t6 = norm.rvs(loc=3.25, scale=0.4, size=Ns)

Add a `gender` column for coloring the data.

females = np.repeat('Female', Ns/2).tolist()
males = np.repeat('Male', Ns/2).tolist()
gender = females + males

Add an `id` column for paired data plotting.

id_col = pd.Series(range(1, Ns+1))

Combine samples and gender into a DataFrame.

df = pd.DataFrame({'Control 1' : c1, 'Test 1' : t1,
'Control 2' : c2, 'Test 2' : t2,
'Control 3' : c3, 'Test 3' : t3,
'Test 4' : t4, 'Test 5' : t5,
'Test 6' : t6,
'Gender' : gender, 'ID' : id_col
})

two_groups_paired = dabest.load(df, idx=("Test 6", "Test 5"),
paired=True, id_col="ID")

plt.figure()
two_groups_paired.median_diff.plot()
plt.show()

0.2.8 much slower than 0.2.7

I updated to v.0.2.8 today, and I noticed my code to be much slower than before. This seem to be related to the inclusion of the lqrt test in the results.

test 1: virtual env with python 3.7.5 pandas 0.24.0 dabest 0.2.7

import numpy as np
import pandas as pd
import dabest

np.random.seed(1234)
df = pd.DataFrame({'Group1':np.random.normal(loc=0, size=(1000,)),
                   'Group2':np.random.normal(loc=1, size=(1000,))})
test = dabest.load(df, idx=['Group1','Group2'])
%time print(test.mean_diff)

DABEST v0.2.7

Good morning!
The current time is Tue Dec 31 11:46:00 2019.

The unpaired mean difference between Group1 and Group2 is 1.03 [95%CI 0.941, 1.11].
The two-sided p-value of the Mann-Whitney test is 2.63e-97.

5000 bootstrap samples were taken; the confidence interval is bias-corrected and accelerated.
The p-value(s) reported are the likelihood(s) of observing the effect size(s),
if the null hypothesis of zero difference is true.

To get the results of all valid statistical tests, use .mean_diff.statistical_tests

CPU times: user 558 ms, sys: 5.83 ms, total: 564 ms
Wall time: 564 ms

test 2: virtual env with python 3.7.5 pandas 0.25.3 dabest 0.2.8

import numpy as np
import pandas as pd
import dabest

np.random.seed(1234)
df = pd.DataFrame({'Group1':np.random.normal(loc=0, size=(1000,)),
                   'Group2':np.random.normal(loc=1, size=(1000,))})
test = dabest.load(df, idx=['Group1','Group2'])
%time print(test.mean_diff)

DABEST v0.2.8

Good morning!
The current time is Tue Dec 31 11:47:09 2019.

The unpaired mean difference between Group1 and Group2 is 1.03 [95%CI 0.941, 1.11].
The two-sided p-value of the Mann-Whitney test is 2.63e-97.

5000 bootstrap samples were taken; the confidence interval is bias-corrected and accelerated.
The p-value(s) reported are the likelihood(s) of observing the effect size(s),
if the null hypothesis of zero difference is true.

To get the results of all valid statistical tests, use .mean_diff.statistical_tests

CPU times: user 2.46 s, sys: 8.69 ms, total: 2.47 s
Wall time: 2.47 s

Would it be possible to delay doing the statistical tests to when effect_size.statistical_tests is called instead of calculating all the tests a priori?

Shared control paired ?

Dear Dabest team,

Thank you for providing a such tool making easy estimation statistic.

When trying to make a shared control with a paired design with :
dabest.load(Data,idx=('Baseline','Post','Off'),paired = True ,id_col = 'ID')
Data consisting in a 8x3 array.
I got the following error ValueError: is_paired is True, but some idx in ('Baseline', 'Post', 'Off') does not consist only of two groups.

How to fix this in order to make a shared control with a paired design.

Maybe it can't support a paired design for shared control ? Maybe one day ?

Thank you

Kind regards,

Alex

Refactor Lq-RT tests to utility function

See #91. The goal would be to refactor the LqRT functionality into a helper function that can be called on the data and idx properties of the Dabest object, independent of the main functions, and at the pleasure of the user.

CC @adam2392

`slopegraph_kwargs` not working

Trying to change the appearance of the slope lines in a paired graph using slopegraph_kwargs result in an Exception:

two_groups_paired.mean_diff.plot(slopegraph_kwargs=dict(linestyle='dotted'));

UnboundLocalError: local variable 'slopegraph_kwargs' referenced before assignment

Also, there is no documentation for the parameter slopegraph_kwargs on https://acclab.github.io/DABEST-python-docs/api.html

pandas==0.21.0 breaks the plotting

Specifically, if the dataframe passed it "wide" (aka un-melted), it throws this error:

~/anaconda3/envs/dabest-dev-py3.6/lib/python3.6/site-packages/dabest/main.py in plot(data, idx, x, y, color_col, float_contrast, paired, show_pairs, group_summaries, custom_palette, swarm_label, contrast_label, swarm_ylim, contrast_ylim, fig_size, font_scale, stat_func, ci, n_boot, show_group_count, swarmplot_kwargs, violinplot_kwargs, reflines_kwargs, group_summary_kwargs, legend_kwargs, aesthetic_kwargs)
    392         color_groups = data_in[x].unique()
    393     else:
--> 394         color_groups = data_in[color_col].unique()
    395 
    396     if custom_palette is None:

~/anaconda3/envs/dabest-dev-py3.6/lib/python3.6/site-packages/pandas/core/generic.py in __getattr__(self, name)
   3612             if name in self._info_axis:
   3613                 return self[name]
-> 3614             return object.__getattribute__(self, name)
   3615 
   3616     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'unique'

Does not run with more recent versions of dependencies

I have tried to run the test code from the README with the example dataset and get an error (TypeError: must be real number, not list) on the line:

iris_dabest.mean_diff.plot(), and occurs due to this line in the gapped_lines function:

--> 162 quantiles = data.groupby(x)[y].quantile([0.25, 0.75])
163 .unstack()
164 .reindex(index=group_order)

I am using Python3.7 with the following modules:

scipy 1.3.0
seaborn 0.9.0
pandas 0.25.0
numpy 1.17.0
matplotlib 3.1.1

which I notice are more recent versions of the dependencies. I installed dabest using pip. I have also built the cloned version to test and get the same error.

I have forked a copy of the code to see if there is a simple fix.

Sina plot instead of beeswarm plot?

Hello,

I appreciate the approach taken by this framework a lot, and I would like to implement it in my publications. However, I would prefer to use a sina plot instead of a beeswarm, as it has 2 advantages:

1- apart from kernel density function estimation, it does not produce an artificial structuring on the data (ie, the "branch-like" lines in the beeswarm),

2- each class's sina plot's width is normalized across all classes, so that we can get an impression of the difference in sample size at a glance.

I think the last point in particular can very well complement the ideas put forward by the DABEST framework. There is a Python implementation of Sina plots in the plotnine package (geom_sina).

Also maybe it would be interesting, if possible at all, to generalize the possibility of using other kinds of plots, as I guess different users might have different preferences?

Paired t-test alignment

A question related to the issue: Paired t-test plot 95% CI shifted #46.

The curve corresponding to 95% CI distribution should be aligned to the black line corresponding to 95% CI. Most of the curve should be inside the black line, only 5% of the date could be outside. On this image it's not the case at all. More than 5% of the CI distribution falls outside the black line. What am I missing? Thanks in advance.

variance +/- 5.

For some reason in all examples you have seems the variance is +/-5 ! I wonder if that's because of the model, or rather the sub-optimial selection of the example data?

Typo in docstring of plot()

DABEST-python/dabest/_classes.py

Line 1182 in 7fb35f3

    
                   The first axes (accessible with ``FigName.axes()[0]``) contains the rawdata swarmplot; the second axes (accessible with ``FigName.axes()[1]``) has the bootstrap distributions and effect sizes (with confidence intervals) plotted on it.

The Figure.axes property of the Figure class is a list, and not a function.
https://matplotlib.org/api/_as_gen/matplotlib.figure.Figure.html#matplotlib.figure.Figure.axes

You would therefore access the two axes using:
FigName.axes[0] and FigName.axes[1]

Add Ns for each group to the `results` DataFrame

The results, presented as a pandas DataFrame, does not include the counts for each group. This feature will be added for v0.2.5

UnboundLocalError: local variable 'rightmin' referenced before assignment

line 533, in EffectSizeDataFramePlotter
contrast_axes.set_ylim(rightmin, rightmax)
UnboundLocalError: local variable 'rightmin' referenced before assignment

plotting fails in pandas==0.25.0

Same code works in pandas==0.24.0
When updated to 0.25.0 it fails and return the following error:

Traceback (most recent call last):
......
  File "/home/tongli/miniconda3/envs/maars/lib/python3.7/site-packages/dabest/_classes.py", line 1235, in plot
    out = EffectSizeDataFramePlotter(self, **all_kwargs)
  File "/home/tongli/miniconda3/envs/maars/lib/python3.7/site-packages/dabest/plotter.py", line 377, in EffectSizeDataFramePlotter
    **group_summary_kwargs)
  File "/home/tongli/miniconda3/envs/maars/lib/python3.7/site-packages/dabest/plot_tools.py", line 162, in gapped_lines
    quantiles = data.groupby(x)[y].quantile([0.25, 0.75])\
  File "/home/tongli/miniconda3/envs/maars/lib/python3.7/site-packages/pandas/core/groupby/groupby.py", line 1908, in quantile
    interpolation=interpolation,
  File "/home/tongli/miniconda3/envs/maars/lib/python3.7/site-packages/pandas/core/groupby/groupby.py", line 2248, in _get_cythonized_result
    func(**kwargs)  # Call func to modify indexer values in place
  File "pandas/_libs/groupby.pyx", line 692, in pandas._libs.groupby.group_quantile
TypeError: must be real number, not list

plot with float_contrast=False causes pandas groupby error

Using jupyter notebook in jupyter/scipy-notebook docker container
DockerFile

FROM jupyter/scipy-notebook:2ce7c06a61a1
RUN pip install dabest

Using example from github:

import pandas as pd
import dabest

iris = pd.read_csv("https://github.com/mwaskom/seaborn-data/raw/master/iris.csv")
iris_dabest = dabest.load(data=iris, x="species", y="petal_width",
                          idx=("setosa", "versicolor", "virginica"))
iris_dabest.mean_diff.plot()

output:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-eb33a02ff120> in <module>
      1 # Produce a Cumming estimation plot.
----> 2 iris_dabest.mean_diff.plot()

/opt/conda/lib/python3.7/site-packages/dabest/_classes.py in plot(self, color_col, raw_marker_size, es_marker_size, swarm_label, contrast_label, swarm_ylim, contrast_ylim, custom_palette, swarm_desat, halfviolin_desat, halfviolin_alpha, float_contrast, show_pairs, group_summaries, group_summaries_offset, fig_size, dpi, swarmplot_kwargs, violinplot_kwargs, slopegraph_kwargs, reflines_kwargs, group_summary_kwargs, legend_kwargs)
   1233         del all_kwargs["self"]
   1234 
-> 1235         out = EffectSizeDataFramePlotter(self, **all_kwargs)
   1236 
   1237         return out

/opt/conda/lib/python3.7/site-packages/dabest/plotter.py in EffectSizeDataFramePlotter(EffectSizeDataFrame, **plot_kwargs)
    375                          gap_width_percent=1.5,
    376                          type=group_summaries, ax=rawdata_axes,
--> 377                          **group_summary_kwargs)
    378 
    379 

/opt/conda/lib/python3.7/site-packages/dabest/plot_tools.py in gapped_lines(data, x, y, type, offset, ax, line_color, gap_width_percent, **kwargs)
    160 
    161     medians   = data.groupby(x)[y].median().reindex(index=group_order)
--> 162     quantiles = data.groupby(x)[y].quantile([0.25, 0.75])\
    163                                   .unstack()\
    164                                   .reindex(index=group_order)

/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in quantile(self, q, interpolation)
   1906             post_processing=post_processor,
   1907             q=q,
-> 1908             interpolation=interpolation,
   1909         )
   1910 

/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _get_cythonized_result(self, how, grouper, aggregate, cython_dtype, needs_values, needs_mask, needs_ngroups, result_is_index, pre_processing, post_processing, **kwargs)
   2246                 func = partial(func, ngroups)
   2247 
-> 2248             func(**kwargs)  # Call func to modify indexer values in place
   2249 
   2250             if result_is_index:

pandas/_libs/groupby.pyx in pandas._libs.groupby.group_quantile()

TypeError: must be real number, not list

I have noticed this error tends to happen when using float_contrast=False (either specified with two samples, or when using more than two samples).

plotting error: 95% range overlaps with data

Hi.
I'm loving this code so far. When there is quite a lot of data, there is a slight formatting error where the 95% range marker overlaps with the data. See pic.

This is also the case when I tried manually specifying a wider aspect ratio using the fig_size parameter.

Change range and tick frequency on contrast axis independently from the rawswarm yaxis

I am trying to find a way to change breaks/tick frequency on the contrast axis independently from the rawswarm y axis. Either on a single or tiled Gardner-Altman plot.

My first language is R so please bear with me.

from scipy.stats import norm

np.random.seed(9999) # Fix the seed so the results are replicable.
pop_size = 10000 # Size of each population.
Ns = 20 # The number of samples taken from each population

# Create populations
pop1 = norm.rvs(loc=3, scale=0.4, size=pop_size)
pop2 = norm.rvs(loc=3.5, scale=0.5, size=pop_size)
pop3 = norm.rvs(loc=2.5, scale=0.6, size=pop_size)
pop4 = norm.rvs(loc=3, scale=0.75, size=pop_size)
pop5 = norm.rvs(loc=3.5, scale=0.75, size=pop_size)
pop6 = norm.rvs(loc=3.25, scale=0.4, size=pop_size)

# Sample from the populations
sampling_kwargs = dict(size=Ns, replace=False)

g1 = np.random.choice(pop1, **sampling_kwargs)
g2 = np.random.choice(pop2, **sampling_kwargs)
g3 = np.random.choice(pop3, **sampling_kwargs)
g4 = np.random.choice(pop4, **sampling_kwargs)
g5 = np.random.choice(pop5, **sampling_kwargs)
g6 = np.random.choice(pop6, **sampling_kwargs)

# Add a `gender` column for coloring the data.
females = np.repeat('Female', Ns/2).tolist()
males = np.repeat('Male', Ns/2).tolist()
gender = females + males

# Add an `id` column for paired data plotting.
# More info below!
id_col = pd.Series(range(1, Ns+1))

# Combine samples and gender into a DataFrame.
df = pd.DataFrame({'Control' : g1,
  'Group 1' : g2,
  'Group 2' : g3,
  'Group 3' : g4,
  'Group 4' : g5,
  'Group 5' : g6,
  'Gender'  : gender,
  'ID'      : id_col
})

f1, r1 = dabest.plot(df, idx=(('Group 2','Group 3'),('Group 4','Group 5')))
This produces the plot5 in the dabest tutorial. Let say you want to change the y axis to span some predefined range.

f2, r2 = dabest.plot(df, idx=(('Group 2','Group 3'),('Group 4','Group 5')), swarm_ylim=(1,6))
Here we are changing the range of the rawswarm y axis. This also alters the contrast axis. This is my main problem. Given my predefined range on the rawswarm y axis, I want a different range and a higher frequency of ticks on the contrast axis

f3, r3 = dabest.plot(df, idx=(('Group 2','Group 3'),('Group 4','Group 5')),
                     swarm_ylim=(1,6))

contrast_axes = f3.axes[2]
contrast_axes.set_ylim(-1,1)
contrast_axes.yaxis.set_major_locator(ticker.MultipleLocator(0.1))

While this changes the range and tick frequency of the contrast axis, it also widens the mean difference distribution and messes with the position of the zero and mean difference lines, shifted them upwards. One another thing I don't understand is, when you increase the range of the contrast axis, e.g., by setting contrast_axes.set_ylim(-3,3), the mean difference distribution becomes less stretched, almost as it originally appeared in f1.

Any help on the matter would be highly appreciated.
Thanks.

Add paired Cohen's d

Is there a way to calculate Cohen's d for paired data in DABEST? Currently DABEST appears to return only unpaired Cohen's d.

Is there a way to change the panel size of one of the group?

Hi,

Thank you for the excellent package. I am analysing survey data with many data points loaded onto one response value, is there a way I can enlarge panel of group "A" in the attached swarm plot to accurately show all the data points?

Here is an example data, I am new to dabest and python so I apologize if the question has been asked before.

`import numpy as np
import pandas as pd
import dabest
import random

Create an example dataframe

x = np.arange(1,6)
xd = np.repeat(x, [70,80,100,300,150])
f = np.array(["A", "B", "C"], dtype = np.str)
fd = np.repeat(f, [510,80,110])

random.seed(123)
np.random.shuffle(fd)

xf = np.vstack((xd,fd))
eg = pd.DataFrame(data = xf, index = ["Resp", "Type"])
eg = eg.T
eg['Resp'] = eg['Resp'].astype('float')
egdf = dabest.load(eg, idx=("A", "B", "C"),
x="Type", y="Resp")
egdfclf = egdf.cliffs_delta
egplt = egdf.cliffs_delta.plot(raw_marker_size = 2, fig_size = [12, 6])
`