acclab / dabest-python Goto Github PK
View Code? Open in Web Editor NEWData Analysis with Bootstrapped ESTimation
Home Page: https://acclab.github.io/DABEST-python/
License: Apache License 2.0
Data Analysis with Bootstrapped ESTimation
Home Page: https://acclab.github.io/DABEST-python/
License: Apache License 2.0
If I understand correctly, the package only supports comparing all conditions to one reference condition.
I thought it would be really neat if dabest can plot more than one reference conditions. For example, I think it would be useful to do all pair-wise comparison (compare condition i vs. condition j for all i, j s.t. i != j). For example, one possible layout is to use the 0-th row to plot the raw data, and plot a grid of mean difference distributions underneath, where the i-th row uses the i-th condition as the reference condition.
Thanks!
I'm loving this plotting and analysis pipeline, and am trying to incorporate it into my own work. I'm working on building a script in Python to streamline and make plots that fit my general presentation style for presentations and publications. Currently, as I resize the plot the axis text becomes too large and begins overlapping (example attached). I've included the code I'm using to generate this, and any tips on where/how to revise this to change axis text would be greatly appreciated! I'd like to resize the text, but could also foresee wanting to change the orientation of the text and removing the "n=".
`
data = dabest.load(dataset, idx=(cntl_dataset, expt_dataset), id_col='Image', paired=True)
fig1 = data.mean_diff.plot(
fig_size=(3.5,3)
,dpi=150
,reflines_kwargs= {'linestyle':'dashed'
,'linewidth':.8
,'color' : 'black'}
,swarm_label=(yaxis_label)
,show_pairs=False
,swarm_desat=1
,group_summaries='mean_sd'
,group_summaries_offset=0.15
,swarmplot_kwargs={'size':5}
,group_summary_kwargs={'lw':3, 'alpha':0.8}
,float_contrast=True
,contrast_label='paired mean difference'
,es_marker_size=5
,halfviolin_desat=1
,halfviolin_alpha=0.8
,violinplot_kwargs={'widths':0.5}
)`
It seems to me that there's a conceptual issue with the default behavior shown in the tutorial: the documentation there states "By default, DABEST will report the two-sided p-value of the most conservative test that is appropriate for the effect size.", which in the example means that the mean_diff effect size reports a Mann-Whitney test p-value.
However, the Mann-Whitney test is fundamentally unrelated to the mean difference in general, corresponding to the median difference instead. It's possible that a positive mean difference and negative median difference could both be valid conclusions for the right data!
Maybe the defaults could restrict to pairing mean- and median-based effect sizes and tests?
np.random.seed(88888)
df = pd.DataFrame({"Group": np.repeat("A", 12).tolist() +
np.repeat("B", 12).tolist(),
"Value": np.random.randint(1, 10, 12).tolist() +
np.random.randint(20, 30, 8).tolist() +
np.repeat(np.nan, 4).tolist()})
db = dabest.load(df, x='Group', y='Value',
idx=("A", "B"))
db.hedges_g.plot();
db2 = dabest.load(df.dropna(), x='Group', y='Value',
idx=("A", "B"))
db2.hedges_g.plot();
Therefore, we need to drop NaNs automatically during the internal munging...
Dear DABEST team,
Thank you for providing this beautiful way to show estimation statistics!
When I tried to plot the shared control groups with different numbers of units, it only showed the mean and the std for the group that has the most number of units. But it worked fine when I uploaded my csv and analyzed online on the estimationstats.com. Do you have any idea why this is happening? Thank you so much!
Best,
Ethan
I am creating a Dataframe with say 10 columns, and 100 rows. When I use Dabest to create an object, and then do subsequent plotting of column 1 vs 2, if there are NANs in other columns (e.g. 8, or 9) in certain rows, then those rows are not used in subsequent analysis: effect size estimation, plotting, etc.
_dabest = dabest.load(data=feature_df,
x='nonancol',
y="stat",
idx=(("S", "F"),))
_dabest.cohens_d.plot()
does not work, unless I run this first:
feature_df.drop(['nancol1', 'nancol2'], axis=1, inplace=True)
Dropping the columns with any nans inside. I presume that the NAN dropping is happening too early and that is why. So even though I don't have NANS in the columns I want, I still don't use those data points.
DABEST-python/dabest/_classes.py
Line 149 in e6cd4b8
Looks like some related issues in the past were fixed with that line:
I would presume an easy fix is to add a drop column in front of that line that removes any columns we are not looking at. (e.g. not passed into the "x" and "y" variables). Lmk if this is the right fix, and I can PR
Hi,
I get an AttributeError when running the following codeL
import numpy as np
import pandas as pd
import dabest
from pandas.api.types import CategoricalDtype
from scipy.stats import norm # Used in generation of populations.
np.random.seed(9999) # Fix the seed so the results are replicable.
Ns = 20 # The number of samples taken from each population
# Create samples
c1 = norm.rvs(loc=3, scale=0.4, size=Ns)
c2 = norm.rvs(loc=3.5, scale=0.75, size=Ns)
c3 = norm.rvs(loc=3.25, scale=0.4, size=Ns)
t1 = norm.rvs(loc=3.5, scale=0.5, size=Ns)
t2 = norm.rvs(loc=2.5, scale=0.6, size=Ns)
t3 = norm.rvs(loc=3, scale=0.75, size=Ns)
t4 = norm.rvs(loc=3.5, scale=0.75, size=Ns)
t5 = norm.rvs(loc=3.25, scale=0.4, size=Ns)
t6 = norm.rvs(loc=3.25, scale=0.4, size=Ns)
# Add a gender column for coloring the data.
females = np.repeat('Female', Ns/2).tolist()
males = np.repeat('Male', Ns/2).tolist()
gender = females + males
# Add an `id` column for paired data plotting.
id_col = pd.Series(range(1, Ns+1))
# Combine samples and gender into a DataFrame.
df = pd.DataFrame({'Control 1' : c1, 'Test 1' : t1,
'Control 2' : c2, 'Test 2' : t2,
'Control 3' : c3, 'Test 3' : t3,
'Test 4' : t4, 'Test 5' : t5,
'Test 6' : t6,
'Gender' : gender, 'ID' : id_col
})
two_groups_unpaired = dabest.load(df, idx=("Control 1", "Test 1"),
resamples=5000)
two_groups_unpaired
two_groups_paired = dabest.load(df, idx=("Control 1", "Test 1"),
paired=True, id_col="ID")
two_groups_paired
two_groups_unpaired.mean_diff
two_groups_unpaired.mean_diff.results
two_groups_unpaired.mean_diff.statistical_tests
two_groups_unpaired.mean_diff.plot(float_contrast=False)
multi_2group = dabest.load(df, idx=(("Control 1", "Test 1",),
("Control 2", "Test 2")
))
multi_2group.mean_diff.plot()
shared_control = dabest.load(df, idx=("Control 1", "Test 1",
"Test 2", "Test 3",
"Test 4", "Test 5", "Test 6"))
shared_control.mean_diff.plot()
This is the raised error:
File "<ipython-input-118-1ba5bbb2090e>", line 55, in <module>
two_groups_unpaired.mean_diff.plot(float_contrast=False)
File "/home/dbau/.local/lib/python3.6/site-packages/dabest/_classes.py", line 1217, in plot
out = EffectSizeDataFramePlotter(self, **all_kwargs)
File "/home/dbau/.local/lib/python3.6/site-packages/dabest/plotter.py", line 376, in EffectSizeDataFramePlotter
**group_summary_kwargs)
File "/home/dbau/.local/lib/python3.6/site-packages/dabest/plot_tools.py", line 147, in gapped_lines
if isinstance(data[x].dtype, pd.CategoricalDtype):
AttributeError: module 'pandas' has no attribute 'CategoricalDtype'
Libraries versions:
pandas
0.23.4
numpy
1.14.5
dabest
0.2.1
The resulting plot will not include the bootstrap effect sizes below the raw data.
Any idea of why this is happening?
Thanks,
Davide
Hello,
Thanks for making this package available, it seems very useful
! I am experiencing scaling issues as seen in the attached screenshot. This happens on Mac and CentOS, conda with python 3.6. I am able to save semi-sane pdfs after downscaling the font and stretching the figure a lot, but it's not great. Do you have any insight?
On a related note, do you have any suggestions for integrating these figures in subplots?
Thank you,
Max
Hi, @josesho
I'm wondering is it possible to change the mean_diff
subplot to a percentage
subplot ?
To clarify, instead of giving the differences of mean
, can DABEST show the mean diff/reference mean
?
Just in order to see whether it's a big difference or a small difference as compared with the control condition.
Hello,
Regarding Multi two-group estimation plot, is it possible to plot the median ± standard deviation of the measurements for each group, instead of the mean? Thanks in advance
How can I use dabest.plot() in multithread mode? I got a really big matrix.
First off, many thanks for DABEST and everyone's work on estimation statistics.
I was looking at the initial example in README.md
and see that there is no label on the y axis of the swarmplot. Checking the docstring of iris_dabest.mean_diff.plot
, I found the kwarg swarm_label
which allows me to set this. Indeed, using this works. However, the description of the kwarg suggests that it axis will be automatically labeled:
If `swarm_label` is not specified, it defaults to
"value", unless a column name was passed to `y`. If
`contrast_label` is not specified, it defaults to the effect size
being plotted.
However, the behavior I see (also shown in the README.md
plot) is neither "value" nor the column name ("petal_width" in this case), but empty. So, I think there is either a bug in the docstring or the implementation.
Furthermore, I think all plots should have the y axis of the swarmplots labeled by default. I see in the tutorial, for example, that this is not the case.
To fulfill the requirements of the BSD-3-Clause license, perhaps add a MANIFEST.in
to include the license file with the source distribution.
When trying to work on a large dataframe, containing several columns, some of which could be analyzed using dabest, I realized that other columns that are unrelated to the comparison I'm trying to do (i.e. columns that are not included in the x/y parameters) are interfering with the results.
Demonstration:
dabest.__version__
'0.2.4'
df = pd.DataFrame(
{'groups': np.random.choice(['Group 1', 'Group 2', 'Group 3'], size=(100,)),
'value': np.random.random(size=(100,))})
df['unrelated'] = np.nan
df.head()
groups | value | unrelated | |
---|---|---|---|
Group 1 | 0.592223 | NaN | |
Group 1 | 0.432398 | NaN | |
Group 3 | 0.714241 | NaN | |
Group 1 | 0.889762 | NaN | |
Group 1 | 0.388109 | NaN |
test = dabest.load(data=df, x='groups', y='value', idx=['Group 1', 'Group 2'])
test.mean_diff
This generates a bunch of warnings:
.../numpy/core/fromnumeric.py:3118: RuntimeWarning: Mean of empty slice.
out=out, **kwargs)
.../numpy/core/_methods.py:85: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
.../dabest/_stats_tools/confint_2group_diff.py:157: RuntimeWarning: invalid value encountered in less
prop_less_than_es = sum(B < effsize) / len(B)
.../dabest/_classes.py:545: UserWarning: The lower limit of the BCa interval cannot be computed. It is set to the effect size itself. All bootstrap values were likely all the same.
stacklevel=0)
.../dabest/_classes.py:550: UserWarning: The upper limit of the BCa interval cannot be computed. It is set to the effect size itself. All bootstrap values were likely all the same.
stacklevel=0)
.../scipy/stats/stats.py:5001: RuntimeWarning: divide by zero encountered in double_scalars
z = (bigu - meanrank) / sd
.../numpy/core/fromnumeric.py:3367: RuntimeWarning: Degrees of freedom <= 0 for slice
**kwargs)
.../numpy/core/_methods.py:110: RuntimeWarning: invalid value encountered in true_divide
arrmean, rcount, out=arrmean, casting='unsafe', subok=False)
.../numpy/core/_methods.py:132: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
(...)
The unpaired mean difference between Group 1 and Group 2 is nan [95%CI nan, nan].
The two-sided p-value of the Mann-Whitney test is 0.0.
(...)
test = dabest.load(data=df[['groups','value']], x='groups', y='value', idx=['Group 1', 'Group 2'])
test.mean_diff
(...)
The unpaired mean difference between Group 1 and Group 2 is -0.0708 [95%CI -0.202, 0.0631].
The two-sided p-value of the Mann-Whitney test is 0.268.
(...)
df.unrelated = 0
test = dabest.load(data=df, x='groups', y='value', idx=['Group 1', 'Group 2'])
test.mean_diff
(...)
The unpaired mean difference between Group 1 and Group 2 is -0.0708 [95%CI -0.202, 0.0631].
The two-sided p-value of the Mann-Whitney test is 0.268.
(...)
This addresses #92, by providing a robust non-parametric bootstrap-based permutation p-value.
c = [1.45246, 1.19208, 1.61360, 1.12898, 1.27610]
t = [1.88, 2.33249, 0.80159, 1.44444]
df = pd.DataFrame({"group": np.repeat("control", len(c)).tolist() + np.repeat("test", len(t)).tolist(),
"value": c + t})
db = dabest.load(df, x="group", y="value", idx=["control", "test"])
db.hedges_g
produces
/Users/whho/anaconda3/envs/dabest-dev-py3.7/lib/python3.7/site-packages/dabest/_stats_tools/effsize.py:224: RuntimeWarning: divide by zero encountered in double_scalars
return M / pooled_sd
DABEST v0.2.5
=============
Good afternoon!
The current time is Tue Oct 1 16:16:10 2019.
The unpaired Hedges' g between control and test is 0.553 [95%CI -1.58, 2.41].
The two-sided p-value of the Mann-Whitney test is 0.54.
5000 bootstrap samples were taken; the confidence interval is bias-corrected and accelerated.
The p-value(s) reported are the likelihood(s) of observing the effect size(s),
if the null hypothesis of zero difference is true.
To get the results of all valid statistical tests, use `.hedges_g.statistical_tests`
and an attempt to create the Hedges' g estimation plot results in
IndexError: list index out of range
Inspection of db.hedges_g.results.bootstraps
reveals
0 [-inf, -7.413427045394007, -5.693852539205496,...
Name: bootstraps, dtype: object
We will have to discard the ±Infs before saving the bootstrap values....
First noted on the estimationstats webapp.
Hi, @josesho,
I got the following error when there are only two unique values
in a group of comparison
"""
Traceback (most recent call last):
File "/home/tongli/miniconda3/envs/maars/lib/python3.7/site-packages/dabest/_classes.py", line 1295, in plot
out = EffectSizeDataFramePlotter(self, **all_kwargs)
File "/home/tongli/miniconda3/envs/maars/lib/python3.7/site-packages/dabest/plotter.py", line 488, in EffectSizeDataFramePlotter
halfviolin(v, fill_color=fc, alpha=halfviolin_alpha)
File "/home/tongli/miniconda3/envs/maars/lib/python3.7/site-packages/dabest/plot_tools.py", line 17, in halfviolin
V = b.get_paths()[0].vertices
IndexError: list index out of range
"""
It seems that DABEST needs at least three points to generate a half violin plot, is that so?
Concretely, I have several groups to be compared. Most of them have more than 3 unique values.
However, I have one group having a very flatten data point distribution (a lot of 0s
and 1s
). Nothing else.
Frankly, I'm not sure what is happening...
Do you think the error is coming from here?
Best wishes
Tong
Hi,
Thanks for this wonderful tool you provided!
I was wondering if it is statistically possible to include nuisance parameters in the group comparison?
For example, if I want to compare a biomarker (DV) of two groups (IV1) and control for gender (IV2) and age (IV3).
In a multiple regression setting, I would include these variables in the model or use them as random effects. (Biomarker ~ Group + Gender + Age, respectively Biomarker ~ Group + (1 | Gender) + (1 | Age))
Is this also possible in bootstrapping?
Thank you and best regards,
Marco
The following code should produce two plots, but only the swarm plot is produced, and an error is thrown.
new_df = pd.read_csv('%s/all_ringrmsd_data_only.txt' % rootdir, sep='\t')
s_control = db.load(new_df,idx=('OMEGA','MOE','Macromodel','Desmond','RDKit','Prime'))
s_control.cohens_d
s_control.mean_diff.plot(raw_marker_size=3)
IndexError: index 12 is out of bounds for size 12
When using float_contrast=False
, the zero reference line does not honor the attributes passed in reflines_kwargs
.
Compare these two plots for example:
two_groups_unpaired.mean_diff.plot(reflines_kwargs=dict(linestyle='dotted'));
two_groups_unpaired.mean_diff.plot(float_contrast=False,reflines_kwargs=dict(linestyle='dotted'))
Hi! I really like DABEST as it's easy to use for python-novices such as I am, and I would love to implement it in my analysis pipeline but I've come across following 'problem' when using long data frames (I am sorry if this is related to the closed issues #55 and #43 )
from scipy.stats import norm
t1 = norm.rvs(loc=3.5, scale=0.5, size=4)
df = pd.DataFrame({'parameter 1' : t1, 'parameter 2' : t1, 'parameter 3' : t1, 'group':('group1','group1','group2','group2')})
It would be very convenient to be able to do a multi group plot for each parameter comparing group1 with group 2 but I am afraid it is not possible to use nested touples for the y values (= parameters in this example) while keeping the x (= 'group') and idx values (='group 1, group2') constant to get something like this as a result:
I am not sure whether this is at all possible as stated in #43 repeated use of the same group in idx raises an error to avoid problems. I know I could restructure my data to something like this
to use the multi-group feature in the normal way and fix the labeling in post-processing
dabest.load(df, idx=(("group1_parameter 1", "group2_parameter 1"), ("group1_parameter 2", "group2_parameter 2"), ("group1_parameter 3", "group2_parameter 3") ))
Bottom line: is there or will there be a way to do multi group analysis/plots as shown above from a long/melted data frame ?
Best
Lukas
I'm not entirely sure how to properly interpret the graphs. Let's take the figure 1 from the paper:
From what I understand, Figure 1d represent p(data|H0), and here it seems to be just at the 95% threshold (ie, p<0.05, although there is no confidence interval plotted here). Figure 1e, the essence of this toolbox, looks to me to represent p(H0|data), or more precisely here p(C|T), which here seems just at the 95% threshold again.
Is this interpretation correct? If so, is it possible with DABEST to plot H0 distribution like in Figure 1d?
Indeed, I agree that p(H0|data) is more informative than p(data|H0), but still most of NHST is relying on the latter, and it would bring the best of both worlds to be able to plot both inferences. What I would imagine would be something like this:
What do you think, would this be pertinent?
In some figures, the estimation plot will only be one panel of the figure. It would be useful to have the ability to specify which axes to use to draw the plot, similar to the ax option in seaborn. In this case, two axes will be required for the two elements of the plot.
In principle, one could save the figure as an image and put together a figure in Illustrator or similar, but it would be nice to generate the whole figure in code.
The example (https://github.com/ACCLAB/DABEST-python) doesn't work.
python3 0
Traceback (most recent call last):
File "0", line 12, in
iris_dabest.mean_diff.plot();
File "/usr/local/lib/python3.7/site-packages/dabest/_classes.py", line 1258, in plot
out = EffectSizeDataFramePlotter(self, **all_kwargs)
File "/usr/local/lib/python3.7/site-packages/dabest/plotter.py", line 377, in EffectSizeDataFramePlotter
**group_summary_kwargs)
File "/usr/local/lib/python3.7/site-packages/dabest/plot_tools.py", line 162, in gapped_lines
quantiles = data.groupby(x)[y].quantile([0.25, 0.75])
File "/usr/local/lib64/python3.7/site-packages/pandas/core/groupby/groupby.py", line 1953, in quantile
return result.take(indices)
File "/usr/local/lib64/python3.7/site-packages/pandas/core/series.py", line 4432, in take
new_index = self.index.take(indices)
File "/usr/local/lib64/python3.7/site-packages/pandas/core/indexes/multi.py", line 2032, in take
na_value=-1,
File "/usr/local/lib64/python3.7/site-packages/pandas/core/indexes/multi.py", line 2060, in _assert_take_fillable
taken = [lab.take(indices) for lab in self.codes]
File "/usr/local/lib64/python3.7/site-packages/pandas/core/indexes/multi.py", line 2060, in
taken = [lab.take(indices) for lab in self.codes]
IndexError: index 6 is out of bounds for size 6
Requirement already satisfied, skipping upgrade: scipy>=0.14.0 in /usr/local/lib64/python3.7/site-packages (from seaborn) (1.3.1)
Requirement already satisfied, skipping upgrade: matplotlib>=1.4.3 in /usr/local/lib64/python3.7/site-packages (from seaborn) (3.1.1)
Requirement already satisfied, skipping upgrade: numpy>=1.9.3 in /usr/local/lib64/python3.7/site-packages (from seaborn) (1.17.1)
Requirement already satisfied, skipping upgrade: python-dateutil>=2.6.1 in /usr/lib/python3.7/site-packages (from pandas>=0.15.2->seaborn) (2.8.0)
Requirement already satisfied, skipping upgrade: kiwisolver>=1.0.1 in /usr/local/lib64/python3.7/site-packages (from matplotlib>=1.4.3->seaborn) (1.1.0)
Requirement already satisfied, skipping upgrade: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/site-packages (from matplotlib>=1.4.3->seaborn) (2.4.2)
Requirement already satisfied, skipping upgrade: cycler>=0.10 in /usr/local/lib/python3.7/site-packages (from matplotlib>=1.4.3->seaborn) (0.10.0)
Requirement already satisfied, skipping upgrade: six>=1.5 in /usr/lib/python3.7/site-packages (from python-dateutil>=2.6.1->pandas>=0.15.2->seaborn) (1.12.0)
Requirement already satisfied, skipping upgrade: setuptools in /usr/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib>=1.4.3->seaborn) (41.0.0)
Requirement already up-to-date: pandas in /usr/local/lib64/python3.7/site-packages (0.25.1)
Requirement already satisfied, skipping upgrade: numpy>=1.13.3 in /usr/local/lib64/python3.7/site-packages (from pandas) (1.17.1)
Requirement already satisfied, skipping upgrade: python-dateutil>=2.6.1 in /usr/lib/python3.7/site-packages (from pandas) (2.8.0)
Requirement already satisfied, skipping upgrade: pytz>=2017.2 in /usr/local/lib/python3.7/site-packages (from pandas) (2019.2)
Requirement already satisfied, skipping upgrade: six>=1.5 in /usr/lib/python3.7/site-packages (from python-dateutil>=2.6.1->pandas) (1.12.0)
Requirement already up-to-date: dabest in /usr/local/lib/python3.7/site-packages (0.2.5)
As the title suggests, I'd like to show the bootstrap 95% confidence interval for the control group too, is there a way of doing that?
Thanks
Hi, I was wondering what your thoughts on adding a robust statistic, such as the LqrT either to replace the t-test, or to add an additional column in the statistical testing? A quick summary: Compared to Wilcoxon Rank-Sum tests, it is more robust when the model is misspecified under a gross error model. See figure 9 of the paper for a most compelling result.
Proposed solution:
Adding the https://github.com/alyakin314/lqrt package into the requirements.txt and incorporating that into the statistical results dataframe result.
Reference:
Thank you for the great resource!
I just have a little comment regard the website version. The vertical bars indicating the SD are not defined in the auto-generated legend and do not appear at all in the two group plots.
Also the given email address [email protected] does not work.
import seaborn as sns
import dabest
# Requires internet access.
iris = sns.load_dataset("iris")
iris_db = dabest.load(iris, x="species", y="sepal_length",
idx=(("setosa", "versicolor"),
("setosa", "virginica"),
("versicolor", "virginica"))
)
iris_db.mean_diff
DABEST v0.2.4
=============
Good evening!
The current time is Tue May 28 18:30:11 2019.
The unpaired mean difference between setosa and versicolor is 0.93 [95%CI 0.76, 1.1].
The two-sided p-value of the Mann-Whitney test is 8.35e-14.
The unpaired mean difference between setosa and virginica is 1.58 [95%CI 1.38, 1.78].
The two-sided p-value of the Mann-Whitney test is 6.4e-17.
The unpaired mean difference between versicolor and virginica is 0.652 [95%CI 0.428, 0.878].
The two-sided p-value of the Mann-Whitney test is 5.87e-07.
5000 bootstrap samples were taken; the confidence interval is bias-corrected and accelerated.
The p-value(s) reported are the likelihood(s) of observing the effect size(s),
if the null hypothesis of zero difference is true.
To get the results of all valid statistical tests, use `.mean_diff.statistical_tests`
iris_db.mean_diff.plot();
When I run either pytest
or python -m pytest tests
at repo root, I got the following error.
_________________________ ERROR collecting tests/test_02_plotting.py _________________________
ImportError while importing test module '/home/tongli/Documents/DABEST-python/dabest/tests/test_02_plotting.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
tests/test_02_plotting.py:13: in <module>
import seaborn as sns
E ModuleNotFoundError: No module named 'seaborn'
I'm sure that I have seaborn
installed.
People said that removing the __init__.py
in tests/
will do the job.
But this is what I got:
______________________ ERROR collecting tests/test_01_effsizes_pvals.py ______________________
ImportError while importing test module '/home/tongli/Documents/DABEST-python/dabest/tests/test_01_effsizes_pvals.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
tests/test_01_effsizes_pvals.py:12: in <module>
from .._stats_tools import effsize
E ImportError: attempted relative import with no known parent package
_________________________ ERROR collecting tests/test_02_plotting.py _________________________
ImportError while importing test module '/home/tongli/Documents/DABEST-python/dabest/tests/test_02_plotting.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
tests/test_02_plotting.py:13: in <module>
import seaborn as sns
E ModuleNotFoundError: No module named 'seaborn'
_________________________ ERROR collecting tests/test_03_confint.py __________________________
ImportError while importing test module '/home/tongli/Documents/DABEST-python/dabest/tests/test_03_confint.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
tests/test_03_confint.py:8: in <module>
from .._api import load
E ImportError: attempted relative import with no known parent package
I believe it comes from the structure of dabest.
I'm not familiar with python packaging. Can you show me the direction ?
Thanks !
I'm sorry, I'm sure this is easy and I apologize if I've missed a duplicate issue. I'm looking to create and export plots out of DABEST for inclusion with a manuscript. Is there a recommended way of saving high-res plots with axis titles that are readable.
Thanks
Thanks for this package, it's exactly what I've been looking for!
On a fresh Ubuntu install, pip3 install dabest
does not install any of the dependencies:
$ sudo apt install python3 python3-pip
# ...
done.
$ pip3 install dabest
Collecting dabest
Downloading https://files.pythonhosted.org/packages/4b/05/baaf3990f30347f9780977b6b102036b6cbe22a8027e13f331c13e585873/dabest-0.2.5-py2.py3-none-any.whl (61kB)
100% |################################| 71kB 1.5MB/s
Installing collected packages: dabest
Successfully installed dabest-0.2.5
If you (unconditionally) specify the dependencies in setup.py then pip will install them automatically. It'd also stop everyone re-reporting the same issues because they have incompatible versions of dependencies (#52, #60, #67).
I'll send you a PR.
I used you the data you created on https://acclab.github.io/DABEST-python-docs/tutorial.html
I make a t-test plot. If I choose median difference, the histogram corresponding to 95% CI (grey histogram) is not aligned with the interval (black line)
That's the code:
from scipy.stats import norm # Used in generation of populations.
np.random.seed(9999) # Fix the seed so the results are replicable.
Ns = 20 # The number of samples taken from each population
c1 = norm.rvs(loc=3, scale=0.4, size=Ns)
c2 = norm.rvs(loc=3.5, scale=0.75, size=Ns)
c3 = norm.rvs(loc=3.25, scale=0.4, size=Ns)
t1 = norm.rvs(loc=3.5, scale=0.5, size=Ns)
t2 = norm.rvs(loc=2.5, scale=0.6, size=Ns)
t3 = norm.rvs(loc=3, scale=0.75, size=Ns)
t4 = norm.rvs(loc=3.5, scale=0.75, size=Ns)
t5 = norm.rvs(loc=3.25, scale=0.4, size=Ns)
t6 = norm.rvs(loc=3.25, scale=0.4, size=Ns)
gender
column for coloring the data.females = np.repeat('Female', Ns/2).tolist()
males = np.repeat('Male', Ns/2).tolist()
gender = females + males
id
column for paired data plotting.id_col = pd.Series(range(1, Ns+1))
df = pd.DataFrame({'Control 1' : c1, 'Test 1' : t1,
'Control 2' : c2, 'Test 2' : t2,
'Control 3' : c3, 'Test 3' : t3,
'Test 4' : t4, 'Test 5' : t5,
'Test 6' : t6,
'Gender' : gender, 'ID' : id_col
})
two_groups_paired = dabest.load(df, idx=("Test 6", "Test 5"),
paired=True, id_col="ID")
plt.figure()
two_groups_paired.median_diff.plot()
plt.show()
I updated to v.0.2.8 today, and I noticed my code to be much slower than before. This seem to be related to the inclusion of the lqrt
test in the results.
import numpy as np
import pandas as pd
import dabest
np.random.seed(1234)
df = pd.DataFrame({'Group1':np.random.normal(loc=0, size=(1000,)),
'Group2':np.random.normal(loc=1, size=(1000,))})
test = dabest.load(df, idx=['Group1','Group2'])
%time print(test.mean_diff)
DABEST v0.2.7
Good morning!
The current time is Tue Dec 31 11:46:00 2019.The unpaired mean difference between Group1 and Group2 is 1.03 [95%CI 0.941, 1.11].
The two-sided p-value of the Mann-Whitney test is 2.63e-97.5000 bootstrap samples were taken; the confidence interval is bias-corrected and accelerated.
The p-value(s) reported are the likelihood(s) of observing the effect size(s),
if the null hypothesis of zero difference is true.To get the results of all valid statistical tests, use
.mean_diff.statistical_tests
CPU times: user 558 ms, sys: 5.83 ms, total: 564 ms
Wall time: 564 ms
import numpy as np
import pandas as pd
import dabest
np.random.seed(1234)
df = pd.DataFrame({'Group1':np.random.normal(loc=0, size=(1000,)),
'Group2':np.random.normal(loc=1, size=(1000,))})
test = dabest.load(df, idx=['Group1','Group2'])
%time print(test.mean_diff)
DABEST v0.2.8
Good morning!
The current time is Tue Dec 31 11:47:09 2019.The unpaired mean difference between Group1 and Group2 is 1.03 [95%CI 0.941, 1.11].
The two-sided p-value of the Mann-Whitney test is 2.63e-97.5000 bootstrap samples were taken; the confidence interval is bias-corrected and accelerated.
The p-value(s) reported are the likelihood(s) of observing the effect size(s),
if the null hypothesis of zero difference is true.To get the results of all valid statistical tests, use
.mean_diff.statistical_tests
CPU times: user 2.46 s, sys: 8.69 ms, total: 2.47 s
Wall time: 2.47 s
Would it be possible to delay doing the statistical tests to when effect_size.statistical_tests
is called instead of calculating all the tests a priori?
Dear Dabest team,
Thank you for providing a such tool making easy estimation statistic.
When trying to make a shared control with a paired design with :
dabest.load(Data,idx=('Baseline','Post','Off'),paired = True ,id_col = 'ID')
Data consisting in a 8x3 array.
I got the following error ValueError:
is_paired is True, but some idx in ('Baseline', 'Post', 'Off') does not consist only of two groups.
How to fix this in order to make a shared control with a paired design.
Maybe it can't support a paired design for shared control ? Maybe one day ?
Thank you
Kind regards,
Alex
Trying to change the appearance of the slope lines in a paired graph using slopegraph_kwargs
result in an Exception:
two_groups_paired.mean_diff.plot(slopegraph_kwargs=dict(linestyle='dotted'));
UnboundLocalError: local variable 'slopegraph_kwargs' referenced before assignment
Also, there is no documentation for the parameter slopegraph_kwargs
on https://acclab.github.io/DABEST-python-docs/api.html
Specifically, if the dataframe passed it "wide" (aka un-melted), it throws this error:
~/anaconda3/envs/dabest-dev-py3.6/lib/python3.6/site-packages/dabest/main.py in plot(data, idx, x, y, color_col, float_contrast, paired, show_pairs, group_summaries, custom_palette, swarm_label, contrast_label, swarm_ylim, contrast_ylim, fig_size, font_scale, stat_func, ci, n_boot, show_group_count, swarmplot_kwargs, violinplot_kwargs, reflines_kwargs, group_summary_kwargs, legend_kwargs, aesthetic_kwargs)
392 color_groups = data_in[x].unique()
393 else:
--> 394 color_groups = data_in[color_col].unique()
395
396 if custom_palette is None:
~/anaconda3/envs/dabest-dev-py3.6/lib/python3.6/site-packages/pandas/core/generic.py in __getattr__(self, name)
3612 if name in self._info_axis:
3613 return self[name]
-> 3614 return object.__getattribute__(self, name)
3615
3616 def __setattr__(self, name, value):
AttributeError: 'DataFrame' object has no attribute 'unique'
I have tried to run the test code from the README with the example dataset and get an error (TypeError: must be real number, not list) on the line:
iris_dabest.mean_diff.plot(), and occurs due to this line in the gapped_lines function:
--> 162 quantiles = data.groupby(x)[y].quantile([0.25, 0.75])
163 .unstack()
164 .reindex(index=group_order)
I am using Python3.7 with the following modules:
scipy 1.3.0
seaborn 0.9.0
pandas 0.25.0
numpy 1.17.0
matplotlib 3.1.1
which I notice are more recent versions of the dependencies. I installed dabest using pip. I have also built the cloned version to test and get the same error.
I have forked a copy of the code to see if there is a simple fix.
Hello,
I appreciate the approach taken by this framework a lot, and I would like to implement it in my publications. However, I would prefer to use a sina plot instead of a beeswarm, as it has 2 advantages:
1- apart from kernel density function estimation, it does not produce an artificial structuring on the data (ie, the "branch-like" lines in the beeswarm),
2- each class's sina plot's width is normalized across all classes, so that we can get an impression of the difference in sample size at a glance.
I think the last point in particular can very well complement the ideas put forward by the DABEST framework. There is a Python implementation of Sina plots in the plotnine package (geom_sina).
Also maybe it would be interesting, if possible at all, to generalize the possibility of using other kinds of plots, as I guess different users might have different preferences?
A question related to the issue: Paired t-test plot 95% CI shifted #46.
The curve corresponding to 95% CI distribution should be aligned to the black line corresponding to 95% CI. Most of the curve should be inside the black line, only 5% of the date could be outside. On this image it's not the case at all. More than 5% of the CI distribution falls outside the black line. What am I missing? Thanks in advance.
For some reason in all examples you have seems the variance is +/-5 ! I wonder if that's because of the model, or rather the sub-optimial selection of the example data?
DABEST-python/dabest/_classes.py
Line 1182 in 7fb35f3
The Figure.axes property of the Figure class is a list, and not a function.
https://matplotlib.org/api/_as_gen/matplotlib.figure.Figure.html#matplotlib.figure.Figure.axes
You would therefore access the two axes using:
FigName.axes[0]
and FigName.axes[1]
The results, presented as a pandas DataFrame
, does not include the counts for each group. This feature will be added for v0.2.5
line 533, in EffectSizeDataFramePlotter
contrast_axes.set_ylim(rightmin, rightmax)
UnboundLocalError: local variable 'rightmin' referenced before assignment
Same code works in pandas==0.24.0
When updated to 0.25.0
it fails and return the following error:
Traceback (most recent call last):
......
File "/home/tongli/miniconda3/envs/maars/lib/python3.7/site-packages/dabest/_classes.py", line 1235, in plot
out = EffectSizeDataFramePlotter(self, **all_kwargs)
File "/home/tongli/miniconda3/envs/maars/lib/python3.7/site-packages/dabest/plotter.py", line 377, in EffectSizeDataFramePlotter
**group_summary_kwargs)
File "/home/tongli/miniconda3/envs/maars/lib/python3.7/site-packages/dabest/plot_tools.py", line 162, in gapped_lines
quantiles = data.groupby(x)[y].quantile([0.25, 0.75])\
File "/home/tongli/miniconda3/envs/maars/lib/python3.7/site-packages/pandas/core/groupby/groupby.py", line 1908, in quantile
interpolation=interpolation,
File "/home/tongli/miniconda3/envs/maars/lib/python3.7/site-packages/pandas/core/groupby/groupby.py", line 2248, in _get_cythonized_result
func(**kwargs) # Call func to modify indexer values in place
File "pandas/_libs/groupby.pyx", line 692, in pandas._libs.groupby.group_quantile
TypeError: must be real number, not list
Using jupyter notebook in jupyter/scipy-notebook docker container
DockerFile
FROM jupyter/scipy-notebook:2ce7c06a61a1
RUN pip install dabest
Using example from github:
import pandas as pd
import dabest
iris = pd.read_csv("https://github.com/mwaskom/seaborn-data/raw/master/iris.csv")
iris_dabest = dabest.load(data=iris, x="species", y="petal_width",
idx=("setosa", "versicolor", "virginica"))
iris_dabest.mean_diff.plot()
output:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-4-eb33a02ff120> in <module>
1 # Produce a Cumming estimation plot.
----> 2 iris_dabest.mean_diff.plot()
/opt/conda/lib/python3.7/site-packages/dabest/_classes.py in plot(self, color_col, raw_marker_size, es_marker_size, swarm_label, contrast_label, swarm_ylim, contrast_ylim, custom_palette, swarm_desat, halfviolin_desat, halfviolin_alpha, float_contrast, show_pairs, group_summaries, group_summaries_offset, fig_size, dpi, swarmplot_kwargs, violinplot_kwargs, slopegraph_kwargs, reflines_kwargs, group_summary_kwargs, legend_kwargs)
1233 del all_kwargs["self"]
1234
-> 1235 out = EffectSizeDataFramePlotter(self, **all_kwargs)
1236
1237 return out
/opt/conda/lib/python3.7/site-packages/dabest/plotter.py in EffectSizeDataFramePlotter(EffectSizeDataFrame, **plot_kwargs)
375 gap_width_percent=1.5,
376 type=group_summaries, ax=rawdata_axes,
--> 377 **group_summary_kwargs)
378
379
/opt/conda/lib/python3.7/site-packages/dabest/plot_tools.py in gapped_lines(data, x, y, type, offset, ax, line_color, gap_width_percent, **kwargs)
160
161 medians = data.groupby(x)[y].median().reindex(index=group_order)
--> 162 quantiles = data.groupby(x)[y].quantile([0.25, 0.75])\
163 .unstack()\
164 .reindex(index=group_order)
/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in quantile(self, q, interpolation)
1906 post_processing=post_processor,
1907 q=q,
-> 1908 interpolation=interpolation,
1909 )
1910
/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _get_cythonized_result(self, how, grouper, aggregate, cython_dtype, needs_values, needs_mask, needs_ngroups, result_is_index, pre_processing, post_processing, **kwargs)
2246 func = partial(func, ngroups)
2247
-> 2248 func(**kwargs) # Call func to modify indexer values in place
2249
2250 if result_is_index:
pandas/_libs/groupby.pyx in pandas._libs.groupby.group_quantile()
TypeError: must be real number, not list
I have noticed this error tends to happen when using float_contrast=False (either specified with two samples, or when using more than two samples).
I am trying to find a way to change breaks/tick frequency on the contrast axis independently from the rawswarm y axis. Either on a single or tiled Gardner-Altman plot.
My first language is R
so please bear with me.
from scipy.stats import norm
np.random.seed(9999) # Fix the seed so the results are replicable.
pop_size = 10000 # Size of each population.
Ns = 20 # The number of samples taken from each population
# Create populations
pop1 = norm.rvs(loc=3, scale=0.4, size=pop_size)
pop2 = norm.rvs(loc=3.5, scale=0.5, size=pop_size)
pop3 = norm.rvs(loc=2.5, scale=0.6, size=pop_size)
pop4 = norm.rvs(loc=3, scale=0.75, size=pop_size)
pop5 = norm.rvs(loc=3.5, scale=0.75, size=pop_size)
pop6 = norm.rvs(loc=3.25, scale=0.4, size=pop_size)
# Sample from the populations
sampling_kwargs = dict(size=Ns, replace=False)
g1 = np.random.choice(pop1, **sampling_kwargs)
g2 = np.random.choice(pop2, **sampling_kwargs)
g3 = np.random.choice(pop3, **sampling_kwargs)
g4 = np.random.choice(pop4, **sampling_kwargs)
g5 = np.random.choice(pop5, **sampling_kwargs)
g6 = np.random.choice(pop6, **sampling_kwargs)
# Add a `gender` column for coloring the data.
females = np.repeat('Female', Ns/2).tolist()
males = np.repeat('Male', Ns/2).tolist()
gender = females + males
# Add an `id` column for paired data plotting.
# More info below!
id_col = pd.Series(range(1, Ns+1))
# Combine samples and gender into a DataFrame.
df = pd.DataFrame({'Control' : g1,
'Group 1' : g2,
'Group 2' : g3,
'Group 3' : g4,
'Group 4' : g5,
'Group 5' : g6,
'Gender' : gender,
'ID' : id_col
})
f1, r1 = dabest.plot(df, idx=(('Group 2','Group 3'),('Group 4','Group 5')))
This produces the plot5 in the dabest tutorial. Let say you want to change the y axis to span some predefined range.
f2, r2 = dabest.plot(df, idx=(('Group 2','Group 3'),('Group 4','Group 5')), swarm_ylim=(1,6))
Here we are changing the range of the rawswarm y axis. This also alters the contrast axis. This is my main problem. Given my predefined range on the rawswarm y axis, I want a different range and a higher frequency of ticks on the contrast axis
f3, r3 = dabest.plot(df, idx=(('Group 2','Group 3'),('Group 4','Group 5')),
swarm_ylim=(1,6))
contrast_axes = f3.axes[2]
contrast_axes.set_ylim(-1,1)
contrast_axes.yaxis.set_major_locator(ticker.MultipleLocator(0.1))
While this changes the range and tick frequency of the contrast axis, it also widens the mean difference distribution and messes with the position of the zero and mean difference lines, shifted them upwards. One another thing I don't understand is, when you increase the range of the contrast axis, e.g., by setting contrast_axes.set_ylim(-3,3)
, the mean difference distribution becomes less stretched, almost as it originally appeared in f1
.
Any help on the matter would be highly appreciated.
Thanks.
Is there a way to calculate Cohen's d for paired data in DABEST? Currently DABEST appears to return only unpaired Cohen's d.
Hi,
Thank you for the excellent package. I am analysing survey data with many data points loaded onto one response value, is there a way I can enlarge panel of group "A" in the attached swarm plot to accurately show all the data points?
Here is an example data, I am new to dabest and python so I apologize if the question has been asked before.
`import numpy as np
import pandas as pd
import dabest
import random
x = np.arange(1,6)
xd = np.repeat(x, [70,80,100,300,150])
f = np.array(["A", "B", "C"], dtype = np.str)
fd = np.repeat(f, [510,80,110])
random.seed(123)
np.random.shuffle(fd)
xf = np.vstack((xd,fd))
eg = pd.DataFrame(data = xf, index = ["Resp", "Type"])
eg = eg.T
eg['Resp'] = eg['Resp'].astype('float')
egdf = dabest.load(eg, idx=("A", "B", "C"),
x="Type", y="Resp")
egdfclf = egdf.cliffs_delta
egplt = egdf.cliffs_delta.plot(raw_marker_size = 2, fig_size = [12, 6])
`
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.