mckinsey / causalnex Goto Github PK

A Python library that helps data scientists to infer causation rather than observing correlation.

Home Page: http://causalnex.readthedocs.io/

License: Other

Makefile 0.09% Python 99.17% Shell 0.63% Dockerfile 0.11%

causal-inference causal-models causal-networks bayesian-networks bayesian-inference machine-learning data-science causalnex

causalnex's Introduction

Theme	Status
Latest Release
Python Version
`master` Branch Build
`develop` Branch Build
Documentation Build
License
Code Style

What is CausalNex?

"A toolkit for causal reasoning with Bayesian Networks."

CausalNex aims to become one of the leading libraries for causal reasoning and "what-if" analysis using Bayesian Networks. It helps to simplify the steps:

To learn causal structures,
To allow domain experts to augment the relationships,
To estimate the effects of potential interventions using data.

Why CausalNex?

CausalNex is built on our collective experience to leverage Bayesian Networks to identify causal relationships in data so that we can develop the right interventions from analytics. We developed CausalNex because:

We believe leveraging Bayesian Networks is more intuitive to describe causality compared to traditional machine learning methodology that are built on pattern recognition and correlation analysis.
Causal relationships are more accurate if we can easily encode or augment domain expertise in the graph model.
We can then use the graph model to assess the impact from changes to underlying features, i.e. counterfactual analysis, and identify the right intervention.

In our experience, a data scientist generally has to use at least 3-4 different open-source libraries before arriving at the final step of finding the right intervention. CausalNex aims to simplify this end-to-end process for causality and counterfactual analysis.

What are the main features of CausalNex?

The main features of this library are:

Use state-of-the-art structure learning methods to understand conditional dependencies between variables
Allow domain knowledge to augment model relationship
Build predictive models based on structural relationships
Fit probability distribution of the Bayesian Networks
Evaluate model quality with standard statistical checks
Simplify how causality is understood in Bayesian Networks through visualisation
Analyse the impact of interventions using Do-calculus

How do I install CausalNex?

CausalNex is a Python package. To install it, simply run:

pip install causalnex

Use all for a full installation of dependencies:

pip install "causalnex[all]"

See more detailed installation instructions, including how to setup Python virtual environments, in our installation guide and get started with our tutorial.

How do I use CausalNex?

You can find the documentation for the latest stable release here. It explains:

An end-to-end tutorial on how to use CausalNex
The main concepts and methods in using Bayesian Networks for Causal Inference

Note: You can find the notebook and markdown files used to build the docs in docs/source.

Can I contribute?

Yes! We'd love you to join us and help us build CausalNex. Check out our contributing documentation.

How do I upgrade CausalNex?

We use SemVer for versioning. The best way to upgrade safely is to check our release notes for any notable breaking changes.

How do I cite CausalNex?

You may click "Cite this repository" under the "About" section of this repository to get the citation information in APA and BibTeX formats.

What licence do you use?

See our LICENSE for more detail.

We're hiring!

Do you want to be part of the team that builds CausalNex and other great products at QuantumBlack? If so, you're in luck! QuantumBlack is currently hiring Machine Learning Engineers who love using data to drive their decisions. Take a look at our open positions and see if you're a fit.

causalnex's People

Contributors

Stargazers

Watchers

Forkers

deepyaman vonrosenchild abhishekms1047 chengxingzhi yyht satyamdg winggy yvesgreijn donniekim411 pandinosaurus codeaudit lumiqai shkim1980 arita37 tejamoy dattachandan dupsys yermouth trendingtechnology jbdatascience cs12b033 da505819 syyunn bigdatamatta crlsmcl shyamalschandra sid8519 opscaleindia mechevere reynoldsm88 mabalija fagan2888 guidotournois srhruc918 keshava sgpohlj87 henokyemam jaykimbravekjh darksheng nvinta georgemeyer-alis zespyx erick-hm t8ch zeta1999 mrmonkey94 cahuja1992 gabrielazevedoferreiraqb dotrado jhdavino bayesianbrad bahadursingh vishalbelsare joaosantinha prashant-bharaj jebq tvjoseph subramanyata bzqweiyi judahrand lystahi david5ive king-sid midnight93 trajeshbe shalevy1 anhnguyendepocen p2t2 mynhervankoek priyamtejaswin phillip1029 i-shuhei sagrawal128 mcarricano mkretsch327 martinatf mohsen-kalantar mburakbozbey emanuelbesliu rishirelan shaya7 lorentzbao linyubupa jovalie mardom jeffchiou nisargvp drahnreb junjiez daiwei280468 rudzanimulaudzi nathanwindle lian6605 mehulsingh jfer2pi adbmd q-leo jorgeglv psenin-sanofi stanleydata

causalnex's Issues

'BDeu' bayes_prior throwing an error

Description

'K2' as bayes_prior is working but 'BDeu' throws an error.

Steps to Reproduce

bn = BayesianNetwork(graph_largest_sub)
bn = bn.fit_node_states(train)
bn = bn.fit_cpds(train, method='BayesianEstimator', bayes_prior='BDeu')

Actual Result

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-12-ec000222de66> in <module>
      1 bn = BayesianNetwork(graph_largest_sub)
      2 bn = bn.fit_node_states(train)
----> 3 bn = bn.fit_cpds(train, method="BayesianEstimator", bayes_prior='BDeu')

/usr/local/lib/python3.7/site-packages/causalnex/network/network.py in fit_cpds(self, data, method, bayes_prior, equivalent_sample_size)
    368                 prior_type=bayes_prior,
    369                 equivalent_sample_size=equivalent_sample_size,
--> 370                 state_names=state_names,
    371             )
    372         else:

/usr/local/lib/python3.7/site-packages/pgmpy/models/BayesianModel.py in fit(self, data, estimator, state_names, complete_samples_only, **kwargs)
    695         _estimator = estimator(self, data, state_names=state_names,
    696                                    complete_samples_only=complete_samples_only)
--> 697         cpds_list = _estimator.get_parameters(**kwargs)
    698         self.add_cpds(*cpds_list)
    699 

/usr/local/lib/python3.7/site-packages/pgmpy/estimators/BayesianEstimator.py in get_parameters(self, prior_type, equivalent_sample_size, pseudo_counts)
     71                                     prior_type=prior_type,
     72                                     equivalent_sample_size=_equivalent_sample_size,
---> 73                                     pseudo_counts=_pseudo_counts)
     74             parameters.append(cpd)
     75 

/usr/local/lib/python3.7/site-packages/pgmpy/estimators/BayesianEstimator.py in estimate_cpd(self, node, prior_type, pseudo_counts, equivalent_sample_size)
    131             pseudo_counts = [1] * node_cardinality
    132         elif prior_type == 'BDeu':
--> 133             alpha = float(equivalent_sample_size) / (node_cardinality * np.prod(parents_cardinalities))
    134             pseudo_counts = [alpha] * node_cardinality
    135         elif prior_type == 'dirichlet':

TypeError: float() argument must be a string or a number, not 'NoneType'

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

CausalNex version used (pip show causalnex): 0.5.0
Python version used (python -V): 3.7
Operating system and version: Ubuntu 16.04

Possibility to manually define the CPTs

Hello, is there an way to define manually the CPTs (Conditional Probability Tables)? I couldn't find anything related to this in the documentation.

If there isnt, would be interesting to have this feature.
Thank you all for the great library.

How to define my dataset label in Causalnex？Like DecisionTreeClassifier

When i use DecisionTreeClassifier,like below:

X_train, X_test, y_train, y_test = train_test_split(train, label, test_size=0.2, random_state=33) 
dtc = DecisionTreeClassifier(max_depth=12)
dtc.fit(X_train, y_train)

how can i define the label of dataset in the process of CausalNex train?

do_intervention never ends running despite simple query

Hi QB––

Description

I am running a do-calculus on a small dataset (116x32) with 2 to 4 discretized buckets.
The BN fits the CPDs in 2 sec, so relatively good perf.

However a simple do-intervention takes forever and even never ends running, I waited several hours then I interrupted kernel.

Steps to Reproduce

$ from causalnex.inference import InferenceEngine
$ ie = InferenceEngine(bn)
$ ie.do_intervention("cD_TropCycl", {1: 0.2, 2: 0.8})
$ print("distribution after do", ie.query()["cD_TropCycl"])

Expected Result

Shouldn't it be running just a few seconds given the low number of buckets?
How long does it normally take?

Actual Result

no results returned after hours running a simple query.

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

CausalNex version used (pip show causalnex): 0.5.0
Python version used (python -V): python 3.7.6
Operating system and version: osx 10.15.14 on 2.3 GHz Quad-Core Intel Core i5

Thank you very much!!

Questions about SCMs

Do you think it is proper to use CausalNex with time series data? (~55 years of annual records, pct change applied). I know on the website, it says it is recommended to use CausalNex with at least 1000 instances but I keep the number of nodes as 5 or 6, so I think maybe it can work. But I'm not really sure about that.
What are the edge weights representing? How important are their quantities? And what should we understand if they are positive or negative?
How should we decide on the threshold?

I would be really glad if you can help. Thanks.

NaN and Missing Value Management

Description

CausalNex should give the optionally to treat missing values as a category for each variable of interest.

Context

NaN or missing value are quite usual in any studies and revealing a lot of information as such

plot_structure does not plot anything

Description

plot_structure does not plot anything since the update. I am just trying to replicate the tutorial in a jupyter notebook.

Steps to Reproduce

from causalnex.structure import StructureModel
sm = StructureModel()
sm.add_edges_from([
    ('health', 'absences'),
    ('health', 'G1')
])
_, _, _ = plot_structure(sm)

Environment

CausalNex version used : causalnex==0.5.0
Python version used (python -V): 3.7
Operating system and version: Ubuntu 16

Installation instruction does not contain set up with Docker

Description

Most of the developers are using python with Docker now a days and I see instructions contain only conda and pycharm. I will suggest to have the setup with Docker as well

Context

I still find it difficult to setup casualnex on local setup, however when I did with Docker it was much simpler and reproducible, hence would like to add this feature and if needed I can do that and contribute as well

Possible Implementation

We could use a base python image or a standard image loaded with other packages and extend it (if needed I can share how I did it)

Possible Alternatives

Alternatives are already available

how to fit_cpds on batch?

Description

Is your feature request related to a problem? A clear and concise description of what the problem is: "I'm always frustrated when ..."

Context

Why is this change important to you? How would you use it? How can it benefit other users?

Possible Implementation

(Optional) Suggest an idea for implementing the addition or change.

Possible Alternatives

(Optional) Describe any alternative solutions or features you've considered.

Edit on Github in readthedocs links to a 404

The edit on Github links to a non-existing https://github.com/quantumblacklabs/causalnex/blob/master/docs/index.rst file

Instead it should probably lead to the repo.

Some links in README are broken

Description

Firstly congrats on open sourcing this project, awesome work everyone! 🙂 Just a quick issue, seems like there is some broken links on the README for the tutorial section.

Steps to Reproduce

Go to README
Click on link in "An end-to-end tutorial on how to use CausalNex" or "The main concepts and methods in using Bayesian Networks for Causal Inference"

Expected Result

Loads correct tutorial page

Actual Result

404

Dependencies shouldn't have pinned requirements

The dependencies specified in setup.py directly come from the requirements.txt where they are specified with a pinned requirement (e.g. pandas == 0.24.0). This is no good practice for a package except in some edge cases (e.g. a known incompatibility or a known bug, but even in that case, it is preferable to declare a "less-than" requirement than an equality).

Due to these pinned versions, it is not easy to use causalnex in an existing environment (where, e.g., another version than pandas 0.24.0 is needed).

Could I suggest you to declare more permissive constraints? It is quite common in Python to only define a lower bound (e.g. pandas>=0.24.0) but this is a very optimistic way of specifying a dependency constraint, since it is unlikely for a package to remain fully compatible with all future updates of a dependency. A better option would be to use ~= (i.e. "all updates supposedly compatible with given version").

Btw, for the specific case of pandas, however, I suggest to adopt updates from the 1.0.0 branch as well.

Syntax Error

Description

I am currently running Bayesian Network Tutorial. I am getting the following syntax error while running the code.

Context

How has this bug affected you? What were you trying to accomplish?
Building a Bayesian Network. Unable to build it because of the syntax error.

Steps to Reproduce

from causalnex.structure import StructureModel

Encoding the causal graph suggested by an expert

d

↙ ↓ ↘

a ← b → c

↑ ↗

e

sm_manual = StructureModel()
sm_manual.add_edges_from(
[
("b", "a", origin="expert"),

)

Run it in jupyter

Expected Result

It should form a Bayesian Graph

Actual Result

Getting syntax Error

-- If you received an error, place it here.
 File "<ipython-input-5-127b20ebb195>", line 11
    ("b", "a", origin="expert"),
                     ^
SyntaxError: invalid syntax

-- Separate them if you have more than one.

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

CausalNex version used (pip show causalnex): 0.4.2
Python version used (python -V): 3.7.4
Operating system and version: Ubuntu 18.04

[minor issue] `make lint` fails to install `black` if already not installed

Description

Make lint fails if black is not installed. It fails with the message premission denied

This is not a huge issue (we can install it on our own), but I believe the expected behaviour is for Make lint to install black if Python >=3.6 and

Context

cloning the repo and contributing.

Steps to Reproduce

I tested initialising environments with the versions of python 3.5, 3.6 and 3.7, installing the requirements and then running make lint

3.5: all passed
3.6 and 3.7: the following error message

Traceback (most recent call last):
  File "/opt/anaconda3/envs/test_37/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/anaconda3/envs/test_37/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/gabriel_azevedo_ferreira/Documents/Projects/CAUSALNEX_RND/causalnex/tools/min_version.py", line 46, in <module>
    subprocess.run(run_cmd, check=True)
  File "/opt/anaconda3/envs/test_37/lib/python3.7/subprocess.py", line 488, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/opt/anaconda3/envs/test_37/lib/python3.7/subprocess.py", line 800, in __init__
    restore_signals, start_new_session)
  File "/opt/anaconda3/envs/test_37/lib/python3.7/subprocess.py", line 1551, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
PermissionError: [Errno 13] Permission denied: 'black'

The images are as follows

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

Python version used (python -V):
Operating system and version: Mac OS catalina

Obtaining conditional probability distribution of the whole graph after a do-intervention

Description

I'm trying to obtain an interventional data distribution, i.e., I want to intervene on a specific node and see how that affects the conditional probabilities in the entire graph. Currently, it's possible to intervene on a specific node and query the marginals of every other node. Let's say I have a graph with 3 nodes (A, B, C), one of them (A) is confounding the causal relation between the other two (B --> C). I can intervene on the confounding node A, and I want to obtain the probabilities P(B | A) and P(C | B, A). I suppose the former is directly inferred from the marginal P(B) (since we know the interventional value of A). But how can I obtain P(C | B, A)?

Context

Database repair problems, where you'd want to remove unwanted causal effects from the distribution

Possible Implementation

Possibility to query conditionals along with marginals, or to sample from the distributions resulting from an intervention?

ValueError fitting different data in train and test

Description

A ValueError is raised when trying to access bn.cpds after probabilities have been fit. This occurs when there are less states in the data given to fit_cpds than to fit_node_states. For example, when fit_node_states is called with the full dataset, and fit_cpds with only a training portion.

Context

It prevents me from inspecting the CPDs fit using training data.

Steps to Reproduce

from causalnex.structure import StructureModel
from causalnex.network import BayesianNetwork
import pandas as pd

sm = StructureModel([("a", "b"), ("c", "b")])
bn = BayesianNetwork(sm)

train = pd.DataFrame(data=[[0, 0, 1], [1, 0, 1], [1, 1, 1]], columns=["a", "b", "c"])
test = pd.DataFrame(data=[[0, 0, 1], [1, 0, 1], [1, 1, 2]], columns=["a", "b", "c"])
data = pd.concat([train, test])

bn.fit_node_states(data)
bn.fit_cpds(train)
bn.cpds

Expected Result

The CPDs should be available

Actual Result

a ValueError is received.

ValueError: cannot reshape array of size 4 into shape (2,4)

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

pip show causalnex

Name: causalnex
Version: 0.4.3
Summary: Toolkit for causal reasoning (Bayesian Networks / Inference)
Home-page: https://github.com/quantumblacklabs/causalnex
Author: QuantumBlack Labs
Author-email: [email protected]
License: Apache Software License (Apache 2.0)
Location: /Users/ben_horsburgh/opt/anaconda3/lib/python3.7/site-packages
Requires: networkx, scikit-learn, pgmpy, matplotlib, pandas, numpy, wrapt, prettytable, scipy
Required-by: 
Note: you may need to restart the kernel to use updated packages.

Python 3.7.4
MacOS 10.15.2

Evaluating "Learning Sparse Nonparametric DAGs"

Description

[Update]DAGs with NO TEARS
https://github.com/xunzheng/notears

Zheng, X., Dan, C., Aragam, B., Ravikumar, P., & Xing, E. P. (2020). Learning sparse nonparametric DAGs (AISTATS 2020, to appear).

https://arxiv.org/pdf/1909.13189.pdf

Possible Implementation

https://github.com/xunzheng/notears

pl. evaluating 1909.13189

Thanks!

pygraphviz is difficult to install

Description

Never having used pygraphviz, most of the google and stack overflow searches are about Errors during installation, on both Mac and PC.

Context

This has been frustrating to run Causal Nex on Jupiter. It works on terminal but is cumbersome. Why is SNS or Matplotlib not used instead (yeah apparently graphviz and pygraphviz are powerful, but rarely used compared to other visual python aides.

Steps to Reproduce

[First Step]
[Second Step]
[And so on...]

Expected Result

Tell us what should happen.

Actual Result

Tell us what happens instead.

-- If you received an error, place it here.

-- Separate them if you have more than one.

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

CausalNex version used (pip show causalnex):
Python version used (python -V):
Operating system and version:

log_uniform method for Discretiser

Description

Discretiser class is great. Would be quite helpful to have log_uniform method just like uniform.

Context

Often times attribute distributions is zero-inflated or somehow non-uniform / log-uniform.

Possible Implementation

Same method utilized in uniform but first taking the logarithm of the data and reverting the percentiles back to normal scale.

Is causalnex really causal?

In the lib repo you write:

A Python library that helps data scientists to infer causation rather than observing correlation

In docs:

In this package we are mostly interested in the case where BNs are causal. Hence, the edge between nodes should be seen as cause -> effect relationship.

Also:

Bayesian Network consists of a DAG, a causal graph where nodes represents random variables

A bayesian network can equivalently encode dependency between variables as a -> b -> c and a <- b <- c. https://en.wikipedia.org/wiki/Bayesian_network#Causal_networks. What helps causalnex find causal relationships and does it really do that? Thanks

Why does my model exists some negative weight edges?

Hi,
After learn the BN model, I use the 'sm.egdes(data)' and then I find some negative weight edges in my model.
What's the meaning of those negative weight edges?

Plotting the DAGs without pygraphviz

Hi, Thank you for the package, it makes lots of causal inference jobs much easier. And the tutorial helps with getting started quickly.

I was just wondering if there is any way of plotting the DAGs inside causalnex without pygraphviz. It's bit of pain to get that running on Windows.

Thanks

fit_cpds and fit_node_states_and_cpds showing deprecated error.

Description

There is a bug associated with 'fit_cpds' and 'fit_node_states_and_cpds' functions associated with pandas.

Context

The function works fine but the error exists.

Steps to Reproduce

A colab jupyter notebook can be accessed from here:
https://colab.research.google.com/drive/1uY4b_gXSwRvUYe774pm6cX7Whigexzk0?usp=sharing

# Create a Bayesian Network with a manually defined DAG
from causalnex.structure.structuremodel import StructureModel
from causalnex.network import BayesianNetwork
from causalnex.inference import InferenceEngine

sm = StructureModel()
sm.add_edges_from([
                  ('rush_hour', 'traffic'),
                  ('weather', 'traffic')
                  ])
data = pd.DataFrame({
           'rush_hour': [True, False, False, False, True, False, True],
          'weather': ['Terrible', 'Good', 'Bad', 'Good', 'Bad', 'Bad', 'Good'],
          'traffic': ['heavy', 'light', 'heavy', 'light', 'heavy', 'heavy', 'heavy']
                    })
bn = BayesianNetwork(sm)

# Inference can only be performed on the `BayesianNetwork` with learned nodes states and CPDs
bn = bn.fit_node_states_and_cpds(data)

-- If you received an error, place it here.

/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py:5191: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py:5192: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
/usr/local/lib/python3.6/dist-packages/pgmpy/estimators/base.py:54: FutureWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  states = sorted(list(self.data.ix[:, variable].dropna().unique()))
/usr/local/lib/python3.6/dist-packages/pgmpy/estimators/base.py:111: FutureWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  state_count_data = data.ix[:, variable].value_counts()
/usr/local/lib/python3.6/dist-packages/pgmpy/estimators/MLE.py:128: FutureWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  state_counts.ix[:, (state_counts == 0).all()] = 1
{'heavy': 0.7142857142857142, 'light': 0.2857142857142857}

-- Separate them if you have more than one.


## Your Environment
Include as many relevant details about the environment in which you experienced the bug:

* CausalNex version used (`pip show causalnex`): 0.8.1
* Python version used (`python -V`): 3.6
* Pandas vesion used: 0.25.3
* Operating system and version: colab session

About from_pandas

Description

when i use 'from_pandas' to learning causal map by notears,i run 'watch -n 1 free -m',it shows that 3/16GB used.i run 370 thousand data but only use memory 3G?how to improve efficiency？

Context

Every 1.0s: free -m Tue Jun 30 16:18:36 2020

          total        used        free      shared  buff/cache   available

Mem: 16384 2799 12213 0 1371 13584
Swap: 0 0 0

I have a question about Dynotears

Description

I want to know an input data and result for dynotears.

Context

I tried to use dynotears.from_pandas using DREAM4 challenge data, but get an empty graph.
I constructed a list of dataframe as below that contains 10 dataframes.
For each dataframe, the column is node and the row is timepoint such as below.
g1 g2
1 1 2
2 4 2
3 3 1

Normalization of data

Hi, I have been working with the library, specially with NOTEARS, and I was making experiments and observed that a structure learned with the raw data can be different from the structure learned from normalized data (MinMaxScaler). Is there any reason to the method be affected by scale? By reading the paper I couldn't figure a reason. Or is some mistake in my application?

Python 3.8 support

Hi there,

Just wondering if there's any plans for causalnex to support Python 3.8? I've had a look at the dependencies in the requirements.txt file and they all seem to support Python 3.8 on PyPI.

Thanks, and great documentation by the way!

Uniform discretiser and quantile discretiser always produce the same result

Description

When we call Discretiser(method='uniform', num_buckets=5) and Discretiser(method='quantile', num_buckets=5), it seems we always get the same result. Concretely, it seems both of these two provide quantile discretisation method. It would be better for the uniform discretiser to produce uniform discretisation method.

Context

I wanted to use uniform discretisation as a discretiser, but I could not achieve it with the implemented Discretiser.

Steps to Reproduce

import modules
from causalnex.discretiser import Discretiser
import numpy as np
create dummy data following gaussian distribution
data = np.random.normal(0, 1, 10000)
create uniform discretiser and quantile discretiser. Then check the results. Numeric split points are the same.
Discretiser(method='uniform', num_buckets=5).fit(data).numeric_split_points
Discretiser(method='quantile', num_buckets=5).fit(data).numeric_split_points

Expected Result

Uniform discretiser is expected to produce numeric split points the delta of which is constant. Ref: sklearn's KBinsDiscretiser

Actual Result

Uniform discretiser produces numeric split points which separate data so that the number of data points for each bucket is the same. Also, this is the same as what quantile discretiser does.

Your Environment

causalnex==0.7.0
python 3.7.7
ProductName: Mac OS X
ProductVersion: 10.15.5

Learning with confounders variables

Description

Does the current structure algorithm valid when there are confounder variables ?
If not, can you add alternative one which take into account.

I want to find some causal relationship from some cols to label what i defined by from_pandas(no tears),but i got some causal relationship from label to others

Description

I want to find some causal relationship from some cols to label what i defined by from_pandas(no tears),but i got some causal relationship from label to others

Context

First,the dataset i have is 300K rows* 150cols.I defined the label(result) what i need to find what cols(reasons) contribute to,and the distribution of labels is more than 280,000 0 and more than 10,000 are 1.
Second,i use from_pandas() to learns the structure.
at the last,i use sm.remove_edges_below_threshold(0.01),but i got all causal relationship of label is that label contribute to others.

AttributeError: 'DataFrame' object has no attribute 'ix'

There is an attribute error encountered after running the below statement in the tutorial.

bn = bn.fit_cpds(train, method="BayesianEstimator", bayes_prior="K2")

Pandas version 0.25.1 is used, not version 1.

FYI: networkx and pandas deprecated features in later libraries

This is just an "FYI" - not a bug report.

I am using causalnex with Apache RAPIDS with a view to using GPUs for the networkx activity.
There are a couple of version incompatibilities because RAPIDS uses more recent Python libraries than causalnex. These are:

with networkx 2.4 the weakly_connected_component_subgraphs has been removed (it was announced as deprecated in 2.1; it's now gone).
with pandas 0.25.0 I get the following attribute error (maybe it's actually a Python 3.7 -> Python 3.6 issue as RAPIDS is using 3.6?):
----> 2 classification_report(bn, test, "G1")

/opt/conda/envs/rapids/lib/python3.6/site-packages/causalnex/evaluation/evaluation.py in classification_report(bn, data, node)
205 )
206
--> 207 return pd.DataFrame.from_dict(report, orient="index")

/opt/conda/envs/rapids/lib/python3.6/site-packages/pandas/core/frame.py in from_dict(cls, data, orient, dtype, columns)
1172 # TODO speed up Series case
1173 if isinstance(list(data.values())[0], (Series, dict)):
-> 1174 data = _from_nested_dict(data)
1175 else:
1176 data, index = list(data.values()), list(data.keys())

/opt/conda/envs/rapids/lib/python3.6/site-packages/pandas/core/frame.py in _from_nested_dict(data)
8473 new_data = OrderedDict()
8474 for index, s in data.items():
-> 8475 for col, v in s.items():
8476 new_data[col] = new_data.get(col, OrderedDict())
8477 new_data[col][index] = v

AttributeError: 'float' object has no attribute 'items'

classification_report

Description

Hi! I have one problem with classification_report method. It throws an AttributeError: 'float' object has no attribute 'items'. The roc_auc method is working properly called with the same parameters as in the case of classification_report

Actual Result

File "C:\Users......\anaconda3\lib\site-packages\causalnex\evaluation\evaluation.py", line 207, in classification_report
return pd.DataFrame.from_dict(report, orient="index")
File "C:\Users......\anaconda3\lib\site-packages\pandas\core\frame.py", line 1179, in from_dict
data = _from_nested_dict(data)
File "C:\Users......\anaconda3\lib\site-packages\pandas\core\frame.py", line 8486, in _from_nested_dict
for col, v in s.items():
AttributeError: 'float' object has no attribute 'items'

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

CausalNex version used (pip show causalnex): 0.5.0
Python version used (python -V): 3.7
Operating system and version: windows 10

Independence unit test not stable on python 3.5

Unit test for independence of variables in data generator fails sometimes for python 3.5 (maybe on others as well but havent observed)

Test is unavoidably stochastic, not sure why python 3.5 specific (numpy dependencies/RNG?).

test_mixed_type_independence

Question about Sklearn Interface

Hi,

Thanks for this awesome package!

I wanted to get some more info on the Sklearn Interface.

How does the Sklearn Interface relate to structural models and Bayesian networks that can be used through other parts of the API?

Are the DAGRegressors just non probabilistic structural causal models where as Bayesian Networks are probabilistic? Is this the key difference?

Cheers,
Nick

I want to find some causal relationship from some cols to label what i defined by from_pandas(no tears),but i got some causal relationship from label to others

Description

I want to find some causal relationship from some cols to label what i defined by from_pandas(no tears),but i got some causal relationship from label to others

Context

DYNOTEARS implementation

Description

This FR addresses the existing structure learning ability in causalnex.

Context

It provides an alternative to the NO TEARS algorithm which performs better comparatively.

Details: https://www.groundai.com/project/dynotears-structure-learning-from-time-series-data/1

Do-Calculus using CausalNex

Description

How to assess the effect of an intervention? How to compute see and do probabilities?

Context

I want to perform Level 2 of Causal Inference - Do-Calculus.

Process

Considering the student performance example on the CausalNex website.

I am using the "higher" node as the intervention and performing do_intervention on it. I get a probability distribution for G1 based on the intervention. When I try to compare the probability distribution of G1 before the intervention, it is the same.

pred = ie.query({"higher": "yes"}), gives
'higher': {'no': 0.0, 'yes': 1.0}
'G1': {'Fail': 0.206829529425519, 'Pass': 0.793170470574481}

ie.do_intervention("higher",{"yes":1.0,"no":0.0})
ie.query()
'higher': {'no': 0.0, 'yes': 0.9999999999999998}
'G1': {'Fail': 0.20682952942551894, 'Pass': 0.7931704705744809}

I want to understand how do_intervention differs? How do I compute the see probabilities (before intervention) and do probabilities?

Typo in documentation

Description

I found a typo on the tutorial (https://causalnex.readthedocs.io/en/latest/03_tutorial/03_tutorial.html):

It should be 'us' instead of 'as'.

pygraphviz is not listed in the requirements

Since the latest update plot_structure requires pygraphviz requirement which is missing from the requirements.txt.

Guidelines for improving accuracy of prediction

Dear Team,

I have a continuous dataset with ~850 samples and 13 variables (including 1 target variable). I would like to identify the causes of the effect (target variable). The tutorial section has nicely explained steps to work with Causalnex. I am using causalnex 0.4.3.

Following this, the Structure Model has been created manually by including possible causes and relationships among them. All other steps as given in the tutorial have been followed and the model works well upto the step of Predictions (i.e. without any Errors). However, the final Evaluation metrics are not very good. For instance, the accuracy of Predictions vs Train_set and Predictions vs Test_set for minority/less majority classes are always somewhere around 60-70%.

I have experimented with following different combination of modifications on model.

As my dataset is largely imbalanced (class 1 : class 2 ~= 40:800), I used SMOTE technique for oversampling.
Increased/Decreased number of classes for all variables in Discretizer.
Used 'BDeu' bayes_prior along with varying equivalent_sample_size.
Different test_size in Train_test_split

I am struggling to figure out the ways to optimize the model. I could not find any supportive material in Internet for the same.

Are there any Guidelines or 'Best practices' of data pre-processing techniques and dataset requirements that particulary work for this library/model? Could you please suggest any troubleshooting methods to identify the cause of low accuracy/metrics?

Any ideas would be of great help! Thanks in Advance.

Incorrect probabilities learnt for Nodes with one parent only

Description

If a Node has only one parent (e.g. A->B) this node is always assigned to the flat distribution when we fit the probabilities.

I dig in and found out that problem turns out to come from PGMPY. I will raise the same issue there too, but am not sure how we want to handle it in CausalNex in the meantime.

Steps to Reproduce

import numpy as np
import pandas as pd
from causalnex.structure import StructureModel
from causalnex.network import BayesianNetwork
sm = StructureModel()
sm.add_edge('A','B')
np.random.seed(11)
vals = [1,2,3]
A = np.random.choice(vals,size=3000,p=[.1,.3,.6])
B = [np.random.choice(vals,p=[.9,.05,.05]) if a==1 else # you can put any values here, the result will be the same
     np.random.choice(vals,p=[.1,.2,.7]) if a==2 else
     np.random.choice(vals,p=[.85,.1,.05]) if a==3 else
     np.random.choice(vals) for a in A]

df = pd.DataFrame([A,B],index=['A','B']).T
#####
bn = BayesianNetwork(sm)
bn = bn.fit_node_states(df)
bn = bn.fit_cpds(df, method="MaximumLikelihoodEstimator")
print(bn.cpds['B'].round(decimals=2))

Expected Result

A     1     2     3
B                  
1  0.92  0.11  0.85
2  0.02  0.21  0.10
3  0.06  0.68  0.05

Actual Result

A     1     2     3
B                  
1  0.33  0.33  0.33
2  0.33  0.33  0.33
3  0.33  0.33  0.33

Your Environment

CausalNex version used (pip show causalnex):

Name: causalnex
Version: 0.5.0
Summary: Toolkit for causal reasoning (Bayesian Networks / Inference)
Home-page: https://github.com/quantumblacklabs/causalnex
Author: QuantumBlack Labs
Author-email: [email protected]
License: Apache Software License (Apache 2.0)
Location: /opt/anaconda3/envs/rehoww/lib/python3.6/site-packages
Requires: scikit-learn, pandas, prettytable, wrapt, pgmpy, scipy, numpy, networkx
Required-by: ```
* Python version used (`python -V`):
Python 3.6.10 :: Anaconda, Inc.
* Operating system and version:
MAC OS

* pandas version: 0.24

## CAUSE:
This comes is from PGMPY, precisely file `pgmpy/estimators/base.py`,  ~ line 127.

parents_states = [self.state_names[parent] for parent in parents]
state_count_data = data.groupby([variable] + parents).size().unstack(parents)

row_index = self.state_names[variable]
column_index = pd.MultiIndex.from_product(parents_states, names=parents)
state_counts = state_count_data.reindex(index=row_index, columns=column_index).fillna(0) # <----Where the error is

If the node has more than one parent, `state_count_data`  columns will be `MultiIndex` from the start. So doing ` state_count_data.reindex(...,columns=column_index)` causes no problem.

If the node has one single parent, however,  `state_count_data` columns will not be `MultiIndex`, but just "normal" indexing. In that case, when doing `state_count_data.reindex(...,columns=column_index)` the result is dataframe full of NAs.

## Dirty solution:
convert `state_count_data.columns` to `Multiindex` before reindexing

parents_states = [self.state_names[parent] for parent in parents]
state_count_data = data.groupby([variable] + parents).size().unstack(parents)

row_index = self.state_names[variable]
if len(parents) == 1: ## ADD THIS IF CONDITION
state_count_data.columns = pd.MultiIndex.from_product(list(state_count_data.columns), names=parents)
column_index = pd.MultiIndex.from_product(parents_states, names=parents)
state_counts = state_count_data.reindex(index=row_index, columns=column_index).fillna(0)

pandas error

tryng to run the model on google collab and pandas rises issues about importing some packages
the first one that turned out to be fatal is relevent to 'OrderedDict'
code:
/usr/local/lib/python3.6/dist-packages/pandas/io/formats/html.py in ()
8 from textwrap import dedent
9
---> 10 from pandas.compat import OrderedDict, lzip, map, range, u, unichr, zip
11
12 from pandas.core.dtypes.generic import ABCMultiIndex

ImportError: cannot import name 'OrderedDict'

the second one which is fatal is relevant to importing lmap

/usr/local/lib/python3.6/dist-packages/pandas/core/config.py in ()
55
56 import pandas.compat as compat
---> 57 from pandas.compat import lmap, map, u
58
59 DeprecatedOption = namedtuple('DeprecatedOption', 'key msg rkey removal_ver')

ImportError: cannot import name 'lmap'

with standalone pandas, I did not get these issues seems to me google collab missing something
any suggestions please ?

Negative probabilities (linear model in BN?)

First, thanks for the grate package!

Using the InferenceEngine I can "do" and obtain negative probabilities.
I was trying something like this from the example in the tutorial:
ie.do_intervention("higher", {'yes': x, 'no': 1-x})
print("updated marginal G1", ie.query()["G1"])
Here, I can set values for x outside the range of [0,1], which I think is not ideal.

A more relevant question occurred to me from observing that the output of ie.query()["G1"] after the intervention is a perfectly linear function of x, extending into the negative. Are the CPDs in your model linear functions by design? Shouldn't these functions be bounded?

API docs is not up-to-date

For example:

https://causalnex.readthedocs.io/en/latest/source/api_docs/causalnex.plots.plot_structure.html#causalnex.plots.plot_structure

vs.

https://github.com/quantumblacklabs/causalnex/blob/master/causalnex/plots/plots.py

Parameter learning with continuous data

Description

Using the discrete parameter learning functionality on a standard BN structure (20-30 nodes and an intuitive discretisation for each) requires huge amounts of memory. Parameter learning is faster and more memory-efficient when calculated implementing parameter learning on continuous data in other packages (Gaussian BN).

Context

When suitable normality assumptions are being met Gaussian BNs perform well, they don't require any loss of information through discretisation and the memory requirements of parameter learning are severely reduced. I've also seen in your docs that this is already on your roadmap to have this implemented - an ETA would be awesome!

Tutorial raises error on conditional probability distributions

Description

Following the tutorial raises the following:

bn = bn.fit_cpds(train, method="BayesianEstimator", bayes_prior="K2")

Would be great to have a fully working jupyter notebook as an example.

Steps to Reproduce

/usr/local/lib/python3.7/site-packages/causalnex/network/network.py in fit_cpds(self, data, method, bayes_prior, equivalent_sample_size)
    344 
    345         transformed_data = data.copy(deep=True)  # type: pd.DataFrame
--> 346         transformed_data = self._state_to_index(transformed_data[self.nodes])
    347 
    348         if method == "MaximumLikelihoodEstimator":

/usr/local/lib/python3.7/site-packages/causalnex/network/network.py in _state_to_index(self, df, nodes)
    307         cols = nodes if nodes else df.columns
    308         for col in cols:
--> 309             df[col] = df[col].map(self._node_states[col])
    310         df.is_copy = True
    311         return df

TypeError: 'NoneType' object is not subscriptable

Your Environment

CausalNex version used (pip show causalnex): causalnex==0.4.3
Python version used (python -V): 3.7
Operating system and version: Ubuntu 16

Question about w_threshold

Hi,
I dont know the meaning of the parameter 'w_threshold' in "from_pandas",because I can get a BN model when I using ''hill_climb" by pgmpy.The edges' number of the model vary the value of w_threshold ,so I dont know which one is correct?
this problem is not exist in ''hill_climb".

Do Intervention

Hello,
Thank you very much for CausalNex! I am new to Bayesian Networks and I am trying to understand them. I have one question. If we set the distribution of the value of one feature to 1 like in the tutorial(100% students wanted to do a higher education) we will obtain a certain rate. Should the obtained rate be the same in the case we set the distribution of the other possible value of the feature to 1? And why not? Moreover, setting only one value for all the instances in the dataset is equivalent to removing the certain feature?

Can Causalnex support word embedding,and could it be useful?

Description

Is your feature request related to a problem? A clear and concise description of what the problem is: "I'm always frustrated when ..."

Context

Why is this change important to you? How would you use it? How can it benefit other users?

Possible Implementation

(Optional) Suggest an idea for implementing the addition or change.

Possible Alternatives

(Optional) Describe any alternative solutions or features you've considered.