causallift package
CausalLift
Subpackages
- causallift.nodes package
- Submodules
- causallift.nodes.estimate_propensity module
- causallift.nodes.model_for_each module
ModelForTreated
ModelForTreatedOrUntreated
ModelForUntreated
bundle_treated_and_untreated_models()
model_for_treated_fit()
model_for_treated_predict_proba()
model_for_treated_simulate_recommendation()
model_for_untreated_fit()
model_for_untreated_predict_proba()
model_for_untreated_simulate_recommendation()
- causallift.nodes.utils module
add_cate_to_df()
apply_method()
bundle_train_and_test_data()
compute_cate()
concat_train_test()
concat_train_test_df()
conf_mat_df()
estimate_effect()
gain_tuple()
get_cols_features()
impute_cols_features()
initialize_model()
len_o()
len_t()
len_to()
outcome_fraction_()
overall_uplift_gain_()
recommend_by_cate()
score_df()
treatment_fraction_()
treatment_fractions_()
- causallift.context package
Submodules
causallift.causal_lift module
- class causallift.causal_lift.CausalLift(train_df=None, test_df=None, cols_features=None, col_treatment='Treatment', col_outcome='Outcome', col_propensity='Propensity', col_proba_if_treated='Proba_if_Treated', col_proba_if_untreated='Proba_if_Untreated', col_cate='CATE', col_recommendation='Recommendation', col_weight='Weight', min_propensity=0.01, max_propensity=0.99, verbose=2, uplift_model_params={'cv': 3, 'estimator': 'xgboost.XGBClassifier', 'n_jobs': -1, 'param_grid': {'base_score': [0.5], 'booster': ['gbtree'], 'colsample_bylevel': [1], 'colsample_bytree': [1], 'gamma': [0], 'learning_rate': [0.1], 'max_delta_step': [0], 'max_depth': [3], 'min_child_weight': [1], 'missing': [None], 'n_estimators': [100], 'n_jobs': [-1], 'nthread': [None], 'objective': ['binary:logistic'], 'random_state': [0], 'reg_alpha': [0], 'reg_lambda': [1], 'scale_pos_weight': [1], 'subsample': [1], 'verbose': [0]}, 'return_train_score': False, 'scoring': None, 'search_cv': 'sklearn.model_selection.GridSearchCV'}, enable_ipw=True, enable_weighting=False, propensity_model_params={'cv': 3, 'estimator': 'sklearn.linear_model.LogisticRegression', 'n_jobs': -1, 'param_grid': {'C': [0.1, 1, 10], 'class_weight': [None], 'dual': [False], 'fit_intercept': [True], 'intercept_scaling': [1], 'max_iter': [100], 'multi_class': ['ovr'], 'n_jobs': [1], 'penalty': ['l1', 'l2'], 'random_state': [0], 'solver': ['liblinear'], 'tol': [0.0001], 'warm_start': [False]}, 'return_train_score': False, 'scoring': None, 'search_cv': 'sklearn.model_selection.GridSearchCV'}, index_name='index', partition_name='partition', runner='SequentialRunner', conditionally_skip=False, df_print=<function display>, dataset_catalog={'df_03': <kedro.extras.datasets.pandas.csv_dataset.CSVDataSet object>, 'estimated_effect_df': <kedro.extras.datasets.pandas.csv_dataset.CSVDataSet object>, 'propensity_model': <kedro.extras.datasets.pickle.pickle_dataset.PickleDataSet object>, 'treated__sim_eval_df': <kedro.extras.datasets.pandas.csv_dataset.CSVDataSet object>, 'untreated__sim_eval_df': <kedro.extras.datasets.pandas.csv_dataset.CSVDataSet object>, 'uplift_models_dict': <kedro.extras.datasets.pickle.pickle_dataset.PickleDataSet object>}, logging_config={'disable_existing_loggers': False, 'formatters': {'json_formatter': {'class': 'pythonjsonlogger.jsonlogger.JsonFormatter', 'format': '[%(asctime)s|%(name)s|%(funcName)s|%(levelname)s] %(message)s'}, 'simple': {'format': '[%(asctime)s|%(name)s|%(levelname)s] %(message)s'}}, 'handlers': {'console': {'class': 'logging.StreamHandler', 'formatter': 'simple', 'level': 'INFO', 'stream': 'ext://sys.stdout'}, 'error_file_handler': {'backupCount': 20, 'class': 'logging.handlers.RotatingFileHandler', 'delay': True, 'encoding': 'utf8', 'filename': './errors.log', 'formatter': 'simple', 'level': 'ERROR', 'maxBytes': 10485760}, 'info_file_handler': {'backupCount': 20, 'class': 'logging.handlers.RotatingFileHandler', 'delay': True, 'encoding': 'utf8', 'filename': './info.log', 'formatter': 'simple', 'level': 'INFO', 'maxBytes': 10485760}}, 'loggers': {'anyconfig': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'WARNING', 'propagate': False}, 'causallift': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'INFO', 'propagate': False}, 'kedro.io': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'WARNING', 'propagate': False}, 'kedro.pipeline': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'INFO', 'propagate': False}, 'kedro.runner': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'INFO', 'propagate': False}}, 'root': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'INFO'}, 'version': 1})[source]
Bases:
object
Set up datasets for uplift modeling. Optionally, propensity scores are estimated based on logistic regression.
- Parameters:
train_df (
Optional
[DataFrame
]) – Pandas Data Frame containing samples used for trainingtest_df (
Optional
[DataFrame
]) – Pandas Data Frame containing samples used for testingcols_features (
Optional
[List
[str
]]) – List of column names used as features. IfNone
(default), all the columns except for outcome, propensity, CATE, and recommendation.col_treatment (
str
) – Name of treatment column. ‘Treatment’ in default.col_outcome (
str
) – Name of outcome column. ‘Outcome’ in default.col_propensity (
str
) – Name of propensity column. ‘Propensity’ in default.col_cate (
str
) – Name of CATE (Conditional Average Treatment Effect) column. ‘CATE’ in default.col_recommendation (
str
) – Name of recommendation column. ‘Recommendation’ in default.col_weight (
str
) – Name of weight column. ‘Weight’ in default.min_propensity (
float
) – Minimum propensity score. 0.01 in default.max_propensity (
float
) – Maximum propensity score. 0.99 in defualt.verbose (
int
) –How much info to show. Valid values are:
0
to show nothing1
to show only warning2
(default) to show useful info3
to show more info
uplift_model_params (
Union
[Dict
[str
,List
[Any
]],Type
[BaseEstimator
]]) –Parameters used to fit 2 XGBoost classifier models.
Optionally use search_cv key to specify the Search CV class name.
e.g. sklearn.model_selection.GridSearchCV
Refer to https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
Use estimator key to specify the estimator class name.
e.g. xgboost.XGBClassifier
Refer to https://xgboost.readthedocs.io/en/latest/parameter.html
Optionally use const_params key to specify the constant parameters to construct the estimator.
If
None
(default):dict( search_cv="sklearn.model_selection.GridSearchCV", estimator="xgboost.XGBClassifier", scoring=None, cv=3, return_train_score=False, n_jobs=-1, param_grid=dict( max_depth=[3], learning_rate=[0.1], n_estimators=[100], verbose=[0], objective=["binary:logistic"], booster=["gbtree"], n_jobs=[-1], nthread=[None], gamma=[0], min_child_weight=[1], max_delta_step=[0], subsample=[1], colsample_bytree=[1], colsample_bylevel=[1], reg_alpha=[0], reg_lambda=[1], scale_pos_weight=[1], base_score=[0.5], missing=[None], ), )
Alternatively, estimator model object is acceptable. The object must have the following methods compatible with scikit-learn estimator interface.
fit()
predict()
predict_proba()
enable_ipw (
bool
) – Enable Inverse Probability Weighting based on the estimated propensity score. True in default.enable_weighting (
bool
) – Enable Weighting. False in default.propensity_model_params (
Dict
[str
,List
[Any
]]) –Parameters used to fit logistic regression model to estimate propensity score.
Optionally use search_cv key to specify the Search CV class name.
e.g. sklearn.model_selection.GridSearchCV
Refer to https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
Use estimator key to specify the estimator class name.
e.g. sklearn.linear_model.LogisticRegression
Refer to https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
Optionally use const_params key to specify the constant parameters to construct the estimator.
If
None
(default):dict( search_cv="sklearn.model_selection.GridSearchCV", estimator="sklearn.linear_model.LogisticRegression", scoring=None, cv=3, return_train_score=False, n_jobs=-1, param_grid=dict( C=[0.1, 1, 10], class_weight=[None], dual=[False], fit_intercept=[True], intercept_scaling=[1], max_iter=[100], multi_class=["ovr"], n_jobs=[1], penalty=["l1", "l2"], solver=["liblinear"], tol=[0.0001], warm_start=[False], ), )
index_name (
str
) –Index name of the pandas data frame after resetting the index. ‘index’ in default.
If
None
, the index will not be reset.partition_name (
str
) – Additional index name to indicate the partition, train or test. ‘partition’ in default.runner (
str
) –If set to ‘SequentialRunner’ (default) or ‘ParallelRunner’, the pipeline is run by Kedro sequentially or in parallel, respectively.
If set to
None
, the pipeline is run by native Python.Refer to https://kedro.readthedocs.io/en/latest/04_user_guide/05_nodes_and_pipelines.html#runners
conditionally_skip (
bool
) –[Effective only if runner is set to either ‘SequentialRunner’ or ‘ParallelRunner’]
Skip running the pipeline if the output files already exist. False in default.
df_print – callable to use to show output data frames. IPython.display.display in default.
dataset_catalog (
Dict
[str
,AbstractDataSet
]) –[Effective only if runner is set to either ‘SequentialRunner’ or ‘ParallelRunner’]
Specify dataset files to save in Dict[str, kedro.io.AbstractDataSet] format.
To find available file formats, refer to https://kedro.readthedocs.io/en/latest/kedro.io.html#data-sets
In default:
dict( # args_raw = CSVLocalDataSet(filepath='../data/01_raw/args_raw.csv', version=None), # train_df = CSVLocalDataSet(filepath='../data/01_raw/train_df.csv', version=None), # test_df = CSVLocalDataSet(filepath='../data/01_raw/test_df.csv', version=None), propensity_model = PickleLocalDataSet( filepath='../data/06_models/propensity_model.pickle', version=None ), uplift_models_dict = PickleLocalDataSet( filepath='../data/06_models/uplift_models_dict.pickle', version=None ), df_03 = CSVLocalDataSet( filepath='../data/07_model_output/df.csv', load_args=dict(index_col=['partition', 'index'], float_precision='high'), save_args=dict(index=True, float_format='%.16e'), version=None, ), treated__sim_eval_df = CSVLocalDataSet( filepath='../data/08_reporting/treated__sim_eval_df.csv', version=None, ), untreated__sim_eval_df = CSVLocalDataSet( filepath='../data/08_reporting/untreated__sim_eval_df.csv', version=None, ), estimated_effect_df = CSVLocalDataSet( filepath='../data/08_reporting/estimated_effect_df.csv', version=None, ), )
logging_config (
Optional
[Dict
[str
,Any
]]) –Specify logging configuration.
Refer to https://docs.python.org/3.6/library/logging.config.html#logging-config-dictschema
In default:
{'disable_existing_loggers': False, 'formatters': { 'json_formatter': { 'class': 'pythonjsonlogger.jsonlogger.JsonFormatter', 'format': '[%(asctime)s|%(name)s|%(funcName)s|%(levelname)s] %(message)s', }, 'simple': { 'format': '[%(asctime)s|%(name)s|%(levelname)s] %(message)s', }, }, 'handlers': { 'console': { 'class': 'logging.StreamHandler', 'formatter': 'simple', 'level': 'INFO', 'stream': 'ext://sys.stdout', }, 'info_file_handler': { 'class': 'logging.handlers.RotatingFileHandler', 'level': 'INFO', 'formatter': 'simple', 'filename': './info.log', 'maxBytes': 10485760, # 10MB 'backupCount': 20, 'encoding': 'utf8', 'delay': True, }, 'error_file_handler': { 'class': 'logging.handlers.RotatingFileHandler', 'level': 'ERROR', 'formatter': 'simple', 'filename': './errors.log', 'maxBytes': 10485760, # 10MB 'backupCount': 20, 'encoding': 'utf8', 'delay': True, }, }, 'loggers': { 'anyconfig': { 'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'WARNING', 'propagate': False, }, 'kedro.io': { 'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'WARNING', 'propagate': False, }, 'kedro.pipeline': { 'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'INFO', 'propagate': False, }, 'kedro.runner': { 'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'INFO', 'propagate': False, }, 'causallift': { 'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'INFO', 'propagate': False, }, }, 'root': { 'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'INFO', }, 'version': 1}
- __init__(train_df=None, test_df=None, cols_features=None, col_treatment='Treatment', col_outcome='Outcome', col_propensity='Propensity', col_proba_if_treated='Proba_if_Treated', col_proba_if_untreated='Proba_if_Untreated', col_cate='CATE', col_recommendation='Recommendation', col_weight='Weight', min_propensity=0.01, max_propensity=0.99, verbose=2, uplift_model_params={'cv': 3, 'estimator': 'xgboost.XGBClassifier', 'n_jobs': -1, 'param_grid': {'base_score': [0.5], 'booster': ['gbtree'], 'colsample_bylevel': [1], 'colsample_bytree': [1], 'gamma': [0], 'learning_rate': [0.1], 'max_delta_step': [0], 'max_depth': [3], 'min_child_weight': [1], 'missing': [None], 'n_estimators': [100], 'n_jobs': [-1], 'nthread': [None], 'objective': ['binary:logistic'], 'random_state': [0], 'reg_alpha': [0], 'reg_lambda': [1], 'scale_pos_weight': [1], 'subsample': [1], 'verbose': [0]}, 'return_train_score': False, 'scoring': None, 'search_cv': 'sklearn.model_selection.GridSearchCV'}, enable_ipw=True, enable_weighting=False, propensity_model_params={'cv': 3, 'estimator': 'sklearn.linear_model.LogisticRegression', 'n_jobs': -1, 'param_grid': {'C': [0.1, 1, 10], 'class_weight': [None], 'dual': [False], 'fit_intercept': [True], 'intercept_scaling': [1], 'max_iter': [100], 'multi_class': ['ovr'], 'n_jobs': [1], 'penalty': ['l1', 'l2'], 'random_state': [0], 'solver': ['liblinear'], 'tol': [0.0001], 'warm_start': [False]}, 'return_train_score': False, 'scoring': None, 'search_cv': 'sklearn.model_selection.GridSearchCV'}, index_name='index', partition_name='partition', runner='SequentialRunner', conditionally_skip=False, df_print=<function display>, dataset_catalog={'df_03': <kedro.extras.datasets.pandas.csv_dataset.CSVDataSet object>, 'estimated_effect_df': <kedro.extras.datasets.pandas.csv_dataset.CSVDataSet object>, 'propensity_model': <kedro.extras.datasets.pickle.pickle_dataset.PickleDataSet object>, 'treated__sim_eval_df': <kedro.extras.datasets.pandas.csv_dataset.CSVDataSet object>, 'untreated__sim_eval_df': <kedro.extras.datasets.pandas.csv_dataset.CSVDataSet object>, 'uplift_models_dict': <kedro.extras.datasets.pickle.pickle_dataset.PickleDataSet object>}, logging_config={'disable_existing_loggers': False, 'formatters': {'json_formatter': {'class': 'pythonjsonlogger.jsonlogger.JsonFormatter', 'format': '[%(asctime)s|%(name)s|%(funcName)s|%(levelname)s] %(message)s'}, 'simple': {'format': '[%(asctime)s|%(name)s|%(levelname)s] %(message)s'}}, 'handlers': {'console': {'class': 'logging.StreamHandler', 'formatter': 'simple', 'level': 'INFO', 'stream': 'ext://sys.stdout'}, 'error_file_handler': {'backupCount': 20, 'class': 'logging.handlers.RotatingFileHandler', 'delay': True, 'encoding': 'utf8', 'filename': './errors.log', 'formatter': 'simple', 'level': 'ERROR', 'maxBytes': 10485760}, 'info_file_handler': {'backupCount': 20, 'class': 'logging.handlers.RotatingFileHandler', 'delay': True, 'encoding': 'utf8', 'filename': './info.log', 'formatter': 'simple', 'level': 'INFO', 'maxBytes': 10485760}}, 'loggers': {'anyconfig': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'WARNING', 'propagate': False}, 'causallift': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'INFO', 'propagate': False}, 'kedro.io': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'WARNING', 'propagate': False}, 'kedro.pipeline': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'INFO', 'propagate': False}, 'kedro.runner': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'INFO', 'propagate': False}}, 'root': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'INFO'}, 'version': 1})[source]
- estimate_cate_by_2_models()[source]
Estimate CATE (Conditional Average Treatment Effect) using 2 XGBoost classifier models.
- Return type:
Tuple
[DataFrame
,DataFrame
]
- estimate_recommendation_impact(cate_estimated=None, treatment_fraction_train=None, treatment_fraction_test=None, verbose=None)[source]
Estimate the impact of recommendation based on uplift modeling.
- Parameters:
cate_estimated (
Optional
[Type
[Series
]]) – Pandas series containing the CATE. IfNone
(default), use the ones calculated by estimate_cate_by_2_models method.treatment_fraction_train (
Optional
[float
]) – The fraction of treatment in train dataset. IfNone
(default), use the ones calculated by estimate_cate_by_2_models method.treatment_fraction_test (
Optional
[float
]) – The fraction of treatment in test dataset. IfNone
(default), use the ones calculated by estimate_cate_by_2_models method.verbose (
Optional
[int
]) – How much info to show. IfNone
(default), use the value set in the constructor.
- Return type:
Type
[DataFrame
]
causallift.generate_data module
The original code is at https://github.com/wayfair/pylift/blob/master/pylift/generate_data.py licensed under the BSD 2-Clause “Simplified” License Copyright 2018, Wayfair, Inc.
This code is an enhanced (backward-compatible) version that can simulate observational dataset including “sleeping dogs.”
“Sleeping dogs” (people who will “buy” if not treated but will not “buy” if treated) can be simulated by negative values in tau parameter. Observational data which includes confounding can be simulated by non-zero values in propensity_coef parameter. A/B Test (RCT) with a 50:50 split can be simulated by all-zeros values in propensity_coef parameter (default). The first element in each list parameter specifies the intercept.
- causallift.generate_data.generate_data(N=1000, n_features=3, beta=[1, -2, 3, -0.8], error_std=0.5, tau=3, discrete_outcome=False)[source]
Generates random data with a ground truth data generating process. Draws random values for features from [0, 1), errors from a 0-centered distribution with std error_std, and creates an outcome y.
- Parameters:
N – (
Optional[int]
) - Number of observations.n_features – (
Optional[int]
) - Number of features.beta – (
Optional[List[float]]
) - Array of beta coefficients to multiply by X to get y.error_std – (
Optional[float]
) - Standard deviation (scale) of distribution from which errors are drawn.tau – (
Union[List[float], float]
) - Array of coefficients to multiply by X to get y if treated. More/larger negative values will simulate more “sleeping dogs” If float scalar is input, effect of features is not considered.tau_std – (
Optional[float]
) - When notNone
, draws tau from a normal distribution centered around tau with standard deviation tau_std rather than just using a constant value of tau.discrete_outcome – (
Optional[bool]
) - If True, outcomes are 0 or 1; otherwise continuous.seed – (
Optional[int]
) - Random seed fed to np.random.seed to allow for deterministic behavior.feature_effect – (
Optional[float]
) - Effect of beta on outcome if treated.propensity_coef – (
Optional[List[float]]
) - Array of coefficients to multiply by X to get propensity log-odds to be treated.index_name – (
Optional[str]
) - Index name in the output DataFrame. IfNone
(default), index name will not be set.
- Returns:
- pd.DataFrame
A DataFrame containing the generated data.
- Return type:
df
causallift.pipeline module
Pipeline construction.