Advanced usage#
This section provides a deeper look at how to use the InsightSolver API client.
Let’s revisit the Titanic demo.
Once the solver is fitted, we can do more than simply print the results.
This becomes particularly important when integrating the InsightSolver API client into a Python pipeline.
Conventions#
In InsightSolver, the parameter target_goal specifies the target modality of the target variable for which rules should capture a large number of 1’s.
By convention:
A data point is considered a
1if it matches the target modality specified bytarget_goal.A data point is considered a
0otherwise.
It is important to note that these 0’s and 1’s are conventions used internally by InsightSolver and should not be confused with the actual values or modalities of the target variable in the dataset.
For instance, consider the Titanic dataset used in the Titanic example:
Dataset:
kaggle_titanic_train.csv.Target variable:
target_name='Survived'.Target modalities:
0(non-survivor),1(survivor).Target goals: either
target_goal=0(looking for non-survivors) eithertarget_goal=1(looking for survivors).
Here, the modalities 0 and 1 are specific to the Titanic dataset and represent whether a passenger survived or not.
Total passengers:
891rows.Non-survivors (
Survived=0): 549 rows.Survivors (
Survived=1): 342 rows.
Case 1: Looking for Survivors (target_goal=1)
When the goal is to identify survivors:
M=891: Total population.M0=549: Number of0’s, representing non-survivors (Survived=0).M1=342: Number of1’s, representing survivors (Survived=1).
Case 2: Looking for Non-Survivors (target_goal=0)
When the goal is to identify non-survivors:
M=891: Total population.M0=342: Number of0’s, representing survivors (Survived=1).M1=549: Number of1’s, representing non-survivors (Survived=0).
InsightSolver operates under the principle of capturing 1’s and rejecting 0’s, regardless of the specific meaning of these values in a given dataset.
Attributes of the solver#
The solver object includes several relevant attributes, which are described exhaustively here.
For now, let’s take a brief look at the most important ones:
M: The total number of points in the population.M0: The number of points classified as0in the population.M1: The number of points classified as1in the population.rule_mining_results: A dictionary containing the results of the rule mining process. Below, we’ll explore methods to access and parse specific aspects of these results.benchmark_scores: A dictionary containing the best scores obtained on shuffled data. This is useful to compare the scores of the rules found in the real data against the scores of the rules found in random data.
Counting the number of rules#
To obtain the number of rules found by the solver, we can use the ruleset_count method:
solver.ruleset_count() # 3
# 3 rules are found by the solver
Each rule in the solver is indexed by an integer, conventionally denoted as i.
Getting the index of the rules#
To retrieve the range of rule indices, we can use the get_range_i method:
solver.get_range_i() # [0, 1, 2]
This shows that the index i can take the values 0, 1 or 2.
Knowing this range is useful when iterating over individual rules in the solver.
Exhaustive dictionary of a given rule#
Let’s take a closer look at the rule at position i=0.
We can retrieve an exhaustive dictionary of the rule at position i=0 as follows:
solver.i_to_rule(i=0)
# {
# "m": 170,
# "m0": 9,
# "m1": 161,
# "coverage": 0.19079685746352412,
# "m1/M1": 0.47076023391812866,
# "mu_rule": 0.9470588235294117,
# "mu_pop": 0.3838383838383838,
# "sigma_pop": 0.48659245426485753,
# "lift": 2.4673374613003096,
# "p_value": 1.925558554763681e-67,
# "F_score": 0.62890625,
# "Z_score": 16.767366956025956,
# "rule_S": {
# "Sex": "female",
# "Pclass": [
# 1,
# 2
# ]
# },
# "complexity_S": 2,
# "F1_pop": 0.5547445255474452,
# "G_bad_class": 0.17059483726150393,
# "G_information": 0.24588549145241542,
# "G_gini": 0.14958927829841417,
# "p_value_ratio_S": {
# "Pclass": 5.359920512293736e-08,
# "Name": 1.0,
# "Sex": 5.022554571114061e-46,
# "Age": 1.0,
# "SibSp": 1.0,
# "Parch": 1.0,
# "Ticket": 1.0,
# "Fare": 1.0,
# "Cabin": 1.0,
# "Embarked": 1.0
# },
# "F_score_ratio_S": {
# "Pclass": 1.2812499999999998,
# "Name": 1.0,
# "Sex": 0.9302417652027029,
# "Age": 1.0,
# "SibSp": 1.0,
# "Parch": 1.0,
# "Ticket": 1.0,
# "Fare": 1.0,
# "Cabin": 1.0,
# "Embarked": 1.0
# },
# "subrules_S": [
# {
# "M": 891,
# "M0": 549,
# "M1": 342,
# "mu_pop": 0.3838383838383838,
# "sigma_pop": 0.48659245426485753,
# "F1_pop": 0.5547445255474452,
# "m": 314,
# "m0": 81,
# "m1": 233,
# "coverage": 0.35241301907968575,
# "m1/M1": 0.6812865497076024,
# "mu_rule": 0.7420382165605095,
# "lift": 1.9332048273550118,
# "mc": 577,
# "m0c": 468,
# "m1c": 109,
# "p_value": 3.592513266469419e-60,
# "F_score": 0.7103658536585366,
# "Z_score": 16.20063097451895,
# "G_bad_class": 0.17059483726150393,
# "G_information": 0.21766010666061436,
# "G_gini": 0.13964795747285225,
# "complexity": 1,
# "subrule_S": {
# "Sex": "female"
# },
# "var_name": "Sex",
# "var_rule": "female",
# "p_value_ratio": 5.022554571114061e-46,
# "shuffling_scores": {
# "p_value": {
# "cohen_d": 75.86463627446636,
# "effect_size": "6. huge",
# "wy_ratio": 0.0
# },
# "Z_score": {
# "cohen_d": 37.71820414056423,
# "effect_size": "6. huge",
# "wy_ratio": 0.0
# },
# "F_score": {
# "cohen_d": 51.16755087975539,
# "effect_size": "6. huge",
# "wy_ratio": 0.0
# }
# }
# },
# {
# "M": 891,
# "M0": 549,
# "M1": 342,
# "mu_pop": 0.3838383838383838,
# "sigma_pop": 0.48659245426485753,
# "F1_pop": 0.5547445255474452,
# "m": 170,
# "m0": 9,
# "m1": 161,
# "coverage": 0.19079685746352412,
# "m1/M1": 0.47076023391812866,
# "mu_rule": 0.9470588235294117,
# "lift": 2.4673374613003096,
# "mc": 721,
# "m0c": 540,
# "m1c": 181,
# "p_value": 1.925558554763681e-67,
# "F_score": 0.62890625,
# "Z_score": 16.767366956025956,
# "G_bad_class": 0.17059483726150393,
# "G_information": 0.24588549145241542,
# "G_gini": 0.14958927829841417,
# "complexity": 2,
# "subrule_S": {
# "Sex": "female",
# "Pclass": [
# 1,
# 2
# ]
# },
# "var_name": "Pclass",
# "var_rule": [
# 1,
# 2
# ],
# "p_value_ratio": 5.359920512293736e-08,
# "shuffling_scores": {
# "p_value": {
# "cohen_d": 85.9832032893128,
# "effect_size": "6. huge",
# "wy_ratio": 0.0
# },
# "Z_score": {
# "cohen_d": 39.52974355288319,
# "effect_size": "6. huge",
# "wy_ratio": 0.0
# },
# "F_score": {
# "cohen_d": 23.36359433204899,
# "effect_size": "6. huge",
# "wy_ratio": 0.0
# }
# }
# }
# ],
# "feature_contributions_S": {
# "rule_S": {
# "Sex": "female",
# "Pclass": "[1, 2]"
# },
# "p_value_contribution": {
# "Sex": 0.8616919700920942,
# "Pclass": 0.13830802990790583
# },
# "F_score_contribution": {
# "Sex": 1.0,
# "Pclass": 0.0
# },
# "Z_score_contribution": {
# "Sex": 0.9417181496074654,
# "Pclass": 0.05828185039253464
# },
# "G_bad_class_contribution": {
# "Sex": 1.0,
# "Pclass": 0.0
# },
# "G_information_contribution": {
# "Sex": 0.857675593215252,
# "Pclass": 0.142324406784748
# },
# "G_gini_contribution": {
# "Sex": 0.9099458875412633,
# "Pclass": 0.0900541124587367
# }
# },
# "shuffling_scores": {
# "p_value": {
# "cohen_d": 85.9832032893128,
# "effect_size": "6. huge",
# "wy_ratio": 0.0
# },
# "Z_score": {
# "cohen_d": 39.52974355288319,
# "effect_size": "6. huge",
# "wy_ratio": 0.0
# },
# "F_score": {
# "cohen_d": 23.36359433204899,
# "effect_size": "6. huge",
# "wy_ratio": 0.0
# }
# }
# }
This dictionary contains detailed information and statistics about the rule at position i=0.
Here are some of the key entries:
"m": 170: This is the number of points captured by the rule. The rule contains 170 points in total."m0": 9: This is the number of0captured by the rule. The rule contains 9 non-survivors."m1": 161: This is the number of1captured by the rule. The rule contains 161 survivors."coverage": 0.19079685746352412: This is the coverage of the rule, i.e. the ratiom/M. The rule covers 19.1% of the population."m1/M1": 0.47076023391812866: This is the sensitivity of the rule, i.e. the capture rate of1. The rule captures 47.1% of the survivors."mu_rule": 0.9470588235294117: This is the average of the target variable in the rule, i.e. the ratiom1/m. Here we have a survival rate of 94.7% in the rule."mu_pop": 0.3838383838383838: This the average of the target variable in the population, i.e. the rationM1/M. Here we have a survival rate of 38.4% in the population."sigma_pop": 0.48659245426485753: This is the standard deviation of the target variable in the population."lift": 2.4673374613003096: This is the lift of the rule, i.e. the ratiomu_rule/mu_pop."p_value": 1.925558554763681e-67: This is the p-value (according to the hypergeometric probability law, not the chi-squared) of the rule."F_score": 0.62890625: This is the F1-score of the rule."Z_score": 16.767366956025956: This is the Z-score of the rule."rule_S": {"Sex": "female","Pclass": [1,2]}: The rule reads Females in first or second class."complexity_S": 2: The complexity of the rule is 2, i.e. two variables are involded in the rule ("Sex"and"Pclass")."F1_pop": 0.5547445255474452: This is the F1-score of the population."G_bad_class": 0.17059483726150393: This is the bad classification gain of the rule."G_information": 0.24588549145241542: This is the information gain of the rule."G_gini": 0.14958927829841417: This is the Gini gain of the rule."shuffling_scores": This contains the scores that measure how strong is the rule compared to what would be found in shuffled data.
DataFrame of subrules#
We can retrieve a DataFrame of the subrules for the rule at position i=0 as follow:
solver.i_to_subrules_dataframe(i=0)
# p_value_ratio variable rule complexity p_value F_score ... m0c m1c G_bad_class G_information G_gini subrule_S
# 0 5.022555e-46 Sex female 1 3.592513e-60 0.710366 ... 468 109 0.170595 0.217660 0.139648 {'Sex': 'female'}
# 1 5.359921e-08 Pclass [1, 2] 2 1.925559e-67 0.628906 ... 540 181 0.170595 0.245885 0.149589 {'Sex': 'female'...
The DataFrame of subrules begins with a rule of complexity 1 (e.g. {"Sex": "female"}) and progresses to higher complexities, such as complexity 2 (e.g. {"Sex": "female","Pclass": [1,2]}).
As we can observe, increasing the complexity from 1 to 2 improves the p-values and the information gain but degrades the F1-score.
The purpose of the subrules DataFrame is to assist in deciding the optimal level of rule complexity based on various metrics.
DataFrame of features contributions#
We can retrieve a DataFrame showing the contributions of the features for the rule at position i=0 as follows:
solver.i_to_feature_contributions_S(i=0)
# p_value F_score Z_score G_bad_class G_information G_gini
# feature_name
# Sex 0.861692 1.0 0.941718 1.0 0.857676 0.909946
# Pclass 0.138308 0.0 0.058282 0.0 0.142324 0.090054
As we can observe, the variable "Sex" provides the largest contribution.
The variable "Pclass" adds a slight positive contribution to both the p-value and the information gain, as including it in the rule improves these metrics.
However, the contribution of "Pclass" is zero for the F-score.
This indicates that it does not enhance the F-score (in fact, it degrades it, but by convention, contributions are kept nonnegative).
Printing modes#
Earlier in Titanic we saw the dense printing mode.
There are three printing modes:
full: A full print of the results.light: A lighter version of the full print.dense: A very compact version of the print.
Column types#
The columns of a Pandas DataFrame are associated with a type known as a dtype, such as int64, float64, object, and so on.
In addition to these, InsightSolver introduces a complementary layer of types called btype.
While dtypes describe the encoding of the data (e.g., integers or floats), btypes define how the data should be interpreted when mining for rules. These btypes include:
binary: The variable is treated as a binary categorical variable, and rule mining will focus on finding subsets.multiclass: The variable is treated as a multiclass categorical variable, and rule mining will aim to find subsets.continuous: The variable is treated as an ordered variable, and rule mining will focus on identifying meaningful intervals.ignore: The variable is excluded from rule mining.
The btypes of the columns are automatically detected in InsightSolver, so its not mandatory to explicitly specify a btype for each variable.
However, if the user wishes to specify the btype for some or all variables, this can be done using the columns_types dictionary (a key is a column name, a value is a btype).
The columns_types dictionary can be passed as a parameter of the solver.