insightsolver module

insightsolver module#

Organization: InsightSolver Solutions Inc.
Project Name: InsightSolver
Module Name: insightsolver
File Name: insightsolver.py
Author: Noé Aubin-Cadot
Email: noe.aubin-cadot@insightsolver.com

Description#

This file contains the InsightSolver class. This class is meant to ingest data, specify rule mining parameters and make rule mining API calls.

Note#

A service key is necessary to use the API client.

License#

Exclusive Use License - see LICENSE for details.

insightsolver.insightsolver.compute_admissible_btypes(M: int, dtype: str, nunique: int, name: str)#

This function computes the admissible btypes a column can take. The btypes:

'binary'
'multiclass'
'continuous'
'ignore'

insightsolver.insightsolver.compute_columns_names_to_admissible_btypes(df: DataFrame) → dict[str, list[str]]#: This function computes a dict that maps the column names of df to lists of admissible btypes.

This function validates the integrity of the parameter values passed during the instantiation of the InsightSolver class.

Parameters#

df: DataFrame: The DataFrame that contains the data to analyse (a target column and various feature columns).
target_name: str: Name of the column of the target variable.
target_goal: str (or other modality of the target variable): Target goal.
columns_types: dict: Types of the columns.
columns_descr: dict: Descriptions of the columns.
threshold_M_max: int: Threshold on the maximum number of observations to consider, above which we sample observations.
specified_constraints: dict: Dictionary of the specified constraints on m_min, m_max, coverage_min, coverage_max.
top_N_rules: int: An integer that specifies the maximum number of rules to get from the rule mining.
filtering_score: str: A string that specifies the filtering score to be used when selecting rules.
n_benchmark_original: int: An integer that specifies the number of benchmarking runs to execute where the target is not shuffled.
n_benchmark_shuffle: int: An integer that specifies the number of benchmarking runs to execute where the target is shuffled.
do_strict_types: bool (default False): A boolean that specifies if we want a strict evaluation of types.
verbose: bool (default False): Verbosity.

Returns#

columns_types: dict: A dict of the columns types after adjusting the types.

insightsolver.insightsolver.format_value(value, format_type='scientific', decimals=1, verbose=False)#

This function formats values depending on the type of values (float or mpmath) and the type of the format to show:

'percentage': shows the values as percentage (default).
'scientific': shows the values in scientific notation with 4 decimals.
'scientific_no_decimals': shows the values in scientific notation without decimals.

insightsolver.insightsolver.S_to_index_points_in_rule(solver, S: dict, verbose: bool = False, df: DataFrame | None = None) → Index#: This function takes a rule S and returns the index of the points inside the rule of a DataFrame. If no DataFrame is provided, the one used to train the solver is used.

insightsolver.insightsolver.resolve_language(language: str = 'auto', default_language: str = 'english') → str#

insightsolver.insightsolver.gain_to_percent(gain: float, decimals: int = 2) → str#

This function formats the gain to either a positive percentage or a negative percentage.

Parameters#

gain: float: The gain (gain = lift - 1).
decimals: int: Number of decimals to show.

insightsolver.insightsolver.search_best_ruleset_from_API_public(df: DataFrame, computing_source: str = 'auto', input_file_service_key: str | None = None, user_email: str | None = None, target_name: str | int | None = None, target_goal: str | Real | uint8 | None = None, columns_names_to_btypes: Dict | None = {}, columns_names_to_descr: Dict | None = {}, threshold_M_max: int | None = None, specified_constraints: Dict | None = {}, top_N_rules: int | None = 10, verbose: bool = False, filtering_score: str = 'auto', api_source: str = 'auto', do_compress_data: bool = True, do_compute_memory_usage: bool = True, n_benchmark_original: int = 5, n_benchmark_shuffle: int = 20, do_llm_readable_rules: bool = False, llm_source: str = 'remote_gemini', llm_language: str = 'auto', do_store_llm_cache: bool = True, do_check_llm_cache: bool = True) → dict#

This function is meant to make a rule mining API call.

Parameters#

df: DataFrame: The DataFrame that contains the data to analyse (a target column and various feature columns).
computing_source: str: If the rule mining should be computed locally or remotely.
input_file_service_key: str: The string that specifies the path to the service key (necessary to use the remote Cloud Function from outside GCP).
user_email: str: The email of the user (necessary to use the remote Cloud Function from inside GCP).
target_name: str: Name of the column of the target variable.
target_goal: str (or other modality of the target variable): Target goal.
columns_names_to_btypes: dict: A dict that specifies the btypes of the columns.
columns_names_to_descr: dict: A dict that specifies the descriptions of the columns.
threshold_M_max: int: Threshold on the maximum number of points to use during the rule mining (max. 10000 pts in the public API).
specified_constraints: dict: A dict that specifies contraints to be used during the rule mining.
top_N_rules: int: An integer that specifies the maximum number of rules to get from the rule mining.
verbose: bool: Verbosity.
filtering_score: str: A string that specifies the filtering score to be used when selecting rules.
api_source: str: A string used to identify the source of the API call.
do_compress_data: bool: A boolean that specifies if we want to compress the data.
do_compute_memory_usage: bool: A bool that specifies if we want to get the memory usage of the computation.
n_benchmark_original: int: Number of benchmarking runs to execute where the target is not shuffled.
n_benchmark_shuffle: int: Number of benchmarking runs to execute where the target is shuffled.
do_llm_readable_rules: bool: If we want to convert the rules to a readable format using a LLM.
llm_source: str: Source where the LLM is running
do_store_llm_cache: bool: If we want to store the result of the LLM in the cache (makes futur LLM calls faster).
do_check_llm_cache: bool: If we want to check if the results of the prompt are found in the cache (makes LLM calls faster).

Returns#

response: requests.models.Response: A response object obtained from the API call that contains the rule mining results.

insightsolver.insightsolver.get_credits_available(computing_source: str = 'auto', service_key: str | None = None, user_email: str | None = None)#: This function is meant to retrieve from the server the amount of credits available.

class insightsolver.insightsolver.InsightSolver(verbose: bool = False, df: DataFrame = None, target_name: str | int | None = None, target_goal: str | Real | uint8 | None = None, columns_types: Dict | None = {}, columns_descr: Dict | None = {}, threshold_M_max: int | None = 10000, specified_constraints: Dict | None = {}, top_N_rules: int | None = 10, filtering_score: str = 'auto', n_benchmark_original: int = 5, n_benchmark_shuffle: int = 20)#

Bases: Mapping

The class InsightSolver is meant to :

Take input data.
Make an insightsolver API calls to the server.
Present the results of the rule mining.

Attributes#

df: DataFrame: The DataFrame that contains the data to analyse.
target_name: str (default None): Name of the target variable (by default it’s the first column).
target_goal: (str or int): Target goal.
target_threshold: (int or float): Threshold used to convert a continuous target variable to a binary target variable.
M: int: Number of points in the population.
M0: int: Number of points 0 in the population.
M1: int: Number of points 1 in the population.
columns_types: dict: Types of the columns.
columns_descr: dict: Textual descriptions of the columns.
other_modalities: dict: Modalities that are mapped to the modality ‘other’.
threshold_M_max: int (default 10000): Threshold on the maximum number of observations to consider, above which we under sample the observations.
specified_constraints: dict: Dictionary of the specified constraints on m_min, m_max, coverage_min, coverage_max.
top_N_rules: int (default 10): An integer that specifies the maximum number of rules to get from the rule mining.
filtering_score: str (default ‘auto’): A string that specifies the filtering score to be used when selecting rules.
n_benchmark_original: int (default 5): An integer that specifies the number of benchmarking runs to execute where the target is not shuffled.
n_benchmark_shuffle: int (default 20): An integer that specifies the number of benchmarking runs to execute where the target is shuffled.
monitoring_metadata: dict: Dictionary of monitoring metadata.
benchmark_scores: dict: Dictionary of the benchmarking scores against shuffled data.
rule_mining_results: dict: Dictionary that contains the results of the rule mining.
_is_fitted: bool: Boolean that tells if the solver is fitted.

Methods#

__init__: None: Initialization of an instance of the class InsightSolver.
__str__: None: Converts the solver to a string as provided by the print method.
ingest_dict: None: Ingests a Python dict.
ingest_json_string: None: Ingests a JSON string.
is_fitted: Bool: Returns a boolean that tells if the solver is fitted.
fit: None: Fits the solver.
S_to_index_points_in_rule: Pandas Index: Returns the index of the points in a rule S.
S_to_s_points_in_rule: Pandas Series: Returns a boolean Pandas Series that tells if the point is in the rule S.
S_to_df_filtered: Pandas DataFrame: Returns the filtered df of rows that are in the rule S.
ruleset_count: int: Counts the number of rules held by the InsightSolver.
i_to_rule: dict: Gives the rule i of the InsightSolver.
i_to_S: dict: Returns the rule S for the rule at index i.
i_to_subrules_dataframe: Pandas DataFrame: Returns a DataFrame containing the informations about the subrules of the rule i.
i_to_feature_contributions_S: Pandas DataFrame: Returns a DataFrame of the feature contributions of the variables in the rule S at position i.
i_to_readable_text: str: Returns the readable text of the rule i if it is available.
i_to_print: None: Prints the content of the rule i in the InsightSolver.
get_range_i: list: Gives the range of i in the InsightSolver.
print: None: Prints the content of the InsightSolver.
print_light: None: Prints the content of the InsightSolver (‘light’ mode).
print_dense: None: Prints the content of the InsightSolver (‘dense’ mode).
to_dict: dict: Exports the content of the InsightSolver object to a Python dict.
to_json_string: str: Exports the content of the InsightSolver object to a JSON string.
to_dataframe: Pandas DataFrame: Exports the rule mining results to a Pandas DataFrame.
to_csv: str: Exports the rule mining results to a CSV string and/or a local CSV file.
to_excel: None: Exports the rule mining results to a Excel file.
to_excel_string: str: Exports the rule mining results to a Excel string.
get_credits_needed_for_computation: int: Get the number of credits needed for the fitting computation of the solver.
get_df_credits_infos: Pandas DataFrame: Get a DataFrame of the transactions involving credits.
get_credits_available: int: Get the number of credits available.
convert_target_to_binary: pd.Series: Converts the target variable to a binary {0,1}-valued Pandas Series.
compute_mutual_information: pd.Series: Computes a Pandas Series of the mutual information between features and the target variable.
to_pdf: str: Generates a PDF containing all visualization figures for the solver.
to_zip: str: Exports the rule mining results to a ZIP file.

Example#

Here’s a sample code to use the class InsightSolver:

# Specify the service key
service_key = 'name_of_your_service_key.json'

# Import some data
import pandas as pd
df = pd.read_csv('kaggle_titanic_train.csv')

# Specify the name of the target variable
target_name = 'Survived' # We are interested in whether the passengers survived or not

# Specify the target goal
target_goal = 1 # We are searching rules that describe survivors

# Import the class InsightSolver from the module insightsolver
from insightsolver import InsightSolver

# Create an instance of the class InsightSolver
solver = InsightSolver(
    df          = df,          # A dataset
    target_name = target_name, # Name of the target variable
    target_goal = target_goal, # Target goal
)

# Fit the solver
solver.fit(
    service_key = service_key, # Use your API service key here
)

# Print the rule mining results
solver.print()

__init__(verbose: bool = False, df: DataFrame = None, target_name: str | int | None = None, target_goal: str | Real | uint8 | None = None, columns_types: Dict | None = {}, columns_descr: Dict | None = {}, threshold_M_max: int | None = 10000, specified_constraints: Dict | None = {}, top_N_rules: int | None = 10, filtering_score: str = 'auto', n_benchmark_original: int = 5, n_benchmark_shuffle: int = 20)#

The initialization occurs when an InsightSolver class instance is created.

Parameters#

verbose: bool (default False): If we want the initialization to be verbose.
df: DataFrame: The DataFrame that contains the data to analyse (a target column and various feature columns).
target_name: str: Name of the column of the target variable.
target_goal: str (or other modality of the target variable): Target goal.
columns_types: dict: Types of the columns.
columns_descr: dict: Descriptions of the columns.
threshold_M_max: int: Threshold on the maximum number of observations to consider, above which we sample observations.
specified_constraints: dict: Dictionary of the specified constraints on m_min, m_max, coverage_min, coverage_max.
top_N_rules: int (default 10): An integer that specifies the maximum number of rules to get from the rule mining.
filtering_score: str (default ‘auto’): A string that specifies the filtering score to be used when selecting rules.
n_benchmark_original: int (default 5): An integer that specifies the number of benchmarking runs to execute where the target is not shuffled.
n_benchmark_shuffle: int (default 20): An integer that specifies the number of benchmarking runs to execute where the target is shuffled.

Returns#

solver: InsightSolver: An instance of the class InsightSolver.

Example#

Here’s a sample code to instantiante the class InsightSolver:

# Import the class InsightSolver from the module insightsolver
from insightsolver import InsightSolver

# Create an instance of the class InsightSolver
solver = InsightSolver(
    df          = df,          # A Pandas DataFrame
    target_name = target_name, # Name of the target variable
    target_goal = target_goal, # Target goal
)

ingest_dict(d: dict, verbose: bool = False) → None#: This method aims to ingest a Python dict in the solver.

ingest_json_string(json_string: str, verbose: bool = False) → None#: This method aims to ingest a JSON string in the solver.

is_fitted()#: This method returns a boolean that tells if the solver is fitted.

fit(verbose: bool = False, computing_source: str = 'auto', service_key: str | None = None, user_email: str | None = None, api_source: str = 'auto', do_compress_data: bool = True, do_compute_memory_usage: bool = True, do_check_enough_credits: bool = False, do_llm_readable_rules: bool = True, llm_source: str = 'auto', llm_language: str = 'auto', do_store_llm_cache: bool = True, do_check_llm_cache: bool = True) → None#

This method aims to fit the solver.

Parameters#

verbose: bool (default False): If we want the fitting to be verbose.
computing_source: str (default ‘auto’): Specify where the rule mining computation is done (‘local_cloud_function’ or ‘remote_cloud_function’).
service_key: str (default None): Path+name of the service key.
user_email: str (default None): User email.
api_source: str (default ‘auto’): Source of the API call.
do_compress_data: bool (default True): If we want to compress the data for the communications with the server.
do_compute_memory_usage: bool (default True): If we want to monitor the first thread memory usage on the server side.
do_check_enough_credits: bool (default False): Check if there are enough credits to fit the solver.
do_llm_readable_rules: bool (default True): If we want to convert the rules to a readable format using a LLM.
llm_source: str (default ‘auto’): Source where the LLM is running.
llm_language: str (default ‘auto’): Language of the LLM.
do_store_llm_cache: bool (default True): If we want to store the result of the LLM in the cache (makes futur LLM calls faster).
do_check_llm_cache: bool (default True): If we want to check if the results of the prompt are found in the cache (makes LLM calls faster).

S_to_index_points_in_rule(S: dict, verbose: bool = False, df: DataFrame | None = None) → Index#: This method returns the index of the points inside a rule S.

S_to_s_points_in_rule(S: dict, verbose: bool = False, df: DataFrame | None = None) → Series#: This method returns a boolean Series that tells if the points are in the rule S or not.

S_to_df_filtered(S: dict, verbose: bool = False, df: DataFrame | None = None)#: This method returns the DataFrame of rows of df that lie inside a rule S.

ruleset_count() → int#: This method returns the number of rules held in an instance of the solver.

i_to_rule(i: int) → dict#

i_to_S(i)#: This method returns the rule S at position i.

i_to_subrules_dataframe(i: int = 0) → DataFrame#: This method returns a DataFrame which contains the informations about the subrules of the rule i.

i_to_feature_contributions_S(i: int, do_rename_cols: bool = True, do_ignore_col_rule_S: bool = True) → DataFrame#: This method returns a DataFrame of the feature contributions of the variables in the rule S at position i.

i_to_feature_names(i: int, do_sort: bool = True)#

Returns the list of feature names in the rule at position i. The feature are sorted by contribution, descending.

Parameters#

i: int: Index of the rule in the solver.
do_sort: bool: If we want to sort the features by contribution, descending.

i_to_readable_text(i) → str | None#: Returns the readable text of the rule i if it is available.

i_to_print(i: int, indentation: str = '', do_print_shuffling_scores: bool = True, do_print_rule_DataFrame: bool = False, do_print_subrules_S: bool = True, do_show_coverage_diff: bool = False, do_show_cohen_d: bool = True, do_show_wy_ratio: bool = True, do_print_feature_contributions_S: bool = True) → None#

This method prints the content of the rule i in the solver.

Parameters#

i: int: Index of the rule to print.
indentation: str: Indentation of some printed elements.
do_print_shuffling_scores: bool: If we want to print the shuffling scores.
do_print_rule_DataFrame: bool: If we want to print a DataFrame of the rule.
do_print_subrules_S: bool: If we want to print the DataFrame of subrules.
do_show_coverage_diff: bool: If we want to show the differences of coverage in the DataFrame of subrules.
do_show_cohen_d: bool: If we want to show the Cohen d in the DataFrame of subrules.
do_show_wy_ratio: bool: If we want to show the WY ratio in the DataFrame of subrules.
do_print_feature_contributions_S: bool: If we want to print the DataFrame of feature contributions.

get_range_i(complexity_max: int | None = None) → list#: This method gives the range of i in the solver. If the integer complexity_max is specified, return only this number of elements.

print(verbose: bool = False, r: int | None = None, do_print_dataset_metadata: bool = True, do_print_monitoring_metadata: bool = False, do_print_benchmark_scores: bool = True, do_print_shuffling_scores: bool = True, do_show_cohen_d: bool = True, do_show_wy_ratio: bool = True, do_print_rule_mining_results: bool = True, do_print_rule_DataFrame: bool = False, do_print_subrules_S: bool = True, do_show_coverage_diff: bool = False, do_print_feature_contributions_S: bool = True, separation_width_between_rules: int | None = 79, do_print_last_separator: bool = True, mode: str = 'full') → None#: This method prints the content of the InsightSolver solver.

print_light(print_format: str = 'list', do_print_shuffling_scores: bool = True, do_print_last_separator: bool = True) → None#

This method does a ‘light’ print of the solver.

Two formats:

'list': shows the rules via a loop of prints.
'compact': shows the rules in a single DataFrame.

print_dense(do_print_lifts: bool = False, do_print_shuffling_scores: bool = True) → None#

This method is aimed at printing a ‘dense’ version of the solver.

Parameters#

do_print_lifts: bool: If we want to show the lifts.
do_print_shuffling_scores: bool: If we want to show the shuffling scores.

to_dict() → dict#: This method aims to export the content of the solver to a dictionary.

to_json_string(verbose=False) → str#: This method aims to export the content of the solver to a JSON string.

to_dataframe(verbose=False, do_append_datetime=False, do_rename_cols=False) → DataFrame#: This method aims to export the content of the solver to a DataFrame.

to_csv(output_file=None, verbose=False, do_rename_cols=False) → str#: This method is meant to export the content of the solver to a CSV file.

to_excel(output_file, verbose=False, do_rename_cols=False) → None#: This method is meant to export the solver to a Excel file.

to_excel_string(verbose=False, do_rename_cols=False) → str#: This method is meant to export the solver to a Excel string.

get_credits_needed_for_computation() → int#: This method is meant to compute the number of credits for the computation during the fitting of the solver.

get_df_credits_infos(computing_source: str = 'auto', service_key: str | None = None, user_email: str | None = None) → DataFrame#: This method is meant to retrieve from the server the transactions involving credits.

get_credits_available(computing_source: str = 'auto', service_key: str | None = None, user_email: str | None = None) → int#: This method is meant to retrieve from the server the amount of credits available.

convert_target_to_binary()#

This method converts the target variable to a binary {0,1}-valued Pandas Series.

To use this method, the attribute solver.target_goal must be populated because it specifies how to convert the target variable to binary. As a reminder, the target goal must be one of the following:

A modality of the target variable in the case of a categorical (i.e. 'binary' or 'multiclass') target variable.
'min', 'min_q0', 'min_q1', 'min_q2', 'min_q3', 'min_c00', 'min_c01', …, 'min_c98', 'min_c99'.
'max', 'max_q1', ',max_q2', 'max_q3', 'max_q4', 'max_c01', 'max_c02', …, 'max_c99', 'max_c100'.

Returns#

s_target: pd.Series: A {0,1}-valued Pandas Series representing the target variable.

compute_mutual_information(n_samples: int = 1000) → Series#

This method computes the mutual information between the features and the target variable. The result is returned as a Pandas Series.

Parameters#

n_samples: int: An integer that specifies the number of data rows to use in the computation of the mutual information.

Returns#

s_mi: pd.Series: A Pandas Series that contains the mutual information of the features with the target variable.

plot(language: str = 'en', do_mutual_information: bool = True, do_banner: bool = True, do_contributions: bool = True, do_distributions: bool = True, do_mosaics_rule_vs_comp: bool = True, do_mosaics_rule_vs_pop: bool = True, do_legend: bool = True) → None#

Displays all visualization figures for the solver.

Parameters#

languagestr: Language for the plots (‘en’ or ‘fr’).
do_mutual_informationbool: Whether to show the mutual information figure.
do_bannerbool: Whether to show the banner figures.
do_contributionsbool: Whether to show feature contributions.
do_distributionsbool: Whether to show feature distributions.
do_mosaics_rule_vs_compbool: Whether to show the mosaics of rule vs complement figures.
do_mosaics_rule_vs_popbool: Whether to show the mosaics of rule vs population figures.
do_legendbool: Whether to show the legend figure.

to_pdf(output_file: str | None = None, verbose: bool = False, do_mutual_information: bool = True, do_banner: bool = True, do_contributions: bool = True, do_distributions: bool = True, do_mosaics_rule_vs_comp: bool = True, do_mosaics_rule_vs_pop: bool = True, do_legend: bool = True, language: str = 'en')#

Export a PDF file containing various results and figures of the solver.

This method is now a simple wrapper around visualization.make_pdf().

Parameters#

output_filestr, optional: Path where the PDF should be exported.
verbosebool, default False: Verbosity.
do_mutual_informationbool: Include mutual information figure.
do_bannerbool: Include banner figures.
do_contributionsbool: Include contribution figures.
do_distributionsbool: Include distribution figures.
do_mosaics_rule_vs_compbool: Include mosaics of rule vs complement figures.
do_mosaics_rule_vs_popbool: Include mosaics of rule vs population figures.
do_legendbool: Include the legend figure.
languagestr: Language for the plots (‘en’ or ‘fr’).

Returns#

pdf_base64str: The PDF content encoded as a base64 string, suitable for in-memory use.

to_zip(output_file: str | None = None, verbose: bool = False, do_png: bool = True, do_csv: bool = True, do_json: bool = True, do_excel: bool = True, do_pdf: bool = True, language: str = 'en')#

Export the solver content to a ZIP file.

This method is now a simple wrapper around visualization.make_zip().

insightsolver module

Contents

insightsolver module#

Description#

Note#

License#

Parameters#

Returns#

Parameters#

Parameters#

Returns#

Attributes#

Methods#

Example#

Parameters#

Returns#

Example#

Parameters#

Parameters#

Parameters#

Parameters#

Returns#

Parameters#

Returns#

Parameters#

Parameters#

Returns#