insightsolver module#
Organization: InsightSolver Solutions Inc.
Project Name: InsightSolver
Module Name: insightsolver
File Name: insightsolver.py
Author: Noé Aubin-Cadot
Description#
This file contains the InsightSolver class.
This class is meant to ingest data, specify rule mining parameters and make rule mining API calls.
Note#
A service key is necessary to use the API client.
License#
Exclusive Use License - see LICENSE for details.
- insightsolver.insightsolver.compute_admissible_btypes(M: int, dtype: str, nunique: int, name: str)#
This function computes the admissible btypes a column can take. The btypes:
'binary''multiclass''continuous''ignore'
- insightsolver.insightsolver.compute_columns_names_to_admissible_btypes(df: DataFrame) dict[str, list[str]]#
This function computes a dict that maps the column names of
dfto lists of admissible btypes.
- insightsolver.insightsolver.validate_class_integrity(df: DataFrame | None, target_name: str | int | None, target_goal: str | Real | uint8 | None, columns_types: Dict | None, columns_descr: Dict | None, threshold_M_max: int | None, specified_constraints: Dict | None, top_N_rules: int | None, filtering_score: str, n_benchmark_original: int, n_benchmark_shuffle: int, do_strict_types: bool = False, verbose: bool = False) dict#
This function validates the integrity of the parameter values passed during the instantiation of the InsightSolver class.
Parameters#
- df: DataFrame
The DataFrame that contains the data to analyse (a target column and various feature columns).
- target_name: str
Name of the column of the target variable.
- target_goal: str (or other modality of the target variable)
Target goal.
- columns_types: dict
Types of the columns.
- columns_descr: dict
Descriptions of the columns.
- threshold_M_max: int
Threshold on the maximum number of observations to consider, above which we sample observations.
- specified_constraints: dict
Dictionary of the specified constraints on m_min, m_max, coverage_min, coverage_max.
- top_N_rules: int
An integer that specifies the maximum number of rules to get from the rule mining.
- filtering_score: str
A string that specifies the filtering score to be used when selecting rules.
- n_benchmark_original: int
An integer that specifies the number of benchmarking runs to execute where the target is not shuffled.
- n_benchmark_shuffle: int
An integer that specifies the number of benchmarking runs to execute where the target is shuffled.
- do_strict_types: bool (default False)
A boolean that specifies if we want a strict evaluation of types.
- verbose: bool (default False)
Verbosity.
Returns#
- columns_types: dict
A dict of the columns types after adjusting the types.
- insightsolver.insightsolver.format_value(value, format_type='scientific', decimals=1, verbose=False)#
This function formats values depending on the type of values (
floatormpmath) and the type of the format to show:'percentage': shows the values as percentage (default).'scientific': shows the values in scientific notation with 4 decimals.'scientific_no_decimals': shows the values in scientific notation without decimals.
- insightsolver.insightsolver.S_to_index_points_in_rule(solver, S: dict, verbose: bool = False, df: DataFrame | None = None) Index#
This function takes a rule
Sand returns the index of the points inside the rule of a DataFrame. If no DataFrame is provided, the one used to train the solver is used.
- insightsolver.insightsolver.resolve_language(language: str = 'auto', default_language: str = 'english') str#
- insightsolver.insightsolver.gain_to_percent(gain: float, decimals: int = 2) str#
This function formats the gain to either a positive percentage or a negative percentage.
Parameters#
- gain: float
The gain (gain = lift - 1).
- decimals: int
Number of decimals to show.
- insightsolver.insightsolver.search_best_ruleset_from_API_public(df: DataFrame, computing_source: str = 'auto', input_file_service_key: str | None = None, user_email: str | None = None, target_name: str | int | None = None, target_goal: str | Real | uint8 | None = None, columns_names_to_btypes: Dict | None = {}, columns_names_to_descr: Dict | None = {}, threshold_M_max: int | None = None, specified_constraints: Dict | None = {}, top_N_rules: int | None = 10, verbose: bool = False, filtering_score: str = 'auto', api_source: str = 'auto', do_compress_data: bool = True, do_compute_memory_usage: bool = True, n_benchmark_original: int = 5, n_benchmark_shuffle: int = 20, do_llm_readable_rules: bool = False, llm_source: str = 'remote_gemini', llm_language: str = 'auto', do_store_llm_cache: bool = True, do_check_llm_cache: bool = True) dict#
This function is meant to make a rule mining API call.
Parameters#
- df: DataFrame
The DataFrame that contains the data to analyse (a target column and various feature columns).
- computing_source: str
If the rule mining should be computed locally or remotely.
- input_file_service_key: str
The string that specifies the path to the service key (necessary to use the remote Cloud Function from outside GCP).
- user_email: str
The email of the user (necessary to use the remote Cloud Function from inside GCP).
- target_name: str
Name of the column of the target variable.
- target_goal: str (or other modality of the target variable)
Target goal.
- columns_names_to_btypes: dict
A dict that specifies the btypes of the columns.
- columns_names_to_descr: dict
A dict that specifies the descriptions of the columns.
- threshold_M_max: int
Threshold on the maximum number of points to use during the rule mining (max. 10000 pts in the public API).
- specified_constraints: dict
A dict that specifies contraints to be used during the rule mining.
- top_N_rules: int
An integer that specifies the maximum number of rules to get from the rule mining.
- verbose: bool
Verbosity.
- filtering_score: str
A string that specifies the filtering score to be used when selecting rules.
- api_source: str
A string used to identify the source of the API call.
- do_compress_data: bool
A boolean that specifies if we want to compress the data.
- do_compute_memory_usage: bool
A bool that specifies if we want to get the memory usage of the computation.
- n_benchmark_original: int
Number of benchmarking runs to execute where the target is not shuffled.
- n_benchmark_shuffle: int
Number of benchmarking runs to execute where the target is shuffled.
- do_llm_readable_rules: bool
If we want to convert the rules to a readable format using a LLM.
- llm_source: str
Source where the LLM is running
- do_store_llm_cache: bool
If we want to store the result of the LLM in the cache (makes futur LLM calls faster).
- do_check_llm_cache: bool
If we want to check if the results of the prompt are found in the cache (makes LLM calls faster).
Returns#
- response: requests.models.Response
A response object obtained from the API call that contains the rule mining results.
- insightsolver.insightsolver.get_credits_available(computing_source: str = 'auto', service_key: str | None = None, user_email: str | None = None)#
This function is meant to retrieve from the server the amount of credits available.
- class insightsolver.insightsolver.InsightSolver(verbose: bool = False, df: DataFrame = None, target_name: str | int | None = None, target_goal: str | Real | uint8 | None = None, columns_types: Dict | None = {}, columns_descr: Dict | None = {}, threshold_M_max: int | None = 10000, specified_constraints: Dict | None = {}, top_N_rules: int | None = 10, filtering_score: str = 'auto', n_benchmark_original: int = 5, n_benchmark_shuffle: int = 20)#
Bases:
MappingThe class
InsightSolveris meant to :Take input data.
Make an insightsolver API calls to the server.
Present the results of the rule mining.
Attributes#
- df: DataFrame
The DataFrame that contains the data to analyse.
- target_name: str (default None)
Name of the target variable (by default it’s the first column).
- target_goal: (str or int)
Target goal.
- target_threshold: (int or float)
Threshold used to convert a continuous target variable to a binary target variable.
- M: int
Number of points in the population.
- M0: int
Number of points 0 in the population.
- M1: int
Number of points 1 in the population.
- columns_types: dict
Types of the columns.
- columns_descr: dict
Textual descriptions of the columns.
- other_modalities: dict
Modalities that are mapped to the modality ‘other’.
- threshold_M_max: int (default 10000)
Threshold on the maximum number of observations to consider, above which we under sample the observations.
- specified_constraints: dict
Dictionary of the specified constraints on
m_min,m_max,coverage_min,coverage_max.- top_N_rules: int (default 10)
An integer that specifies the maximum number of rules to get from the rule mining.
- filtering_score: str (default ‘auto’)
A string that specifies the filtering score to be used when selecting rules.
- n_benchmark_original: int (default 5)
An integer that specifies the number of benchmarking runs to execute where the target is not shuffled.
- n_benchmark_shuffle: int (default 20)
An integer that specifies the number of benchmarking runs to execute where the target is shuffled.
- monitoring_metadata: dict
Dictionary of monitoring metadata.
- benchmark_scores: dict
Dictionary of the benchmarking scores against shuffled data.
- rule_mining_results: dict
Dictionary that contains the results of the rule mining.
- _is_fitted: bool
Boolean that tells if the solver is fitted.
Methods#
- __init__: None
Initialization of an instance of the class InsightSolver.
- __str__: None
Converts the solver to a string as provided by the print method.
- ingest_dict: None
Ingests a Python dict.
- ingest_json_string: None
Ingests a JSON string.
- is_fitted: Bool
Returns a boolean that tells if the solver is fitted.
- fit: None
Fits the solver.
- S_to_index_points_in_rule: Pandas Index
Returns the index of the points in a rule S.
- S_to_s_points_in_rule: Pandas Series
Returns a boolean Pandas Series that tells if the point is in the rule S.
- S_to_df_filtered: Pandas DataFrame
Returns the filtered df of rows that are in the rule S.
- ruleset_count: int
Counts the number of rules held by the InsightSolver.
- i_to_rule: dict
Gives the rule i of the InsightSolver.
- i_to_S: dict
Returns the rule S for the rule at index i.
- i_to_subrules_dataframe: Pandas DataFrame
Returns a DataFrame containing the informations about the subrules of the rule i.
- i_to_feature_contributions_S: Pandas DataFrame
Returns a DataFrame of the feature contributions of the variables in the rule S at position i.
- i_to_readable_text: str
Returns the readable text of the rule i if it is available.
- i_to_print: None
Prints the content of the rule i in the InsightSolver.
- get_range_i: list
Gives the range of i in the InsightSolver.
- print: None
Prints the content of the InsightSolver.
- print_light: None
Prints the content of the InsightSolver (‘light’ mode).
- print_dense: None
Prints the content of the InsightSolver (‘dense’ mode).
- to_dict: dict
Exports the content of the InsightSolver object to a Python dict.
- to_json_string: str
Exports the content of the InsightSolver object to a JSON string.
- to_dataframe: Pandas DataFrame
Exports the rule mining results to a Pandas DataFrame.
- to_csv: str
Exports the rule mining results to a CSV string and/or a local CSV file.
- to_excel: None
Exports the rule mining results to a Excel file.
- to_excel_string: str
Exports the rule mining results to a Excel string.
- get_credits_needed_for_computation: int
Get the number of credits needed for the fitting computation of the solver.
- get_df_credits_infos: Pandas DataFrame
Get a DataFrame of the transactions involving credits.
- get_credits_available: int
Get the number of credits available.
- convert_target_to_binary: pd.Series
Converts the target variable to a binary {0,1}-valued Pandas Series.
- compute_mutual_information: pd.Series
Computes a Pandas Series of the mutual information between features and the target variable.
- to_pdf: str
Generates a PDF containing all visualization figures for the solver.
- to_zip: str
Exports the rule mining results to a ZIP file.
Example#
Here’s a sample code to use the class
InsightSolver:# Specify the service key service_key = 'name_of_your_service_key.json' # Import some data import pandas as pd df = pd.read_csv('kaggle_titanic_train.csv') # Specify the name of the target variable target_name = 'Survived' # We are interested in whether the passengers survived or not # Specify the target goal target_goal = 1 # We are searching rules that describe survivors # Import the class InsightSolver from the module insightsolver from insightsolver import InsightSolver # Create an instance of the class InsightSolver solver = InsightSolver( df = df, # A dataset target_name = target_name, # Name of the target variable target_goal = target_goal, # Target goal ) # Fit the solver solver.fit( service_key = service_key, # Use your API service key here ) # Print the rule mining results solver.print()
- __init__(verbose: bool = False, df: DataFrame = None, target_name: str | int | None = None, target_goal: str | Real | uint8 | None = None, columns_types: Dict | None = {}, columns_descr: Dict | None = {}, threshold_M_max: int | None = 10000, specified_constraints: Dict | None = {}, top_N_rules: int | None = 10, filtering_score: str = 'auto', n_benchmark_original: int = 5, n_benchmark_shuffle: int = 20)#
The initialization occurs when an
InsightSolverclass instance is created.Parameters#
- verbose: bool (default False)
If we want the initialization to be verbose.
- df: DataFrame
The DataFrame that contains the data to analyse (a target column and various feature columns).
- target_name: str
Name of the column of the target variable.
- target_goal: str (or other modality of the target variable)
Target goal.
- columns_types: dict
Types of the columns.
- columns_descr: dict
Descriptions of the columns.
- threshold_M_max: int
Threshold on the maximum number of observations to consider, above which we sample observations.
- specified_constraints: dict
Dictionary of the specified constraints on m_min, m_max, coverage_min, coverage_max.
- top_N_rules: int (default 10)
An integer that specifies the maximum number of rules to get from the rule mining.
- filtering_score: str (default ‘auto’)
A string that specifies the filtering score to be used when selecting rules.
- n_benchmark_original: int (default 5)
An integer that specifies the number of benchmarking runs to execute where the target is not shuffled.
- n_benchmark_shuffle: int (default 20)
An integer that specifies the number of benchmarking runs to execute where the target is shuffled.
Returns#
- solver: InsightSolver
An instance of the class InsightSolver.
Example#
Here’s a sample code to instantiante the class
InsightSolver:# Import the class InsightSolver from the module insightsolver from insightsolver import InsightSolver # Create an instance of the class InsightSolver solver = InsightSolver( df = df, # A Pandas DataFrame target_name = target_name, # Name of the target variable target_goal = target_goal, # Target goal )
- ingest_dict(d: dict, verbose: bool = False) None#
This method aims to ingest a Python dict in the solver.
- ingest_json_string(json_string: str, verbose: bool = False) None#
This method aims to ingest a JSON string in the solver.
- is_fitted()#
This method returns a boolean that tells if the solver is fitted.
- fit(verbose: bool = False, computing_source: str = 'auto', service_key: str | None = None, user_email: str | None = None, api_source: str = 'auto', do_compress_data: bool = True, do_compute_memory_usage: bool = True, do_check_enough_credits: bool = False, do_llm_readable_rules: bool = True, llm_source: str = 'auto', llm_language: str = 'auto', do_store_llm_cache: bool = True, do_check_llm_cache: bool = True) None#
This method aims to fit the solver.
Parameters#
- verbose: bool (default False)
If we want the fitting to be verbose.
- computing_source: str (default ‘auto’)
Specify where the rule mining computation is done (‘local_cloud_function’ or ‘remote_cloud_function’).
- service_key: str (default None)
Path+name of the service key.
- user_email: str (default None)
User email.
- api_source: str (default ‘auto’)
Source of the API call.
- do_compress_data: bool (default True)
If we want to compress the data for the communications with the server.
- do_compute_memory_usage: bool (default True)
If we want to monitor the first thread memory usage on the server side.
- do_check_enough_credits: bool (default False)
Check if there are enough credits to fit the solver.
- do_llm_readable_rules: bool (default True)
If we want to convert the rules to a readable format using a LLM.
- llm_source: str (default ‘auto’)
Source where the LLM is running.
- llm_language: str (default ‘auto’)
Language of the LLM.
- do_store_llm_cache: bool (default True)
If we want to store the result of the LLM in the cache (makes futur LLM calls faster).
- do_check_llm_cache: bool (default True)
If we want to check if the results of the prompt are found in the cache (makes LLM calls faster).
- S_to_index_points_in_rule(S: dict, verbose: bool = False, df: DataFrame | None = None) Index#
This method returns the index of the points inside a rule
S.
- S_to_s_points_in_rule(S: dict, verbose: bool = False, df: DataFrame | None = None) Series#
This method returns a boolean Series that tells if the points are in the rule
Sor not.
- S_to_df_filtered(S: dict, verbose: bool = False, df: DataFrame | None = None)#
This method returns the DataFrame of rows of
dfthat lie inside a ruleS.
- ruleset_count() int#
This method returns the number of rules held in an instance of the solver.
- i_to_rule(i: int) dict#
- i_to_S(i)#
This method returns the rule
Sat positioni.
- i_to_subrules_dataframe(i: int = 0) DataFrame#
This method returns a DataFrame which contains the informations about the subrules of the rule
i.
- i_to_feature_contributions_S(i: int, do_rename_cols: bool = True, do_ignore_col_rule_S: bool = True) DataFrame#
This method returns a DataFrame of the feature contributions of the variables in the rule
Sat positioni.
- i_to_feature_names(i: int, do_sort: bool = True)#
Returns the list of feature names in the rule at position
i. The feature are sorted by contribution, descending.Parameters#
- i: int
Index of the rule in the solver.
- do_sort: bool
If we want to sort the features by contribution, descending.
- i_to_readable_text(i) str | None#
Returns the readable text of the rule
iif it is available.
- i_to_print(i: int, indentation: str = '', do_print_shuffling_scores: bool = True, do_print_rule_DataFrame: bool = False, do_print_subrules_S: bool = True, do_show_coverage_diff: bool = False, do_show_cohen_d: bool = True, do_show_wy_ratio: bool = True, do_print_feature_contributions_S: bool = True) None#
This method prints the content of the rule
iin the solver.Parameters#
- i: int
Index of the rule to print.
- indentation: str
Indentation of some printed elements.
- do_print_shuffling_scores: bool
If we want to print the shuffling scores.
- do_print_rule_DataFrame: bool
If we want to print a DataFrame of the rule.
- do_print_subrules_S: bool
If we want to print the DataFrame of subrules.
- do_show_coverage_diff: bool
If we want to show the differences of coverage in the DataFrame of subrules.
- do_show_cohen_d: bool
If we want to show the Cohen d in the DataFrame of subrules.
- do_show_wy_ratio: bool
If we want to show the WY ratio in the DataFrame of subrules.
- do_print_feature_contributions_S: bool
If we want to print the DataFrame of feature contributions.
- get_range_i(complexity_max: int | None = None) list#
This method gives the range of
iin the solver. If the integercomplexity_maxis specified, return only this number of elements.
- print(verbose: bool = False, r: int | None = None, do_print_dataset_metadata: bool = True, do_print_monitoring_metadata: bool = False, do_print_benchmark_scores: bool = True, do_print_shuffling_scores: bool = True, do_show_cohen_d: bool = True, do_show_wy_ratio: bool = True, do_print_rule_mining_results: bool = True, do_print_rule_DataFrame: bool = False, do_print_subrules_S: bool = True, do_show_coverage_diff: bool = False, do_print_feature_contributions_S: bool = True, separation_width_between_rules: int | None = 79, do_print_last_separator: bool = True, mode: str = 'full') None#
This method prints the content of the
InsightSolversolver.
- print_light(print_format: str = 'list', do_print_shuffling_scores: bool = True, do_print_last_separator: bool = True) None#
This method does a ‘light’ print of the solver.
Two formats:
'list': shows the rules via a loop of prints.'compact': shows the rules in a single DataFrame.
- print_dense(do_print_lifts: bool = False, do_print_shuffling_scores: bool = True) None#
This method is aimed at printing a ‘dense’ version of the solver.
Parameters#
- do_print_lifts: bool
If we want to show the lifts.
- do_print_shuffling_scores: bool
If we want to show the shuffling scores.
- to_dict() dict#
This method aims to export the content of the solver to a dictionary.
- to_json_string(verbose=False) str#
This method aims to export the content of the solver to a JSON string.
- to_dataframe(verbose=False, do_append_datetime=False, do_rename_cols=False) DataFrame#
This method aims to export the content of the solver to a DataFrame.
- to_csv(output_file=None, verbose=False, do_rename_cols=False) str#
This method is meant to export the content of the solver to a CSV file.
- to_excel(output_file, verbose=False, do_rename_cols=False) None#
This method is meant to export the solver to a Excel file.
- to_excel_string(verbose=False, do_rename_cols=False) str#
This method is meant to export the solver to a Excel string.
- get_credits_needed_for_computation() int#
This method is meant to compute the number of credits for the computation during the fitting of the solver.
- get_df_credits_infos(computing_source: str = 'auto', service_key: str | None = None, user_email: str | None = None) DataFrame#
This method is meant to retrieve from the server the transactions involving credits.
- get_credits_available(computing_source: str = 'auto', service_key: str | None = None, user_email: str | None = None) int#
This method is meant to retrieve from the server the amount of credits available.
- convert_target_to_binary()#
This method converts the target variable to a binary {0,1}-valued Pandas Series.
To use this method, the attribute
solver.target_goalmust be populated because it specifies how to convert the target variable to binary. As a reminder, the target goal must be one of the following:A modality of the target variable in the case of a categorical (i.e.
'binary'or'multiclass') target variable.'min','min_q0','min_q1','min_q2','min_q3','min_c00','min_c01', …,'min_c98','min_c99'.'max','max_q1',',max_q2','max_q3','max_q4','max_c01','max_c02', …,'max_c99','max_c100'.
Returns#
- s_target: pd.Series
A {0,1}-valued Pandas Series representing the target variable.
- compute_mutual_information(n_samples: int = 1000) Series#
This method computes the mutual information between the features and the target variable. The result is returned as a Pandas Series.
Parameters#
- n_samples: int
An integer that specifies the number of data rows to use in the computation of the mutual information.
Returns#
- s_mi: pd.Series
A Pandas Series that contains the mutual information of the features with the target variable.
- plot(language: str = 'en', do_mutual_information: bool = True, do_banner: bool = True, do_contributions: bool = True, do_distributions: bool = True, do_mosaics_rule_vs_comp: bool = True, do_mosaics_rule_vs_pop: bool = True, do_legend: bool = True) None#
Displays all visualization figures for the solver.
Parameters#
- languagestr
Language for the plots (‘en’ or ‘fr’).
- do_mutual_informationbool
Whether to show the mutual information figure.
- do_bannerbool
Whether to show the banner figures.
- do_contributionsbool
Whether to show feature contributions.
- do_distributionsbool
Whether to show feature distributions.
- do_mosaics_rule_vs_compbool
Whether to show the mosaics of rule vs complement figures.
- do_mosaics_rule_vs_popbool
Whether to show the mosaics of rule vs population figures.
- do_legendbool
Whether to show the legend figure.
- to_pdf(output_file: str | None = None, verbose: bool = False, do_mutual_information: bool = True, do_banner: bool = True, do_contributions: bool = True, do_distributions: bool = True, do_mosaics_rule_vs_comp: bool = True, do_mosaics_rule_vs_pop: bool = True, do_legend: bool = True, language: str = 'en')#
Export a PDF file containing various results and figures of the solver.
This method is now a simple wrapper around visualization.make_pdf().
Parameters#
- output_filestr, optional
Path where the PDF should be exported.
- verbosebool, default False
Verbosity.
- do_mutual_informationbool
Include mutual information figure.
- do_bannerbool
Include banner figures.
- do_contributionsbool
Include contribution figures.
- do_distributionsbool
Include distribution figures.
- do_mosaics_rule_vs_compbool
Include mosaics of rule vs complement figures.
- do_mosaics_rule_vs_popbool
Include mosaics of rule vs population figures.
- do_legendbool
Include the legend figure.
- languagestr
Language for the plots (‘en’ or ‘fr’).
Returns#
- pdf_base64str
The PDF content encoded as a base64 string, suitable for in-memory use.
- to_zip(output_file: str | None = None, verbose: bool = False, do_png: bool = True, do_csv: bool = True, do_json: bool = True, do_excel: bool = True, do_pdf: bool = True, language: str = 'en')#
Export the solver content to a ZIP file.
This method is now a simple wrapper around visualization.make_zip().