insightsolver module#

  • Organization: InsightSolver Solutions Inc.

  • Project Name: InsightSolver

  • Module Name: insightsolver

  • File Name: insightsolver.py

  • Author: Noé Aubin-Cadot

  • Email: noe.aubin-cadot@insightsolver.com

Description#

This file contains the InsightSolver class. This class is meant to ingest data, specify rule mining parameters and make rule mining API calls.

Note#

A service key is necessary to use the API client.

License#

Exclusive Use License - see LICENSE for details.


insightsolver.insightsolver.compute_admissible_btypes(M: int, dtype: str, nunique: int, name: str)#

This function computes the admissible btypes a column can take. The btypes:

  • 'binary'

  • 'multiclass'

  • 'continuous'

  • 'ignore'

insightsolver.insightsolver.compute_columns_names_to_admissible_btypes(df: DataFrame) dict[str, list[str]]#

This function computes a dict that maps the column names of df to lists of admissible btypes.

insightsolver.insightsolver.validate_class_integrity(df: DataFrame | None, target_name: str | int | None, target_goal: str | Real | uint8 | None, columns_types: Dict | None, columns_descr: Dict | None, threshold_M_max: int | None, specified_constraints: Dict | None, top_N_rules: int | None, filtering_score: str, n_benchmark_original: int, n_benchmark_shuffle: int, do_strict_types: bool = False, verbose: bool = False) dict#

This function validates the integrity of the parameter values passed during the instantiation of the InsightSolver class.

Parameters#

df: DataFrame

The DataFrame that contains the data to analyse (a target column and various feature columns).

target_name: str

Name of the column of the target variable.

target_goal: str (or other modality of the target variable)

Target goal.

columns_types: dict

Types of the columns.

columns_descr: dict

Descriptions of the columns.

threshold_M_max: int

Threshold on the maximum number of observations to consider, above which we sample observations.

specified_constraints: dict

Dictionary of the specified constraints on m_min, m_max, coverage_min, coverage_max.

top_N_rules: int

An integer that specifies the maximum number of rules to get from the rule mining.

filtering_score: str

A string that specifies the filtering score to be used when selecting rules.

n_benchmark_original: int

An integer that specifies the number of benchmarking runs to execute where the target is not shuffled.

n_benchmark_shuffle: int

An integer that specifies the number of benchmarking runs to execute where the target is shuffled.

do_strict_types: bool (default False)

A boolean that specifies if we want a strict evaluation of types.

verbose: bool (default False)

Verbosity.

Returns#

columns_types: dict

A dict of the columns types after adjusting the types.

insightsolver.insightsolver.format_value(value, format_type='scientific', decimals=1, verbose=False)#

This function formats values depending on the type of values (float or mpmath) and the type of the format to show:

  • 'percentage': shows the values as percentage (default).

  • 'scientific': shows the values in scientific notation with 4 decimals.

  • 'scientific_no_decimals': shows the values in scientific notation without decimals.

insightsolver.insightsolver.S_to_index_points_in_rule(solver, S: dict, verbose: bool = False, df: DataFrame | None = None) Index#

This function takes a rule S and returns the index of the points inside the rule of a DataFrame. If no DataFrame is provided, the one used to train the solver is used.

insightsolver.insightsolver.resolve_language(language: str = 'auto', default_language: str = 'english') str#
insightsolver.insightsolver.gain_to_percent(gain: float, decimals: int = 2) str#

This function formats the gain to either a positive percentage or a negative percentage.

Parameters#

gain: float

The gain (gain = lift - 1).

decimals: int

Number of decimals to show.

insightsolver.insightsolver.search_best_ruleset_from_API_public(df: DataFrame, computing_source: str = 'auto', input_file_service_key: str | None = None, user_email: str | None = None, target_name: str | int | None = None, target_goal: str | Real | uint8 | None = None, columns_names_to_btypes: Dict | None = {}, columns_names_to_descr: Dict | None = {}, threshold_M_max: int | None = None, specified_constraints: Dict | None = {}, top_N_rules: int | None = 10, verbose: bool = False, filtering_score: str = 'auto', api_source: str = 'auto', do_compress_data: bool = True, do_compute_memory_usage: bool = True, n_benchmark_original: int = 5, n_benchmark_shuffle: int = 20, do_llm_readable_rules: bool = False, llm_source: str = 'remote_gemini', llm_language: str = 'auto', do_store_llm_cache: bool = True, do_check_llm_cache: bool = True) dict#

This function is meant to make a rule mining API call.

Parameters#

df: DataFrame

The DataFrame that contains the data to analyse (a target column and various feature columns).

computing_source: str

If the rule mining should be computed locally or remotely.

input_file_service_key: str

The string that specifies the path to the service key (necessary to use the remote Cloud Function from outside GCP).

user_email: str

The email of the user (necessary to use the remote Cloud Function from inside GCP).

target_name: str

Name of the column of the target variable.

target_goal: str (or other modality of the target variable)

Target goal.

columns_names_to_btypes: dict

A dict that specifies the btypes of the columns.

columns_names_to_descr: dict

A dict that specifies the descriptions of the columns.

threshold_M_max: int

Threshold on the maximum number of points to use during the rule mining (max. 10000 pts in the public API).

specified_constraints: dict

A dict that specifies contraints to be used during the rule mining.

top_N_rules: int

An integer that specifies the maximum number of rules to get from the rule mining.

verbose: bool

Verbosity.

filtering_score: str

A string that specifies the filtering score to be used when selecting rules.

api_source: str

A string used to identify the source of the API call.

do_compress_data: bool

A boolean that specifies if we want to compress the data.

do_compute_memory_usage: bool

A bool that specifies if we want to get the memory usage of the computation.

n_benchmark_original: int

Number of benchmarking runs to execute where the target is not shuffled.

n_benchmark_shuffle: int

Number of benchmarking runs to execute where the target is shuffled.

do_llm_readable_rules: bool

If we want to convert the rules to a readable format using a LLM.

llm_source: str

Source where the LLM is running

do_store_llm_cache: bool

If we want to store the result of the LLM in the cache (makes futur LLM calls faster).

do_check_llm_cache: bool

If we want to check if the results of the prompt are found in the cache (makes LLM calls faster).

Returns#

response: requests.models.Response

A response object obtained from the API call that contains the rule mining results.

insightsolver.insightsolver.get_credits_available(computing_source: str = 'auto', service_key: str | None = None, user_email: str | None = None)#

This function is meant to retrieve from the server the amount of credits available.

class insightsolver.insightsolver.InsightSolver(verbose: bool = False, df: DataFrame = None, target_name: str | int | None = None, target_goal: str | Real | uint8 | None = None, columns_types: Dict | None = {}, columns_descr: Dict | None = {}, threshold_M_max: int | None = 10000, specified_constraints: Dict | None = {}, top_N_rules: int | None = 10, filtering_score: str = 'auto', n_benchmark_original: int = 5, n_benchmark_shuffle: int = 20)#

Bases: Mapping

The class InsightSolver is meant to :

  1. Take input data.

  2. Make an insightsolver API calls to the server.

  3. Present the results of the rule mining.

Attributes#

df: DataFrame

The DataFrame that contains the data to analyse.

target_name: str (default None)

Name of the target variable (by default it’s the first column).

target_goal: (str or int)

Target goal.

target_threshold: (int or float)

Threshold used to convert a continuous target variable to a binary target variable.

M: int

Number of points in the population.

M0: int

Number of points 0 in the population.

M1: int

Number of points 1 in the population.

columns_types: dict

Types of the columns.

columns_descr: dict

Textual descriptions of the columns.

other_modalities: dict

Modalities that are mapped to the modality ‘other’.

threshold_M_max: int (default 10000)

Threshold on the maximum number of observations to consider, above which we under sample the observations.

specified_constraints: dict

Dictionary of the specified constraints on m_min, m_max, coverage_min, coverage_max.

top_N_rules: int (default 10)

An integer that specifies the maximum number of rules to get from the rule mining.

filtering_score: str (default ‘auto’)

A string that specifies the filtering score to be used when selecting rules.

n_benchmark_original: int (default 5)

An integer that specifies the number of benchmarking runs to execute where the target is not shuffled.

n_benchmark_shuffle: int (default 20)

An integer that specifies the number of benchmarking runs to execute where the target is shuffled.

monitoring_metadata: dict

Dictionary of monitoring metadata.

benchmark_scores: dict

Dictionary of the benchmarking scores against shuffled data.

rule_mining_results: dict

Dictionary that contains the results of the rule mining.

_is_fitted: bool

Boolean that tells if the solver is fitted.

Methods#

__init__: None

Initialization of an instance of the class InsightSolver.

__str__: None

Converts the solver to a string as provided by the print method.

ingest_dict: None

Ingests a Python dict.

ingest_json_string: None

Ingests a JSON string.

is_fitted: Bool

Returns a boolean that tells if the solver is fitted.

fit: None

Fits the solver.

S_to_index_points_in_rule: Pandas Index

Returns the index of the points in a rule S.

S_to_s_points_in_rule: Pandas Series

Returns a boolean Pandas Series that tells if the point is in the rule S.

S_to_df_filtered: Pandas DataFrame

Returns the filtered df of rows that are in the rule S.

ruleset_count: int

Counts the number of rules held by the InsightSolver.

i_to_rule: dict

Gives the rule i of the InsightSolver.

i_to_S: dict

Returns the rule S for the rule at index i.

i_to_subrules_dataframe: Pandas DataFrame

Returns a DataFrame containing the informations about the subrules of the rule i.

i_to_feature_contributions_S: Pandas DataFrame

Returns a DataFrame of the feature contributions of the variables in the rule S at position i.

i_to_readable_text: str

Returns the readable text of the rule i if it is available.

i_to_print: None

Prints the content of the rule i in the InsightSolver.

get_range_i: list

Gives the range of i in the InsightSolver.

print: None

Prints the content of the InsightSolver.

print_light: None

Prints the content of the InsightSolver (‘light’ mode).

print_dense: None

Prints the content of the InsightSolver (‘dense’ mode).

to_dict: dict

Exports the content of the InsightSolver object to a Python dict.

to_json_string: str

Exports the content of the InsightSolver object to a JSON string.

to_dataframe: Pandas DataFrame

Exports the rule mining results to a Pandas DataFrame.

to_csv: str

Exports the rule mining results to a CSV string and/or a local CSV file.

to_excel: None

Exports the rule mining results to a Excel file.

to_excel_string: str

Exports the rule mining results to a Excel string.

get_credits_needed_for_computation: int

Get the number of credits needed for the fitting computation of the solver.

get_df_credits_infos: Pandas DataFrame

Get a DataFrame of the transactions involving credits.

get_credits_available: int

Get the number of credits available.

convert_target_to_binary: pd.Series

Converts the target variable to a binary {0,1}-valued Pandas Series.

compute_mutual_information: pd.Series

Computes a Pandas Series of the mutual information between features and the target variable.

to_pdf: str

Generates a PDF containing all visualization figures for the solver.

to_zip: str

Exports the rule mining results to a ZIP file.

Example#

Here’s a sample code to use the class InsightSolver:

# Specify the service key
service_key = 'name_of_your_service_key.json'

# Import some data
import pandas as pd
df = pd.read_csv('kaggle_titanic_train.csv')

# Specify the name of the target variable
target_name = 'Survived' # We are interested in whether the passengers survived or not

# Specify the target goal
target_goal = 1 # We are searching rules that describe survivors

# Import the class InsightSolver from the module insightsolver
from insightsolver import InsightSolver

# Create an instance of the class InsightSolver
solver = InsightSolver(
    df          = df,          # A dataset
    target_name = target_name, # Name of the target variable
    target_goal = target_goal, # Target goal
)

# Fit the solver
solver.fit(
    service_key = service_key, # Use your API service key here
)

# Print the rule mining results
solver.print()
__init__(verbose: bool = False, df: DataFrame = None, target_name: str | int | None = None, target_goal: str | Real | uint8 | None = None, columns_types: Dict | None = {}, columns_descr: Dict | None = {}, threshold_M_max: int | None = 10000, specified_constraints: Dict | None = {}, top_N_rules: int | None = 10, filtering_score: str = 'auto', n_benchmark_original: int = 5, n_benchmark_shuffle: int = 20)#

The initialization occurs when an InsightSolver class instance is created.

Parameters#

verbose: bool (default False)

If we want the initialization to be verbose.

df: DataFrame

The DataFrame that contains the data to analyse (a target column and various feature columns).

target_name: str

Name of the column of the target variable.

target_goal: str (or other modality of the target variable)

Target goal.

columns_types: dict

Types of the columns.

columns_descr: dict

Descriptions of the columns.

threshold_M_max: int

Threshold on the maximum number of observations to consider, above which we sample observations.

specified_constraints: dict

Dictionary of the specified constraints on m_min, m_max, coverage_min, coverage_max.

top_N_rules: int (default 10)

An integer that specifies the maximum number of rules to get from the rule mining.

filtering_score: str (default ‘auto’)

A string that specifies the filtering score to be used when selecting rules.

n_benchmark_original: int (default 5)

An integer that specifies the number of benchmarking runs to execute where the target is not shuffled.

n_benchmark_shuffle: int (default 20)

An integer that specifies the number of benchmarking runs to execute where the target is shuffled.

Returns#

solver: InsightSolver

An instance of the class InsightSolver.

Example#

Here’s a sample code to instantiante the class InsightSolver:

# Import the class InsightSolver from the module insightsolver
from insightsolver import InsightSolver

# Create an instance of the class InsightSolver
solver = InsightSolver(
    df          = df,          # A Pandas DataFrame
    target_name = target_name, # Name of the target variable
    target_goal = target_goal, # Target goal
)
ingest_dict(d: dict, verbose: bool = False) None#

This method aims to ingest a Python dict in the solver.

ingest_json_string(json_string: str, verbose: bool = False) None#

This method aims to ingest a JSON string in the solver.

is_fitted()#

This method returns a boolean that tells if the solver is fitted.

fit(verbose: bool = False, computing_source: str = 'auto', service_key: str | None = None, user_email: str | None = None, api_source: str = 'auto', do_compress_data: bool = True, do_compute_memory_usage: bool = True, do_check_enough_credits: bool = False, do_llm_readable_rules: bool = True, llm_source: str = 'auto', llm_language: str = 'auto', do_store_llm_cache: bool = True, do_check_llm_cache: bool = True) None#

This method aims to fit the solver.

Parameters#

verbose: bool (default False)

If we want the fitting to be verbose.

computing_source: str (default ‘auto’)

Specify where the rule mining computation is done (‘local_cloud_function’ or ‘remote_cloud_function’).

service_key: str (default None)

Path+name of the service key.

user_email: str (default None)

User email.

api_source: str (default ‘auto’)

Source of the API call.

do_compress_data: bool (default True)

If we want to compress the data for the communications with the server.

do_compute_memory_usage: bool (default True)

If we want to monitor the first thread memory usage on the server side.

do_check_enough_credits: bool (default False)

Check if there are enough credits to fit the solver.

do_llm_readable_rules: bool (default True)

If we want to convert the rules to a readable format using a LLM.

llm_source: str (default ‘auto’)

Source where the LLM is running.

llm_language: str (default ‘auto’)

Language of the LLM.

do_store_llm_cache: bool (default True)

If we want to store the result of the LLM in the cache (makes futur LLM calls faster).

do_check_llm_cache: bool (default True)

If we want to check if the results of the prompt are found in the cache (makes LLM calls faster).

S_to_index_points_in_rule(S: dict, verbose: bool = False, df: DataFrame | None = None) Index#

This method returns the index of the points inside a rule S.

S_to_s_points_in_rule(S: dict, verbose: bool = False, df: DataFrame | None = None) Series#

This method returns a boolean Series that tells if the points are in the rule S or not.

S_to_df_filtered(S: dict, verbose: bool = False, df: DataFrame | None = None)#

This method returns the DataFrame of rows of df that lie inside a rule S.

ruleset_count() int#

This method returns the number of rules held in an instance of the solver.

i_to_rule(i: int) dict#
i_to_S(i)#

This method returns the rule S at position i.

i_to_subrules_dataframe(i: int = 0) DataFrame#

This method returns a DataFrame which contains the informations about the subrules of the rule i.

i_to_feature_contributions_S(i: int, do_rename_cols: bool = True, do_ignore_col_rule_S: bool = True) DataFrame#

This method returns a DataFrame of the feature contributions of the variables in the rule S at position i.

i_to_feature_names(i: int, do_sort: bool = True)#

Returns the list of feature names in the rule at position i. The feature are sorted by contribution, descending.

Parameters#

i: int

Index of the rule in the solver.

do_sort: bool

If we want to sort the features by contribution, descending.

i_to_readable_text(i) str | None#

Returns the readable text of the rule i if it is available.

i_to_print(i: int, indentation: str = '', do_print_shuffling_scores: bool = True, do_print_rule_DataFrame: bool = False, do_print_subrules_S: bool = True, do_show_coverage_diff: bool = False, do_show_cohen_d: bool = True, do_show_wy_ratio: bool = True, do_print_feature_contributions_S: bool = True) None#

This method prints the content of the rule i in the solver.

Parameters#

i: int

Index of the rule to print.

indentation: str

Indentation of some printed elements.

do_print_shuffling_scores: bool

If we want to print the shuffling scores.

do_print_rule_DataFrame: bool

If we want to print a DataFrame of the rule.

do_print_subrules_S: bool

If we want to print the DataFrame of subrules.

do_show_coverage_diff: bool

If we want to show the differences of coverage in the DataFrame of subrules.

do_show_cohen_d: bool

If we want to show the Cohen d in the DataFrame of subrules.

do_show_wy_ratio: bool

If we want to show the WY ratio in the DataFrame of subrules.

do_print_feature_contributions_S: bool

If we want to print the DataFrame of feature contributions.

get_range_i(complexity_max: int | None = None) list#

This method gives the range of i in the solver. If the integer complexity_max is specified, return only this number of elements.

print(verbose: bool = False, r: int | None = None, do_print_dataset_metadata: bool = True, do_print_monitoring_metadata: bool = False, do_print_benchmark_scores: bool = True, do_print_shuffling_scores: bool = True, do_show_cohen_d: bool = True, do_show_wy_ratio: bool = True, do_print_rule_mining_results: bool = True, do_print_rule_DataFrame: bool = False, do_print_subrules_S: bool = True, do_show_coverage_diff: bool = False, do_print_feature_contributions_S: bool = True, separation_width_between_rules: int | None = 79, do_print_last_separator: bool = True, mode: str = 'full') None#

This method prints the content of the InsightSolver solver.

print_light(print_format: str = 'list', do_print_shuffling_scores: bool = True, do_print_last_separator: bool = True) None#

This method does a ‘light’ print of the solver.

Two formats:

  • 'list': shows the rules via a loop of prints.

  • 'compact': shows the rules in a single DataFrame.

print_dense(do_print_lifts: bool = False, do_print_shuffling_scores: bool = True) None#

This method is aimed at printing a ‘dense’ version of the solver.

Parameters#

do_print_lifts: bool

If we want to show the lifts.

do_print_shuffling_scores: bool

If we want to show the shuffling scores.

to_dict() dict#

This method aims to export the content of the solver to a dictionary.

to_json_string(verbose=False) str#

This method aims to export the content of the solver to a JSON string.

to_dataframe(verbose=False, do_append_datetime=False, do_rename_cols=False) DataFrame#

This method aims to export the content of the solver to a DataFrame.

to_csv(output_file=None, verbose=False, do_rename_cols=False) str#

This method is meant to export the content of the solver to a CSV file.

to_excel(output_file, verbose=False, do_rename_cols=False) None#

This method is meant to export the solver to a Excel file.

to_excel_string(verbose=False, do_rename_cols=False) str#

This method is meant to export the solver to a Excel string.

get_credits_needed_for_computation() int#

This method is meant to compute the number of credits for the computation during the fitting of the solver.

get_df_credits_infos(computing_source: str = 'auto', service_key: str | None = None, user_email: str | None = None) DataFrame#

This method is meant to retrieve from the server the transactions involving credits.

get_credits_available(computing_source: str = 'auto', service_key: str | None = None, user_email: str | None = None) int#

This method is meant to retrieve from the server the amount of credits available.

convert_target_to_binary()#

This method converts the target variable to a binary {0,1}-valued Pandas Series.

To use this method, the attribute solver.target_goal must be populated because it specifies how to convert the target variable to binary. As a reminder, the target goal must be one of the following:

  • A modality of the target variable in the case of a categorical (i.e. 'binary' or 'multiclass') target variable.

  • 'min', 'min_q0', 'min_q1', 'min_q2', 'min_q3', 'min_c00', 'min_c01', …, 'min_c98', 'min_c99'.

  • 'max', 'max_q1', ',max_q2', 'max_q3', 'max_q4', 'max_c01', 'max_c02', …, 'max_c99', 'max_c100'.

Returns#

s_target: pd.Series

A {0,1}-valued Pandas Series representing the target variable.

compute_mutual_information(n_samples: int = 1000) Series#

This method computes the mutual information between the features and the target variable. The result is returned as a Pandas Series.

Parameters#

n_samples: int

An integer that specifies the number of data rows to use in the computation of the mutual information.

Returns#

s_mi: pd.Series

A Pandas Series that contains the mutual information of the features with the target variable.

plot(language: str = 'en', do_mutual_information: bool = True, do_banner: bool = True, do_contributions: bool = True, do_distributions: bool = True, do_mosaics_rule_vs_comp: bool = True, do_mosaics_rule_vs_pop: bool = True, do_legend: bool = True) None#

Displays all visualization figures for the solver.

Parameters#

languagestr

Language for the plots (‘en’ or ‘fr’).

do_mutual_informationbool

Whether to show the mutual information figure.

do_bannerbool

Whether to show the banner figures.

do_contributionsbool

Whether to show feature contributions.

do_distributionsbool

Whether to show feature distributions.

do_mosaics_rule_vs_compbool

Whether to show the mosaics of rule vs complement figures.

do_mosaics_rule_vs_popbool

Whether to show the mosaics of rule vs population figures.

do_legendbool

Whether to show the legend figure.

to_pdf(output_file: str | None = None, verbose: bool = False, do_mutual_information: bool = True, do_banner: bool = True, do_contributions: bool = True, do_distributions: bool = True, do_mosaics_rule_vs_comp: bool = True, do_mosaics_rule_vs_pop: bool = True, do_legend: bool = True, language: str = 'en')#

Export a PDF file containing various results and figures of the solver.

This method is now a simple wrapper around visualization.make_pdf().

Parameters#

output_filestr, optional

Path where the PDF should be exported.

verbosebool, default False

Verbosity.

do_mutual_informationbool

Include mutual information figure.

do_bannerbool

Include banner figures.

do_contributionsbool

Include contribution figures.

do_distributionsbool

Include distribution figures.

do_mosaics_rule_vs_compbool

Include mosaics of rule vs complement figures.

do_mosaics_rule_vs_popbool

Include mosaics of rule vs population figures.

do_legendbool

Include the legend figure.

languagestr

Language for the plots (‘en’ or ‘fr’).

Returns#

pdf_base64str

The PDF content encoded as a base64 string, suitable for in-memory use.

to_zip(output_file: str | None = None, verbose: bool = False, do_png: bool = True, do_csv: bool = True, do_json: bool = True, do_excel: bool = True, do_pdf: bool = True, language: str = 'en')#

Export the solver content to a ZIP file.

This method is now a simple wrapper around visualization.make_zip().