lale.lib.aif360.datasets module

Fetcher methods to load fairness datasets and provide fairness_info for them.

See the notebook demo_fairness_datasets for an example for using the functions, along with some tables and figures about them. There is also an arxiv paper about these datasets. Some of the fetcher methods have a preprocess argument that defaults to False. The notebook does not use that argument, instead demonstrating how to do any required preprocessing in the context of a Lale pipeline. Most of the datasets are from OpenML, a few are from meps.ahrq or ProPublica, and most of the datasets have been used in various papers. The Lale library does not distribute the datasets themselves, it only provides methods for downloading them.

lale.lib.aif360.datasets.fetch_adult_df(preprocess: bool = False)[source]

Fetch the adult dataset from OpenML and add fairness_info. It contains information about individuals from the 1994 U.S. census. The prediction task is a binary classification on whether the income of a person exceeds 50K a year. Without preprocessing, the dataset has 48,842 rows and 14 columns. There are two protected attributes, sex and race, and the disparate impact is 0.23. The data includes both categorical and numeric columns, and has some missing values.

Parameters

preprocess (boolean, optional, default False) – If True, impute missing values; encode protected attributes in X as 0 or 1 to indicate privileged groups; encode labels in y as 0 or 1 to indicate favorable outcomes; and apply one-hot encoding to any remaining features in X that are categorical and not protected attributes.

Returns

result

  • item 0: pandas Dataframe

    Features X, including both protected and non-protected attributes.

  • item 1: pandas Series

    Labels y.

  • item 3: fairness_info

    JSON meta-data following the format understood by fairness metrics and mitigation operators in lale.lib.aif360.

Return type

tuple

lale.lib.aif360.datasets.fetch_bank_df(preprocess: bool = False)[source]

Fetch the bank-marketing dataset from OpenML and add fairness_info.

It contains information from marketing campaigns of a Portuguise bank. The prediction task is a binary classification on whether the client will subscribe a term deposit. Without preprocessing, the dataset has 45,211 rows and 16 columns. There is one protected attribute, age, and the disparate impact of 0.84. The data includes both categorical and numeric columns, with no missing values.

Parameters

preprocess (boolean, optional, default False) – If True, encode protected attributes in X as 0 or 1 to indicate privileged groups; encode labels in y as 0 or 1 to indicate favorable outcomes; and apply one-hot encoding to any remaining features in X that are categorical and not protected attributes.

Returns

result

  • item 0: pandas Dataframe

    Features X, including both protected and non-protected attributes.

  • item 1: pandas Series

    Labels y.

  • item 3: fairness_info

    JSON meta-data following the format understood by fairness metrics and mitigation operators in lale.lib.aif360.

Return type

tuple

lale.lib.aif360.datasets.fetch_compas_df(preprocess: bool = False)[source]

Fetch the compas-two-years dataset, also known as ProPublica recidivism, from GitHub and add fairness_info.

It contains information about individuals with a binary classification for recidivism, indicating whether they were re-arrested within two years after the first arrest. Without preprocessing, the dataset has 6,172 rows and 51 columns. There are two protected attributes, sex and race, and the disparate impact is 0.75. The data includes numeric and categorical columns, with some missing values.

Parameters

preprocess (boolean, optional, default False) – If True, encode protected attributes in X as 0 or 1 to indicate privileged groups (1 if Female or Caucasian for the corresponding sex and race columns respectively); and apply one-hot encoding to any remaining features in X that are categorical and not protected attributes.

Returns

result

  • item 0: pandas Dataframe

    Features X, including both protected and non-protected attributes.

  • item 1: pandas Series

    Labels y.

  • item 3: fairness_info

    JSON meta-data following the format understood by fairness metrics and mitigation operators in lale.lib.aif360.

Return type

tuple

lale.lib.aif360.datasets.fetch_compas_violent_df(preprocess: bool = False)[source]

Fetch the compas-two-years-violent dataset, also known as ProPublica violent recidivism, from GitHub and add fairness_info.

It contains information about individuals with a binary classification for violent recidivism, indicating whether they were re-arrested within two years after the first arrest. Without preprocessing, the dataset has 4,020 rows and 51 columns. There are three protected attributes, sex, race, and age, and the disparate impact is 0.85. The data includes numeric and categorical columns, with some missing values.

Parameters

preprocess (boolean, optional, default False) – If True, encode protected attributes in X as 0 or 1 to indicate privileged groups (1 if Female, Caucasian, or at least 25 for the corresponding sex, race, and age columns respectively); and apply one-hot encoding to any remaining features in X that are categorical and not protected attributes.

Returns

result

  • item 0: pandas Dataframe

    Features X, including both protected and non-protected attributes.

  • item 1: pandas Series

    Labels y.

  • item 3: fairness_info

    JSON meta-data following the format understood by fairness metrics and mitigation operators in lale.lib.aif360.

Return type

tuple

lale.lib.aif360.datasets.fetch_creditg_df(preprocess: bool = False)[source]

Fetch the credit-g dataset from OpenML and add fairness_info.

It contains information about individuals with a binary classification into good or bad credit risks. Without preprocessing, the dataset has 1,000 rows and 20 columns. There are two protected attributs, personal_status/sex and age, and the disparate impact is 0.75. The data includes both categorical and numeric columns, with no missing values.

Parameters

preprocess (boolean, optional, default False) – If True, encode protected attributes in X as 0 or 1 to indicate privileged groups; encode labels in y as 0 or 1 to indicate favorable outcomes; and apply one-hot encoding to any remaining features in X that are categorical and not protected attributes.

Returns

result

  • item 0: pandas Dataframe

    Features X, including both protected and non-protected attributes.

  • item 1: pandas Series

    Labels y.

  • item 3: fairness_info

    JSON meta-data following the format understood by fairness metrics and mitigation operators in lale.lib.aif360.

Return type

tuple

lale.lib.aif360.datasets.fetch_default_credit_df()[source]

Fetch the Default of Credit Card Clients Dataset from OpenML and add fairness_info. It is a binary classification to predict whether the customer suffers a default in the next month (1) or not (0). The dataset has 30,000 rows and 24 columns, all numeric. The protected attribute is sex and the disparate impact is 0.957.

Returns

result

  • item 0: pandas Dataframe

    Features X, including both protected and non-protected attributes.

  • item 1: pandas Series

    Labels y.

  • item 3: fairness_info

    JSON meta-data following the format understood by fairness metrics and mitigation operators in lale.lib.aif360.

Return type

tuple

lale.lib.aif360.datasets.fetch_heart_disease_df()[source]

Fetch the heart-disease dataset from OpenML and add fairness_info. It is a binary classification to predict heart disease from the Cleveland database, with 303 rows and 13 columns, all numeric. The protected attribute is age and the disparate impact is 0.589.

Returns

result

  • item 0: pandas Dataframe

    Features X, including both protected and non-protected attributes.

  • item 1: pandas Series

    Labels y.

  • item 3: fairness_info

    JSON meta-data following the format understood by fairness metrics and mitigation operators in lale.lib.aif360.

Return type

tuple

lale.lib.aif360.datasets.fetch_law_school_df()[source]

Fetch the law school dataset from OpenML and add fairness_info. This function returns both X and y unchanged, since the dataset was already binarized by the OpenML contributors, with the target of predicting whether the GPA is greater than 3. The protected attributes is race1 and the disparate impact is 0.704. The dataset has 20,800 rows and 11 columns (5 categorical and 6 numeric columns).

Returns

result

  • item 0: pandas Dataframe

    Features X, including both protected and non-protected attributes.

  • item 1: pandas Series

    Labels y.

  • item 3: fairness_info

    JSON meta-data following the format understood by fairness metrics and mitigation operators in lale.lib.aif360.

Return type

tuple

lale.lib.aif360.datasets.fetch_meps_panel19_fy2015_df(preprocess: bool = False)[source]

Fetch a subset of the MEPS dataset from aif360 and add fairness info.

It contains information collected on a nationally representative sample of the civilian noninstitutionalized population of the United States, specifically reported medical expenditures and civilian demographics. This dataframe corresponds to data from panel 19 from the year 2015. Without preprocessing, the dataframe contains 16578 rows and 1825 columns. (With preprocessing the dataframe contains 15830 rows and 138 columns.) There is one protected attribute, race, and the disparate impact is 0.496 if preprocessing is not applied and 0.490 if preprocessing is applied. The data includes numeric and categorical columns, with some missing values.

Note: in order to use this dataset, be sure to follow the instructions found in the AIF360 documentation and accept the corresponding license agreement.

Parameters

preprocess (boolean, optional, default False) – If True, encode protected attribute in X corresponding to race as 0 or 1 to indicate privileged groups; encode labels in y as 0 or 1 to indicate faborable outcomes; rename columns that are panel or round-specific; drop columns such as ID columns that are not relevant to the task at hand; and drop rows where features are unknown.

Returns

result

  • item 0: pandas Dataframe

    Features X, including both protected and non-protected attributes.

  • item 1: pandas Series

    Labels y.

  • item 3: fairness_info

    JSON meta-data following the format understood by fairness metrics and mitigation operators in lale.lib.aif360.

Return type

tuple

lale.lib.aif360.datasets.fetch_meps_panel20_fy2015_df(preprocess: bool = False)[source]

Fetch a subset of the MEPS dataset from aif360 and add fairness info.

It contains information collected on a nationally representative sample of the civilian noninstitutionalized population of the United States, specifically reported medical expenditures and civilian demographics. This dataframe corresponds to data from panel 20 from the year 2015. Without preprocessing, the dataframe contains 18849 rows and 1825 columns. (With preprocessing the dataframe contains 17570 rows and 138 columns.) There is one protected attribute, race, and the disparate impact is 0.493 if preprocessing is not applied and 0.488 if preprocessing is applied. The data includes numeric and categorical columns, with some missing values.

Note: in order to use this dataset, be sure to follow the instructions found in the AIF360 documentation and accept the corresponding license agreement.

Parameters

preprocess (boolean, optional, default False) – If True, encode protected attribute in X corresponding to race as 0 or 1 to indicate privileged groups; encode labels in y as 0 or 1 to indicate faborable outcomes; rename columns that are panel or round-specific; drop columns such as ID columns that are not relevant to the task at hand; and drop rows where features are unknown.

Returns

result

  • item 0: pandas Dataframe

    Features X, including both protected and non-protected attributes.

  • item 1: pandas Series

    Labels y.

  • item 3: fairness_info

    JSON meta-data following the format understood by fairness metrics and mitigation operators in lale.lib.aif360.

Return type

tuple

lale.lib.aif360.datasets.fetch_meps_panel21_fy2016_df(preprocess: bool = False)[source]

Fetch a subset of the MEPS dataset from aif360 and add fairness info.

It contains information collected on a nationally representative sample of the civilian noninstitutionalized population of the United States, specifically reported medical expenditures and civilian demographics. This dataframe corresponds to data from panel 20 from the year 2016. Without preprocessing, the dataframe contains 17052 rows and 1936 columns. (With preprocessing the dataframe contains 15675 rows and 138 columns.) There is one protected attribute, race, and the disparate impact is 0.462 if preprocessing is not applied and 0.451 if preprocessing is applied. The data includes numeric and categorical columns, with some missing values.

Note: in order to use this dataset, be sure to follow the instructions found in the AIF360 documentation and accept the corresponding license agreement.

Parameters

preprocess (boolean, optional, default False) – If True, encode protected attribute in X corresponding to race as 0 or 1 to indicate privileged groups; encode labels in y as 0 or 1 to indicate faborable outcomes; rename columns that are panel or round-specific; drop columns such as ID columns that are not relevant to the task at hand; and drop rows where features are unknown.

Returns

result

  • item 0: pandas Dataframe

    Features X, including both protected and non-protected attributes.

  • item 1: pandas Series

    Labels y.

  • item 3: fairness_info

    JSON meta-data following the format understood by fairness metrics and mitigation operators in lale.lib.aif360.

Return type

tuple

lale.lib.aif360.datasets.fetch_nlsy_df()[source]

Fetch the National Longitudinal Survey for the Youth (NLSY) (also known as “University of Michigan Health and Retirement Study (HRS)”) dataset from OpenML and add fairness_info.

It is a binary classification to predict whether the income at a certain time exceeds a threshold, with 4,908 rows and 15 columns (comprising 6 categorical and 9 numerical columns). The protected attributes are age and gender and the disparate impact is 0.668.

Returns

result

  • item 0: pandas Dataframe

    Features X, including both protected and non-protected attributes.

  • item 1: pandas Series

    Labels y.

  • item 3: fairness_info

    JSON meta-data following the format understood by fairness metrics and mitigation operators in lale.lib.aif360.

Return type

tuple

lale.lib.aif360.datasets.fetch_nursery_df(preprocess: bool = False)[source]

Fetch the nursery dataset from OpenML and add fairness_info.

It contains data gathered from applicants to public schools in Ljubljana, Slovenia during a competitive time period. Without preprocessing, the dataset has 12960 rows and 8 columns. There is one protected attribute, parents, and the disparate impact is 0.46. The data has categorical columns (with numeric ones if preprocessing is applied), with no missing values.

Parameters

preprocess (boolean, optional, default False) – If True, encode protected attributes in X as 0 or 1 to indicate privileged groups and apply one-hot encoding to any remaining features in X that are categorical and not protected attributes.

Returns

result

  • item 0: pandas Dataframe

    Features X, including both protected and non-protected attributes.

  • item 1: pandas Series

    Labels y.

  • item 3: fairness_info

    JSON meta-data following the format understood by fairness metrics and mitigation operators in lale.lib.aif360.

Return type

tuple

lale.lib.aif360.datasets.fetch_ricci_df(preprocess: bool = False)[source]

Fetch the ricci_vs_destefano dataset from OpenML and add fairness_info.

It contains test scores for 2003 New Haven Fire Department promotion exams with a binary classification into promotion or no promotion. Without preprocessing, the dataset has 118 rows and 5 columns. There is one protected attribute, race, and the disparate impact is 0.50. The data includes both categorical and numeric columns, with no missing values.

Parameters

preprocess (boolean, optional, default False) – If True, encode protected attributes in X as 0 or 1 to indicate privileged groups; encode labels in y as 0 or 1 to indicate favorable outcomes; and apply one-hot encoding to any remaining features in X that are categorical and not protected attributes.

Returns

result

  • item 0: pandas Dataframe

    Features X, including both protected and non-protected attributes.

  • item 1: pandas Series

    Labels y.

  • item 3: fairness_info

    JSON meta-data following the format understood by fairness metrics and mitigation operators in lale.lib.aif360.

Return type

tuple

lale.lib.aif360.datasets.fetch_speeddating_df(preprocess: bool = False)[source]

Fetch the SpeedDating dataset from OpenML and add fairness_info.

It contains data gathered from participants in experimental speed dating events from 2002-2004 with a binary classification into match or no match. Without preprocessing, the dataset has 8378 rows and 122 columns. There are two protected attributes, whether the other candidate has the same race and importance of having the same race, and the disparate impact is 0.85. The data includes both categorical and numeric columns, with some missing values.

Parameters

preprocess (boolean, optional, default False) – If True, encode protected attributes in X as 0 or 1 to indicate privileged groups; encode labels in y as 0 or 1 to indicate favorable outcomes; and apply one-hot encoding to any remaining features in X that are categorical and not protected attributes.

Returns

result

  • item 0: pandas Dataframe

    Features X, including both protected and non-protected attributes.

  • item 1: pandas Series

    Labels y.

  • item 3: fairness_info

    JSON meta-data following the format understood by fairness metrics and mitigation operators in lale.lib.aif360.

Return type

tuple

lale.lib.aif360.datasets.fetch_student_math_df()[source]

Fetch the Student Performance (Math) dataset from OpenML and add fairness_info.

The original prediction target is a integer math grade from 1 to 20. This function returns X unchanged but with a binarized version of the target y, using 1 for values >=10 and 0 otherwise. The two protected attributes are sex and age and the disparate impact is 0.894. The dataset has 395 rows and 32 columns, including both categorical and numeric columns.

Returns

result

  • item 0: pandas Dataframe

    Features X, including both protected and non-protected attributes.

  • item 1: pandas Series

    Labels y.

  • item 3: fairness_info

    JSON meta-data following the format understood by fairness metrics and mitigation operators in lale.lib.aif360.

Return type

tuple

lale.lib.aif360.datasets.fetch_student_por_df()[source]

Fetch the Student Performance (Portuguese) dataset from OpenML and add fairness_info.

The original prediction target is a integer Portuguese grade from 1 to 20. This function returns X unchanged but with a binarized version of the target y, using 1 for values >=10 and 0 otherwise. The two protected attributes are sex and age and the disparate impact is 0.858. The dataset has 649 rows and 32 columns, including both categorical and numeric columns.

Returns

result

  • item 0: pandas Dataframe

    Features X, including both protected and non-protected attributes.

  • item 1: pandas Series

    Labels y.

  • item 3: fairness_info

    JSON meta-data following the format understood by fairness metrics and mitigation operators in lale.lib.aif360.

Return type

tuple

lale.lib.aif360.datasets.fetch_tae_df(preprocess: bool = False)[source]

Fetch the tae dataset from OpenML and add fairness_info.

It contains information from teaching assistant (TA) evaluations. at the University of Wisconsin–Madison. The prediction task is a classification on the type of rating a TA receives (1=Low, 2=Medium, 3=High). Without preprocessing, the dataset has 151 rows and 5 columns. There is one protected attributes, “whether_of_not_the_ta_is_a_native_english_speaker” [sic], and the disparate impact of 0.45. The data includes both categorical and numeric columns, with no missing values.

Parameters

preprocess (boolean or "y", optional, default False) – If True, encode protected attributes in X as 0 or 1 to indicate privileged group (“native_english_speaker”); encode labels in y as 0 or 1 to indicate favorable outcomes; and apply one-hot encoding to any remaining features in X that are categorical and not protecteded attributes. If “y”, leave features X unchanged and only encode labels y as 0 or 1. If False, encode neither features X nor labels y.

Returns

result

  • item 0: pandas Dataframe

    Features X, including both protected and non-protected attributes.

  • item 1: pandas Series

    Labels y.

  • item 3: fairness_info

    JSON meta-data following the format understood by fairness metrics and mitigation operators in lale.lib.aif360.

Return type

tuple

lale.lib.aif360.datasets.fetch_titanic_df(preprocess: bool = False)[source]

Fetch the Titanic dataset from OpenML and add fairness_info.

It contains data gathered from passengers on the Titanic with a binary classification into “survived” or “did not survive”. Without preprocessing, the dataset has 1309 rows and 13 columns. There is one protected attribute, sex, and the disparate impact is 0.26. The data includes both categorical and numeric columns, with some missing values.

Parameters

preprocess (boolean, optional, default False) – If True, encode protected attributes in X as 0 or 1 to indicate privileged groups; and apply one-hot encoding to any remaining features in X that are categorical and not protected attributes.

Returns

result

  • item 0: pandas Dataframe

    Features X, including both protected and non-protected attributes.

  • item 1: pandas Series

    Labels y.

  • item 3: fairness_info

    JSON meta-data following the format understood by fairness metrics and mitigation operators in lale.lib.aif360.

Return type

tuple

lale.lib.aif360.datasets.fetch_us_crime_df()[source]

Fetch the us_crime (also known as “communities and crime”) dataset from OpenML and add fairness_info. The original dataset has several columns with a large number of missing values, which this function drops. The binary protected attribute is blackgt6pct, which is derived by thresholding racepctblack > 0.06 and dropping the original racepctblack. The binary target is derived by thresholding its original y > 0.70. The disparate impact is 0.888. The resulting dataset has 1,994 rows and 102 columns, all but one of which are numeric.

Returns

result

  • item 0: pandas Dataframe

    Features X, including both protected and non-protected attributes.

  • item 1: pandas Series

    Labels y.

  • item 3: fairness_info

    JSON meta-data following the format understood by fairness metrics and mitigation operators in lale.lib.aif360.

Return type

tuple