lale.datasets.multitable.util module

lale.datasets.multitable.util.multitable_train_test_split(dataset: List[Any], main_table_name: str, label_column_name: str, test_size: float = 0.25, random_state: Optional[Union[RandomState, int]] = None) Tuple[source]

Splits X and y into random train and test subsets stratified by labels and protected attributes.

Behaves similar to the train_test_split function from scikit-learn.

Parameters
  • dataset (list of either Pandas or Spark dataframes) – Each dataframe in the list corresponds to an entity/table in the multi-table setting.

  • main_table_name (string) – The name of the main table as the split is going to be based on the main table.

  • label_column_name (string) – The name of the label column from the main table.

  • test_size (float or int, default=0.25) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples.

  • random_state (int, RandomState instance or None, default=None) –

    Controls the shuffling applied to the data before applying the split. Pass an integer for reproducible output across multiple function calls.

    • None

      RandomState used by numpy.random

    • numpy.random.RandomState

      Use the provided random state, only affecting other users of that same random state instance.

    • integer

      Explicit seed.

Returns

result

  • item 0: train_X, List of datasets corresponding to the train split

  • item 1: test_X, List of datasets corresponding to the test split

  • item 2: train_y

  • item 3: test_y

Return type

tuple

Raises

jsonschema.ValueError – Bad configuration. Either the table name was not found, or te provided list does not contain spark or pandas dataframes