lale.datasets.multitable.util module¶

lale.datasets.multitable.util.multitable_train_test_split(dataset: List[Any], main_table_name: str, label_column_name: str, test_size: float = 0.25, random_state: Optional[Union[RandomState, int]] = None) → Tuple[source]¶

Splits X and y into random train and test subsets stratified by labels and protected attributes.

Behaves similar to the train_test_split function from scikit-learn.

Parameters

dataset (list of either Pandas or Spark dataframes) – Each dataframe in the list corresponds to an entity/table in the multi-table setting.
main_table_name (string) – The name of the main table as the split is going to be based on the main table.
label_column_name (string) – The name of the label column from the main table.
test_size (float or int, default=0.25) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples.
random_state (int, RandomState instance or None, default=None) –
Controls the shuffling applied to the data before applying the split. Pass an integer for reproducible output across multiple function calls.
- None
  
  RandomState used by numpy.random
- numpy.random.RandomState
  
  Use the provided random state, only affecting other users of that same random state instance.
- integer
  
  Explicit seed.

Returns

result –

item 0: train_X, List of datasets corresponding to the train split
item 1: test_X, List of datasets corresponding to the test split
item 2: train_y
item 3: test_y

Return type

tuple

Raises

jsonschema.ValueError – Bad configuration. Either the table name was not found, or te provided list does not contain spark or pandas dataframes