lale.datasets.multitable.util module¶
- lale.datasets.multitable.util.multitable_train_test_split(dataset: List[Any], main_table_name: str, label_column_name: str, test_size: float = 0.25, random_state: Optional[Union[RandomState, int]] = None) Tuple [source]¶
Splits X and y into random train and test subsets stratified by labels and protected attributes.
Behaves similar to the train_test_split function from scikit-learn.
- Parameters
dataset (list of either Pandas or Spark dataframes) – Each dataframe in the list corresponds to an entity/table in the multi-table setting.
main_table_name (string) – The name of the main table as the split is going to be based on the main table.
label_column_name (string) – The name of the label column from the main table.
test_size (float or int, default=0.25) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples.
random_state (int, RandomState instance or None, default=None) –
Controls the shuffling applied to the data before applying the split. Pass an integer for reproducible output across multiple function calls.
None
RandomState used by numpy.random
numpy.random.RandomState
Use the provided random state, only affecting other users of that same random state instance.
integer
Explicit seed.
- Returns
result –
item 0: train_X, List of datasets corresponding to the train split
item 1: test_X, List of datasets corresponding to the test split
item 2: train_y
item 3: test_y
- Return type
- Raises
jsonschema.ValueError – Bad configuration. Either the table name was not found, or te provided list does not contain spark or pandas dataframes