lale.datasets.multitable.fetch_datasets module

lale.datasets.multitable.fetch_datasets.fetch_creditg_multitable_dataset(datatype: Literal['pandas', 'spark'] = 'pandas')[source]

Fetches credit-g dataset from OpenML, but in a multi-table format. It transforms the [credit-g](https://www.openml.org/d/31) dataset from OpenML to a multi-table format. We split the dataset into 3 tables: loan_application, bank_account_info and existing_credits_info. The table loan_application serves as our primary table, and we treat the other two tables as providing additional information related to the applicant’s bank account and existing credits. As one can see, this is very close to a real life scenario where information is present in multiple tables in normalized forms. We created a primary key column id as a proxy to the loan applicant’s identity number.

Parameters

datatype (string, optional, default 'pandas') – If ‘pandas’, Returns a list of singleton dictionaries (each element of the list is one table from the dataset) after reading the downloaded CSV files. The key of each dictionary is the name of the table and the value contains a pandas dataframe consisting of the data.

Returns

dataframes_list

Return type

list of singleton dictionary of pandas dataframes

lale.datasets.multitable.fetch_datasets.fetch_go_sales_dataset(datatype: Literal['pandas', 'spark'] = 'pandas')[source]

Fetches the Go_Sales dataset from IBM’s Watson’s ML samples. It contains information about daily sales, methods, retailers and products of a company in form of 5 CSV files. This method downloads and stores these 5 CSV files under the ‘lale/lale/datasets/multitable/go_sales_data’ directory. It creates this directory by itself if it does not exists.

Dataset URL: https://github.com/IBM/watson-machine-learning-samples/raw/master/cloud/data/go_sales/

Parameters

datatype (string, optional, default 'pandas') –

If ‘pandas’, Returns a list of singleton dictionaries (each element of the list is one table from the dataset) after reading the downloaded CSV files. The key of each dictionary is the name of the table and the value contains a pandas dataframe consisting of the data.

If ‘spark’, Returns a list of singleton dictionaries (each element of the list is one table from the dataset) after reading the downloaded CSV files. The key of each dictionary is the name of the table and the value contains a spark dataframe consisting of the data extended with an index column.

Else, Throws an error as it does not support any other return type.

Returns

go_sales_list

Return type

list of singleton dictionary of pandas / spark dataframes

lale.datasets.multitable.fetch_datasets.fetch_imdb_dataset(datatype: Literal['pandas', 'spark'] = 'pandas')[source]

Fetches the IMDB movie dataset from Relational Dataset Repo. It contains information about directors, actors, roles and genres of multiple movies in form of 7 CSV files. This method downloads and stores these 7 CSV files under the ‘lale/lale/datasets/multitable/imdb_data’ directory. It creates this directory by itself if it does not exists.

Dataset URL: https://relational.fit.cvut.cz/dataset/IMDb

Parameters

datatype (string, optional, default 'pandas') –

If ‘pandas’, Returns a list of singleton dictionaries (each element of the list is one table from the dataset) after reading the already existing CSV files. The key of each dictionary is the name of the table and the value contains a pandas dataframe consisting of the data.

If ‘spark’, Returns a list of singleton dictionaries (each element of the list is one table from the dataset) after reading the downloaded CSV files. The key of each dictionary is the name of the table and the value contains a spark dataframe consisting of the data extended with an index column.

Else, Throws an error as it does not support any other return type.

Returns

imdb_list

Return type

list of singleton dictionary of pandas / spark dataframes

Raises

jsonschema.ValueError – dataset not found

lale.datasets.multitable.fetch_datasets.get_data_from_csv(datatype: Literal['pandas', 'spark'], data_file_name)[source]