lale.lib.sklearn.tfidf_vectorizer module¶
- class lale.lib.sklearn.tfidf_vectorizer.TfidfVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer='word', stop_words=None, token_pattern='(?u)\x08\\w\\w+\x08', ngram_range='(1, 1)', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype='float64', norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)¶
Bases:
PlannedIndividualOp
TF-IDF vectorizer transformer from scikit-learn for turning text into term frequency - inverse document frequency numeric features.
This documentation is auto-generated from JSON schemas.
- Parameters
input (‘filename’, ‘file’, or ‘content’, not for optimizer, default ‘content’) –
encoding (string, not for optimizer, default 'utf-8') –
decode_error (‘strict’, ‘ignore’, or ‘replace’, not for optimizer, default ‘strict’) –
strip_accents (‘ascii’, ‘unicode’, or None, not for optimizer, default None) –
lowercase (boolean, not for optimizer, default True) –
preprocessor (union type, not for optimizer, default None) –
callable, not for optimizer
or None
tokenizer (union type, not for optimizer, default None) –
callable, not for optimizer
or None
See also constraint-1.
analyzer (union type, default 'word') –
‘word’, ‘char’, or ‘char_wb’
or callable, not for optimizer
See also constraint-1, constraint-2.
stop_words (union type, not for optimizer, default None) –
None or ‘english’
or array of items : string
See also constraint-2.
token_pattern (string, optional, not for optimizer, default '(?u)\b\w\w+\b') –
ngram_range (union type, default (1, 1)) –
tuple, >=2 items for optimizer, <=2 items for optimizer, not for optimizer of items : integer, >=1 for optimizer, <=3 for optimizer
or (1, 1), (1, 2), (1, 3), (2, 2), (2, 3), or (3, 3)
max_df (union type, default 1.0) –
float, >=0.0, >=0.8 for optimizer, <=1.0, <=0.9 for optimizer, uniform distribution
float in range [0.0, 1.0]
or integer, not for optimizer
min_df (union type, default 1) –
float, >=0.0, >=0.0 for optimizer, <=1.0, <=0.1 for optimizer, uniform distribution
float in range [0.0, 1.0]
or integer, not for optimizer
max_features (union type, not for optimizer, default None) –
integer, >=1, <=10000 for optimizer
or None
vocabulary (union type, not for optimizer, default None) –
XXX TODO XXX, Mapping or iterable, optional
dict
or None
binary (boolean, default False) –
dtype (string, not for optimizer, default 'float64') – XXX TODO XXX, type, optional
norm (‘l1’, ‘l2’, or None, default ‘l2’) –
use_idf (boolean, default True) –
smooth_idf (boolean, default True) –
sublinear_tf (boolean, default False) –
Notes
constraint-1 : union type
tokenizer, only applies if analyzer == ‘word’
analyzer : ‘word’
or tokenizer : None
constraint-2 : union type
stop_words can be a list only if analyzer == ‘word’
stop_words : negated type of array of items : string
or analyzer : ‘word’
- fit(X, y=None, **fit_params)¶
Train the operator.
Note: The fit method is not available until this operator is trainable.
Once this method is available, it will have the following signature:
- Parameters
X (union type) –
Features; the outer array is over samples.
array of items : string
or array of items : array, >=1 items, <=1 items of items : string
y (any type, optional) – Target class labels; the array is over samples.
- transform(X, y=None)¶
Transform the data.
Note: The transform method is not available until this operator is trained.
Once this method is available, it will have the following signature:
- Parameters
X (union type) –
Features; the outer array is over samples.
array of items : string
or array of items : array, >=1 items, <=1 items of items : string
- Returns
result – Output data schema for predictions (projected data) using the TfidfVectorizer model from scikit-learn.
- Return type
array of items : array of items : float