lale.lib.sklearn.tfidf_vectorizer module

class lale.lib.sklearn.tfidf_vectorizer.TfidfVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer='word', stop_words=None, token_pattern='(?u)\x08\\w\\w+\x08', ngram_range='(1, 1)', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype='float64', norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)

Bases: PlannedIndividualOp

TF-IDF vectorizer transformer from scikit-learn for turning text into term frequency - inverse document frequency numeric features.

This documentation is auto-generated from JSON schemas.

Parameters
  • input (‘filename’, ‘file’, or ‘content’, not for optimizer, default ‘content’) –

  • encoding (string, not for optimizer, default 'utf-8') –

  • decode_error (‘strict’, ‘ignore’, or ‘replace’, not for optimizer, default ‘strict’) –

  • strip_accents (‘ascii’, ‘unicode’, or None, not for optimizer, default None) –

  • lowercase (boolean, not for optimizer, default True) –

  • preprocessor (union type, not for optimizer, default None) –

    • callable, not for optimizer

    • or None

  • tokenizer (union type, not for optimizer, default None) –

    • callable, not for optimizer

      • or None

    See also constraint-1.

  • analyzer (union type, default 'word') –

    • ‘word’, ‘char’, or ‘char_wb’

      • or callable, not for optimizer

    See also constraint-1, constraint-2.

  • stop_words (union type, not for optimizer, default None) –

    • None or ‘english’

      • or array of items : string

    See also constraint-2.

  • token_pattern (string, optional, not for optimizer, default '(?u)\b\w\w+\b') –

  • ngram_range (union type, default (1, 1)) –

    • tuple, >=2 items for optimizer, <=2 items for optimizer, not for optimizer of items : integer, >=1 for optimizer, <=3 for optimizer

    • or (1, 1), (1, 2), (1, 3), (2, 2), (2, 3), or (3, 3)

  • max_df (union type, default 1.0) –

    • float, >=0.0, >=0.8 for optimizer, <=1.0, <=0.9 for optimizer, uniform distribution

      float in range [0.0, 1.0]

    • or integer, not for optimizer

  • min_df (union type, default 1) –

    • float, >=0.0, >=0.0 for optimizer, <=1.0, <=0.1 for optimizer, uniform distribution

      float in range [0.0, 1.0]

    • or integer, not for optimizer

  • max_features (union type, not for optimizer, default None) –

    • integer, >=1, <=10000 for optimizer

    • or None

  • vocabulary (union type, not for optimizer, default None) –

    XXX TODO XXX, Mapping or iterable, optional

    • dict

    • or None

  • binary (boolean, default False) –

  • dtype (string, not for optimizer, default 'float64') – XXX TODO XXX, type, optional

  • norm (‘l1’, ‘l2’, or None, default ‘l2’) –

  • use_idf (boolean, default True) –

  • smooth_idf (boolean, default True) –

  • sublinear_tf (boolean, default False) –

Notes

constraint-1 : union type

tokenizer, only applies if analyzer == ‘word’

  • analyzer : ‘word’

  • or tokenizer : None

constraint-2 : union type

stop_words can be a list only if analyzer == ‘word’

  • stop_words : negated type of array of items : string

  • or analyzer : ‘word’

fit(X, y=None, **fit_params)

Train the operator.

Note: The fit method is not available until this operator is trainable.

Once this method is available, it will have the following signature:

Parameters
  • X (union type) –

    Features; the outer array is over samples.

    • array of items : string

    • or array of items : array, >=1 items, <=1 items of items : string

  • y (any type, optional) – Target class labels; the array is over samples.

transform(X, y=None)

Transform the data.

Note: The transform method is not available until this operator is trained.

Once this method is available, it will have the following signature:

Parameters

X (union type) –

Features; the outer array is over samples.

  • array of items : string

  • or array of items : array, >=1 items, <=1 items of items : string

Returns

result – Output data schema for predictions (projected data) using the TfidfVectorizer model from scikit-learn.

Return type

array of items : array of items : float