sparse.
When set to "yes", they will produce sparse vectors. (#277)
step_dummy_hash()step_texthash()step_tf()step_tfidf()Documentation for tidy methods for all steps has been improved to describe the return value more accurately. (#262)
Calling ?tidy.step_*() now sends you to the
documentation for step_*() where the outcome is documented.
(#261)
step_textfeatures() has been made faster and more
robust. (#265)
step_clean_levels() where it would produce
NAs for character columns. (#274)textfeatures has been removed from Suggests. (#255)
step_textfeatures() no longer returns a politeness
feature. (#254)
step_untokenize() and step_normalization()
now returns factors instead of strings. (#247)step_clean_names() now throw an informative error if
needed non-standard role columns are missing during bake().
(#235)
The keep_original_cols argument has been added to
step_tokenmerge. This change should mean that every step
that produces new columns has the keep_original_cols
argument. (#242)
Many internal changes to improve consistency and slight speed increases.
Fixed bug where step_dummy_hash() and
step_texthash() would add new columns before old columns.
(#235)
Fixed bug where vocabulary_size wasn’t tunable in
step_tokenize_bpe(). (#239)
Steps with tunable arguments now have those arguments listed in the documentation.
All steps that add new columns will now informatively error if name collision occurs.
step_tf() wasn’t tunable for
weight argument.Setting token = "tweets" in
step_tokenize() have been deprecated due to
tokenizers::tokenize_tweets() being deprecated.
(#209)
step_sequence_onehot(),
step_dummy_hash(), step_dummy_texthash() now
return integers. step_tf() returns integer when
weight_scheme is "binary" or
"raw count".
All steps now have required_pkgs() methods.
if (require(...)) code.Remove use of okc_text in vignette
Fix bug in printing of tokenlists
step_tfidf() now correctly saves the idf values and
applies them to the testing data set.
tidy.step_tfidf() now returns calculated IDF
weights.
step_dummy_hash() generates binary indicators
(possibly signed) from simple factor or character vectors.
step_tokenize() has gotten a couple of cousin
functions step_tokenize_bpe(),
step_tokenize_sentencepiece() and
step_tokenize_wordpiece() which wraps {tokenizers.bpe},
{sentencepiece} and {wordpiece} respectively (#147).
Added all_tokenized() and
all_tokenized_predictors() to more easily select tokenized
columns (#132).
Use show_tokens() to more easily debug a recipe
involving tokenization.
Reorganize documentation for all recipe step tidy
methods (#126).
Steps now have a dedicated subsection detailing what happens when
tidy() is applied. (#163)
All recipe steps now officially support empty selections to be more aligned with dplyr and other packages that use tidyselect (#141).
step_ngram() has been given a speed increase to put
it in line with other packages performance.
step_tokenize() will now try to error if vocabulary
size is too low when using engine = "tokenizers.bpe"
(#119).
Warning given by step_tokenfilter() when filtering
failed to apply now correctly refers to the right argument name
(#137).
step_tf() now returns 0 instead of NaN when there
aren’t any tokens present (#118).
step_tokenfilter() now has a new argument
filter_fun will takes a function which can be used to
filter tokens. (#164)
tidy.step_stem() now correctly shows if custom
stemmer was used.
Added keep_original_cols argument to
step_lda, step_texthash(),
step_tf(), step_tfidf(),
step_word_embeddings(), step_dummy_hash(),
step_sequence_onehot(), and
step_textfeatures() (#139).
prefix argument now creates names according
to the pattern prefix_variablename_name/number. (#124)step_tokenfilter() and
step_sequence_onehot() that sometimes caused crashes in R
4.1.0.step_lda() now takes a tokenlist instead of a character
variable. See readme for more detail.step_sequence_onehot() now takes tokenlists as
input.step_tokenize().step_tokenize().step_clean_names() and
step_clean_levels(). (#101)step_ngram() gained an argument
min_num_tokens to be able to return multiple n-grams
together. (#90)step_text_normalization() to perform unicode
normalization on character vectors. (#86)step_word_embeddings() got a argument
aggregation_default to specify value in cases where no
words matches embedding.step_tokenize() got an engine argument to
specify packages other then tokenizers to tokenize.spacyr have been added as an engine to
step_tokenize().step_lemma() has been added to extract lemma attribute
from tokenlists.step_pos_filter() has been added to allow filtering of
tokens bases on their pat of speech tags.step_ngram() has been added to generate ngrams from
tokenlists.step_stem() not correctly uses the options argument.
(Thanks to @grayskripko for finding bug, #64)step_word2vec() have been changed to
step_lda() to reflect what is actually happening.step_word_embeddings() has been added. Allows for use
of pre-trained word embeddings to convert token columns to vectors in a
high-dimensional “meaning” space. (@jonthegeek, #20)step_tfidf() calculations are slightly changed due to
flaw in original implementation
https://github.com/dselivanov/text2vec/issues/280.step_textfeatures() have been added, allows for
multiple numerical features to be pulled from text.step_sequence_onehot() have been added, allows for one
hot encoding of sequences of fixed width.step_word2vec() have been added, calculates word2vec
dimensions.step_tokenmerge() have been added, combines multiple
list columns into one list-columns.step_texthash() now correctly accepts
signed argument.step_tf() and
step_tfidf().First CRAN version