tune_args() would error if argument to
step had a parsnip object with tuned arguments. (#1506)step_select() has started its deprecation process.
See ?step_select() for alternatives. (#1488)
The strings_as_factors argument of
prep.recipe() has been soft-deprecated in favor of
recipe(strings_as_factors). If both are provided, the value
in recipe() takes precedence. This allows control of recipe
behavior within a workflow, which wasn’t previously possible. (@smingerson, #331,
#287)
step_mutate_at() has been superceded in favor of
step_mutate() when used with across().
(#662)
step_num2factor() has gotten improved documentation
to avoid getting NAs as output. (#575)
step_impute_bag() now has a much smaller memory
footprint when prepped. (#638)
The following arguments in steps can now take bare names as input
instead of strings, calls to vars(),
imp_vars(), and denom_vars(). (#1225)
step_classdist_shrunken(class)step_classdist(class)step_depth(class)step_impute_bag(impute_with)step_impute_knn(impute_with)step_impute_linear(impute_with)step_pls(outcome)step_profile(profile)step_ratio(denom)More informative error for some failures of
step_impute_bag(). (#209)
step_impute_bag() and step_impute_knn()
now gives more informative warnings when impute_with data
contains all NAs. (#1385)
step_spline_b(), step_spline_convex(),
step_spline_monotone(), step_spline_natural(),
and step_spline_nonnegative() now gives informative errors
when applied to zero variance predictors. (#1455)
step_dummy() has gained contrasts
argument. This change soft deprecates the use of
getOption("contrasts") with step_dummy().
(##1349)
Fixed printing for step_geodist() when no variables
are selected. (#1423)
Fixed bug where extract_fit_time() would throw
warning for when recipe didn’t have any steps. (#1475)
step_interact() now works with empty selections
instead of erroring. (#1417)
fixed bug where step_nnmf_sparse() required that the
Matrix package was loaded. (#1141)
Fixed bug where recipe() would error on sf objects.
(#1393)
step_cut() not longer errors on NA values in
bake(). (#1304)
Fixed bug in step_impute_knn() would error on
character vectors when strings_as_factors = TRUE.
(#926)
Make it so recipe.formula() can’t take table objects
as input, in accordance with documentation. (#1416)
Fixed bug where step_lincomb() would remove both
variables if they were identical. (#1357)
Fixed bugs in step_bs(), step_depth(),
step_harmonic(), step_invlogit(),
step_isomap(), step_logit(),
check_range(), step_poly_bernstein(),
step_spline_b(), step_spline_convex(),
step_monotone(), step_natural(),
step_nonnegative() would error in bake() with
zero-row data. (#1219)
fixed bug where bake.step_discretize() would error
if applied to predictor only containing NAs.
(#1350)
Added developer function check_options().
(#1269)
Officially deprecate printer() in favor of
print_step(). (#1243)
recipe(), prep(), and
bake() now work with sparse tibbles. (#1364,
#1366)
recipe(), prep(), and
bake() now work with sparse matrices. (#1364, #1368,
#1369)
The following steps has gained the argument sparse.
When set to "yes", they will produce sparse vectors.
(#1392)
step_count()step_dummy_extract()step_dummy_multi_choice()step_dummy()step_dummy()step_holiday()step_indicate_na()step_regex()The following steps have been modified to preserve sparsity in its input. (#1395)
step_arrange()step_filter_missing()step_filter()step_impute_mean()step_impute_median()step_lag()step_lag()step_rename_at()step_rename()step_rm()step_sample()step_scale()step_select()step_shuffle()step_slice()step_sqrt()step_zv()All steps and checks now require arguments trained,
skip, role, and id at all times.
(#1387)
Example for step_novel() now better illustrates how
it works. (@Edgar-Zamora, #1248)
prep.recipe(..., strings_as_factors = TRUE) now only
converts string variables that have role “predictor” or “outcome”.
(@dajmcdon, #1358,
#1376)
Improved error message for misspelled argument in step functions. (#1318)
recipe() can now take data.frames with list-columns
or sf data.frames as input to data. (#1283)
recipe() will now show better error when columns are
misspelled in formula (#1283).
add_role() now errors if a column would
simultaneously have roles "outcome" and
"predictor". (#935)
prep() will now error if the ptype of the data
doesn’t match which was used to define the recipe. (#793)
Added more documentation in ?selections about how
tidyselect::everything() works in recipes. (#1259)
New extract_fit_time() method has been added that
returns the time it took to train the recipe. (#1071)
step_spline_b(), step_spline_convex(),
step_spline_monotone(), and
step_spline_nonnegative() now throws informative errors if
thedegree, deg_free, and
complete_set arguments causes an error. (#1170)
step_mutate() gained .pkgs argument to
specify what packages need to be loaded for step to work.
(#1282)
step_interact() now gives better error if
terms isn’t a formula. (#1299)
The prefix argument of
step_dummy_multi_choice() is now properly documented.
(#1298)
Significant speedup in step_dummy() when applied to
many columns. (#1305)
step_dummy() now gives an informative error on
attempt to generate too many columns to fit in memory. (#828)
step_dummy() and step_unknown() now
throw more informative warnings for unseen levels. (#450)
step_dummy() now throws more informative warnings
for NA values. (#450)
step_date() now accepts "mday" as a
possible feature. (@Edgar-Zamora, #1211)
NA levels in factors aren’t dropped when passed to
recipe(). (#1291)
recipe() no longer crashes when given long formula
expression (#1283).
Fixed bug in step_ns() and step_bs()
where knots field in options argument wasn’t
correctly used. (#1297)
Bug fixed in step_interact() where long formulas
were used. (#1231, #1289)
Fixed documentation mistake where default value of
keep_original_cols argument were wrong. (#1314)
Developer helper function recipes_ptype() has been
added, returning expected input data for prep() and
bake() for a given recipe object. (#1329)
Developer helper function recipes_ptype_validate()
has been added, to validate new data is compatible with recipe ptype.
(#793)
Developer helper functions
recipes_names_predictors() and
recipes_names_outcomes() have been added to aid variable
selection in steps. (#1026)
step_log() breaks legacy recipe objects
by indexing names(object) in bake(). (@stufield, #1284)Minor speed-up and reduced memory consumption for
step_pca() in the bake() stage by reducing
unused multiplications (@jkennel, #1265)
Document that update_role(), add_role()
and remove_role() are applied before steps and checks.
(#778)
Documentation for tidy methods for all steps has been added when missing and improved to describe the return value more accurately. (#936)
step_dummy() will now error if passed character
instead of loudly ignoring them. Only applicable when setting
strings_as_factors = FALSE. (#1233)
It is now documented that step_spline_b() can be
made periodic. (#1223)
prep() now correctly throws a warning when
training argument is set when prepping a prepped recipe,
telling the user that it will be ignored. (#1244)
When errors are thrown about wrongly typed input to steps, the offending variables and their types are now listed. (#1217)
All warnings and errors have been updated to use the cli package for increased clarity and consistency. (#1237)
Added warnings when step_scale(),
step_normalise(), step_center() or
step_range() result in NaN columns. (@mastoffel,
#1221)
Fixed bug where step_factor2string() if
strings_as_factors = TRUE is set in prep().
(#317)
Fixed bug where tidy.step_cut() always returned zero
row tibbles for trained recipes. (#1229)
spline2_apply (#1200)step_ns(),
step_bs(), step_spline_b(),
step_spline_convex(), step_spline_monotone(),
step_spline_natural(),
step_spline_nonnegative()) would error if baked with 1 row.
(#1191)step_classdist_shrunken(), a regularized version of
step_classdist(), was added. (#1185)step_bs() and step_ns() have gained
keep_original_cols argument. (#1164)
The keep_original_cols argument has been added to
step_classdist(), step_count(),
step_depth(), step_geodist(),
step_indicate_na(), step_interact(),
step_lag(), step_poly(),
step_regex(), step_window(). The default for
each step is set to preserve past behavior. This change should mean that
every step that produces new columns has the
keep_original_cols argument. (#1167)
Fixed bugs where step_classdist(),
step_count(), step_depth(),
step_geodist(), step_interact(),
step_nnmf_sparse(), and step_regex() didn’t
work with empty selection. All steps now leave data unmodified when
having empty selections. (#1142)
step_classdist(), step_count() and
step_depth() no longer returns a column with all
NAs with empty selections. (#1142)
step_regex() no longer returns a column with all 0s
with empty selections. (#1142)
The tidy() methods for step_geodist(),
step_nnmf_sparse(), and step_sample() now
correctly return zero-row tibbles when used with empty selections.
(#1144)
step_poly_bernstein(), step_profile(),
step_spline_b(), step_spline_convex(),
step_spline_monotone(), step_spline_natural(),
and step_spline_nonnegative() now correctly return a zero
row tibble when used with empty selection. (#1133)
Fixed bug where the tidy() method for
step_sample() didn’t return an id column.
(#1144)
check_class(), check_missing(),
check_new_values(), check_range(),
step_naomit(), step_poly_bernstein(),
step_spline_b(), step_spline_convex(),
step_spline_monotone(), step_spline_natural(),
step_spline_nonnegative(), and
step_string2factor() now throw an informative error if
needed non-standard role columns are missing during bake().
(#1145)
step_window() now throws an error instead of
silently overwriting if names argument overlaps with
existing columns. (#1172)
step_regex() and step_count() will now
informatively error if name collision occurs. (#1169)
Added developer function remove_original_cols() to
help remove original columns that are no longer needed. (#1149)
Added developer function recipes_remove_cols() to
provide standardized way to remove columns by column names.
(#1155)
Steps with tunable arguments now have those arguments listed in the documentation.
All steps that add new columns will now informatively error if name collision occurs. (#983)
Fixed bug in step_spline_b(),
step_spline_convex(), step_spline_monotone(),
and spline_nonnegative() where you weren’t able to tune the
degree argument.
step_range() now perform correctly performs clipping
on recipes created before 1.0.3. (#1097)
tidy() method for step_impute_mean(),
step_impute_median(), and step_impute_mode()
now the imputed value with the column name value instead of
model. This is in line with the output of
step_impute_lower(). (#826)Added outside argument to
step_percentile() to determine different ways of handling
values outside the range of the training data.
step_range() is now backwards compatible with
respect to the clipping argument that was added 1.0.3, and
old saved recipes can now be baked. (#1090)
update print methods to use cli package for formatting. (#426)
Print methods no longer errors for untrained recipes with long selections. (#1083)
The recipe, step, and
check methods for generics::tune_args() are
now registered unconditionally (tidymodels/workflows#192).
Added a conditionMessage() method for
recipes_errors to consistently point out which step errors
occurred in when reporting errors. (#1080)
Added missing tidy method for step_intercept() and
step_lag(). (#730)
Errors in prep() and bake() will now
indicate which step caused the error. (#420)
Developer focused check_type() got a new
types argument for more precise checking of column
types.
recipes_extension_check() have been added. This
developer focused function checks that steps have all the required S3
methods.
recipe() now error more informatively when
data is missing. (#1042)
step_dummy() no longer returns integer columns as
there are a number of contrast methods that return fractional values.
(#1053)
Fixed a 0-length recycling bug in
step_dummy_extract() exposed by the development version of
purrr (#1052).
Types of variables have been made granular.
"nominal" has been split into "ordered" and
"unordered" and "numeric" has been split into
"double" and "integer". (#993)
New selectors: all_double(),
all_ordered(), all_unordered(),
all_date() and all_datetime(), in addition to
the existing all_numeric() and all_nominal().
All selectors come with a *_predictors() variant.
(#993)
Developer focused .get_data_types() generic has been
added to designate types of columns. Exported for use in extension
packages that deal with types not supported in recipes directly.
(#993)
The step_date() function now defaults to using the
clock package to format day-of-week and month labels. (#1048)
step_range() has gained a argument
clipping that when set to FALSE no longer
clips the data to be between min and
max.
Added documentation regarding developer functions
?developer_functions. (#1163)
A new set of basis functions were added:
step_spline_b(), step_spline_convex(),
step_spline_monotone(), step_spline_natural(),
step_spline_nonnegative(), and
step_poly_bernstein().
step_date(), step_dummy(),
step_dummy_extract(), step_holiday(),
step_ordinalscore(), and step_regex() now
returns integer results when appropriate. (#766)
The default for the strict argument in
step_integer() has been changed from FALSE to
TRUE. The function will thus return integers, rather than
whole-number numerics, by default. (#766)
The default for the value argument in
step_intercept() has been changed from 1 to
1L. (#766)
step_holiday() didn’t work if it isn’t
have any missing values. (#1019)Added support for case weights in the following steps
step_center()step_classdist()step_corr()step_dummy_extract()step_filter_missing()step_impute_linear()step_impute_mean()step_impute_median()step_impute_mode()step_normalize()step_nzv()step_other()step_percentile()step_pca()step_sample()step_scale()A number of developer focused functions to deal with case weights
are added: are_weights_used(),
get_case_weights(), averages(),
medians(), variances(),
correlations(), covariances(), and
pca_wts()
recipes now checks that all columns in the data
supplied to recipe() are also present in the
new_data supplied to bake(). An exception is
made for columns with roles of either "outcome" or
"case_weights", which are typically not required at
bake() time. The new
update_role_requirements() function can be used to adjust
whether or not columns of a particular role are required at
bake() time if you need to opt out of this check
(#1011).
The summary() method for recipe objects now contains
an extra column to indicate which columns are required when
bake() is used.
step_time() has been added that extracts time features
such as hour, minute, or second. (#968)Fixed bug in which functions that step_hyperbolic()
uses (#932).
step_dummy_multi_choice() now respects factor-levels
of the selected variables when creating dummies. (#916)
step_dummy() no works correctly with recipes trained
on version 0.1.17 or earlier. (#921)
Fixed a bug where setting fresh = TRUE in
prep() wouldn’t result in re-prepping the recipe.
(#492)
Bug was fixed in step_holiday() which used to error
when it was applied to variable with missing values. (#743)
A bug was fixed in step_normalize() which used to
error if 1 variable was selected. (#963)
Finally removed step_upsample() and
step_downsample() in recipes as they are now available in
the themis package.
discretize() and step_discretize() now
can return factor levels similar to cut(). (#674)
step_naomit() now actually had their defaults for
skip changed to TRUE as was stated in release
0.1.13. (934)
step_dummy() has been made more robust to
non-standard column names. (#879)
step_pls() now allows you use use multiple outcomes
if they are numeric. (#651)
step_normalize() and step_scale()
ignore columns with zero variance, generate a warning and suggest to use
step_zv() (#920).
printing for step_impute_knn() now show variables
that were imputed instead of variables used for imputing.
(#837)
step_discretize() and discretize() will
automatically remove missing values if keep_na = TRUE,
removing the need to specify keep_na = TRUE and
na.rm = TRUE. (#982)
prep() and bake() checks and errors if
output of bake.bake_*() isn’t a tibble.
step_date() now has a locale argument that can be
used to control how the month and dow features
are returned. (#1000)
step_nnmf_sparse() uses a different implementation
of non-negative matrix factorization that is much faster and enables
regularized estimation. (#790)
step_dummy_extract() creates multiple variables from
a character variable by extracting elements using regular expressions
and counting those elements.
step_filter_missing() can filter columns based on
proportion of missingness (#270).
step_percentile() replaces the value of a variable
with its percentile from the training set. (#765)
All recipe steps now officially support empty selections to be
more aligned with dplyr and other packages that use tidyselect (#603,
#531). For example, if a previous step removed all of the columns need
for a later step, the recipe does not fail when it is estimated (with
the exception of step_mutate()). The documentation in
?selections has been updated with advice for writing
selectors when filtering steps are used. (#813)
Fixed bug in step_harmonic() printing and changed
defaults to role = "predictor" and
keep_original_cols = FALSE (#822).
Improved the efficiency of computations for the Box-Cox transformation (#820).
When a feature extraction step (e.g., step_pca(),
step_ica(), etc.) has zero components specified, the
tidy() method now lists the selected columns in the
terms column.
Deprecation has started for step_nnmf() in favor of
step_nnmf_sparse(). (#790)
Steps now have a dedicated subsection detailing what happens when
tidy() is applied. (#876)
step_ica() now runs fastICA() using a
specific set of random numbers so that initialization is
reproducible.
tidy.recipe() now returns a zero row tibble instead
of an error when applied to a empty recipe. (#867)
step_zv() now has a group argument. The
same filter is applied but looks for zero-variance within 1 or more
columns that define groups. (#711)
detect_step() is no longer restricted to steps
created in recipes (#869).
New extract_parameter_set_dials() and
extract_parameter_dials() methods to extract parameter sets
and single parameters from recipe objects.
step_other() now allow for setting
threshold = 0 which will result in no othering.
(#904)
step_ica() now indirectly uses the
fastICA package since that package has increased their R
version requirement. Recipe objects from previous versions will error
when applied to new data. (#823)
step_kpca*() now directly use the
kernlab package. Recipe objects from previous versions will
error when applied to new data.
bake() will now error if new_data
doesn’t contain all the required columns. (#491)
print_step() instead of printer(). This is
done for a smoother transition to use cli in the next
version. (#871)Added new step_harmonic() (#702).
Added a new step called step_dummy_multi_choice(),
which will take multiple nominal variables and produces shared dummy
variables. (#716)
The deprecation for step_upsample() and
step_downsample() has been escalated from a deprecation
warning to a deprecation error; these functions are available in the
themis package.
Escalate deprecation for old versions of imputation steps (such
as step_bagimpute()) from a soft deprecation to a regular
deprecation; these imputation steps have new names like
step_impute_bag() (#753).
step_kpca() was un-deprecated and gained the
keep_original_cols argument.
The deprecation of the preserve argument to
step_pls() and step_dummy() was escalated from
a soft deprecation to regular deprecation.
The deprecation of the options argument to
step_nzv() was escalated to a deprecation error.
Fix imputation steps for new data that is all NA,
and generate a warning for recipes created under previous versions that
cannot be imputed with this fix (#719).
A bug was fixed where imputed values via bagged trees would have the wrong levels.
The computations for the Yeo-Johnson transformation were made more efficient (#782).
New recipes_eval_select() which is a developer tool
that is useful for creating new recipes steps. It powers the tidyselect
semantics that are specific to recipes and supports the modern
tidyselect API introduced in tidyselect 1.0.0. Additionally, the older
terms_select() has been deprecated in favor of this new
helper (#739).
Speed-up/simplification to
step_spatialsign()
When only the terms attributes are desired from
model.frame use the first row of data to improve speed and
memory use (#726).
Use Haversine formula for latitude-longitude pairs in
step_geodist() (#725).
Reorganize documentation for all recipe step tidy
methods (#701).
Generate warning when user attempts a Box-Cox transformation of non-positive data (@LiamBlake, #713).
step_logit() gained an offset argument for cases
where the input is either zero or one (#784)
The tidy() methods for objects from
check_new_values(), check_class() and
step_nnmf() are now exported.
Added a new step called step_indicate_na(), which
will create and append additional binary columns to the data set to
indicate which observations are missing (#623).
Added new step_select() (#199).
The threshold argument of step_pca() is
now tunable() (#534).
Integer variables used in step_profile() are now
kept as integers (and not doubles).
Preserve multiple roles in last_term_info so
bake() can correctly respond to has_roles.
(#632)
Fixed behavior of the retain flag in prep()
(#652).
The tidy() methods for step_nnmf() was
rewritten since it was not great (#665), and step_nnmf()
now no longer fully loads underlying packages (#685).
Two new selectors that combine role and data type were added:
all_numeric_predictors() and
all_nominal_predictors(). (#620)
Changed the names of all imputation steps, for example, from
step_knnimpute() or step_medianimpute() (old)
to step_impute_knn() or step_impute_median()
(new) (#614).
Added keep_original_cols argument to several
steps:
step_pca(), step_ica(),
step_nnmf(), step_kpca_rbf(),
step_kpca_poly(), step_pls(),
step_isomap() which all default to FALSE
(#635).step_ratio(), step_holiday(),
step_date() which all default to TRUE to
maintain original behavior, as well as step_dummy() which
defaults to FALSE (#645).Added allow_rename argument to
recipes_eval_select() (#646).
Performance improvements for step_bs() and
step_ns(). The prep() step no longer evaluates
the basis functions on the training set and the bake()
steps only evaluates the basis functions once for each unique input
value (#574)
The neighbors parameter’s default range for
step_isomap() was changed to be 20-80.
The deprecation for step_upsample() and
step_downsample() has been escalated from a soft
deprecation to a regular deprecation; these functions are available in
the themis package.
Re-licensed package from GPL-2 to MIT. See consent from copyright holders here.
The full tidyselect DSL is now allowed inside recipes
step_*() functions. This includes the operators
&, |, - and !
and the new where() function. Additionally, the restriction
preventing user defined selectors from being used has been lifted
(#572).
If steps that drop/add variables are skipped when baking the test set, the resulting column ordering of the baked test set will now be relative to the original recipe specification rather than relative to the baked training set. This is often more intuitive.
More infrastructure work to make parallel processing on Windows less buggy with PSOCK clusters
fully_trained() now returns FALSE when
an unprepped recipe is used.
prep() gained an option to print a summary of which
columns were added and/or removed during execution.
To reduce confusion between bake() and
juice(), the latter is superseded in favor of using
bake(object, new_data = NULL). The new_data
argument now has no default, so a NULL value must be
explicitly used in order to emulate the results of juice().
juice() will remain in the package (and used internally)
but most communication and training will use
bake(object, new_data = NULL). (#543)
Tim Zhou added a step to use linear models for imputation (#555)
step_filter(), step_slice(),
step_sample(), and step_naomit() had their
defaults for skip changed to TRUE. In the vast
majority of applications, these steps should not be applied to the test
or assessment sets.
tidyr version 1.0.0 or later is now
required.
step_pls() was changed so that it uses the
Bioconductor mixOmics package. Objects created with previous versions of
recipes can still use juice() and
bake(). With the current version, the categorical outcomes
can be used but now multivariate models do not. Also, the new method
allows for sparse results.
As suggested by @StefanBRas, step_ica() now
defaults to the C engine (#518)
Avoided partial matching on seq() arguments in
internal functions.
Improved error messaging, for example when a user tries to
prep() a tuneable recipe.
step_upsample() and step_downsample()
are soft deprecated in recipes as they are now available in the themis
package. They will be removed in the next version.
step_zv() now handles NA values so that
variables with zero variance plus are removed.
The selectors all_of() and any_of() can
now be used in step selections (#477).
The tune pacakge can now use recipes with
check operations (but also requires tune >=
0.1.0.9000).
The tidy method for step_pca() now has
an option for returning the variance statistics for each
component.
recipes does not directly depend on
dials, it has several S3 methods for generics in
dials. Version 0.0.5 of dials added stricter
validation for these methods, so changes were required for
recipes.step_cut() enables you to create a factor from a
numeric based on provided break (contributed by Edwin Thoen)yj_trans() to yj_transform() to
avoid conflicts.Added flexible naming options for new columns created by
step_depth() and step_classdist()
(#262).
Small changes for base R’s stringsAsFactors
change.
Delayed S3 method registration for tune::tunable()
methods that live in recipes will now work correctly on R >=4.0.0 (#439, tidymodels/tune#146).
step_relevel() added.
recipes 0.1.8The imputation steps do not change the data type being imputed now. Previously, if the data were integer, the data would be changed to numeric (for some step types). The change is breaking since the underlying data of imputed values are now saved as a list instead of a vector (for some step types).
The data sets were moved to the new modeldata
package.
step_num2factor() was rewritten due to a bug that
ignored the user-supplied levels (#425). The
results of the transform argument are now required to be a
function and levels must now be supplied.
Using a minus in the formula to recipes() is no
longer allowed (it didn’t remove variables anyway).
step_rm() or update_role() can be used
instead.
When using a selector that returns no columns,
juice() and bake() will now return a tibble
with as many rows as the original template data or the
new_data respectively. This is more consistent with how
selectors work in dplyr (#411).
Code was added to explicitly register tunable
methods when recipes is loaded. This is required because of
changes occurring in R 4.0.
check_class() checks if a variable is of the
designated class. Class is either learned from the train set or provided
in the check. (contributed by Edwin Thoen)
step_normalize() and step_scale()
gained a factor argument with values of 1 or 2 that can
scale the standard deviations used to transform the data. (#380)
bake() now produces a tibble with columns in the
same order as juice() (#365)
recipes 0.1.7Release driven by changes in tidyr (v 1.0.0).
format_selector()’s wdth argument has been
renamed to width (#250).
step_mutate_at(), step_rename(), and
step_rename_at() were added.The use of varying() will be deprecated in favor of
an upcoming function tune(). No changes are need in this
version, but subsequent versions will work with
tune().
format_ch_vec() and format_selector()
are now exported (#250).
check_new_values breaks bake if
variable contains values that were not observed in the train set
(contributed by Edwin Thoen)
When no outcomes are in the recipe, using
juice(object, all_outcomes() and
bake(object, new_data, all_outcomes() will return a tibble
with zero rows and zero columns (instead of failing). (#298). This
will also occur when the selectors select no columns.
As alternatives to step_kpca(), two separate steps
were added called step_kpca_rbf() and
step_kpca_poly(). The use of step_kpca() will
print a deprecation message that it will be going away.
step_nzv() and step_poly() had
arguments promoted out of their options slot.
options can be used in the short term but is
deprecated.
step_downsample() will replace the
ratio argument with under_ratio and
step_upsample() will replace it with
over_ratio. ratio still works (for now) but
issues a deprecation message.
step_discretize() has arguments moved out of
options too; the main arguments are now
num_breaks (instead of cuts) and
min_unique. Again, deprecation messages are issued with the
old argument structure.
Models using the dimRed package
(step_kpca(), step_isomap(), and
step_nnmf()) would silently fail if the projection method
failed. An error is issued now.
Methods were added for a future generic called
tunable(). This outlines which parameters in a step
can/could be tuned.
recipes 0.1.6Release driven by changes in rlang.
Since 2018, a warning has been issued when the wrong argument was
used in bake(recipe, newdata). The depredation period is
over and new_data is officially required.
Previously, if step_other() did not
collapse any levels, it would still add an “other” level to the factor.
This would lump new factor levels into “other” when data were baked (as
step_novel() does). This no longer occurs since it was
inconsistent with ?step_other, which said that
“If no pooling is done the data are unmodified”.
step_normalize() centers and scales the data (if you
are, like Max, too lazy to use two separate steps).step_unknown() will convert missing data in categorical
columns to “unknown” and update factor levels.If threshold argument of step_other is
greater than one then it specifies the minimum sample size before the
levels of the factor are collapsed into the “other” category. #289
step_knnimpute() can now pass two options to the
underlying knn code, including the number of threads (#323).
Due to changes by CRAN, step_nnmf() only works on
versions of R >= 3.6.0 due to dependency issues.
step_dummy() and step_other() are now
tolerant to cases where that step’s selectors do not capture any
columns. In this case, no modifications to the data are made. (#290, #348)
step_dummy() can now retain the original columns
that are used to make the dummy variables. (#328)
step_other()’s print method only reports the
variables with collapsed levels (as opposed to any column that was
tested to see if it needed collapsing). (#338)
step_pca(), step_kpca(),
step_ica(), step_nnmf(),
step_pls(), and step_isomap() now accept zero
components. In this case, the original data are returned.
recipes 0.1.5Small release driven by changes in sample() in the
current r-devel.
A new vignette discussing roles has been added.
To provide infrastructure for finalizing varying parameters, an
update() method for recipe steps has been added. This
allows users to alter information in steps that have not yet been
trained.
step_interact will no longer fail if an interaction
contains an interaction using column that has been previously filtered
from the data. A warning is issued when this happens and no interaction
terms will be created.
step_corr was made more fault tolerant for cases
where the data contain a zero-variance column or columns with missing
values.
Set the embedded environment to NULL in
prep.step_dummy to reduce the file size of serialized
recipe class objects when using saveRDS.
tidy method for step_dummy now returns
the original variable and the levels of the future dummy
variables.NA roles of existing columns (#296).recipes 0.1.4Several argument names were changed to be consistent with other
tidymodels packages (e.g. dials) and the
general tidyverse naming conventions.
K in step_knnimpute was changed to
neighbors. step_isomap had the number of
neighbors promoted to a main argument called neighborsstep_pca, step_pls,
step_kpca, step_ica now use
num_comp instead of num. ,
step_isomap uses num_terms instead of
num.step_bagimpute moved nbagg out of the
options and into a main argument trees.step_bs and step_ns has degrees of freedom
promoted to a main argument with name deg_free. Also,
step_bs had degree promoted to a main
argument.step_BoxCox and step_YeoJohnson had
nunique change to num_unique.bake, juice and other functions has
newdata changed to new_data. For this
version only, using newdata will only result in a
wanring.na.rm changed to
na_rm.prep and a few steps had stringsAsFactors
changed to strings_as_factors.add_role() can now only add new additional
roles. To alter existing roles, use update_role(). This
change also allows for the possibility of having multiple roles/types
for one variable. #221
All steps gain an id field that will be used in the
future to reference other steps.
The retain option to prep is now
defaulted to TRUE. If verbose = TRUE, the
approximate size of the data set is printed. #207
step_integer converts data to ordered integers similar
to LabelEncoder
#123 and
#185step_geodist can be used to calculate the distance
between geocodes and a single reference location.step_arrange, step_filter,
step_mutate, step_sample, and
step_slice implement their dplyr analogs.step_nnmf computes the non-negative matrix
factorization for data.rsample function prepper was moved to
recipes (issue).step_step_string2factor will now accept
factors and leave them as-is.step_knnimpute now excludes missing data in the
variable to be imputed from the nearest-neighbor calculation. This would
have resulted in some missing data to not be imputed (i.e. return
another missing value).step_dummy now produces a warning (instead of failing)
when non-factor columns are selected. Only factor columns are used; no
conversion is done for character data. issue
#186dummy_names gained a separator argument. issue
#183step_downsample and step_upsample now have
seed arguments for more control over randomness.broom is no longer used to get the tidy
generic. These are now contained in the generics
package.recipes 0.1.3check_range breaks bake if variable
range in new data is outside the range that was learned from the train
set (contributed by Edwin Thoen)
step_lag can lag variables in the data set
(contributed by Alex Hayes).
step_naomit removes rows with missing data for
specific columns (contributed by Alex Hayes).
step_rollimpute can be used to impute data in a
sequence or series by estimating their values within a moving
window.
step_pls can conduct supervised feature extraction
for predictors.
step_log gained an offset
argument.
step_log gained a signed argument
(contributed by Edwin Thoen).
The internal functions sel2char and
printer have been exported to enable other packages
to contain steps.
When training new steps after some steps have been
previously trained, the retain = TRUE option should be set
on previous
invocations of prep.
For step_dummy:
one_hot = TRUE option. Thanks to Davis Vaughan.contrast option was removed. The step uses the
global option for contrasts.step_other will now convert novel levels of the
factor to the “other” level.
step_bin2factor now has an option to choose how the values
are translated to the levels (contributed by Michael Levy).
bake and juice can now export basic
data frames.
The okc data were updated with two additional
columns.
issue 125 that prevented several steps from working with dplyr grouped data frames. (contributed by Jeffrey Arnold)
issue
127 where options to step_discretize were not being
passed to discretize.
recipes 0.1.2Edwin Thoen suggested adding validation
checks for certain data characteristics. This fed into the existing
notion of expanding recipes beyond steps (see the non-step steps
project). A new set of operations, called
checks, can now be used. These should
throw an informative error when the check conditions are not met and
return the existing data otherwise.
Steps now have a skip option that will not apply
preprocessing when bake is used. See the article on skipping
steps for more information.
check_missing will validate that none of the
specified variables contain missing data.
detect_step can be used to check if a recipe
contains a particular preprocessing operation.
step_num2factor can be used to convert numeric data
(especially integers) to factors.
step_novel adds a new factor level to nominal
variables that will be used when new data contain a level that did not
exist when the recipe was prepared.
step_profile can be used to generate design matrix
grids for prediction profile plots of additive models where one variable
is varied over a grid and all of the others are fixed at a single
value.
step_downsample and step_upsample can
be used to change the number of rows in the data based on the frequency
distributions of a factor variable in the training set. By default, this
operation is only applied to the training set; bake ignores
this operation.
step_naomit drops rows when specified columns
contain NA, similar to
tidyr::drop_na.
step_lag allows for the creation of lagged predictor
columns.
step_spatialsign now has the option of removing missing
data prior to computing the norm.recipes 0.1.1bake was changed from
all_predictors() to everything().verbose option for prep is now
defaulted to FALSEstep_dummy was fixed that makes sure that the correct
binary variables are generated despite the levels or values of the
incoming factor. Also, step_dummy now requires factor
inputs.step_dummy also has a new default naming function that
works better for factors. However, there is an extra argument
(ordinal) now to the functions that can be passed to
step_dummy.step_interact now allows for selectors
(e.g. all_predictors() or
starts_with("prefix") to be used in the interaction
formula.step_YeoJohnson gained an na.rm
option.dplyr::one_of
was added to the list of selectors.step_bs adds B-spline basis functions.step_unorder converts ordered factors to unordered
factors.step_count counts the number of instances that a
pattern exists in a string.step_string2factor and step_factor2string
can be used to move between encodings.step_lowerimpute is for numeric data where the values
cannot be measured below a specific value. For these cases, random
uniform values are used for the truncated values.step_zv).tidy methods were added for recipes and
many (but not all) steps.bake.recipe, the argument newdata is
now without a default.bake and juice can now save the
final processed data set in sparse
format. Note that, as the steps are processed, a non-sparse data
frame is used to store the results.recipes 0.1.0First CRAN release.
prepare to prep per issue
#59recipes 0.0.1.9003learn has become prepare and
process has become bakerecipes 0.0.1.9002step_lincomb removes variables involved in linear
combinations to resolve them.step_bin2factor)step_regex applies a regular expression to a character
or factor vector to create dummy variables.step_dummy and step_interact do a better
job of respecting missing values in the data set.recipes 0.0.1.9001recipe objects was changed so that
pipes can be
used to create the recipe with a formula.process.recipe lost the role argument in
factor of a general set of selectors.
If no selector is used, all the predictors are returned.