The first vignette showed how featdelta helps move
features created in R into a database table. This workshop focuses on
the part of the workflow that most data scientists will edit most often:
the feature definitions.
Feature definitions are the bridge between exploratory R code and a
repeatable feature pipeline. When features are kept only as lines inside
an ad-hoc script, they are harder to review, reuse, test, and refresh.
With fd_define(), the feature logic becomes a structured
object. That object can then be computed locally with
fd_compute() or passed to fd_run() for the
full database pipeline.
In this workshop, we will build feature definitions step by step:
fd_block();We will use a small version of the built-in mtcars
dataset. The data is simple enough to inspect directly, but it lets us
demonstrate the same patterns used in larger feature engineering
projects.
library(featdelta)
raw_cars <- mtcars
raw_cars$car_id <- seq_len(nrow(raw_cars))
raw_cars <- raw_cars[, c("car_id", "mpg", "cyl", "disp", "hp", "wt", "am")]
head(raw_cars)
#> car_id mpg cyl disp hp wt am
#> Mazda RX4 1 21.0 6 160 110 2.620 1
#> Mazda RX4 Wag 2 21.0 6 160 110 2.875 1
#> Datsun 710 3 22.8 4 108 93 2.320 1
#> Hornet 4 Drive 4 21.4 6 258 110 3.215 0
#> Hornet Sportabout 5 18.7 8 360 175 3.440 0
#> Valiant 6 18.1 6 225 105 3.460 0The car_id column is the key. It identifies each row and
will be preserved in the computed feature table.
The most direct use of fd_define() is to write one
expression per feature. Each named expression becomes one output
column.
defs_basic <- fd_define(
transmission = ifelse(am == 1, "automatic", "manual"),
hp_per_cyl = hp / cyl,
wt_per_hp = wt / hp
)
defs_basic
#> <featdelta_defs>
#> Definition steps (3):
#> - [column] transmission -> ifelse(am == 1, "automatic", "manual")
#> - [column] hp_per_cyl -> hp/cyl
#> - [column] wt_per_hp -> wt/hpThe printed object gives a quick overview of the feature set. Note that the feature names and expressions are stored together and can be passed around as one object.
Now compute the definitions on the raw data.
features_basic <- fd_compute(
data = raw_cars,
defs = defs_basic,
key = "car_id"
)
head(features_basic)
#> car_id transmission hp_per_cyl wt_per_hp
#> 1 1 automatic 18.33333 0.02381818
#> 2 2 automatic 18.33333 0.02613636
#> 3 3 automatic 23.25000 0.02494624
#> 4 4 manual 18.33333 0.02922727
#> 5 5 manual 21.87500 0.01965714
#> 6 6 manual 17.50000 0.03295238The output contains the key column plus the computed features. This
is the feature table shape that can later be written to the database by
fd_run().
Definitions are evaluated in order. This means a later feature can
use columns created by earlier definitions in the same
fd_compute() call.
defs_ordered <- fd_define(
hp_per_cyl = hp / cyl,
strong_engine = hp_per_cyl > 30,
engine_label = ifelse(strong_engine, "strong", "regular")
)
features_ordered <- fd_compute(
data = raw_cars,
defs = defs_ordered,
key = "car_id"
)
head(features_ordered)
#> car_id hp_per_cyl strong_engine engine_label
#> 1 1 18.33333 FALSE regular
#> 2 2 18.33333 FALSE regular
#> 3 3 23.25000 FALSE regular
#> 4 4 18.33333 FALSE regular
#> 5 5 21.87500 FALSE regular
#> 6 6 17.50000 FALSE regularThis is useful when your feature engineering naturally has stages.
You can first create a base transformation, then reuse it in flags,
labels, scores, or other derived features. The important habit is to
keep the order intentional. If a later feature uses
hp_per_cyl, then hp_per_cyl must be defined
earlier.
Sometimes feature definitions are created outside the
fd_define() call. For example, you might keep a small
library of expressions, generate definitions from a configuration file,
or reuse the same expression across projects.
log_hp_expr <- expression(log(hp))
heavy_car_expr <- expression(wt > 3.5)
defs_programmatic <- fd_define(
log_hp = log_hp_expr,
heavy_car = heavy_car_expr
)
features_programmatic <- fd_compute(
data = raw_cars,
defs = defs_programmatic,
key = "car_id"
)
head(features_programmatic)
#> car_id log_hp heavy_car
#> 1 1 4.700480 FALSE
#> 2 2 4.700480 FALSE
#> 3 3 4.532599 FALSE
#> 4 4 4.700480 FALSE
#> 5 5 5.164786 FALSE
#> 6 6 4.653960 FALSEThe benefit is administrative: the feature set can be built from named pieces instead of being rewritten by hand each time. This matters when the feature catalog becomes larger than a few simple columns.
One expression per feature is convenient for small feature sets. In
real projects, however, a single conceptual feature step may naturally
produce several columns. That is what fd_block() is
for.
An fd_block() is a multi-column definition step. It must
return a data.frame, and each column of that data frame
becomes a feature.
defs_block <- fd_define(
engine_ratios = fd_block({
data.frame(
hp_per_cyl = hp / cyl,
disp_per_cyl = disp / cyl,
wt_per_hp = wt / hp
)
})
)
features_block <- fd_compute(
data = raw_cars,
defs = defs_block,
key = "car_id"
)
head(features_block)
#> car_id hp_per_cyl disp_per_cyl wt_per_hp
#> 1 1 18.33333 26.66667 0.02381818
#> 2 2 18.33333 26.66667 0.02613636
#> 3 3 23.25000 27.00000 0.02494624
#> 4 4 18.33333 43.00000 0.02922727
#> 5 5 21.87500 45.00000 0.01965714
#> 6 6 17.50000 37.50000 0.03295238This is useful when you want one named definition step, such as
engine_ratios, to produce a small family of related
columns.
The block body does not have to be a single data.frame()
call. It can be a small R script. You can create temporary variables,
reuse intermediate calculations, and return only the final columns you
want to store.
defs_script_block <- fd_define(
engine_script = fd_block({
hp_per_cyl <- hp / cyl
disp_per_cyl <- disp / cyl
ratio_average <- (hp_per_cyl + disp_per_cyl) / 2
high_ratio <- ratio_average > stats::median(ratio_average, na.rm = TRUE)
data.frame(
hp_per_cyl = hp_per_cyl,
disp_per_cyl = disp_per_cyl,
engine_ratio_average = ratio_average,
high_engine_ratio = high_ratio
)
})
)
features_script_block <- fd_compute(
data = raw_cars,
defs = defs_script_block,
key = "car_id"
)
head(features_script_block)
#> car_id hp_per_cyl disp_per_cyl engine_ratio_average high_engine_ratio
#> 1 1 18.33333 26.66667 22.50000 FALSE
#> 2 2 18.33333 26.66667 22.50000 FALSE
#> 3 3 23.25000 27.00000 25.12500 FALSE
#> 4 4 18.33333 43.00000 30.66667 TRUE
#> 5 5 21.87500 45.00000 33.43750 TRUE
#> 6 6 17.50000 37.50000 27.50000 FALSEThis pattern is often easier to read than forcing every intermediate expression into a separate top-level feature. Temporary variables stay inside the block, while the returned data frame defines the columns that become part of the final feature table.
As feature logic grows, it is often better to move it into a regular
R function. This is especially helpful when you want to test the feature
script separately, reuse it across projects, or keep the
fd_define() call compact.
make_engine_features <- function(data) {
hp_per_cyl <- data$hp / data$cyl
disp_per_cyl <- data$disp / data$cyl
data.frame(
hp_per_cyl = hp_per_cyl,
disp_per_cyl = disp_per_cyl,
engine_index = hp_per_cyl + disp_per_cyl
)
}
defs_function_block <- fd_define(
engine_features = fd_block(make_engine_features)
)
features_function_block <- fd_compute(
data = raw_cars,
defs = defs_function_block,
key = "car_id"
)
head(features_function_block)
#> car_id hp_per_cyl disp_per_cyl engine_index
#> 1 1 18.33333 26.66667 45.00000
#> 2 2 18.33333 26.66667 45.00000
#> 3 3 23.25000 27.00000 50.25000
#> 4 4 18.33333 43.00000 61.33333
#> 5 5 21.87500 45.00000 66.87500
#> 6 6 17.50000 37.50000 55.00000Function-based blocks are a good fit for code that already looks like
a small feature-engineering script. The function receives the current
working data and returns a data.frame of feature
columns.
Some feature sets are not known column by column in advance. For example, you might want to apply the same transformation to a selected group of numeric variables. A function-based block can generate those columns in a loop.
make_scaled_features <- function(data) {
vars <- c("hp", "disp", "wt")
out <- list()
for (var in vars) {
center <- mean(data[[var]], na.rm = TRUE)
spread <- stats::sd(data[[var]], na.rm = TRUE)
out[[paste0(var, "_scaled")]] <- (data[[var]] - center) / spread
}
as.data.frame(out)
}
defs_loop_block <- fd_define(
scaled_inputs = fd_block(make_scaled_features)
)
features_loop_block <- fd_compute(
data = raw_cars,
defs = defs_loop_block,
key = "car_id"
)
head(features_loop_block)
#> car_id hp_scaled disp_scaled wt_scaled
#> 1 1 -0.5350928 -0.57061982 -0.610399567
#> 2 2 -0.5350928 -0.57061982 -0.349785269
#> 3 3 -0.7830405 -0.99018209 -0.917004624
#> 4 4 -0.5350928 0.22009369 -0.002299538
#> 5 5 0.4129422 1.04308123 0.227654255
#> 6 6 -0.6080186 -0.04616698 0.248094592This pattern is useful when the number of output columns depends on a vector of input names, a configuration object, or another piece of project logic. The important rule remains the same: the block must return a data frame with one row per input row.
You do not have to choose between ordinary definitions and blocks. A single definition object can contain both. Later steps can also use columns produced by earlier steps, including columns produced by blocks.
defs_combined <- fd_define(
transmission = ifelse(am == 1, "automatic", "manual"),
engine_features = fd_block(make_engine_features),
scaled_inputs = fd_block(make_scaled_features),
engine_per_weight = engine_index / wt
)
features_combined <- fd_compute(
data = raw_cars,
defs = defs_combined,
key = "car_id"
)
head(features_combined)
#> car_id transmission hp_per_cyl disp_per_cyl engine_index hp_scaled
#> 1 1 automatic 18.33333 26.66667 45.00000 -0.5350928
#> 2 2 automatic 18.33333 26.66667 45.00000 -0.5350928
#> 3 3 automatic 23.25000 27.00000 50.25000 -0.7830405
#> 4 4 manual 18.33333 43.00000 61.33333 -0.5350928
#> 5 5 manual 21.87500 45.00000 66.87500 0.4129422
#> 6 6 manual 17.50000 37.50000 55.00000 -0.6080186
#> disp_scaled wt_scaled engine_per_weight
#> 1 -0.57061982 -0.610399567 17.17557
#> 2 -0.57061982 -0.349785269 15.65217
#> 3 -0.99018209 -0.917004624 21.65948
#> 4 0.22009369 -0.002299538 19.07724
#> 5 1.04308123 0.227654255 19.44041
#> 6 -0.04616698 0.248094592 15.89595This is where feature definitions become a practical organizing tool. You can keep simple expressions simple, move related feature families into blocks, and still evaluate the full set as one ordered pipeline.
Sometimes you want a block to have an expected output schema. This is useful when the block may return only some columns in some situations, but the database feature table should still have a stable set of columns.
defs_expected <- fd_define(
optional_engine_flags = fd_block(
{
data.frame(
high_hp = hp > 150
)
},
expected_names = c("high_hp", "high_disp")
)
)
features_expected <- fd_compute(
data = raw_cars,
defs = defs_expected,
key = "car_id"
)
head(features_expected)
#> car_id high_hp high_disp
#> 1 1 FALSE NA
#> 2 2 FALSE NA
#> 3 3 FALSE NA
#> 4 4 FALSE NA
#> 5 5 TRUE NA
#> 6 6 FALSE NAThe block returned high_hp, but high_disp
was declared as an expected output. fd_compute() includes
the missing expected column and fills it with NA. This can
help when you want the feature table to keep a predictable schema.
Feature definitions are where the package lets you turn R feature engineering into a reusable pipeline component.
Use ordinary fd_define() expressions when each feature
is simple and readable on one line. Use fd_block() when a
feature step naturally produces several columns, needs temporary
variables, or belongs in a reusable function. Use function-based blocks
when the logic is long enough to test separately or when the output
columns are generated programmatically.
Once the definitions object is ready, the same object can be used in two ways:
# Local computation while developing feature logic
fd_compute(raw_data, defs, key = "id")
# Full database pipeline once the definitions are ready
fd_run(con, sql, defs, key = "id", feat_table_name = "feature_table")That is the main workflow: develop feature logic in R, store it as a definitions object, test it locally, and then use it in the incremental database pipeline.