Feature definitions workshop

Purpose of this workshop

The first vignette showed how featdelta helps move features created in R into a database table. This workshop focuses on the part of the workflow that most data scientists will edit most often: the feature definitions.

Feature definitions are the bridge between exploratory R code and a repeatable feature pipeline. When features are kept only as lines inside an ad-hoc script, they are harder to review, reuse, test, and refresh. With fd_define(), the feature logic becomes a structured object. That object can then be computed locally with fd_compute() or passed to fd_run() for the full database pipeline.

In this workshop, we will build feature definitions step by step:

simple one-column features;
features that depend on earlier features;
reusable definitions supplied programmatically;
multi-column scripts with fd_block();
function-based blocks that generate an unknown number of columns.

Workshop data

We will use a small version of the built-in mtcars dataset. The data is simple enough to inspect directly, but it lets us demonstrate the same patterns used in larger feature engineering projects.

library(featdelta)

raw_cars <- mtcars
raw_cars$car_id <- seq_len(nrow(raw_cars))

raw_cars <- raw_cars[, c("car_id", "mpg", "cyl", "disp", "hp", "wt", "am")]

head(raw_cars)
#>                   car_id  mpg cyl disp  hp    wt am
#> Mazda RX4              1 21.0   6  160 110 2.620  1
#> Mazda RX4 Wag          2 21.0   6  160 110 2.875  1
#> Datsun 710             3 22.8   4  108  93 2.320  1
#> Hornet 4 Drive         4 21.4   6  258 110 3.215  0
#> Hornet Sportabout      5 18.7   8  360 175 3.440  0
#> Valiant                6 18.1   6  225 105 3.460  0

The car_id column is the key. It identifies each row and will be preserved in the computed feature table.

Start with simple feature definitions

The most direct use of fd_define() is to write one expression per feature. Each named expression becomes one output column.

defs_basic <- fd_define(
  transmission = ifelse(am == 1, "automatic", "manual"),
  hp_per_cyl = hp / cyl,
  wt_per_hp = wt / hp
)

defs_basic
#> <featdelta_defs>
#> Definition steps (3):
#>   - [column] transmission -> ifelse(am == 1, "automatic", "manual")
#>   - [column] hp_per_cyl -> hp/cyl
#>   - [column] wt_per_hp -> wt/hp

The printed object gives a quick overview of the feature set. Note that the feature names and expressions are stored together and can be passed around as one object.

Now compute the definitions on the raw data.

features_basic <- fd_compute(
  data = raw_cars,
  defs = defs_basic,
  key = "car_id"
)

head(features_basic)
#>   car_id transmission hp_per_cyl  wt_per_hp
#> 1      1    automatic   18.33333 0.02381818
#> 2      2    automatic   18.33333 0.02613636
#> 3      3    automatic   23.25000 0.02494624
#> 4      4       manual   18.33333 0.02922727
#> 5      5       manual   21.87500 0.01965714
#> 6      6       manual   17.50000 0.03295238

The output contains the key column plus the computed features. This is the feature table shape that can later be written to the database by fd_run().

Use earlier features in later features

Definitions are evaluated in order. This means a later feature can use columns created by earlier definitions in the same fd_compute() call.

defs_ordered <- fd_define(
  hp_per_cyl = hp / cyl,
  strong_engine = hp_per_cyl > 30,
  engine_label = ifelse(strong_engine, "strong", "regular")
)

features_ordered <- fd_compute(
  data = raw_cars,
  defs = defs_ordered,
  key = "car_id"
)

head(features_ordered)
#>   car_id hp_per_cyl strong_engine engine_label
#> 1      1   18.33333         FALSE      regular
#> 2      2   18.33333         FALSE      regular
#> 3      3   23.25000         FALSE      regular
#> 4      4   18.33333         FALSE      regular
#> 5      5   21.87500         FALSE      regular
#> 6      6   17.50000         FALSE      regular

This is useful when your feature engineering naturally has stages. You can first create a base transformation, then reuse it in flags, labels, scores, or other derived features. The important habit is to keep the order intentional. If a later feature uses hp_per_cyl, then hp_per_cyl must be defined earlier.

Keep programmatic definitions reusable

Sometimes feature definitions are created outside the fd_define() call. For example, you might keep a small library of expressions, generate definitions from a configuration file, or reuse the same expression across projects.

log_hp_expr <- expression(log(hp))
heavy_car_expr <- expression(wt > 3.5)

defs_programmatic <- fd_define(
  log_hp = log_hp_expr,
  heavy_car = heavy_car_expr
)

features_programmatic <- fd_compute(
  data = raw_cars,
  defs = defs_programmatic,
  key = "car_id"
)

head(features_programmatic)
#>   car_id   log_hp heavy_car
#> 1      1 4.700480     FALSE
#> 2      2 4.700480     FALSE
#> 3      3 4.532599     FALSE
#> 4      4 4.700480     FALSE
#> 5      5 5.164786     FALSE
#> 6      6 4.653960     FALSE

The benefit is administrative: the feature set can be built from named pieces instead of being rewritten by hand each time. This matters when the feature catalog becomes larger than a few simple columns.

Use fd_block() when one feature step returns several columns

One expression per feature is convenient for small feature sets. In real projects, however, a single conceptual feature step may naturally produce several columns. That is what fd_block() is for.

An fd_block() is a multi-column definition step. It must return a data.frame, and each column of that data frame becomes a feature.

defs_block <- fd_define(
  engine_ratios = fd_block({
    data.frame(
      hp_per_cyl = hp / cyl,
      disp_per_cyl = disp / cyl,
      wt_per_hp = wt / hp
    )
  })
)

features_block <- fd_compute(
  data = raw_cars,
  defs = defs_block,
  key = "car_id"
)

head(features_block)
#>   car_id hp_per_cyl disp_per_cyl  wt_per_hp
#> 1      1   18.33333     26.66667 0.02381818
#> 2      2   18.33333     26.66667 0.02613636
#> 3      3   23.25000     27.00000 0.02494624
#> 4      4   18.33333     43.00000 0.02922727
#> 5      5   21.87500     45.00000 0.01965714
#> 6      6   17.50000     37.50000 0.03295238

This is useful when you want one named definition step, such as engine_ratios, to produce a small family of related columns.

Write a small script inside fd_block()

The block body does not have to be a single data.frame() call. It can be a small R script. You can create temporary variables, reuse intermediate calculations, and return only the final columns you want to store.

defs_script_block <- fd_define(
  engine_script = fd_block({
    hp_per_cyl <- hp / cyl
    disp_per_cyl <- disp / cyl

    ratio_average <- (hp_per_cyl + disp_per_cyl) / 2
    high_ratio <- ratio_average > stats::median(ratio_average, na.rm = TRUE)

    data.frame(
      hp_per_cyl = hp_per_cyl,
      disp_per_cyl = disp_per_cyl,
      engine_ratio_average = ratio_average,
      high_engine_ratio = high_ratio
    )
  })
)

features_script_block <- fd_compute(
  data = raw_cars,
  defs = defs_script_block,
  key = "car_id"
)

head(features_script_block)
#>   car_id hp_per_cyl disp_per_cyl engine_ratio_average high_engine_ratio
#> 1      1   18.33333     26.66667             22.50000             FALSE
#> 2      2   18.33333     26.66667             22.50000             FALSE
#> 3      3   23.25000     27.00000             25.12500             FALSE
#> 4      4   18.33333     43.00000             30.66667              TRUE
#> 5      5   21.87500     45.00000             33.43750              TRUE
#> 6      6   17.50000     37.50000             27.50000             FALSE

This pattern is often easier to read than forcing every intermediate expression into a separate top-level feature. Temporary variables stay inside the block, while the returned data frame defines the columns that become part of the final feature table.

Use function-based blocks for larger feature scripts

As feature logic grows, it is often better to move it into a regular R function. This is especially helpful when you want to test the feature script separately, reuse it across projects, or keep the fd_define() call compact.

make_engine_features <- function(data) {
  hp_per_cyl <- data$hp / data$cyl
  disp_per_cyl <- data$disp / data$cyl

  data.frame(
    hp_per_cyl = hp_per_cyl,
    disp_per_cyl = disp_per_cyl,
    engine_index = hp_per_cyl + disp_per_cyl
  )
}

defs_function_block <- fd_define(
  engine_features = fd_block(make_engine_features)
)

features_function_block <- fd_compute(
  data = raw_cars,
  defs = defs_function_block,
  key = "car_id"
)

head(features_function_block)
#>   car_id hp_per_cyl disp_per_cyl engine_index
#> 1      1   18.33333     26.66667     45.00000
#> 2      2   18.33333     26.66667     45.00000
#> 3      3   23.25000     27.00000     50.25000
#> 4      4   18.33333     43.00000     61.33333
#> 5      5   21.87500     45.00000     66.87500
#> 6      6   17.50000     37.50000     55.00000

Function-based blocks are a good fit for code that already looks like a small feature-engineering script. The function receives the current working data and returns a data.frame of feature columns.

Generate an unknown number of features in a loop

Some feature sets are not known column by column in advance. For example, you might want to apply the same transformation to a selected group of numeric variables. A function-based block can generate those columns in a loop.

make_scaled_features <- function(data) {
  vars <- c("hp", "disp", "wt")
  out <- list()

  for (var in vars) {
    center <- mean(data[[var]], na.rm = TRUE)
    spread <- stats::sd(data[[var]], na.rm = TRUE)

    out[[paste0(var, "_scaled")]] <- (data[[var]] - center) / spread
  }

  as.data.frame(out)
}

defs_loop_block <- fd_define(
  scaled_inputs = fd_block(make_scaled_features)
)

features_loop_block <- fd_compute(
  data = raw_cars,
  defs = defs_loop_block,
  key = "car_id"
)

head(features_loop_block)
#>   car_id  hp_scaled disp_scaled    wt_scaled
#> 1      1 -0.5350928 -0.57061982 -0.610399567
#> 2      2 -0.5350928 -0.57061982 -0.349785269
#> 3      3 -0.7830405 -0.99018209 -0.917004624
#> 4      4 -0.5350928  0.22009369 -0.002299538
#> 5      5  0.4129422  1.04308123  0.227654255
#> 6      6 -0.6080186 -0.04616698  0.248094592

This pattern is useful when the number of output columns depends on a vector of input names, a configuration object, or another piece of project logic. The important rule remains the same: the block must return a data frame with one row per input row.

Combine ordinary features and blocks

You do not have to choose between ordinary definitions and blocks. A single definition object can contain both. Later steps can also use columns produced by earlier steps, including columns produced by blocks.

defs_combined <- fd_define(
  transmission = ifelse(am == 1, "automatic", "manual"),
  engine_features = fd_block(make_engine_features),
  scaled_inputs = fd_block(make_scaled_features),
  engine_per_weight = engine_index / wt
)

features_combined <- fd_compute(
  data = raw_cars,
  defs = defs_combined,
  key = "car_id"
)

head(features_combined)
#>   car_id transmission hp_per_cyl disp_per_cyl engine_index  hp_scaled
#> 1      1    automatic   18.33333     26.66667     45.00000 -0.5350928
#> 2      2    automatic   18.33333     26.66667     45.00000 -0.5350928
#> 3      3    automatic   23.25000     27.00000     50.25000 -0.7830405
#> 4      4       manual   18.33333     43.00000     61.33333 -0.5350928
#> 5      5       manual   21.87500     45.00000     66.87500  0.4129422
#> 6      6       manual   17.50000     37.50000     55.00000 -0.6080186
#>   disp_scaled    wt_scaled engine_per_weight
#> 1 -0.57061982 -0.610399567          17.17557
#> 2 -0.57061982 -0.349785269          15.65217
#> 3 -0.99018209 -0.917004624          21.65948
#> 4  0.22009369 -0.002299538          19.07724
#> 5  1.04308123  0.227654255          19.44041
#> 6 -0.04616698  0.248094592          15.89595

This is where feature definitions become a practical organizing tool. You can keep simple expressions simple, move related feature families into blocks, and still evaluate the full set as one ordered pipeline.

Declare expected block outputs when useful

Sometimes you want a block to have an expected output schema. This is useful when the block may return only some columns in some situations, but the database feature table should still have a stable set of columns.

defs_expected <- fd_define(
  optional_engine_flags = fd_block(
    {
      data.frame(
        high_hp = hp > 150
      )
    },
    expected_names = c("high_hp", "high_disp")
  )
)

features_expected <- fd_compute(
  data = raw_cars,
  defs = defs_expected,
  key = "car_id"
)

head(features_expected)
#>   car_id high_hp high_disp
#> 1      1   FALSE        NA
#> 2      2   FALSE        NA
#> 3      3   FALSE        NA
#> 4      4   FALSE        NA
#> 5      5    TRUE        NA
#> 6      6   FALSE        NA

The block returned high_hp, but high_disp was declared as an expected output. fd_compute() includes the missing expected column and fills it with NA. This can help when you want the feature table to keep a predictable schema.

What to remember

Feature definitions are where the package lets you turn R feature engineering into a reusable pipeline component.

Use ordinary fd_define() expressions when each feature is simple and readable on one line. Use fd_block() when a feature step naturally produces several columns, needs temporary variables, or belongs in a reusable function. Use function-based blocks when the logic is long enough to test separately or when the output columns are generated programmatically.

Once the definitions object is ready, the same object can be used in two ways:

# Local computation while developing feature logic
fd_compute(raw_data, defs, key = "id")

# Full database pipeline once the definitions are ready
fd_run(con, sql, defs, key = "id", feat_table_name = "feature_table")

That is the main workflow: develop feature logic in R, store it as a definitions object, test it locally, and then use it in the incremental database pipeline.