Feature engineering and selection

CHAPTER 7 Link to heading

For many problems in predictive analysis, the variation in response can be explained by the independent variables working together, however for certain variation in the response variable, it has to be explained by the joint workings of two or more variables together.

Two or more predictors are said to interact if their combined effect is different (less or greater) than what we would expect if we were to add the impact of each of their effects when considered alone. Interactions are always in the context of how predictors relate to the outcome. Correlation among predictors does not necesarily imply interaction.

Mathematically interaction is presented as:
$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 + \epsilon_t$$
Here:

  • $\beta_0$ represents overall average response.
  • $\beta_1$, $\beta_2$ represent average rate to change due to $x_1$ and $x_3$ respectively.
  • $\beta_3$ represents incremental change due to the combined effect of $x_1,x_2$ that goes beyond what the two variables can explain alone.
  • $\epsilon_t$ repreents thee random variation in $y$ that goes beyond what the deterministic part of the equation can explain.

Once model is fitted by means of Least squares estimation, or say logistic regression, there are 4 instances that could occur:

  1. if $\beta_3$ is not significantly different from 0, then interaction between $x_1,x_2$ is not useful in explaining rhe variation in the response. THus in this case the relationship between $x_1,x_2$ is called additive.
  2. If the coefficient is meaningfuly negative, while $x_1,x_2$ alone also affect the response, then the relationship is said to be antagonistic
  3. If the coefficient is meaningfuly positive, while $x_1,x_2$ alone also affect the response, then the relationship is said to be synergistic
  4. When the interaction term is significant in the model, but either one or both of $x_1$ or $x_2$ do not affect the response, then the average response of $x_1$ across the values of $x_2$ and vice versa has a rate of change which is essentially 0. This is reffered to as atypical interaction. Visualization is important to understand the data here.

Interactions can either be found by domain knowlege or using more complicated techniques. Research has shown that tree based models, are good at uncovering interaction effects through the recursive splitting techniques. However it is important to identify the in a domain-fashion since this leads to meningully builing features which are explainable and relevant.

Guding principles in the search for interaction. Link to heading

  1. Expert knowledge in te area of analysis is critical for guiding this process of selecting interaction terms. Hence expert-selected interaction features should be identified and explored. If expert knowledge is not available, sound guiding principles are needed that will lead to reasonable selection of interactions.

  2. An important area of research that has actively done work in interaction effects is Statistical design and analysis of experiments

![[DoE.png]]

Here experiments are done to determine how factors affect a given response variable, using ANOVA, and it is possible to model the interaction between the factors. For example: an experiment conducted to analyze effect of different fertilizer levels and water on yield of crops, could be used to get the interactive effect of a level of fertilizer and water on the yield. Check factorial experiments

  1. Determining which interactions are significant in explaining variation in the response is of key interest. Three important factors are useful.
  • Hierarchy principle: THe higher the degree of interaction, the less likely the interaction will explain the variation in the response. Therefore, pairwise interactions should be the first set tried, then three way, then four-way etc. Care should be taken when dealing with higher degree of interactions, since they hardly are usually significant and may lack interpretability.
  • Effect sparsity: States that onlt a fraction of tje possinle effects truly explain a siginificant amount of variation in the response. THis aproach greatly narrows down the possible main an interaction effects.
  • Effect heredity: This asserts that the interaction terms may only be considered if the ordered terms preceeding the interaction are effective at explaining the response variation.
    • Strong heredity: States that the interaction effect $x_1*x_2$ can only be included in the model, if the main effect of $x_1$ and $x_2$ are significant in explaining a decent amount of variation in the response.
    • Weak heredity: Relaxes the strong heredity, and allows for any interaction, where at least one of the terms ios significant in explaining response variation. If $x_1$ is a significant predictor,but $x_2$ isn’t, then weak heredity allows for the interaction $x_1*x_2$ to be included in the model.

Even though higher order interaction terms are usually not desirable, they often occur in nature, and thus it is of importance to try and explore the possibility of higher order interaction terms, especially when expert knoweldge points to the possibility of their relevance in the model.

  1. A brute force search over all possible pair-wise interactions could be useful, however computationally intensive since as the number $p$ of predictors increase, then does the number of their pair-wise interactions, i.e. $p(1-p)/2$ , e.g. for 100 predictors, we would have pair-wise $4950$ terms, and for 500 predictors, we would have pairwise $125,000$ terms.

    A challenge with the brute force approach, is that it is easier for interaction terms, which are related to the response by purely random chance to be found. This is called false positive finding. For the model:

    $$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2$$

    The interaction terms, when added to the model becomes:

    $$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2$$

    The two models above are called nested models, since the first model, is a subset of the second model. In Linear regression for instance, a test could be constructed to compare the residuals of the two models (adjusted for the degree of freedom) on whether there is any significant improveemnt upon adding the interaction term.

    This approach could be taken further, where resampling is first done, to create a number of nested models (on each sub-dataset), and these models are then evaluated using an objective function e.g. ROC, AUC, sensitivity, specificity, RMSE on predictions, ANOVA, $R^2$ etc.

    Measures to deal with False positive findings include: not dealing with them, or using corrections, which are usually viewed as very strict(read on: Bonferroni corrections).

  2. Another question which emerges when creating the interaction terms, is whether to include the terms after pre-processing, or before pre-processing e.g. (centering, scaling, normalizing,dimension expansion/reduction etc). It is advisable that, we carefully evaluate how the order should be. In general it is more plausible to construct the interaction terms on original measurement scales, prior to any pre-processing steps.

  3. Another issue which emerges, when trying to add interaction effects is the possibility of $p > n$ i.e. the number of predictors could get larger than the sample itself. This will pose a challenge to linear models e.g. linear regression, logistic regression etc, but will not be a challenge to models such as KNN, ANN, SVMs etc. This hurts the process, since the goal of feature engineering is to identify features that improve model performance, while improving interpretability.

A work around is penalized models. These are models which are built to handle the $p > n$ issue, when interpretability is still desired. Recall that in fitting models, such as Linear regression models, the error function is of the form:

$$Q = \sum{(y_i - \hat{y_i})^2}$$

In solving for the regression coefficients, there involves matrix inversion of the covariance matrix of predictors:

$$\beta = (X^T X)^{-1} (X^T Y)$$

However, in the prescence of many predictors, there is likely the case, some predictors will be highly correlated, and thus the $(X^T X)^{-1}$ matrix will not be of full rank, and thus NOT invertible, and the regression coefficients grow large and unstable.

A work around is to change the error function of the regression model to be:

$$Q_{L2} = \sum{(y_i - \hat{y_i})^2} + \lambda_r \sum{\beta^2_j}$$

The $\lambda$ is called a penalty, and this technique is called ridge regression. The penalty term must increase when the coefficients grow large, in order to enforce minimization. In result, the penalty causes the regression coefficients to become smaller and shrink towards 0, this makes the model much interpretable. A different penalized regression model is shown below:

$$Q_{L1} = \sum{(y_i - \hat{y_i})^2} + \lambda_L \sum{|\beta_j|}$$

This method is called: Least absolute shrinkage and selection operator: (LASSO) In modifying the penalizer, lasso method forces some regression coefficients to 0, and in doing so, it practically selects model terms to an optimal number of predictors. This makes it a feature selection model.

  • Ridge regression is commonly used in comabting collinearity between predictors.
  • Lasso regression is commonly used in elimination of predictors.
  • However, we could desie both the lasso and ridge penalty in the same model, so that we deal with collinearity, and feature selection, all at the same time. This model becomes:

$$Q = \sum{(y_i - \hat{y_i})^2} + \lambda [ (1 - \alpha)\sum{\beta^2_j} + \alpha \sum{| \beta_j |} ]$$

Here, $\lambda = \lambda_r + \lambda_L$ , and the proportion of $\lambda$ associated with the lasso is denoted $\alpha$. Thus, selecting $\alpha = 1$ would be a full lasso penalty model, whereas $\alpha = 0.5$ is an even mix of a ridge and lasso model.

Read: GLMNET in R

  1. In the prescence of a large number of predictors, complete enumeration is impossible, and hence there is need for more efficient search procedures. The two stage approach: In this approach, we start of by constructing a model which doesn’t account for the interaction terms. In a linear regression setting, the error terms from such a model usually account for:
  • The unexplained variation in the response, that is due to random measurement error.
  • Due to predictors that were not available or measured.
  • Or interactions among the observed predictors, that weren’t included in the model.

Supposing, that the data generating process is shown below:

$$y = x_1 + x_2+ 10 x_1 x_2 + error$$

Then suppose that, when data was collected, only $x_1$ was collected, so that the equation approximated becomes:

$$y = \beta_1 x_1 + error*$$

Since the true response depends on $x_2, x_1 x_2$ , then the error $error*$ contains the information about these important terms which were absent from the model. When collected, then it would be possible to seperate their contribution, through a second model:

$$error* = \beta_2 x_2 + \beta_3 x_1 x_2 + error$$

In the prescence of many predictors, the efficient search for interaction terms could be narrowed down using the three principles outlined above: hierarchy, sparsity, heredity principles:

  • hierarchy: First look at pair-wise interactions, before going to higher level interactions.
  • sparsity: States that if there are interactions, which are relevant, then only few of them exist.
  • heredity: Only search for interactions among the significant predictors.

![[tree.png]]

This is an example of a recusrive splitting model, for a synergistic relationship between $x_1$ and $x_2$. For node 4, the equation becomes:

$$Node_4 = I(x_1 < 0.655) * I(x_2 < 0.5) * I(x_2 < 0.316) * 0.939$$

$0.939$ is the mean of the training set data falling into this mode.

$I(x) = 1$, if x is true
$I(x) = 0$, if x is false

In tree based models, the final prediction equation would be computed by adding similar terms from the other 8 terminal nodes(which would all be 0, except a single rule when predicting).

Tree-based models, have some shortcomings such as: High variability, where a tree changes so vastly, when data changes by even small magnitudes. To correct for this other models such as boosting and bagging are useful.

Bagging: includes sampling many times, with replacement fro the original data in an independent manner, and for each sample, a tree model is built. For predictions, the model averages the predicted responses across all of the trees. Random forests are a further modification of bagging.

Boosting: Boosting models also utilize a sequence of trees, but creates them in a different manner,where, instead of building trees to maximum depth, a boosted model restricts tree depth. Boosting also utilizes the predictive performance of the previous tree on the samples to reweigh poorly predicted samples. The contribution of each tree to the model is weighted using statistics related to individual model fit, when predicting.

Tree-based models are useful in uncovering interaction effects, and modelling them especially locally, which would be difficult for linear regression models.

Predictor importance information: