Simulation Study:
Machine Learning and Evaluation
(MLE_SIM_ACC)


1 Introduction

Prediction models are everywhere. They predict if an email contains spam, forecast tommorrow’s stock prices or diagnose if a patient is diseased. Such models take some feature data as input and return a prediction of the target variable. Features could be specific word frequencies, recent market movements or X-ray images. More often than not, machine-learning (ML) algorithms are used to train such models from data. An ML algorithm takes some data, a collection of feature-label pairs, as input and outputs a prediction model. Once a model has been trained it can be employed into practice. Theoretically…

Of course, things are rarely that simple. In practice, it is never known beforehand which algorithm and hyperparameter choices lead to the model with the highest performance. That is why in practice usually many different approaches are (rightly) tried out and compared. Unfortunately, if many such comparisons are conducted it is hard to distinguish between truly good models and a model that only empirically appears to perform well enough. This is in particular true, when the amount of available data is limited. Thus, for serious applications, a rigorous evaluation before implementation in practice becomes very important. After all, while there are certainly benefits when implementing a good model there are also costs when implementing a bad one.

This document contains the results of an extensive simulation study regarding the selection and evaluation of machine learned prediction models described by Westphal & Brannath (2019, ICML). The main purpose of this document is to act as a supplementary report of results. It shows several additional analyses which could not be included in the recent publication. Moreover, this report may also serve as a starting point for non-technical readers as complex mathematical notation and theoretical details are avoided whenever possible. For more detailed information regarding the employed methods we refer the reader to Westphal & Brannath (2019, ICML) and our previous work Westphal & Brannath (2019, SMMR).

Any question or suggestion regarding this report and our underlying research is very welcomed - feel free to contact me. While the above mentioned publications are joint work with Werner Brannath, I compiled this report on my own and hence I am solely responsible for any potential error. An overview over other projects in the context of model evaluation can be found at https://maxwestphal.github.io/SEPM.PUB/.

1.1 Background

The usual recommendation in applied machine learning is to select a prediction model based on the empirical performance on a hold-out validation set to avoid overfitting to the training data. When computationally feasible, the results of different data splits (learning = training + validation) can be averaged. Such cross-validation and bootstrapping approaches are however not investigated in this study - instead we focus on simple hold-out validation. Once a final model is selected, its performance should be evaluated on a seperate test dataset, as illustrated in the following diagram:

The main goal of this simulation study is to compare this default model selection approach with other strategies. In particular, we highlight differences to selecting multiple promising models (based on the validation data) for a simultaneous evaluation on the test data. This novel approach to assess several models at once allows to utilise the test data for model selection. In contrast, model selection and evaluation are usually stricly seperated to avoid selection-induced bias. Our approach involves countering this overoptimism with a particular simultaneous test procedure, the so-called maxT-approach, which is based on approximate joint distribution of performance estimates. This expanded machine-learning and evaluation pipeline is shown below.

1.2 Setup

In the following, some preliminary steps for for data handling are carried out. In theory, showing these steps makes the whole analysis reproducible. To this end, all simulations need however to be conducted locally first which is very time consuming. For us, this process took roughly 3 months with a 16 core CPU (AMD Ryzen Threadripper 1950X). If you are just interested in the results of the study, feel free to skip this section.

After loading all neccessary packages (assuming they are installed already), a few global options are set which may alter the appearance of this document.

Finally, a few custom objects and functions are defined.

## default ggplot theme:
owntheme <- function(x=NULL, pars=config){
  theme(axis.text    = element_text(size=pars$text.size),
        axis.title   = element_text(size=pars$title.size,face="bold"),
        legend.text  = element_text(size=pars$title.size),
        legend.title = element_text(size=pars$title.size, face="bold"),
        strip.text.x = element_text(size=pars$title.size, face="bold"),
        strip.text.y = element_text(size=pars$title.size, face="bold", angle=90),
        legend.direction = "horizontal", legend.position = "bottom", legend.box = "vertical",
        legend.spacing.x = unit(0.5, 'char'),
        legend.background = element_rect(fill="white", size=0.75, linetype="solid", colour="black"))}

## read in simulation results
read_data <- function(sim, path=config$path){
  bind_rows(lapply(sim, function(x) read_csv(paste0(path, "/", x, ".csv"), col_types = cols())))
}

## data preprocessing
preproc_data <- function(data, size=config$subset.size, seed=1337,
                         paired=TRUE, test=TRUE){
  if(inherits(data, "list")){return(data)}
  if(!is.na(size)){
    set.seed(seed)
    data <- filter(data, load.id %in% sample(unique(data.in$load.id), size, replace=FALSE))
  }
  out <- list()
  # 1st slot: raw data
  out$raw <- data %>%
    mutate(select.args = ifelse(is.na(select.args), "", select.args)) %>%
    mutate(select.rule = as.character(interaction(select.method, select.args, select.limit))) %>%
    mutate(select.rule = recode(select.rule, 
                                oracle.learn.sqrt = "oracle[learn]",
                                oracle.train.sqrt = "oracle[train]",
                                `se.c=1.sqrt`     = "within1SE",
                                `rank.r=1.sqrt`   = "default",
                                `simplest.en..sqrt` = "simple",
                                `rank.r=100.none` = "S=100")) %>%
    mutate(select.rule = ordered(select.rule, 
                                 levels = info$select.rule)) %>%
    mutate(score = recode(scenario,
                          EOMPM_A2 = "A",
                          EOMPM_B2 = "B",
                          MLE_SIM_F1_prev30 = "C",
                          MLE_SIM_F1_prev15 = "D",
                          MLE_SIM_F13_prev30 = "E",
                          MLE_SIM_F13_prev15 = "F")) %>%
    mutate(redundancy = recode(red,
                               `0` = "[I]",
                               `1` = "[R]")) %>%
    mutate(scenario = paste0(score, redundancy)) %>%
    mutate(n.eval = eval.n) %>%
    mutate(abs.dev = final.mod.thetahat-final.mod.learn.theta,
           rel.dev = abs.dev/final.mod.learn.theta,
           abs.dev.c = final.mod.thetatilde-final.mod.learn.theta,
           rel.dev.c = abs.dev.c/final.mod.learn.theta,
           maxS.sqrt = S == round(sqrt(n.eval))) %>%
    unite(eval.strat, select.rule, infer.method, sep="|", remove=FALSE) 
  

  # 2nd slot for paired analysis
  if(paired){
    out$paired <- out$raw %>%
    dplyr::select(load.id, M.start, M, n.eval, n.learn, scenario, methods,
                  select.rule, infer.method, final.mod.learn.theta, opt.theta) %>%
    dplyr::filter(select.rule %in% c("within1SE", "default")) %>%
    tidyr::spread(select.rule, final.mod.learn.theta)
  }
  
  #3rd slot for analysis of rejection rate
  if(test){
    out$test <- out$raw %>%
      preproc_shift() %>%
      aggr_results()
  }

  return(out)
}

## read in theory data:
read_theory <- function(sim, path=config$path, at=18:67, vars=5:54){
  data.in <- bind_rows(lapply(sim, function(x) read_csv(paste0(path, "/", x, ".csv"), col_types = cols())))
  data.in %>%
    group_by(M, n.learn, eval.n, scenario) %>%
    summarise_at(at, function(x) mean(x, na.rm=T)) %>%
    gather("by", "var", vars) %>%
    mutate(x = unname(sapply(lapply(strsplit(sapply(by, function(x)substr(x, 2, nchar(x)-1)), ","), as.numeric), mean)))
}

1.3 Data preparation

1.3.1 Loading

To reproduce the results shown in this report, several numerical experiments have to be conducted first. It is assumed that this has been done such that the resulting CSV files can be loaded.

1.3.2 Preprocessing

Next we perform several preproccessing steps via the custom function preproc_data, defined above. To reduce the computational demand of compiling this Rmd, only a subset of the data may be analysed. To do so, the variable config$subset.size needs to be set accordingly.

## List of 3
##  $ raw   :Classes 'tbl_df', 'tbl' and 'data.frame':  7200000 obs. of  55 variables:
##  $ paired:Classes 'tbl_df', 'tbl' and 'data.frame':  1800000 obs. of  11 variables:
##  $ test  :Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame':    1920 obs. of  9 variables:
##   ..- attr(*, "vars")= chr [1:5] "shift" "select.rule" "infer.method" "M" ...
##   ..- attr(*, "drop")= logi TRUE

The following main analysis is based on 72000 instances of the complete machine learning and evaluation pipeline.


2 Preliminaries

2.1 Machine Learning

Every instance of the simulation study consists of a triple \((\mathcal{T}, \mathcal{V}, \mathcal{E})\) of training, validation and evaluation (test) data. All candidate algorithms \(A_m, m\in \mathcal{M}= \{1,\ldots,M\},\) are applied to the training data. A selection rule is used to select one or more models deemed suitable for evaluation. This selection \(\mathcal{S}\subset \mathcal{M}\) is based on the validation data. The selected algorithms \(A_m, m\in \mathcal{S},\) are than applied again to training and validation data - refered to as learning data \(\mathcal{L}= \mathcal{T}\cup \mathcal{V}\). In our simulations, we consider the following four learning algorithms implemented in the caret package:

  • EN: Elastic Net: Penalized Logistic Regression (glmnet)
  • CART: Cost-Sensitive Classification and Regression Trees (rpartCost)
  • SVM: L2 Regularized Linear Support Vector Machines with Class Weights (svmlinearweights)
  • XGB: eXtreme Gradient Boosting (xgbTree)

If not specified otherwise, an equal number of algorithms from each class with different (randomly sampled) hyperparameters is applied per training dataset, i.e. \(M/4\) models of of each type. In our main analysis we investigate the cases of \(M \in \{40, 100, 200\}\) candidate models.

2.2 Prediction tasks

We consider binary classifcation tasks and classification accuracy (proportion of correct predictions) as our performance measure of interest. The following table gives an overview over the considered learning tasks. In terms of important characteristics. The optimal performance \(\vartheta_{opt}\) is defined as the performance of the classifier which results from thresholding the true (data-generating) risk-score \(\mathbb{P}(Y=1|\mathbf X)\) at \(0.5\). The four rightmost columns show the mean of \(\vartheta_{max}=\max_{m\in \mathcal{M}}\vartheta_m\) over all simulation instances for different combinations of learning samples size \(n_\mathcal{L}\in {400, 800}\) and independent [I] or redundant [R] features. \(\vartheta_{max}\) can be seen as a (rough) measure how well the candidate algorithms are able to learn the true relationship between features \(\mathbf X\) and target \(Y \in \{0,1\}\). In all simulations a 3:1 ratio between training and validatio is used, e.g. \(n_\mathcal{L}= 400 = 300+100 = n_\mathcal{T}+ n_\mathcal{V}\).

2.3 Selection rules

The following selection rules will be investigated in this analysis. The color coding is fixed for the entire report. All orcale rules cannot be employed in practice as they are based on population quantities but serve as additional comparators in this simulation study. The median (IQR) number of models within one standard error of the best validation model is 9 (11) for \(n_\mathcal{L}=400\) and 8 (10) for \(n_\mathcal{L}=800\), given \(M=200\) initial candidate models.

2.4 Model evaluation

Finally, model performances of the selected models \(\hat{f}_m = A_m(\mathcal{L}), m \in \mathcal{S}\subset \mathcal{M},\) are assessed based on the evaluation data. If multiple models are evaluated, the final model \(\hat{f}_{m^*}\) is defined as the model with the highest empirical test performance. In addition we are interested in a formal test decision for the null hypotheses \(H_0: \vartheta_m \leq \vartheta_0\) where \(\vartheta_0\) is a given comparator (performance threshold or benchmark model). That is to say the goal of the evaluation study is to provide evidence that at least one model (the final model) has a sufficiently high performance.

3 Results

3.1 Model performance

3.1.1 Overall analysis

The first figure shows the final model performance \(\vartheta_{m^*}\) of the employed selection rules minus the best model performance \(\vartheta_{max} = \max_{m\in \mathcal{M}} \vartheta_m\) among all candidate models. The results are stratified by the number of learning samples \(n_\mathcal{L}\) and the number of evaluation samples \(n_\mathcal{E}\). Remember, that we use a \(n_\mathcal{T}:n_\mathcal{V}= 3:1\) ratio to split the learning data into training and validation data, i.e. \(n_\mathcal{V}=1/4 \cdot n_\mathcal{L}\). The red symbol ontop of the boxplots indicates the sample average of the corresponding observations. The first plot is restricted to the case of \(M=200\) initial candidate models.

In the next plot shows the same loss as above but now relative to \(\vartheta_{max}\).

As we are most interested in comparing the within 1 SE selection rule with the default selection rule, the following figure shows exactly this comparison.

3.1.2 Stratified analysis

In the following, detailed results regarding the final model performance are presented. First, the last figure is stratified by learning task. The analysis is restricted to the case of \(M=200\) candidate models.

In the following the previous results are given in detail in two tables. The quantity of main interest here is the performance gain \(\Delta=\mathbb{E}[\vartheta_{m^*|\text{w1SE}}-\vartheta_{m^*|\text{default}}]\) (theta.delta) for which (unadjusted) two-sided 95% confidence intervals and p-values regarding the null hypothesis \(H_0: \Delta=0\) are given.

Next, we show the same table, additionally stratified for the prediction task scenario, for a more detailed analysis.

3.2 Estimation

In this section we investigate how well the true final model performance (accuracy) \(\vartheta_{m^*}\) can be estimated. In particular, we are interested in the distribution of the relative deviation \((\hat{\vartheta}_{m^*} - \vartheta_{m^*})/\vartheta_{m^*}\) between the point estimate \(\hat{\vartheta}_{m^*}\) and the true value \(\vartheta_{m^*}\).

3.2.1 Naive performance estimation

The obvious naive estimator for classification accuracy is just the sample proportion of correct predictions. In our simulations we implemented a slightly different naive estimator by adding two pseudo-observations (1 correct and 1 false prediction) to the observated evaluation data before calculating this sample proportion. This procedure has a Bayesian motivation and results in a slightly more conservative (downward biased) estimate. The main reason for using this estimator was to prevent a singular covariance matrix due to estimates \(\hat{\vartheta}_m \{0,1\}\). The sample mean of deviations (red) is an estimator for the bias of the estimator. The slight downward bias observed for the default approach (evaluate only single model) results only from this modification.

3.2.2 Corrected performance estimation

The corrected estimator \(\tilde{\vartheta}\) takes into account the number of models in the evaluation study as well as the correlation structure between raw performance estimates. It is computed as a simultaneous lower 50%-confidence bound for all model performances and is thus a median conservative point estimator. In effect, the corrected estimate is rather biased downward than upward. This bias again decreases with increasing \(n_\mathcal{E}\). Note that naive and corrected estimator coincide if only a single model is evaluated (default selection rule).

3.2.3 Detailed results

Finally, more detailed results are given in the following table. We investiage the relative Bias, RMSE and MAD of the naive [N] and corrected [C] estimators. All these number are given as percent values, i.e. are multiplied by 100.

bias.active <- info$select.rule %in% config$bias.methods  

## range of observed proportions:
config$p.range <- c(-5,5)

## estimator properties:
data$raw %>%
  filter(select.rule %in% config$bias.methods) %>%
  group_by(M, n.learn, n.eval, select.rule) %>%
  summarize(Nsim = n(),
            bias.naive = mean(rel.dev),
            bias.corrected = mean(rel.dev.c),
            rmse.naive  = sqrt(mean(rel.dev^2)),
            rmse.corrected  = sqrt(mean(rel.dev.c^2)),
            mad.naive  = mean(abs(rel.dev)),
            mad.corrected  = mean(abs(rel.dev.c))) %>%
  mutate_at(vars(matches("bias|mse|mad")), function(x)round(100*x, 3)) %>%
  arrange(-M, n.learn, n.eval) %>%
  select(-Nsim) %>%
  datatable(class=c("compact"), rownames = FALSE, filter="top", extensions = 'FixedHeader',
  options = list(pageLength = 20, fixedHeader = TRUE),
  colnames=c("M", "n.learn", "n.eval", "select.rule", #"Nsim",
                       "rBias [N]", "rBias [C]", "rRMSE [N]", "rRMSE [C]", "rMAD [N]", "rMAD [C]")) %>%
  formatStyle(columns = 1:12, color="black") %>%
  formatStyle(columns = 1:4, color="black", fontWeight = "bold") %>%
    formatStyle('select.rule', #target = 'row',
              color = styleEqual(info$select.rule[bias.active], info$col[bias.active])) %>%
  formatStyle(
    c('bias.naive', 'rmse.naive', 'mad.naive'),
    background = styleColorBar(config$p.range, '#8d99ae'),
    backgroundSize = '100% 90%',
    backgroundRepeat = 'no-repeat',
    backgroundPosition = 'center',
    fontWeight = "bold"
  ) %>% 
  formatStyle(
    c('bias.corrected', 'rmse.corrected', 'mad.corrected'),
    background = styleColorBar(config$p.range, '#f05028'),
    backgroundSize = '100% 90%',
    backgroundRepeat = 'no-repeat',
    backgroundPosition = 'center',
    fontWeight = "bold"
  ) 

3.3 Test decisions

In our framework, a prediction model \(\hat{f}_m\) can be seen to be positively evaluated if the null hypothesis \(H_0^m: \vartheta_m \leq \vartheta_0\) can be rejected in the evaluation study. Hereby \(\vartheta_0\) can be the performance of a reference model or, in our case, a prespecified performance benchmark, e.g. an accuracy of \(90\%\). In the following, we will investigate type 1 and type 2 errors rate of the employed statistical test when the number of evaluated models changes.

3.3.1 Type 1 error rate

The type 1 error rate or family wise error rate (FWER) in the multiple texting context is given as the probability to reject at least one false hypotheses. The FWER is required to be bounded by the significance level \(\alpha\) which is set to \(2.5\%\) in all our simulations. The first three figures are show only result for the case \(M=200\).

3.3.2 Statistical power

Statistical power is the probability to correctly reject at least one false null hypotheses. This is equal to the probability to correctly reject the global null hypotheses \(G = \cap_{m\in \mathcal{M}} H_0^m\) if it is false. In our model evaluation framework, the global null \(G\) corresponds to the case that none of the considered candidate models perform well enough, i.e. all performances \(\vartheta_m\) are less or equal then the prespecified threshold \(\vartheta_0\). Therefore, the minimal requirement for a successfull evaluation study from a statistical viewpoint is rejection of \(G\). Comparing the statistical power of a test is a accepted way to choose between different tests with same type 1 error.

3.3.4 Stratified analysis

In the following we will investigate the previous findings from a different perspective. We will include all cases \(M \in \{40, 100, 200\}\) but only look at the type 1 error rate (false positive test decisions) at \(\delta = \vartheta_{max} - \vartheta_0 = 0\) and statistical power (true positive test decisions) at \(\delta=0.05\).

3.4 Further results

In the following, the results of different sensitivity and auxiliary analyses are presented.

3.4.1 Model performance and similarity

The following table gives a more detailed overview of the model quality per learning algorithm, task and sample size. The column theta.max indicates the maximum true performance \(\vartheta_{max}\) stratified by algorithm, averaged over all simulation runs. The column S shows how many models (trained by the respective algorithm) are within 1 Se of the best validation model. The final column dependence contains (the mean of) a dimensionless measure of dependence defined as \(\varrho=1 - (c_{\alpha}-c_d)/(c_i-c_d)\). Hereby \(c_\alpha\) is the actual critical value, \(c_d\) is the critical value under perfect dependence (i.e. corresponding quantile of the univariate normal distribution) and \(c_i\) is the according quantile of the \(S\)-dimensional normal distribution under the assumption of independece (of the model predictions). This implies \(\varrho=0\) if the performances of the \(S\) evaluated models are completely independent and \(\varrho=1\) if the models are actually all the same (at least on the test data, which in this analysis has a size of \(n_\mathcal{E}=8000\).)

data.details %>% preproc_data() %>% {.$raw} %>%
  filter(select.rule=="within1SE", n.eval==8000) %>%
  mutate(methods = recode(methods, glmnet="EN", rpartCost="CART", svmLinearWeights2="SVM", xgbTree="XGB")) %>% 
  mutate(theta.opt = sapply(score, function(x) switch(x, A=0.885, B=0.86, C=0.951, D=0.966, E=0.95, F=0.963))) %>%
  group_by(score, n.learn, methods) %>%
  summarize(Nsim=n(),
            theta.opt = mean(theta.opt),
            theta.max.m = mean(opt.theta),
            theta.max.sd = sd(opt.theta),
            S.m = mean(S),
            S.sd = sd(S),
            dependence.m = mean(dependence, na.rm=T),
            dependence.sd = sd(dependence, na.rm=T)) %>%
  mutate_at(5:11, function(x) round(x, 3)) %>%
  select(score, n.learn, algorithm=methods, theta.opt, theta.max=theta.max.m, S=S.m, dependence=dependence.m) %>%
  mutate(S = round(S, 1)) %>%
  datatable(class=c("compact"), rownames = FALSE, filter="top", extensions = 'FixedHeader',
  options = list(pageLength = 8, fixedHeader = FALSE)) %>%
  formatStyle(columns = 1:12, color="black") %>%
  formatStyle(columns = 1:2, color="black", fontWeight = "bold") %>%
  formatStyle('algorithm', fontWeight = "bold",
              color = styleEqual(unique(data.details$methods),
                                 colorspace::rainbow_hcl(4))) %>%
  formatStyle(
    c('theta.max'),
    background = styleColorBar(c(0.50, 1), '#f05028'),
    backgroundSize = '100% 90%',
    backgroundRepeat = 'no-repeat',
    backgroundPosition = 'center',
    fontWeight = "bold"
  ) %>%
  formatStyle(
    c('S'),
    background = styleColorBar(c(1:50), '#8d99ae'),
    backgroundSize = '100% 90%',
    backgroundRepeat = 'no-repeat',
    backgroundPosition = 'center',
    fontWeight = "bold"
  ) %>%
    formatStyle(
    c('dependence'),
    background = styleColorBar(c(0:1), '#8d99ae'),
    backgroundSize = '100% 90%',
    backgroundRepeat = 'no-repeat',
    backgroundPosition = 'center',
    fontWeight = "bold"
  ) 

3.4.2 Advantages of the maxT-approach

In this brief analysis, we compare the employed maxT-Test with the better known Bonferroni-correction which is simpler but also more conservative. On the other hand we compare to a “naive” evaluatuion strategy without any statistical testing (i.e. concluding success if the point estimate is above the performance threshold).

The statistical testing approach does not affect the final model performance but it does affect the rejection rate. We can observe that the Bonferroni approach result in more conservative test decision, i.e. fewer rejections. The naive approach of concluding \(\vartheta_{m^*}>\vartheta_0\) whenever \(\hat{\vartheta}_{m^*} > \vartheta_0\) on the other hand results in a way too optimistic approach with type 1 error rates up to 50%.

3.4.3 Selection based on model complexity

There are different model selection approaches which focus on trading of performance versus model complexity, penalizing the latter. In general, it may however be hard to find such a criterion which is applicable to all candidate models at hand when a diverse set of learning algorithms is used.

We performed a small sensitivity analysis in this regard by considering a traditional selection approach for the elastic net (EN) algorithm. Hereby, the number of nonzero model coefficient is minimized (or the L1-penalty maximized) under the side condition that the validation performance is at most one standard errors below the best validation model. The results concerning this simple (also referred to as 1-SE rule) rule are given below.

Quite surprisingly, this approach is not even beating the default approach, neither regarding the expected model performance nor the rejection rate of the according statistical test. The bias is not shown here, as this method also enables an unbiased estimation (\(|\mathcal{S}|=1\)).

3.4.4 Optimal number of evaluated models

For this simulation study, we prespecified the within1SE selection rule as a comparator for the default selection rule. However, choosing models within one (and not e.g. two) standard error(s) is not necessarily optimal. In our numerical experiments, the within1SE lead to a mean number of evaluated models \(S\) of 9 to 10 averaged over all learning tasks (in reality the number is lower due to our self-imposed limit \(S \leq \lfloor \sqrt{n_\mathcal{E}} \rceil\)).

In the following we inspect how the expected model performance changes depending on \(S\). Two important conclusions can be drawn from the following figure:

  • Given \(n_\mathcal{E}\geq n_\mathcal{V}= n_\mathcal{L}/4\) it would be more reasonable to conduct model selection completely on the test data (\(S=M\)) compared to only on the validation data (\(S=1\)). This is expected, as the final models (after re-training on the complete learning data) can be only assessed directly only on the test data as the hold-out validation estimate is downward biased by construction.
  • However, \(S=M\) as well as \(S=1\) are far away from being the optimal \(S\). We rather observe an optimum around between \(S=10\) and \(S=15\). The magnitude of the decline in expected model performance after that (i.e. when \(S \rightarrow M\)) is inversely related to the test set size \(n_\mathcal{E}\).

We have also conducted this analysis stratified by learning task and the results did not change qualitively. We remark here that this analysis only covers the aspect of expected model performance. The estimation bias can be expected to increase in \(S\) while for the statistical power we expect a similar curve as above (increase at first, followed by a decrease). We expect the power loss to be immense at some point due to necessary adjustment for multiplicity. We have not investiagted these aspects in detail so far, mainly due to the high computational demand (for calculation of critical values for \(S>>10\)).


4 Summary

In this simulation study, we examined if the default evaluation strategy in machine learning can be improved. Our main finding is that evaluating multiple instead of only a single promising model has several benefits:

  • In general, our multiple testing approach leads to a greater flexibility in the evaluation study.
  • As more data informs the model selection, the expected final model performance is increased. The magnitude of performance gain is between 0.5% and 1% accuracy in our simulations (when the test data has the same size as the validation data). This effect is increasing in the size of the test set.
  • The probability to successfully find a sufficiently good model increases. This gain in statistical power can be immense, up to 20% in our simulations.

These benefits come at the cost of the biased estimation of the final model performance. Luckily, we are able to correct the upward bias of the naive estimator quite well. The corrected estimate is rather downward biased which is preferable. The other properties (MSE, MAD) of the point estimator however do not differ strongly when a single or multiple models are evaluated.

In practice, other considerations than raw predictive power may of course also play a role. A resonable strategy would be to check suitable side conditions before the evaluation study. For instance, if a model would not be implemented in pratice because it is implausible or too complex it should not be considered for evaluation in the first place.

Alltogether, evaluating multiple promising models instead of a single candidate seems to be preferable. How to choose the number of models to be evaluated in an optimal way is still an open question. We observed good results with the within1SE rule, i.e. when all models within one standard error of the best validation model are evaluated. In the future, we plan to investigate the performance of more elaborate approaches.


5 Resources

5.1 References

  • Westphal & Brannath (2019, SMMR): Westphal, M. and Brannath, W. Evaluation of multiple prediction models: a novel view on model selection and performance assessment. Statistical Methods in Medical Research, forthcoming 2019.
  • Westphal & Brannath (2019, ICML): Westphal, M. and Brannath, W. Improving Model Selection by Employing the Test Data. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, California, PMLR 97, forthcoming 2019.

5.2 Software

5.3 Version History

  • 13/05/2019 - Version 1: Initial release
  • 16/05/2019 - Version 2: Added ‘selection on based on model complexity’

Max Westphal

https://github.com/maxwestphal
https://www.linkedin.com/in/maxwestphal/
https://www.anstat.uni-bremen.de/node/22

11 June, 2019