Not an issue per se, but some thoughts I had that may be useful.
A paper to look at is On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation (link). I think it goes over most of the "best practices" relevant to matbench.
The main topic of the paper is the types of bias and variance that arise in model selection and evaluation. An important case is how over-fitting during model selection can cause the final model to be either under or over fit. The more models / hyper-parameters searched over, the more likely it is for the best model to be chosen due to variance, which can result in a worse final model. The paper notes that minimizing bias during model selection is less important than variance, as (assuming some uniform bias) the best model will still be chosen. The paper gives a few options to deal with over-fitting during model selection, such as regularization and stopping criteria. A practical choice would be to use 5-fold cv instead of 10-fold cv, as it likely will have somewhat higher bias but lower variance, and is less computationally expensive. This also highlights why it is so important to have a final hold out test set to evaluate the final chosen model, as the cross validation scores can be heavily biased. Another option for small data (it is expensive) is a nested cross validation.
This quote gives the main conclusion:
model selection should be viewed as an integral part of the model fitting procedure, and should be conducted independently in each trial in order to prevent selection bias and because it reflects best practice in operational use.
In practice this means there must be a train / test split, with the training set then used as train / validation either by a split or cross validation, with the entirety of model selection (including most types of feature selection, hyper-parameter optimization, etc) internal to the cross validation. The TPOT code is a good example of this, where "pipelines" are compared to each other as a whole using cross validation, followed by a final evaluation of the best pipeline on a test set to estimate the true generalization error. A useful idea in model comparison would be using variance of the estimates of generalization error (e.g. var of CV error) to see if differences are statistically significant, but this is very hard to do correctly.
(Sidenote) With a goal of simply comparing many models (that is without the goal of choosing a best one at the end), the bias-variance trade-off is tricky. Theoretically either we can choose biased results to minimize variance, and hope that this gives accurate ranking of performance but may not reflect the true generalization error, or we can choose unbiased results that may reflect the generalization error but may not give an accurate ranking. In practice we hope to be somewhere in the middle that gives reasonable results. The hold out test set does not necessarily help here (though it may be better than nothing?), as using it to evaluate multiple models will lead to the same type of bias as in model selection. For example, after years of competitions training NN to do well on CIFAR-10, the selected models may just be over-fitting to the CIFAR-10 competition test set, and would do worse than previous models on new test data, even if they are currently winning by some small percent increase in classification accuracy. This is just a thought, I'm not sure how to deal with this issue, I haven't read much ML literature on large scale model comparison.
Let me know if you have any questions, want clarification, more references, or anything else.