If you follow the news, you’ve probably heard about the recent advances in – and seemingly superhuman capabilities of – deep learning neural networks like DALL-E and GPT. You’ve probably also heard the speculation about what use-cases they may support – and who (or what) they might replace.
But in the world of business, where tabular, structured data is most common, tree-based machine learning (ML) models still rule the roost. Indeed, a quick look at Kaggle, a website that hosts ML competitions, reveals that when combined with feature engineering, tree-based ensemble models like XGBoost and CatBoost continue to bring home the prizes.
In this article, I unpack some of the technical reasons why tree-based models have maintained the lead, and what this means for Solidus’ near-term machine learning strategy.
Deep learners, unlike tree learners, still struggle with tabular data
A recent paper, promisingly titled ‘Why do tree-based models still outperform deep learning on tabular data?’ comes to the conclusion, based on multiple benchmarks, that tree-based models will give better predictions with less computational cost. They give two main reasons for this:
1) Neural Networks are biased to smooth solutions
The above diagram shows what happens when you have irregular features that aren’t smooth. The decision trees making up tree models like Random Forests (left) can split the non-smooth data at hard boundaries, whereas neural nets are also influenced by spatially adjacent data. Tree learners are therefore able to learn irregular patterns piecewise, whereas multilayer perceptrons or MLPs (right) – sometimes referred to as ‘vanilla’ neural networks – struggle.
2) Uninformative features affect more MLP-like NNs
Again, this is well-illustrated by a diagram from the paper. Here we can see that when adding unhelpful columns to the data, performance drops most rapidly with Resnet, which has an MLP architecture. Interestingly, the Feature Tokenizer (FT) Transformer, a neural network that uses attention-based mechanisms similar to GPT, already shows greater resistance to uninformative columns.
One reason that uninformative features are a hindrance is that the neural networks are ‘rotationally invariant’. Meaning that if you rotate the data with a matrix rotation, the results will still be the same. Super useful when dealing with images, but a waste of effort and confusing to the model when data is strictly in columns for business applications.
In the near-term, it looks like tree-based models will continue to outperform deep learning models' performance on tabular data.
Neural networks are typically capable of finding their own ‘features’, or characteristics of a phenomenon. In the tree-based world, however, data scientists must engineer these attributes themselves. In Solidus’ case, this means creating features like deposit-to-withdrawal ratios and customer behavioral profiles that are informed by our crypto market expertise, as well as our engagement with compliance professionals and financial regulators. In the near-team, this deep subject matter knowledge will remain necessary.
Even when deep learning does overtake tree models – which based on recent history could be a good medium-term bet – professionals in domains like trade surveillance and financial services will still have to justify their processes to customers and regulators. In other words, AI explainability will continue to depend on understandable features verified by experts in the domain.
*Punny headline generated by ChatGPT