If you’ve ever fiddled with data science before, you’ve probably come across some version of the following chart:
So while it’s no secret that feature engineering can be a lot of work, it can also be treasure-trove for creative ideas, and can contain the key to unlocking features that make your machine learning (ML) model highly performant.
We recently had to use ratios in one of our clustering algorithms, and it proved to be far less trivial than we thought when we first started down the rabbit hole.
User Segmentation - a problem as old as time
Recently, we were tasked with segmenting traders at crypto exchanges into different groups.
Users come in many shapes and sizes, and whether a user is a high frequency trader, an institutional investor, or a one-time buyer influences the behavioral patterns we expect to see when we look at our data.
Our objective, then, was to define clear trading profiles, such that we could attribute users of interest to a profile and then determine when a deviation from that profile took place.
We decided on a few features that we would focus on for our model, one of which was the Deposit / Withdrawal Ratio. As its name suggests, it’s pretty straightforward.
The naive approach - an asymmetrical ratio
We first checked to see if we could get away with something simple:
Ratio = deposit / withdrawal
This had one bold advantage going for it: It’s pretty intuitive, so it is easy to explain to our users what’s going on there. Explainability is important, as we'll explain later.
The main drawback was the asymmetry between cases where:
- deposits < withdrawals: the range between 0 and 1.
- deposits > withdrawals: the range between 1 and, theoretically, infinity.
The result would be that a certain group of data-points would exist along one scale of values, and another group of data-points would be distributed along a very different scale. This means we would have an asymmetric range.
The problem with this asymmetry would arise when we tried to throw this data into a machine-learning model. Models take feature-vectors as inputs, where each feature represents a coordinate on some axis in space. Having a feature with inconsistent distances would make it hard for a machine-learning model to do its job.
When you think about it, this is especially intuitive for clustering, which is very clearly distance-based! In order for it to properly work, the algorithm would have to learn that different ranges on an axis have different scales. For example, that a distance of 0.3 should have a very different meaning if it occurs in the range between 0 and 1, or in the range of numbers that are greater than 1. But clustering algorithms don’t work that way, and so it’s up to us to help them help us.
In an attempt to right this wrong, we took inspiration from the logistic regression’s use of the logit function and went with the natural log (also known as “ln”):
Ratio = log(deposit / withdrawal)
And now, with the magic of mathematics, our data would be distributed within this feature like this:
Amazing! It’s symmetrical. Problem solved.
Or is it?
The current function might be symmetrical, but the possible range is now anywhere between negative infinity and infinity. Not to mention, a significant loss to our intuitiveness & explainability. Once again, we’ve reached a problem with the scale of our data – it might be symmetrical, but infinity isn’t easy for machine learning models either. So we need a way to compress the range of possible values.
(If you’ve ever worked with neural networks, you can probably guess what’s about to come).
Containing the scale
There aren’t many functions in the realm of machine learning as well-known as the “Sigmoid function.”
In case you’ve been living under an ML-rock, the activation function’s sole purpose is to take infinite ranges and border them into a predefined range, usually between 0 and 1. We settled on this function.
Ratio = 1 / (1 + e-log(deposit/withdrawal))
And so, at last, our data was ready to be used in our model.
A quick glance at the distribution looks promising. Notice the high number of “0” and “1” values; these would have been -∞ and ∞ had we not used this function.
One glaring disadvantage of this metric is that it’s not very intuitive. This problem is compounded by the fact that the end-users of our system are compliance analysts, and making sense of the alerts that we throw at them is their job.
While the metric itself might be more mathematically accurate, and can even get the clustering algorithm to spit out pretty good results, we would have a hard time explaining to them why a certain trader’s “deposit-to-withdrawal ratio” is acting the way it is, if we had to explain it in terms of logs and sigmoids.
This is a classic product consideration, with a nifty solution.
We would keep the sigmoid metric as an internal calculation of the system, as it is not absolutely necessary to show it to our users - as long as we can provide them with equivalent information that is easier for humans to understand. (The metric would still be used in the clustering logic, in order to assure that the results we’re surfacing are on-point).
In order to show the ratio on our front-end, we create a user-friendly version of it by taking the deposits and withdrawals and computing their greatest common denominator with the Euclidean algorithm.
The algorithm takes an input like this: Deposits = 500, Withdrawals = 200
And produces an output like this: Deposit / Withdrawal Ratio is 5:2
And along with some additional trader stats, they have everything they need in order to keep their exchange manipulation-free!
It’s important to remember that having creative ideas for new features is not always enough, since these ideas often depend on effective execution. Whenever we come across a feature that is not 100% straightforward, we find it helpful to think about how this feature would behave across an entire range of values, and whether it would display inconsistencies that deteriorate results.
Solid data explorations & visualizations are your best friends when tackling such endeavors.