Everything You Need To Know About Modelling Click Stream Data
Ad Click Prediction : A View from Trenches - Learnings from research paper by Google focussed on modelling of Click through Rate
What is Click through Rate?
Click through rate (CTR) is the most common metric for measuring engagement within an email campaign. That basically means how many people click through an email in relation to how many emails were delivered.
Essentially, the click rate is a percentage that tells you how many emails successfully achieved one click from a subscriber. This click tells you if your campaign was engaging enough for your audience.
Now that you’ve learned what CTR is, how to measure it, and why it’s important for emails, you’re ready to implement this metric into your next campaign analysis. Analytics play an important role in any email campaign and it’s always important to keep best practices at the forefront of your mind.
Online Advertising plays a vital role now that the world has evolved into the E-commerce era, of which Click-Through Rates (CTR) prediction is the essential driving technology. Given a user, commodities and scenarios, the CTR model can predict the user’s click probability of an online advertisement i.e. customers’ digital traces to achieve more precise outputs.
Key takeaways from modelling CTR
According to author’s, following are the key takeaways and learnings from modelling Click through Rates:
The author’s [2] experimented with dropout rates from 0.1 to 0.5 followed by grid search and found that dropout training does not improve predictive accuracy metrics or generalization ability, and was often found to be detrimental. They believe that as opposed to vision tasks where dropouts have shown excellent performance and the data is dense, CTR data is sparse and labels are noisy. They believed that “In the dense setting, dropout serves to separate effects from strongly correlated features, resulting in a more robust classifier. But in our sparse, noisy setting adding in dropout appears to simply reduce the amount of data available for learning.”
Online gradient descent (OGD) is not particularly effective at producing sparse models. The Regularized Dual Averaging (RDA) algorithm produces even better accuracy vs sparsity tradeoffs.
“Follow The Regularized Leader” algorithm, or FTRL-Proximal gives significantly improved sparsity with the same or better prediction accuracy.
Follow The Regularized Leader (FTRL) is an optimization algorithm developed at Google for click-through rate prediction in the early 2010s. It is best suited for shallow models having sparse and large feature spaces. This algorithm supports both shrinkage-type L2 regularization (summation of L2 penalty and loss function) and online L2 regularization. According to author’s [1], FTRL-Proximal with L1 regularization significantly out-performed Regularized Dual Averaging (RDA) algorithm in terms of the size-versus-accuracy tradeoffs.
Compared with traditional machine learning algorithms, such as logistic regression and factorization machine, DNN-based CTR model can better deal with the combination of feature vectors and higher-order features, and it has structural advantages in improving prediction accuracy. DNN-based CTR prediction models, introduced a new model of FO-FTRL-DCN, based on the prestigious model of Deep and Cross Network (DCN) augmented with the latest optimization technique of Follow The Regularized Leader (FTRL) for DNN.
The authors [2] state that they suggested use L1 regularization to save memory, while modelling large scale data, during prediction. Though this technique might fare poorly as compared to FO-FTRL-DCN.
Authors [2] state that in many scenarios with high dimensional data, the majority of features are extremely rare. This is to the extent that few of them occur once in million data samples. The author’s found that “Bloom Filter Inclusion” gives the good tradeoffs for RAM savings against loss in predictive quality. According to authors, “We use a rolling set of counting Bloom filters to detect the first n times a feature is encountered in training. Once a feature has occurred more than n times (according to the filter), we add it to the model and use it for training in subsequent observations as above. Note that this method is also probabilistic, because a counting bloom filter is capable of false positives (but not false negatives). That is, we will sometimes include a feature that has actually occurred less than n times.”
Naive implementations of OGD use 32 or 64 bit floating point encodings to store coefficient values. But authors [2] found that instead of using 64-bit floating point values using q2.13 encoding, no measurable loss comparing results from a model and they saved 75% of the RAM for coefficient storage. q2.13 encoding has two bits to the left of the binary decimal point, thirteen bits to the right of the binary decimal point and a bit for the sign. So it uses a total 16 bits per value. The key idea is “explicitly rounding”, which ensures that the discretization error has zero mean.
Subscribe to receive a copy of our newsletter directly delivered to your inbox.
The above article is sponsored by Vevesta.
Vevesta: Your Machine Learning Team’s Collective Wiki: Identify and use relevant machine learning projects, features and techniques.
100 early birds who login into Vevesta will get free subscription for 3 months
References
This article was originally published at https://www.vevesta.com/blog/26-Ad-Click-Prediction .