In today’s data-driven marketing landscape, predicting which users are likely to convert is crucial for optimizing strategies and improving ROI. In this blog post, we’ll walk through the process of building user conversion prediction models using BigQuery ML and the Google Analytics Sample Dataset. We’ll cover feature development, model building, and evaluation, all using SQL.
Contents
- Feature Development
- Building and Training Predictive Models
- Logistic Regression Model
- Random Forest Model
- XGBoost Model
- Confusion Matrix Results
- Performance Comparison
- Feature Importance
- Conclusion
1. Feature Development
Let’s create features that could be indicative of user conversion:
CREATE OR REPLACE TABLE `your-project.your-dataset.user_features` AS
WITH user_sessions AS (
SELECT
fullVisitorId,
PARSE_DATE('%Y%m%d', date) AS visit_date,
totals.transactions,
totals.timeOnSite,
totals.pageviews,
device.deviceCategory,
geoNetwork.country,
trafficSource.medium
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170801'
)
SELECT
fullVisitorId,
MAX(CASE WHEN transactions > 0 THEN 1 ELSE 0 END) AS has_converted,
COUNT(DISTINCT visit_date) AS num_visits,
AVG(timeOnSite) AS avg_time_on_site,
AVG(pageviews) AS avg_pageviews,
MAX(deviceCategory) AS device_category,
MAX(country) AS country,
MAX(medium) AS traffic_medium,
SUM(pageviews) AS total_pageviews,
SUM(timeOnSite) AS total_time_on_site,
DATE_DIFF(MAX(visit_date), MIN(visit_date), DAY) AS days_since_first_visit
FROM
user_sessions
GROUP BY
fullVisitorId;
In this query, we’ve created several features that might be predictive of conversion:
- has_converted: Whether the user has made a purchase (our target variable)
- num_visits: Number of visits by the user
- avg_time_on_site: Average time spent on the site per visit
- avg_pageviews: Average number of pages viewed per visit
- device_category: The user’s device type
- country: The user’s country
- traffic_medium: The traffic source medium
- total_pageviews: Total number of pages viewed across all visits
- total_time_on_site: Total time spent on the site across all visits
- days_since_first_visit: Number of days between the user’s first and last visit
2. Building and Training Predictive Models
Let’s build three different types of models to predict user conversion: Logistic Regression, Random Forest, and XGBoost.
Logistic Regression Model
CREATE OR REPLACE MODEL `your-project.your-dataset.user_conversion_logistic`
OPTIONS(model_type='logistic_reg', input_label_cols=['has_converted']) AS
SELECT
* EXCEPT(fullVisitorId)
FROM
`your-project.your-dataset.user_features`;
Random Forest Model
CREATE OR REPLACE MODEL `your-project.your-dataset.user_conversion_random_forest`
OPTIONS(model_type='random_forest_classifier', input_label_cols=['has_converted']) AS
SELECT
* EXCEPT(fullVisitorId)
FROM
`your-project.your-dataset.user_features`;
XGBoost Model
CREATE OR REPLACE MODEL `your-project.your-dataset.user_conversion_xgboost`
OPTIONS(model_type='boosted_tree_classifier', input_label_cols=['has_converted']) AS
SELECT
* EXCEPT(fullVisitorId)
FROM
`your-project.your-dataset.user_features`;
3. Confusion Matrix Results
Let’s evaluate the performance of all three models:
-- Function to evaluate a model
CREATE TEMP FUNCTION EvaluateModel(model_name STRING)
RETURNS TABLE<
model STRING,
accuracy FLOAT64,
precision FLOAT64,
recall FLOAT64,
f1_score FLOAT64,
log_loss FLOAT64,
roc_auc FLOAT64
>
AS ((
SELECT
model_name AS model,
accuracy,
precision,
recall,
f1_score,
log_loss,
roc_auc
FROM
ML.EVALUATE(MODEL `your-project.your-dataset.${model_name}`,
(
SELECT
* EXCEPT(fullVisitorId)
FROM
`your-project.your-dataset.user_features`
)
)
));
-- Evaluate all models
SELECT * FROM EvaluateModel('user_conversion_logistic')
UNION ALL
SELECT * FROM EvaluateModel('user_conversion_random_forest')
UNION ALL
SELECT * FROM EvaluateModel('user_conversion_xgboost')
ORDER BY model;
Confusion matrices provide a detailed breakdown of the classification results by showing the counts of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). Here’s an analysis of the confusion matrices for Logistic Regression, Random Forest, and XGBoost models.
Logistic Regression
Logistic Regression shows a high number of true negatives (59,339) and a low number of false positives (184), which indicates that it correctly identifies non-converters most of the time. However, it struggles with correctly identifying converters, as shown by the relatively high number of false negatives (791) and the low number of true positives (209).
Predicted: 0 (not converted) | Predicted: 1 (converted) | |
Actual: 0 (not converted) | (TN) 59,339 | (FP) 184 |
Actual: 1 (converted) | (FN) 791 | (TP) 209 |
Random Forest
The Random Forest model improves significantly over Logistic Regression in terms of detecting true positives (639 vs. 209), which indicates that it is better at identifying actual converters. The number of false negatives is also reduced to 361, meaning fewer actual converters are missed. The model also maintains a low false positive rate (126), which means it remains accurate in not falsely predicting conversions when there are none. Overall, Random Forest offers a better balance between detecting converters and avoiding false alarms.
Predicted: 0 (not converted) | Predicted: 1 (converted) | |
Actual: 0 (not converted) | (TN) 59,397 | (FP) 126 |
Actual: 1 (converted) | (FN) 361 | (TP) 639 |
XGBoost
XGBoost provides a mixed performance compared to Random Forest. It has slightly more false positives (246) than Random Forest. It also shows a higher number of false negatives (512) than Random Forest, meaning it misses more actual converters. Despite having a higher true positive count (488) than Logistic Regression, XGBoost underperforms compared to Random Forest in detecting actual converters.
Predicted: 0 (not converted) | Predicted: 1 (converted) | |
Actual: 0 (not converted) | (TN) 59,277 | (FP) 246 |
Actual: 1 (converted) | (FN) 512 | (TP) 488 |
Based on the confusion matrix analysis, Random Forest emerges as the top-performing model for predicting user conversion due to its superior balance between identifying actual conversions and minimizing false predictions.
XGBoost is a viable alternative but may require adjustments, while Logistic Regression is the least effective in this context, particularly due to its high false negative rate.
4. Performance Comparison
Logistic Regression | Random Forest | XGBoost | |
Precision | 0.532 | 0.835 | 0.665 |
Recall | 0.209 | 0.639 | 0.488 |
Accuracy | 0.984 | 0.992 | 0.987 |
F1 Score | 0.300 | 0.724 | 0.563 |
Log Loss | 0.046 | 0.144 | 0.028 |
AUC | 0.986 | 0.986 | 0.993 |
When evaluating predictive models, key metrics include precision, recall, accuracy, F1 score, Log Loss, and AUC.
Precision measures how often positive predictions are correct, while recall assesses how well the model identifies actual positives.
Accuracy gives an overall correctness measure, but can be misleading in imbalanced datasets.
F1 Score balances precision and recall.
Log Loss penalizes incorrect predictions, and AUC shows the model’s ability to distinguish between classes.
1. Precision
- Logistic Regression (0.532): This model shows moderate precision, meaning that 53.2% of the users it predicted as converters were indeed converters. However, this also indicates that nearly half of the positive predictions were false positives.
- Random Forest (0.835): Random Forest excels in precision with 83.5%, making it very effective in minimizing false positives. This model is ideal if the goal is to avoid wasting resources on users who are unlikely to convert.
- XGBoost (0.665): XGBoost offers a precision of 66.5%, which is higher than Logistic Regression but lower than Random Forest.
2. Recall
- Logistic Regression (0.209): The recall of 20.9% suggests that Logistic Regression misses a significant portion of actual converters. While it is conservative in predicting conversions, it could lead to lost opportunities by not identifying potential converters.
- Random Forest (0.639): With a recall of 63.9%, Random Forest is much better at capturing true positives. It balances between identifying converters and maintaining precision, making it a strong candidate for conversion prediction.
- XGBoost (0.488): XGBoost’s recall of 48.8% falls between Logistic Regression and Random Forest. It is less aggressive than Random Forest but still better than Logistic Regression in identifying potential converters.
3. Accuracy
- Logistic Regression (0.984): Despite its lower recall and precision, Logistic Regression shows high accuracy at 98.4%. However, accuracy can be misleading in imbalanced datasets, where the majority class dominates.
- Random Forest (0.992): Random Forest’s accuracy of 99.2% indicates it performs well overall, reducing both false positives and false negatives.
- XGBoost (0.987): XGBoost also shows high accuracy at 98.7%, slightly below Random Forest but still indicating strong overall performance.
4. F1 Score
- Logistic Regression (0.300): The F1 score of 30.0% is low, reflecting the model’s struggle to balance precision and recall. This low score indicates that the model is not well-suited for conversion prediction, especially if both precision and recall are important.
- Random Forest (0.724): Random Forest achieves a high F1 score of 72.4%, suggesting a good balance between precision and recall. This makes it a robust choice for conversion prediction where both metrics are crucial.
- XGBoost (0.563): XGBoost’s F1 score of 56.3% suggests it provides a better balance than Logistic Regression but is slightly less effective than Random Forest.
5. Log Loss
- Logistic Regression (0.046): Logistic Regression has the lowest log loss at 0.046, indicating that it is very confident in its predictions, even though the overall effectiveness may be compromised by low recall.
- Random Forest (0.144): Random Forest has a higher log loss of 0.144, indicating that while it is accurate, it is less confident in its predictions compared to Logistic Regression.
- XGBoost (0.028): XGBoost outperforms the other models with the lowest log loss of 0.028, indicating high confidence in its predictions. This, combined with its balanced performance, makes it a strong contender.
6. AUC
- Logistic Regression (0.986): The AUC of 98.6% indicates that Logistic Regression is very good at distinguishing between converters and non-converters, despite its lower recall.
- Random Forest (0.986): Random Forest also achieves an AUC of 98.6%, matching Logistic Regression and showing strong overall performance in distinguishing between classes.
- XGBoost (0.993): XGBoost tops the AUC at 99.3%, indicating that it is the most effective model at distinguishing between converters and non-converters, making it particularly powerful for this task.
Conclusion
- Logistic Regression: Offers high accuracy and low log loss, but its low recall and F1 score suggest it is less effective for predicting user conversions when both precision and recall are essential.
- Random Forest: Strikes a strong balance between precision, recall, and overall accuracy. Its high F1 score and solid AUC make it a very reliable model for conversion prediction.
- XGBoost: Delivers the best AUC and the lowest log loss, indicating strong predictive power and confidence. It provides a good balance, though it may require fine-tuning to optimize precision and recall fully.
For predicting user conversion, XGBoost stands out as a robust model, particularly for scenarios where distinguishing between converters and non-converters is critical.
However, Random Forest remains a strong alternative, offering a slightly better balance between precision and recall.
5. Feature Importance
Given that both the Confusion Matrix results and the performance metrics comparison indicate the Random Forest model outperforms the other two models, I will focus on the Random Forest model to discuss feature importance.
Understanding the key features driving the Random Forest model’s success is crucial. The model highlights “avg_time_on_site” and “total_pageviews” as the top factors, indicating that user engagement and browsing behavior are pivotal in predicting conversion potential within a non-linear context.
avg_time_on_site | 14436.0 |
total_pageviews | 14121.0 |
avg_pageviews | 7509.0 |
total_time_on_site | 7108.0 |
days_since_first_visit | 7085.0 |
num_visits | 5667.0 |
traffic_medium | 3484.0 |
device_category | 1852.0 |
country | 825.0 |
While the Random Forest model stands out in this analysis, it’s also worth noting that Logistic Regression and XGBoost, while slightly less effective, provide valuable insights.
However, they may not fully capture the complexity of user behavior that the Random Forest model does, particularly in terms of how engagement metrics and traffic sources interact to influence conversion outcomes.
6. Conclusion
In the competitive landscape of digital marketing, accurately predicting user conversion is essential for optimizing strategies and maximizing ROI.
I built and compared three models: Logistic Regression, Random Forest, and XGBoost.
Each model provided valuable insights, with Random Forest emerging as the most balanced in terms of precision, recall, and overall accuracy.
XGBoost, while showing exceptional AUC and low log loss, also proved to be a strong contender, especially in distinguishing between converters and non-converters.
The analysis of feature importance within the Random Forest model underscored the significance of user engagement metrics like average time on site and total page views in predicting conversions.
These findings highlight the importance of understanding user behavior holistically to develop more effective marketing strategies.
You can find the complete code in my GitHub repository.