Predicting Hotel Reservation Cancellations¶
This is a dataset provided by the MIT IDSS Data Science and ML Course. We will use it to predict cancellations.
Loading Data and Preprocessing¶
Loading¶
Loading the relevant libraries and data.
# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from xgboost import XGBClassifier
import numpy as np
from xgboost import plot_importance
from xgboost import XGBClassifier, plot_importance
# Load the dataset (upload your file to Colab first or use local path)
file_path = '/content/INNHotelsGroup.csv' # Update if needed
data = pd.read_csv(file_path)
Check for Missing¶
So far this looks like a very clean dataset without missing data. While not realistic in most cases in real life, it is very convenient. However, this tells us nothing about erroneous entries.
If we had missing values we would have to consider how to deal with them such as various types of imputation, removing the observations, or attempting model types that are more robust to missing data such as maybe XGBoost.
This output also tells us about the datatypes and how we may need to encode some.
# View basic info about the dataset
print("Dataset Info:")
print(data.info())
# Check for missing values
print("\nMissing Values:")
print(data.isnull().sum())
Dataset Info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 36275 entries, 0 to 36274 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 lead_time 36275 non-null int64 1 market_segment_type 36275 non-null object 2 no_of_special_requests 36275 non-null int64 3 avg_price_per_room 36275 non-null float64 4 no_of_adults 36275 non-null int64 5 no_of_weekend_nights 36275 non-null int64 6 arrival_date 36275 non-null object 7 required_car_parking_space 36275 non-null int64 8 no_of_week_nights 36275 non-null int64 9 booking_status 36275 non-null object dtypes: float64(1), int64(6), object(3) memory usage: 2.8+ MB None Missing Values: lead_time 0 market_segment_type 0 no_of_special_requests 0 avg_price_per_room 0 no_of_adults 0 no_of_weekend_nights 0 arrival_date 0 required_car_parking_space 0 no_of_week_nights 0 booking_status 0 dtype: int64
Encode Categorical Features/Variables¶
This tells the model the data types are categorical, specifically for market_segment_type and booking_status. Although they are categorical, there are only two options so really they can be considered binary (0,1).
# Encode categorical variables
label_enc = LabelEncoder()
data['market_segment_type'] = label_enc.fit_transform(data['market_segment_type'])
data['booking_status'] = label_enc.fit_transform(data['booking_status']) # Canceled = 1, Not_Canceled = 0
Feature Engineering¶
Creates new features out of what we already have.
We have no_of_weekend_nights and no_of_week_nights but created a new feature called total_nights. This may be informative for our model.
# Feature Engineering
# Create a new feature: total nights stayed
data['total_nights'] = data['no_of_weekend_nights'] + data['no_of_week_nights']
# Drop irrelevant features (if any, e.g., arrival_date)
data = data.drop(columns=['arrival_date'])
The Distribution of the Target Feature¶
This is important to know for deciding which type of models we may want to choose. It can also inform us if we want to try some sort of oversampling or synthetic sampling such as SMOTE. For example, logisitic regression does best with a 40/60 to 50/50 split. Some models are less sensative to this but many will overfit to the larger outcome.
# Visualize cancellation rates
plt.figure(figsize=(8, 5))
sns.countplot(data=data, x='booking_status')
plt.title('Cancellation Distribution')
plt.xticks(ticks=[0, 1], labels=['Not_Canceled', 'Canceled'])
plt.show()
Correlations¶
When we include highly correlated feautures in a model, we risk introducing multicollinearity, this can cause overfitting or poor model selection.
Multicollinearity occurs when two or more features are highly correlated, leading to redundancy. This can complicate feature selection because:
- Redundancy: Highly correlated features convey overlapping information, making it harder to determine which feature contributes most to the target variable.
- Instability: Multicollinearity can cause instability in model coefficients, especially for linear models, as small changes in the data can lead to large coefficient variations.
- Reduced Interpretability: It becomes challenging to assess the individual impact of correlated features on the model outcome.
Notice how the feature that we engineered (total_nights) is highly correalated with its component features(no_of_weekend_nights and no_of_week_nights). That makes a lot of sense, right? We will have to be careful with feature selection including those three.
booking_status and lead_time have some correlation but that is ok since booking_status is our target feature. This suggests that lead_time is an important predictor for our model and we will likely include it in the final product. Of course, others predictors could be less linearly related since Pearson corrleation is really about linear correlations.
# Correlation heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(data.corr(), annot=True, fmt=".2f", cmap='coolwarm')
plt.title('Feature Correlation Heatmap')
plt.show()
Summary Statistics¶
Let’s check for strange values, similar to before when we looked for missing values.
Notice that avg_price_per_room, no_of_adults, and total_nights all have min values of 0. That seems a bit odd. To have no adults; are minors even allowed to rent rooms alone? If so, maybe this is real. Rooms could be comped (free for some reason such as vouchers) but they may be trivial cases for our prediction. As for total_nights = 0, that seems very odd and I cannot come up with a plausible reason other than error.
Some of the max values are also a bit high but not impossible. lead_time = 443 is over one year; again, is booking this far ahead permitted by hotel policy? avg_price_per_room = 540 just seems high when 75th percentile is 120 but it could be the nicest suite during a busy time so just stay aware of it.
# Summary statistics for numerical columns
summary_stats = data.describe().T # Transpose for better readability
print(summary_stats)
count mean std min 25% 50% \ lead_time 36275.0 85.232557 85.930817 0.0 17.0 57.00 market_segment_type 36275.0 0.709773 0.453873 0.0 0.0 1.00 no_of_special_requests 36275.0 0.619655 0.786236 0.0 0.0 0.00 avg_price_per_room 36275.0 103.423539 35.089424 0.0 80.3 99.45 no_of_adults 36275.0 1.844962 0.518715 0.0 2.0 2.00 no_of_weekend_nights 36275.0 0.810724 0.870644 0.0 0.0 1.00 required_car_parking_space 36275.0 0.030986 0.173281 0.0 0.0 0.00 no_of_week_nights 36275.0 2.204300 1.410905 0.0 1.0 2.00 booking_status 36275.0 0.672364 0.469358 0.0 0.0 1.00 total_nights 36275.0 3.015024 1.786017 0.0 2.0 3.00 75% max lead_time 126.0 443.0 market_segment_type 1.0 1.0 no_of_special_requests 1.0 5.0 avg_price_per_room 120.0 540.0 no_of_adults 2.0 4.0 no_of_weekend_nights 2.0 7.0 required_car_parking_space 0.0 1.0 no_of_week_nights 3.0 17.0 booking_status 1.0 1.0 total_nights 4.0 24.0
How many of each has a value of 0?¶
Maybe these are errors, promotions or uncommon sitauions.
# get count of occurance when value equals zero
print("Zero Value Counts:")
print((data[['avg_price_per_room', 'no_of_adults', 'total_nights']] == 0).sum())
Zero Value Counts: avg_price_per_room 545 no_of_adults 139 total_nights 78 dtype: int64
How common is cancelation when value = 0?¶
Notice that cancelation is very common with these so we will keep them. They are important predictors even if they seem a bit trivial. Perhaps they get automatically cancelled when no rooms are booked or there is some other reason that they get cancelled.
# Calculate booking status distributions for each case where the value is 0
zero_cases_summary = {
"avg_price_per_room = 0": data[data['avg_price_per_room'] == 0]['booking_status'].value_counts(),
"no_of_adults = 0": data[data['no_of_adults'] == 0]['booking_status'].value_counts(),
"total_nights = 0": data[data['total_nights'] == 0]['booking_status'].value_counts(),
}
# Convert results to a readable DataFrame
zero_cases_df = pd.DataFrame(zero_cases_summary).fillna(0)
# Display the summary
zero_cases_df.index.name = "Booking Status"
zero_cases_df.reset_index(inplace=True)
zero_cases_df
Booking Status | avg_price_per_room = 0 | no_of_adults = 0 | total_nights = 0 | |
---|---|---|---|---|
0 | 1 | 539 | 95 | 76 |
1 | 0 | 6 | 44 | 2 |
Weirdly Cheap Rooms¶
When avg_price_per_room < 25, usually avg_price_per_room = 0. This could be some error or promotion.
# Filter rows where avg_price_per_room < 25
low_price_data = data[data['avg_price_per_room'] < 25]
# Plot the distribution
plt.figure(figsize=(10, 6))
sns.histplot(low_price_data['avg_price_per_room'], bins=20, kde=True)
plt.title('Distribution of avg_price_per_room (< 25)', fontsize=14)
plt.xlabel('Average Price Per Room', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
The high cancelation pattern continues with these and could be an important feature. I wonder if these are reservationsa that are often not completed (especially the 0 situation). Perhaps, many are auto-cancelled by a process that we do not know.
# Filter rows where avg_price_per_room < 25
low_price_cases = data[data['avg_price_per_room'] < 25]
# Check the distribution of booking_status for these cases
low_price_distribution = low_price_cases['booking_status'].value_counts()
# Print the results
print("Distribution of booking_status for cases where avg_price_per_room < 25:")
print(low_price_distribution)
Distribution of booking_status for cases where avg_price_per_room < 25: booking_status 1 612 0 16 Name: count, dtype: int64
Seperates the Dataset from the Target¶
So, we end un with a data matrix X and a target vector y.
# Define features (X) and target (y)
X = data.drop(columns=['booking_status'])
y = data['booking_status']
Preprocessing¶
We did some preprocessing by encoding categoricals and creating a new feature earlier to make EDA a bit easier.
Here we are normalizing the data. This helps scale the data so that very large magnitude features do not overwhelm very small magnitude features. Depending on the model, it can make convergence suffer or lose some predictors depending on the models tested. Not every model really cares about this.
# Normalize numerical features (optional)
scaler = StandardScaler()
X = scaler.fit_transform(X)
Train-Test Split¶
I performed the exploratory data analysis on the entire dataset. I would recommend against that outside of an educational or testing environment. This could lead to data leakage where you introduce information about the test set into the model. We are not goint to do a huge amount of model testing on this set so it will not be a major issue. Just know that normally if you have a sufficiently sized dataset, EDA and scaling really should be done after creating the train-test split. Just be careful to apply the same scaling to the test set!
Here, we set 20% of the set aside for testing. We set a random state of 42 for reproducability. We would not always set a random state when building a real model meant to be used in practice. Setting it allows us to pull the same sets later.
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print("\nTraining and test data prepared!")
print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")
# Save preprocessed data for modeling
joblib.dump((X_train, X_test, y_train, y_test), 'preprocessed_data.pkl')
print("Preprocessed data saved as 'preprocessed_data.pkl'")
Training and test data prepared! Training set size: (29020, 9) Test set size: (7255, 9) Preprocessed data saved as 'preprocessed_data.pkl'
Model Building¶
Logistic Regression: Baseline Model¶
I like to start with an all-feature linear or logistic regression model as my baseline. I probably won’t keep it as my final model, especially with collinear features, but it helps establish where I stand. If it performs well, I may tweak it and stick with a simple linear model. Otherwise, it provides an approximate floor. I think of this as Ockham’s Razor of Model Building.
Since my receiver operating characteristic (ROC) curve has an area under the curve (AUC) of 0.86, this is probably my floor for other models, especially given less stellar precision and recall. The AUC isn’t bad, and this model may be worth revisiting, but for now, I won’t analyze it further. Instead, I’ll explore other approaches—if they don’t work out, I’ll come back to this one.
# Initialize Logistic Regression with class weights to handle imbalance
model = LogisticRegression(class_weight='balanced', random_state=42)
# Fit the model
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1] # Get probabilities for the positive class
# Evaluate the model
print("Classification Report:")
print(classification_report(y_test, y_pred))
# Confusion Matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
# ROC-AUC Score
roc_auc = roc_auc_score(y_test, y_pred_proba)
print(f"\nROC-AUC Score: {roc_auc:.2f}")
# Plot ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f"Logistic Regression (AUC = {roc_auc:.2f})")
plt.plot([0, 1], [0, 1], linestyle='--', color='grey')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.grid()
plt.show()
Classification Report: precision recall f1-score support 0 0.64 0.76 0.69 2377 1 0.87 0.79 0.83 4878 accuracy 0.78 7255 macro avg 0.76 0.78 0.76 7255 weighted avg 0.80 0.78 0.79 7255 Confusion Matrix: [[1799 578] [1002 3876]] ROC-AUC Score: 0.86
XGBoost¶
Next we will try XGBoost. XGBoost (Extreme Gradient Boosting) is a powerful, optimized implementation of gradient boosting that improves efficiency, accuracy, and scalability by using parallel processing, regularization, and advanced tree-pruning techniques. It is widely used for structured data problems, often outperforming traditional machine learning models in classification and regression tasks.
Wow, we got an ROC-AUC of 0.9373 which is a huge improvement without hyperparameter tuning. So, we may want to try to improve on this model.
# Feature names
feature_names = ['lead_time',
'market_segment_type',
'no_of_special_requests',
'avg_price_per_room',
'no_of_adults',
'no_of_weekend_nights',
'required_car_parking_space',
'no_of_week_nights',
'total_nights']
# Convert X to DataFrame with column names
X = pd.DataFrame(X, columns=feature_names) # Ensure X retains feature names
# Initialize the XGBoost classifier
xgb_model = XGBClassifier(
scale_pos_weight=(len(y) - sum(y)) / sum(y), # Handle imbalance
eval_metric='logloss', # Prevent warnings
random_state=42
)
# 5-Fold Stratified Cross-Validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_auc_scores = []
for train_index, val_index in skf.split(X, y):
X_train_fold, X_val_fold = X.iloc[train_index], X.iloc[val_index] # Use .iloc for DataFrame indexing
y_train_fold, y_val_fold = y[train_index], y[val_index]
# Fit the model on the training fold
xgb_model.fit(X_train_fold, y_train_fold)
# Predict probabilities on the validation fold
y_val_pred_proba = xgb_model.predict_proba(X_val_fold)[:, 1]
# Calculate AUC for the validation fold
auc_score = roc_auc_score(y_val_fold, y_val_pred_proba)
cv_auc_scores.append(auc_score)
# Print results
print(f"5-Fold Cross-Validation AUC Scores: {cv_auc_scores}")
print(f"Mean AUC: {np.mean(cv_auc_scores):.4f}")
print(f"Standard Deviation of AUC: {np.std(cv_auc_scores):.4f}")
5-Fold Cross-Validation AUC Scores: [0.9329960243228853, 0.9422372873286998, 0.9361036984370684, 0.9359622582342777, 0.9392755381066643] Mean AUC: 0.9373 Standard Deviation of AUC: 0.0032
Feature Importance¶
This feature importance plot from XGBoost shows how often each feature was used in decision trees during training. The “F score” (frequency score) on the x-axis represents the number of times a feature was split across all trees in the model.
Key Takeaways:¶
Most Important Features:
lead_time
andavg_price_per_room
are the most influential features, meaning they were frequently used in decision splits.- This suggests that how far in advance a booking is made and the average room price strongly impact the model’s predictions.
Moderate Importance:
no_of_week_nights
,total_nights
, andno_of_special_requests
also play significant roles, but to a lesser extent than the top two.
Least Important Features:
market_segment_type
andrequired_car_parking_space
contribute the least to model decisions, implying they have minimal impact on predictions.
# Plot feature importance
plt.figure(figsize=(10, 6))
plot_importance(xgb_model, max_num_features=10) # Show top 10 features
plt.title("Feature Importance")
plt.show()
<Figure size 1000x600 with 0 Axes>
Hyperparam Tuning¶
After hyperparameter tuning, the XGBoost model achieved a mean ROC-AUC score of 93.8% across 5-fold cross-validation, compared to 93.3% for the baseline model. While the improvement was marginal, it suggests that the baseline parameters were already near optimal for this dataset.
Warning: this may be a bit slow as it tests many models.
Here’s a breakdown of the XGBoost hyperparameters and how they impact model performance:
1. max_depth
: [3, 5, 7]¶
- Controls the maximum depth of each tree.
- Lower values (3) → Prevents overfitting, making the model simpler but possibly underfitting.
- Higher values (7) → Captures more complex patterns but increases overfitting risk.
- Trade-off: Balance depth for interpretability and generalization.
2. learning_rate
: [0.01, 0.1, 0.2]¶
- Controls how much the model adjusts in each boosting iteration.
- Lower values (0.01) → More gradual learning, needs more trees (
n_estimators
) but generalizes well. - Higher values (0.2) → Faster convergence but may overshoot optimal weights and overfit.
- Trade-off: Lower
learning_rate
with a highern_estimators
is often preferred.
3. n_estimators
: [100, 200, 300]¶
- Number of boosting rounds (trees).
- Lower values (100) → Faster training, may not fully capture patterns.
- Higher values (300) → More trees improve learning but increase training time and overfitting risk.
- Trade-off: More trees require lower
learning_rate
to avoid overfitting.
4. subsample
: [0.7, 0.8, 1.0]¶
- Fraction of training data randomly sampled for each boosting round.
- Lower values (0.7, 0.8) → Introduces randomness, helps prevent overfitting.
- Higher value (1.0) → Uses full dataset, can overfit.
- Trade-off: Typically,
0.7–0.8
gives a good balance of generalization and robustness.
5. colsample_bytree
: [0.7, 0.8, 1.0]¶
- Fraction of features (columns) randomly selected per tree.
- Lower values (0.7, 0.8) → Reduces overfitting by preventing trees from relying on specific features.
- Higher value (1.0) → Uses all features, can overfit if dataset has many features.
- Trade-off:
0.7–0.8
is often a good range for diverse learning.
Summary of Tuning Strategy¶
- Prevent Overfitting: Lower
max_depth
,colsample_bytree
,subsample
, andlearning_rate
. - Improve Learning: Increase
n_estimators
, but balance withlearning_rate
. - Increase Model Complexity: Higher
max_depth
,subsample
, andcolsample_bytree
.
# Define parameter grid
param_grid = {
'max_depth': [3, 5, 7],
'learning_rate': [0.01, 0.1, 0.2],
'n_estimators': [100, 200, 300],
'subsample': [0.7, 0.8, 1.0],
'colsample_bytree': [0.7, 0.8, 1.0]
}
# Stratified K-Fold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Function to evaluate model
def evaluate_model(params):
cv_auc_scores = []
for train_index, val_index in skf.split(X, y):
X_train_fold, X_val_fold = X[train_index], X[val_index]
y_train_fold, y_val_fold = y[train_index], y[val_index]
# Initialize model with current parameters
model = XGBClassifier(
max_depth=params['max_depth'],
learning_rate=params['learning_rate'],
n_estimators=params['n_estimators'],
subsample=params['subsample'],
colsample_bytree=params['colsample_bytree'],
scale_pos_weight=(len(y) - sum(y)) / sum(y), # Handle imbalance
eval_metric='logloss',
random_state=42
)
# Fit and evaluate
model.fit(X_train_fold, y_train_fold)
y_val_pred_proba = model.predict_proba(X_val_fold)[:, 1]
auc_score = roc_auc_score(y_val_fold, y_val_pred_proba)
cv_auc_scores.append(auc_score)
return np.mean(cv_auc_scores)
# Grid search
best_score = 0
best_params = None
for max_depth in param_grid['max_depth']:
for learning_rate in param_grid['learning_rate']:
for n_estimators in param_grid['n_estimators']:
for subsample in param_grid['subsample']:
for colsample_bytree in param_grid['colsample_bytree']:
params = {
'max_depth': max_depth,
'learning_rate': learning_rate,
'n_estimators': n_estimators,
'subsample': subsample,
'colsample_bytree': colsample_bytree
}
score = evaluate_model(params)
if score > best_score:
best_score = score
best_params = params
print(f"Best AUC Score: {best_score}")
print(f"Best Parameters: {best_params}")
Best AUC Score: 0.9399748046702175 Best Parameters: {'max_depth': 7, 'learning_rate': 0.2, 'n_estimators': 300, 'subsample': 1.0, 'colsample_bytree': 1.0}
# Initialize the XGBoost classifier
xgb_model = XGBClassifier(
scale_pos_weight=(len(y_train) - sum(y_train)) / sum(y_train), # Handle class imbalance
eval_metric='logloss',
random_state=42,
use_label_encoder=False
)
# Train the model
xgb_model.fit(X_train, y_train)
# Predictions
y_pred_proba = xgb_model.predict_proba(X_test)[:, 1]
y_pred = xgb_model.predict(X_test)
# Compute metrics
roc_auc = roc_auc_score(y_test, y_pred_proba)
precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
conf_matrix = confusion_matrix(y_test, y_pred)
# Plot ROC-AUC curve
plt.figure(figsize=(8, 6))
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
plt.plot(fpr, tpr, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], linestyle='--', color='gray')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()
# Display confusion matrix
plt.figure(figsize=(6, 5))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=["Negative", "Positive"], yticklabels=["Negative", "Positive"])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
# Display classification report
report = classification_report(y_test, y_pred, output_dict=True)
df_report = pd.DataFrame(report).transpose()
print(classification_report(y_test, y_pred))
/usr/local/lib/python3.11/dist-packages/xgboost/core.py:158: UserWarning: [03:34:01] WARNING: /workspace/src/learner.cc:740: Parameters: { "use_label_encoder" } are not used. warnings.warn(smsg, UserWarning)
precision recall f1-score support 0 0.79 0.84 0.81 2377 1 0.92 0.89 0.90 4878 accuracy 0.87 7255 macro avg 0.85 0.86 0.86 7255 weighted avg 0.88 0.87 0.87 7255
Conclusion¶
The XGBoost model outputs binary predictions (0 = no cancellation, 1 = cancellation
), but we can also extract the probability scores (predict_proba
). This allows us to adjust our business logic dynamically rather than relying on a hard cutoff.
How This Works:¶
- Binary Output (
predict
) – The model classifies each booking as likely to cancel (1) or not cancel (0). - Probability Scores (
predict_proba
) – Instead of just 0 or 1, we retrieve the actual probability of cancellation.- Example:
- Booking A → 0.30 (30%) cancellation risk → Standard booking.
- Booking B → 0.85 (85%) cancellation risk → Require deposit or offer a discount for non-refundable.
- Example:
Next Steps: Implementing Business Logic¶
- Threshold-Based Decisioning:
- If
p(cancellation) > 0.8
→ Require a deposit. - If
0.5 < p(cancellation) ≤ 0.8
→ Offer non-refundable incentives. - If
p(cancellation) ≤ 0.5
→ Allow standard booking.
- If