The goal of this project is to understand the underlying dynamics of shopping duration of an online store.
The natural path forward is to breakdown the problem in 3:
1) understand the data via exploration (EDA),
2) design a workflow to transform raw data into the feature space of the model,
3) model selection and prediction.
Exploratory Data Analysis
The data comprises timestamps, categorical and continuous variables:
Next, I established the following hypotheses to guide my exploration on the effect of features to shopping duration.
Shopping basket size proportional to shopping time
Summing all items on the users’ baskets and plotting against the shopping duration for each shopping event:
For the bulk of events, which have a basket between 1 - 50 items, there is a week linear relationship between basket size and shopping time (Pearson Coeff: 0.39).
Basket complexity drives longer duration
I figured a proxy for basket complexity might the amount of distinct items and distinct number of departments per trip.
The density of the distribution is concentrated around 1-20 distinct items per basket although there is considerable density dispersed up to 150 (Long-tailed distribution). Since there is a wide range of cases, let’s split the sample into deciles and plotted box plots for time duration on each:
Interesting! Few things to note.
10% of the sample has a range of 200 distinct items, while 90% are within 30 items.
Shopping duration variance increases with distinct items on the basket, I expect those events to be harder to predict.
A linear effect is notable, eg. the more distinct items the greater time shopping.
From the amount of outliers on each decile, the data doesn’t seem to be normally distributed.
Although shopping time increases with distinct items per trip, the range that they span is very similar!
Taking a similar approach for distinct departments:
With this exploration, one might think a linear regression could be a good starting point to model shopping time.
Reordering is faster
I computed a metric that translates to how much a shopper reorders items.
It’s the summation of reordered counts of items normalized to the total number of items a person has ordered, I call it reordering factor.
Say for shopper with ID 52, the metric captures the ratio of reordering by calculating for each of the items, the independent events where the item is present. Then it’s normalized to the total number of items she has ordered without recurrences.
|52||489||1||(1+2+3) / 3 = 2.0|
A reordering factor of 1 corresponds to a user which doesn’t show any reordering behaviour, greater numbers correspond to reordering factor by user.
There is an interesting effect here, it seems that reordering factor is indeed inversely proportional to the shopping time.
I should explore further why the trend is not monotonically decreasing.
Temporality influences shopper’s time
I weighted the data proportionally to the frequency of starting shopping time. This shows an abstraction of the shopping habits of the entire sample.
There is an obvious trend of relatively high shopping times on Mondays from 10 am to 1pm, and 1pm to 4pm on Sundays. An idea would be to create with a categorical feature to distinguish events where their starting shopping time is within those time periods.
Also, a quick inspection on mean shopping time per store reveals significant variation among them; a dummy feature for store_id would be appropriate for our model.
Older cohorts, shorter times
Finally, I did cohort analysis heat map to visualize the shopping time progression. That means, how does time vary over recurrent usage of the platform.
There are two interesting effects to note here.
Shoppers that made their first buy (assuming their first trip in this dataset is their first trip in the platform) during weeks 36 - 39 share high shopping duration. It would be interesting to explore the attributes they have in common.
It is visible that as time progresses, user’s do get more time efficient at the platform.
I built a class called Preprocessing which handles the input files (order_items and events_data) and transforms them into a feature space ready for modeling. It also generates the engineered features from the EDA.
The usage is as follows:
Now that we have an iterable workflow, let’s build a model to predict shopping_time.
When going through the data I started thinking about the purpose of the model.
A model’s purpose is either to predict a variable that’s hard to obtain or to infer how the predictors affect the response variable.
From the data, having information about the basket to be predicted suggests it’s more likely to be useful for inference; in other words, understand the drivers of shopping duration.
With that in mind and from the hints of linearity in the EDA, it makes sense to make our first model a linear regression.
To validate the model before predicting on the test set, I did a split on the training dataset further into training and validation.
My predictors are a combination of categorical, discrete and continuous variables, based on the previous exploration.
As baseline, I predicted shopping time for all shopping events in the validation set using the mean of the training data.
I fitted the data to a L1 Regularized Regression. The default parameters in sklearn over penalized the majority of predictor coefficients, yielding a very poor model. With a cross-validation approach, I tuned the alpha parameter to understand which set of coefficients yield the lowest error, with as low variance as possible.
At this point, the model outperforms the Means-Model by 21% and it’s now capable of explaining 31% of the variance.
Also, we can look into the feature selection (since Lasso penalizes coefficients and drives them towards zero) with the cross-validated alpha.
Let’s look at the Top 12 (absolute effect):
Some questions come to mind:
Looks like the shoppers of Store 54 are far more time-efficient than any other. What do they sell, what could we learn from them?
Why is reordering not as big of a factor? Could we optimize basket recommendations?
Random Forest Regressor
I tried a quick non-linear approach to the problem as well.
So after all the work on the Lasso, it turns out a Random Forest outperforms it in terms of bias.
Still, each has it’s advantages. Depending on our goal for this model, these are some considerations to take into account going forward from here.
Lasso vs. Tree-based Model:
- Since it is a linear model, we can provide measurable insight on how our features influence shopping duration, providing extra value for teams within the company to guide optimizations.
- It is not computationally distributable yet it doesn’t require great computational challenges.
- This model could be updated online as more data becomes available using stochastic gradient descent, instead of batch processing.
- As we saw from plotting and the MSE, there are high influential points to be addressed before fitting the model. - We should explore increasing the complexity of this model, while cross-validating to avoid over-fitting. Ideas would be to create cross-effects between our continuous variables and some variable transformations (squared, cubed).
Tree-based Model over Lasso:
- Can be parallelized
- Fast predictions once it has been trained
- Lower Bias but prone to over-fitting
- We can take a look at a relative metric of feature importance to gauge their importance agains shopping duration. Still it is not as interpretable as the Lasso counterpart.
- Needs to be tuned. Naturally we could increase the number of estimators and vary the number of maximum random features per split. I would use cross-validation to pick on the parameters that decrease the error metric.
In any case, more time should be spent understanding the data since better features always beat better models!