Given a synthetic dataset of coordinates, closing dates and pricing, the goal is to code a k-NN model to predict home closing prices and evaluate its main trade-offs in terms of computational expense, scalability and performance.

Here is a sample of the data:

latitude longitude close_date close_price
0 1.501986 86.350685 2014-08-16 22:25:31.925431 1.302246e+06
1 36.367095 -98.664280 2014-08-05 06:34:00.165876 1.475045e+05
2 36.599284 -97.924700 2014-08-12 23:48:00.887510 1.374006e+05
3 67.994791 64.688589 2014-08-17 05:27:01.404296 -1.411200e+04
4 36.647982 -97.866100 2014-08-09 04:00:40.358242 2.391053e+05

Before jumping into the solution – a particular implementation constraint was to avoid time-leakage meaning that for a given home i such a home j should be considered a neighbor to home i only if the close date of j occurred prior to the close date of i.

Solution:

I framed the solution to this problem by building two python classes.

a) Preprocessing Class - Includes 4 preprocessing methods to run prior to prediction model to avoid time leakage.

b) KNearestNeighbors Class - Initializes with k number, type of distance to compare similarity of elements and preprocessing class. Subsequently class runs a kNN algorithm to predict the closing price for a given home.

Here is the code for the predictive class:

Now some questions:

2) What is the performance of the model measured in Median Relative Absolute Error?

The data pipeline I built in this allotted timeframe is not optimized for computation efficiency (details on Q.5). I truncated the computations after 1000 predictions which accounts for only 1.1% of the dataset.

The Median Relative Absolute Error for the current model on the truncated sample is:

• 18.7%

3) What would be an appropriate methodology to determine the optimal k?

Assuming the optimal k is what provides the lowest performance error of the model; to determine k I would cross-validate using range values of k, plotting the error and selecting a value which generalizes the lowest error.

Equally important, prior knowledge on this business/market could provide other heuristics to consider when cross-validating and selecting the optimal k.

4) How would you improve this model?

Note the only feature I’m considering to compute similarity between homes is geographic distance. My model also currently assumes a uniform contribution per data point in the local neighborhood, the next step would be to cross-validate whether uniform is indeed better than weighting proportionally to the distance.

With this limited dataset I would also introduce a temporal feature with the hypothesis that the closer the closing dates between two homes, the more correlated their closing price would be since they share temporal market conditions. Then check if the hypothesis is true, and based on the temporal contribution to the housing price, adjust the weights to improve the model (via cross-validation as well).

On the other hand, considering the value of homes is determined by characteristics other than location and temporality, the most logical step would be to understand the key features that describe and correlate to the value of a home. For example, we could do this via quantitative analysis (ex. linear regression regularized L1) as well as to obtain intangibles/heuristics from field agents.

In terms of which model or function to use, an interesting consideration is whether interpretability is of value or not. Most likely we could use a non-parametric model (ex. Random Forest) and achieve the higher accuracies; however, in the company’s particular interest, a client is interested in transparency and would presumably value the information taken into account as well as the weights to achieve the final pricing so one must navigate a trade-off between interpretability and accuracy.

5) How would you productionize this model?

The model I built is very slow at the moment for various reasons:

• kNN is all about saving data into memory. In this case, I’m doing a computationally expensive procedure since I’m copying a distinct dataset for every single home and recurrently calculating distances with the time leakage constraint (this constraint is actually beneficial to the overall time calculation since it reduces row count or N). My approach is currently a ‘brute force’, which without the time leakage constraint would be O[N^2], which is fairly unreasonable even with one feature.

• A step to the first point would be to partition the feature space so for every new prediction, the amount of computations is reduced given a data point coincides with a partition. If we keep the dimensions of the feature space low, a possibility would be to build a K-Dimensional tree to infer distances between homes as opposed to doing computations repeatedly.

• Another idea with the current dataset would be to implement a clustering technique to assign geographical centroids and associate homes before computing distances.

• Finally, before production we would want to test the cross-validated model with out-of-sample data.