Given a synthetic dataset of coordinates, closing dates and pricing, the goal is to code a kNN model to predict home closing prices and evaluate its main tradeoffs in terms of computational expense, scalability and performance.
Here is a sample of the data:
latitude  longitude  close_date  close_price  

0  1.501986  86.350685  20140816 22:25:31.925431  1.302246e+06 
1  36.367095  98.664280  20140805 06:34:00.165876  1.475045e+05 
2  36.599284  97.924700  20140812 23:48:00.887510  1.374006e+05 
3  67.994791  64.688589  20140817 05:27:01.404296  1.411200e+04 
4  36.647982  97.866100  20140809 04:00:40.358242  2.391053e+05 
Before jumping into the solution – a particular implementation constraint was to avoid timeleakage meaning that for a given home i such a home j should be considered a neighbor to home i only if the close date of j occurred prior to the close date of i.
I framed the solution to this problem by building two python classes.
a) Preprocessing Class  Includes 4 preprocessing methods to run prior to prediction model to avoid time leakage.
b) KNearestNeighbors Class  Initializes with k number, type of distance to compare similarity of elements and preprocessing class. Subsequently class runs a kNN algorithm to predict the closing price for a given home.
Here is the code for the predictive class:
Now some questions:
2) What is the performance of the model measured in Median Relative Absolute Error?
The data pipeline I built in this allotted timeframe is not optimized for computation efficiency (details on Q.5). I truncated the computations after 1000 predictions which accounts for only 1.1% of the dataset.
The Median Relative Absolute Error for the current model on the truncated sample is:
 18.7%
3) What would be an appropriate methodology to determine the optimal k?
Assuming the optimal k is what provides the lowest performance error of the model; to determine k I would crossvalidate using range values of k, plotting the error and selecting a value which generalizes the lowest error.
Equally important, prior knowledge on this business/market could provide other heuristics to consider when crossvalidating and selecting the optimal k.
4) How would you improve this model?
Note the only feature I’m considering to compute similarity between homes is geographic distance. My model also currently assumes a uniform contribution per data point in the local neighborhood, the next step would be to crossvalidate whether uniform is indeed better than weighting proportionally to the distance.
With this limited dataset I would also introduce a temporal feature with the hypothesis that the closer the closing dates between two homes, the more correlated their closing price would be since they share temporal market conditions. Then check if the hypothesis is true, and based on the temporal contribution to the housing price, adjust the weights to improve the model (via crossvalidation as well).
On the other hand, considering the value of homes is determined by characteristics other than location and temporality, the most logical step would be to understand the key features that describe and correlate to the value of a home. For example, we could do this via quantitative analysis (ex. linear regression regularized L1) as well as to obtain intangibles/heuristics from field agents.
In terms of which model or function to use, an interesting consideration is whether interpretability is of value or not. Most likely we could use a nonparametric model (ex. Random Forest) and achieve the higher accuracies; however, in the company’s particular interest, a client is interested in transparency and would presumably value the information taken into account as well as the weights to achieve the final pricing so one must navigate a tradeoff between interpretability and accuracy.
5) How would you productionize this model?
The model I built is very slow at the moment for various reasons:

kNN is all about saving data into memory. In this case, I’m doing a computationally expensive procedure since I’m copying a distinct dataset for every single home and recurrently calculating distances with the time leakage constraint (this constraint is actually beneficial to the overall time calculation since it reduces row count or N). My approach is currently a ‘brute force’, which without the time leakage constraint would be O[N^2], which is fairly unreasonable even with one feature.

A step to the first point would be to partition the feature space so for every new prediction, the amount of computations is reduced given a data point coincides with a partition. If we keep the dimensions of the feature space low, a possibility would be to build a KDimensional tree to infer distances between homes as opposed to doing computations repeatedly.

Another idea with the current dataset would be to implement a clustering technique to assign geographical centroids and associate homes before computing distances.

Finally, before production we would want to test the crossvalidated model with outofsample data.