read

Given a synthetic dataset of coordinates, closing dates and pricing, the goal is to code a k-NN model to predict home closing prices and evaluate its main trade-offs in terms of computational expense, scalability and performance.

Here is a sample of the data:

#Importing pandas to manipulate data.
import pandas as pd
#Load data to dataframe.
df = pd.read_csv('../data/data.csv')
#Display head of dataframe.
df.head(5)
latitude longitude close_date close_price
0 1.501986 86.350685 2014-08-16 22:25:31.925431 1.302246e+06
1 36.367095 -98.664280 2014-08-05 06:34:00.165876 1.475045e+05
2 36.599284 -97.924700 2014-08-12 23:48:00.887510 1.374006e+05
3 67.994791 64.688589 2014-08-17 05:27:01.404296 -1.411200e+04
4 36.647982 -97.866100 2014-08-09 04:00:40.358242 2.391053e+05


Before jumping into the solution – a particular implementation constraint was to avoid time-leakage meaning that for a given home i such a home j should be considered a neighbor to home i only if the close date of j occurred prior to the close date of i.

Solution:


I framed the solution to this problem by building two python classes.

a) Preprocessing Class - Includes 4 preprocessing methods to run prior to prediction model to avoid time leakage.

class Preprocessing(object):

    '''
    Preprocess data to avoid time-leakage.

    Parameters
    ----------
    data: path
        Path to file with data

    Output
    ------
    Returns feature matrix and house pricing (ie. Preprocessing.x, Preprocessing.y)
    '''

    def __init__(self, path2data):

        # read dataset to pandas dataframe
        self.df = pd.read_csv(path2data)

        # Create a temporary dataframe for home i be used to avoid time leakage.
        self.df_home_i = None

        # Create temporary pointer to home_i
        self.home_i = None

    def _datatypes(self):
        '''
        Convert datatypes suitable for manipulation.
        '''
        #Convert to DatetimeIndex
        self.df.close_date = pd.DatetimeIndex(self.df.close_date)

        return self

    def _time_leakage(self, home_i):
        '''
        Filters dataframe to specified home h_i
        such a home j should be considered a neighbor to home i
        only if the close date of j occurred prior to the close date of i.

        Parameters
        ----------
        home_i: integer
            Index of a home in the dataframe.

        Output
        ------
        None (Dataframe for home_i was modified).

        '''
        #Index homes.
        self.home_i = home_i

        #Mask to avoid time leakage.
        self.df_home_i = self.df.ix[self.df.close_date > self.df.iloc[self.home_i].close_date].copy()

    def _distance(self):
        '''
        Compute distances between home_i and neighbors.

        Parameters
        ----------
        home_i: integer
            Index of a home in the dataframe.

        Output
        ------
        None (Dataframe for home_i was modified).

        '''
        #Index Home i
        home_i = self.df.iloc[self.home_i]

        #Get home_i coordenates
        home_i_coord = (home_i.latitude, home_i.longitude)

        #Create lat_lon feature for simple reference
        self.df_home_i['lat_lon'] = zip(self.df_home_i.latitude, self.df_home_i.longitude)

        #Apply function to dataframe element-wise
        self.df_home_i['dist2home_i_miles'] = self.df_home_i.lat_lon.apply(lambda (coord2): geodistance(home_i_coord,coord2).miles)

    def get_training_data(self, home_i):
        '''
        Parameters
        ----------
        home_i: integer
            Index number of home to predict value for.

        Output
        ------
        X: array-like
            Returns feature matrix.

        Y: array-like
            Returns home closing price.
        '''
        #Convert to DatetimeIndex
        self._datatypes()

        #Filter dataframe avoiding time leakage for home i.
        self._time_leakage(home_i)

        #Compute distances from home i to all remaining homes.
        self._distance()

        #Get closing price as y.
        y = self.df_home_i.pop('close_price').values

        #Get Distance as X.
        X = self.df_home_i.pop('dist2home_i_miles').values

        return X, y

b) KNearestNeighbors Class - Initializes with k number, type of distance to compare similarity of elements and preprocessing class. Subsequently class runs a kNN algorithm to predict the closing price for a given home.

Here is the code for the predictive class:

class KNearestNeighbors(object):
    '''
    KNN regressor to calculate house pricing.

    Parameters
    ----------

    k: integer
        Number of k nearest neighbors.

    distance: function
        Function to calculate distance (not neccesarily spatial).
    '''

    def __init__(self, k=4, distance=geodistance, weights = None):
        self.k = k
        self.distance = distance
        self.Preprocessing = None #This is to import the Preprocessing class.
        self.predictions = [] #To store model predictions.
        self.weights = None #If we were to make this model more complex, suitable weigths would be selected based on cross-validates feature importance.
        self.model_performance = None #Attribute to get MRAE.

    def fit(self, Preprocessing):
        '''
        Takes in Preprocessing class to fit feature matrix (X) and closing price (y), assuring there is not time leakage.
        '''
        #Loads data and proprocessing class.
        self.Preprocessing = Preprocessing

    def predict(self):
        '''
        Predict home price.
        '''
        #For each home, iterate to get X_train, y_train and predict k neighrest with weights w.
        #Note the Preprocessing class handles calculating the distance.

        #Create y_pred, y_actual results.
        self.predictions = []

        #Truncate dataframe to only first 1000 predictions.
        for home_i in self.Preprocessing.df.index.tolist()[:1000]:

            #Get training data
            X_train, y_train = self.Preprocessing.get_training_data(home_i)

            #Sort and get top k.
            top_k = y_train[X_train.argsort()[:self.k]]  #sort and take top k

            #Use weights to compute house pricing.
            #In this case I will use the mean, ie. for k = 4, w = 0.25
            home_i_pred = np.mean(np.absolute(top_k)) #I noted about 4% of the dataset has negative sign values for pricing, I'll consider those to be positive for business sense.

            #Get actual home price.
            home_i_actual = np.absolute(self.Preprocessing.df.iloc[home_i].close_price)

            #Append to list as tuple for later conversion to numpy array.
            self.predictions.append((home_i_pred, home_i_actual))

        #Truncate df to only the first 1000 predictions.
        self.df_truncated = self.Preprocessing.df.iloc[:1000,:].copy()

        #Insert Column with predictions
        self.df_truncated['pred'] = np.array(self.predictions)[:,0] #from above, note the position for predictions in the tuple is 0.

    def performance(self):
        '''
        Compute performance of model element wise and overall.
        '''
        #Compute RAE element-wise.
        self.df_truncated['RAE'] = np.absolute(np.absolute(self.df_truncated.close_price) - self.df_truncated.pred) / np.absolute(self.df_truncated.close_price)
        #Compute MRAE.
        self.model_performance = np.median(self.df_truncated.RAE)

        #Return Model Performance
        print 'Model Median Relative Absolute Error: {0:.1%}'.format(self.model_performance)

Now some questions:

2) What is the performance of the model measured in Median Relative Absolute Error?

The data pipeline I built in this allotted timeframe is not optimized for computation efficiency (details on Q.5). I truncated the computations after 1000 predictions which accounts for only 1.1% of the dataset.

The Median Relative Absolute Error for the current model on the truncated sample is:

  • 18.7%

3) What would be an appropriate methodology to determine the optimal k?

Assuming the optimal k is what provides the lowest performance error of the model; to determine k I would cross-validate using range values of k, plotting the error and selecting a value which generalizes the lowest error.

Equally important, prior knowledge on this business/market could provide other heuristics to consider when cross-validating and selecting the optimal k.

4) How would you improve this model?

Note the only feature I’m considering to compute similarity between homes is geographic distance. My model also currently assumes a uniform contribution per data point in the local neighborhood, the next step would be to cross-validate whether uniform is indeed better than weighting proportionally to the distance.

With this limited dataset I would also introduce a temporal feature with the hypothesis that the closer the closing dates between two homes, the more correlated their closing price would be since they share temporal market conditions. Then check if the hypothesis is true, and based on the temporal contribution to the housing price, adjust the weights to improve the model (via cross-validation as well).

On the other hand, considering the value of homes is determined by characteristics other than location and temporality, the most logical step would be to understand the key features that describe and correlate to the value of a home. For example, we could do this via quantitative analysis (ex. linear regression regularized L1) as well as to obtain intangibles/heuristics from field agents.

In terms of which model or function to use, an interesting consideration is whether interpretability is of value or not. Most likely we could use a non-parametric model (ex. Random Forest) and achieve the higher accuracies; however, in the company’s particular interest, a client is interested in transparency and would presumably value the information taken into account as well as the weights to achieve the final pricing so one must navigate a trade-off between interpretability and accuracy.

5) How would you productionize this model?

The model I built is very slow at the moment for various reasons:

  • kNN is all about saving data into memory. In this case, I’m doing a computationally expensive procedure since I’m copying a distinct dataset for every single home and recurrently calculating distances with the time leakage constraint (this constraint is actually beneficial to the overall time calculation since it reduces row count or N). My approach is currently a ‘brute force’, which without the time leakage constraint would be O[N^2], which is fairly unreasonable even with one feature.

  • A step to the first point would be to partition the feature space so for every new prediction, the amount of computations is reduced given a data point coincides with a partition. If we keep the dimensions of the feature space low, a possibility would be to build a K-Dimensional tree to infer distances between homes as opposed to doing computations repeatedly.

  • Another idea with the current dataset would be to implement a clustering technique to assign geographical centroids and associate homes before computing distances.

  • Finally, before production we would want to test the cross-validated model with out-of-sample data.

Blog Logo

Pablo Felgueres


Published