Sklearn Linear Regression Tutorial with Boston House Dataset
The Boston Housing dataset contains information about various houses in Boston through different parameters. This data was originally a part of UCI Machine Learning Repository and has been removed now.
There are 506 samples and 13 feature variables in this dataset. The objective is to predict the value of prices of the house using the given features. The dataset itself is available at this link. However, because here we are importing this dataset from the scikit-learn itself.
let’s start by importing some libraries.
Now we need to import important modules from the sklearn
sklearn returns Dictionary-like object, the interesting attributes are: ‘data’, the data to learn, ‘target’, the regression targets, ‘DESCR’, the full description of the dataset, and ‘filename’, the physical location of boston csv dataset. This we can from the following Operations.
There are 4 keys in the bunch [‘data’, ‘target’, ‘feature_names’, ‘DESCR’] as mentioned above. The data has 506 rows and 13 feature variable. Notice that this doesn’t include target variable. Also the names of the columns are also extracted. The details about the features and more information about the dataset can be seen by using boston.DESCR
`
Before applying any EDA or model we have to convert this to a pandas dataframe, which we can do by calling the dataframe on boston.data
. We also adds the target variable to the dataframe from boston.target
Data preprocessing
After loading the data, it’s a good practice to see if there are any missing values in the data. We count the number of missing values for each feature using .isnull()
bos.isnull().sum()
As it was also mentioned in the description there are no null values in the dataset and here we can also see the same.
print(bos.describe())
Exploratory Data Analysis
Exploratory Data Analysis is a very important step before training the model. Here, we will use visualizations to understand the relationship of the target variable with other features.
Let’s first plot the distribution of the target variable. We will use the histogram plot function from the matplotlib library.
sns.set(rc={'figure.figsize':(11.7,8.27)})
plt.hist(bos['PRICE'], bins=30)
plt.xlabel("House prices in $1000")
plt.show()
We can see from the plot that the values of PRICE are distributed normally with few outliers. Most of the house are around 20–24 range (in $1000 scale)
Now, we create a correlation matrix that measures the linear relationships between the variables. The correlation matrix can be formed by using the corr function from the pandas dataframe library. We will use the heatmap function from the seaborn library to plot the correlation matrix.
The correlation coefficient ranges from -1 to 1. If the value is close to 1, it means that there is a strong positive correlation between the two variables. When it is close to -1, the variables have a strong negative correlation.
Notice
- By looking at the correlation matrix we can see that RM has a strong positive correlation with PRICE (0.7) where as LSTAThas a high negative correlation with PRICE (-0.74).
- An important point in selecting features for a linear regression model is to check for multicolinearity. The features RAD, TAXhave a correlation of 0.91. These feature pairs are strongly correlated to each other. This can affect the model. Same goes for the features DIS and AGE which have a correlation of -0.75.
But for now we will keep all the features.
Notice
- The prices increase as the value of RM increases linearly. There are few outliers and the data seems to be capped at 50.
- The prices tend to decrease with an increase in LSTAT. Though it doesn’t look to be following exactly a linear line.
Since it is really hard to visualize with the multiple features, we will 1st predict the house price with just one variable and then move to the regression with all features.
Since you saw that ‘RM’ shows positive correlation with the House Prices we will use this variable.
X_rooms = bos.RM
y_price = bos.PRICE
X_rooms = np.array(X_rooms).reshape(-1,1)
y_price = np.array(y_price).reshape(-1,1)
print(X_rooms.shape)
print(y_price.shape)
These both have the dimensions of [506,1]
Splitting the data into training and testing sets
Since we need to test our model, we split the data into training and testing sets. We train the model with 80% of the samples and test with the remaining 20%. We do this to assess the model’s performance on unseen data.
To split the data we use train_test_split function provided by scikit-learn library. We finally print the shapes of our training and test set to verify if the splitting has occurred properly.
The train has the output of [404,1] and test data is of the size [102,1].
Training and testing the model
Here we use scikit-learn’s LinearRegression to train our model on both the training and check it on the test sets. and check the model performance on the train dataset.
Plotting the model fitted line on the output variable.
prediction_space = np.linspace(min(X_rooms), max(X_rooms)).reshape(-1,1)
plt.scatter(X_rooms,y_price)
plt.plot(prediction_space, reg_1.predict(prediction_space), color = 'black', linewidth = 3)
plt.ylabel('value of house/1000($)')
plt.xlabel('number of rooms')
plt.show()
Regression Model for All the variables
Now we will create a model considering all the features in the dataset. The process is almost the same and also the evaluation model but in this case the visualization will not be possible in a 2D space.
The steps are exactly the same.
we can see how our model is predicting by plotting a scatter plot between the original house price and predicted house prices.
This is all I have for this read, will cover how to fit polynomial rather than linear fit in the data in order to achieve better performance. You can get all the code at my GitHub repository at this link. Till then Happy Learning..!!