Week 2 - ML Zoom Camp

27 minute read

Car Price Prediction Project

In this week, we are going to learn our first ML algorithm regression through a machine learning project. The dataset can be download from here. We have many features like Make, Year, City MPG and we have a target feature MSRP

Project Plan

Prepare data and do EDA
Use linear regression for predicting price
Understanding internals of linear regression (Vector form) and implementing it
Evaluating model with RMSE
Feature Engineering
Regularization

Data Preparation

import numpy as np
import pandas as pd

# Uncomment below line to download dataset
# !wget https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv

df = pd.read_csv('data.csv')
df.head() # looking first few rows to understand data

	Make	Model	Year	Engine Fuel Type	Engine HP	Engine Cylinders	Transmission Type	Driven_Wheels	Number of Doors	Market Category	Vehicle Size	Vehicle Style	highway MPG	city mpg	Popularity	MSRP
0	BMW	1 Series M	2011	premium unleaded (required)	335.0	6.0	MANUAL	rear wheel drive	2.0	Factory Tuner,Luxury,High-Performance	Compact	Coupe	26	19	3916	46135
1	BMW	1 Series	2011	premium unleaded (required)	300.0	6.0	MANUAL	rear wheel drive	2.0	Luxury,Performance	Compact	Convertible	28	19	3916	40650
2	BMW	1 Series	2011	premium unleaded (required)	300.0	6.0	MANUAL	rear wheel drive	2.0	Luxury,High-Performance	Compact	Coupe	28	20	3916	36350
3	BMW	1 Series	2011	premium unleaded (required)	230.0	6.0	MANUAL	rear wheel drive	2.0	Luxury,Performance	Compact	Coupe	28	18	3916	29450
4	BMW	1 Series	2011	premium unleaded (required)	230.0	6.0	MANUAL	rear wheel drive	2.0	Luxury	Compact	Convertible	28	18	3916	34500

Observations from .head()

Column names either separated by white space or hypen
Column names either small or capital letters
Values in the data either small or capital letters

The columns must be cleaned and put everything in same order

df.columns # display column names in the data

Index(['Make', 'Model', 'Year', 'Engine Fuel Type', 'Engine HP',
       'Engine Cylinders', 'Transmission Type', 'Driven_Wheels',
       'Number of Doors', 'Market Category', 'Vehicle Size', 'Vehicle Style',
       'highway MPG', 'city mpg', 'Popularity', 'MSRP'],
      dtype='object')

# data cleaning on column names
df.columns = df.columns.str.lower().str.replace(" ","_")
# converting the strings (column names) into lowercase and then replace & fill whitespace with '-'
df.columns

Index(['make', 'model', 'year', 'engine_fuel_type', 'engine_hp',
       'engine_cylinders', 'transmission_type', 'driven_wheels',
       'number_of_doors', 'market_category', 'vehicle_size', 'vehicle_style',
       'highway_mpg', 'city_mpg', 'popularity', 'msrp'],
      dtype='object')

df.dtypes # display data types of each column

make                  object
model                 object
year                   int64
engine_fuel_type      object
engine_hp            float64
engine_cylinders     float64
transmission_type     object
driven_wheels         object
number_of_doors      float64
market_category       object
vehicle_size          object
vehicle_style         object
highway_mpg            int64
city_mpg               int64
popularity             int64
msrp                   int64
dtype: object

strings = list(df.dtypes[df.dtypes == 'object'].index)
strings

['make',
 'model',
 'engine_fuel_type',
 'transmission_type',
 'driven_wheels',
 'market_category',
 'vehicle_size',
 'vehicle_style']

[df.dtypes == ‘object’] shows column with true if dtype is object
df.dtypes[df.dtypes == ‘object’] shows only columns with object datatypes
df.dtypes[df.dtypes == ‘object’].index shows column as indexes
list(df.dtypes[df.dtypes == ‘object’].index) converts indexes as list

for column in strings:
    df[column] = df[column].str.lower().str.replace(" ","_") 
    # exactly how we did for column names

df.select_dtypes("object").head() # display only object datatypes

	make	model	engine_fuel_type	transmission_type	driven_wheels	market_category	vehicle_size	vehicle_style
0	bmw	1_series_m	premium_unleaded_(required)	manual	rear_wheel_drive	factory_tuner,luxury,high-performance	compact	coupe
1	bmw	1_series	premium_unleaded_(required)	manual	rear_wheel_drive	luxury,performance	compact	convertible
2	bmw	1_series	premium_unleaded_(required)	manual	rear_wheel_drive	luxury,high-performance	compact	coupe
3	bmw	1_series	premium_unleaded_(required)	manual	rear_wheel_drive	luxury,performance	compact	coupe
4	bmw	1_series	premium_unleaded_(required)	manual	rear_wheel_drive	luxury	compact	convertible

Exploratory Data Analysis (EDA)

for column in df.columns:
    print(column) # display column name
    print('-'*10)
    print(df[column].unique()[:5]) # display first 5 unique values of each column 
    print(df[column].nunique())
    print()

make
----------
['bmw' 'audi' 'fiat' 'mercedes-benz' 'chrysler']
48

model
----------
['1_series_m' '1_series' '100' '124_spider' '190-class']
914

year
----------
[2011 2012 2013 1992 1993]
28

engine_fuel_type
----------
['premium_unleaded_(required)' 'regular_unleaded'
 'premium_unleaded_(recommended)' 'flex-fuel_(unleaded/e85)' 'diesel']
10

engine_hp
----------
[335. 300. 230. 320. 172.]
356

engine_cylinders
----------
[ 6.  4.  5.  8. 12.]
9

transmission_type
----------
['manual' 'automatic' 'automated_manual' 'direct_drive' 'unknown']
5

driven_wheels
----------
['rear_wheel_drive' 'front_wheel_drive' 'all_wheel_drive'
 'four_wheel_drive']
4

number_of_doors
----------
[ 2.  4.  3. nan]
3

market_category
----------
['factory_tuner,luxury,high-performance' 'luxury,performance'
 'luxury,high-performance' 'luxury' 'performance']
71

vehicle_size
----------
['compact' 'midsize' 'large']
3

vehicle_style
----------
['coupe' 'convertible' 'sedan' 'wagon' '4dr_hatchback']
16

highway_mpg
----------
[26 28 27 25 24]
59

city_mpg
----------
[19 20 18 17 16]
69

popularity
----------
[3916 3105  819  617 1013]
48

msrp
----------
[46135 40650 36350 29450 34500]
6049

df.describe() # by default describe shows for numeric values

	year	engine_hp	engine_cylinders	number_of_doors	highway_mpg	city_mpg	popularity	msrp
count	11914.000000	11845.00000	11884.000000	11908.000000	11914.000000	11914.000000	11914.000000	1.191400e+04
mean	2010.384338	249.38607	5.628829	3.436093	26.637485	19.733255	1554.911197	4.059474e+04
std	7.579740	109.19187	1.780559	0.881315	8.863001	8.987798	1441.855347	6.010910e+04
min	1990.000000	55.00000	0.000000	2.000000	12.000000	7.000000	2.000000	2.000000e+03
25%	2007.000000	170.00000	4.000000	2.000000	22.000000	16.000000	549.000000	2.100000e+04
50%	2015.000000	227.00000	6.000000	4.000000	26.000000	18.000000	1385.000000	2.999500e+04
75%	2016.000000	300.00000	6.000000	4.000000	30.000000	22.000000	2009.000000	4.223125e+04
max	2017.000000	1001.00000	16.000000	4.000000	354.000000	137.000000	5657.000000	2.065902e+06

df.describe(include=[object]) # show statistics for categorical values

	make	model	engine_fuel_type	transmission_type	driven_wheels	market_category	vehicle_size	vehicle_style
count	11914	11914	11911	11914	11914	8172	11914	11914
unique	48	914	10	5	4	71	3	16
top	chevrolet	silverado_1500	regular_unleaded	automatic	front_wheel_drive	crossover	compact	sedan
freq	1123	156	7172	8266	4787	1110	4764	3048

Describing Price

import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = [12,10]

%matplotlib inline

sns.histplot(df.msrp, bins=10);

plot-1

Based on the above figure, we can observe the prices (msrp) values are higher which would confuse our model

sns.histplot(df.msrp[df.msrp < 100000], bins=50);

plot-2

Logarithmic function on values

To avoid, we apply logarithmic function on msrp value so that, even higher values become smaller. Check the below example

np.log([1,10,1000,10000,100000])

array([ 0.        ,  2.30258509,  6.90775528,  9.21034037, 11.51292546])

# but if we apply log on zero, values becomes inifity
np.log(0)

/tmp/ipykernel_19857/1310345647.py:2: RuntimeWarning: divide by zero encountered in log
  np.log(0)

-inf

# to avoid that we have to add 1 if we have zero values np.log(0 + 1) ~= np.log(1) = 0
# but numpy has inbuilt function np.log1p which adds 1 automically
np.log1p(0)

0.0

# Coming back to our datasets, we dont have zero msrp values
price_logs = np.log1p(df.msrp)
price_logs

      10.739349
      10.612779
      10.500977
      10.290483
      10.448744
           ...    
  10.739024
  10.945018
  10.832122
  10.838031
  10.274913
Name: msrp, Length: 11914, dtype: float64

# applying hsitogram on price_logs
sns.histplot(price_logs, bins=50);

plot-3

Missing values

df.isnull().sum().sort_values(ascending=False)

market_category      3742
engine_hp              69
engine_cylinders       30
number_of_doors         6
engine_fuel_type        3
make                    0
model                   0
year                    0
transmission_type       0
driven_wheels           0
vehicle_size            0
vehicle_style           0
highway_mpg             0
city_mpg                0
popularity              0
msrp                    0
dtype: int64

Setting validation framework

# we are splitting data into 60% for training, 20% for validation and remaining 20% for testing
n = len(df)

n_val = int(0.2 * n) #20% validation
n_test = int(0.2 * n) #20% testing
n_train = n - (n_test + n_val)

n_train, n_test, n_val

(7150, 2382, 2382)

# Separating dataframe based on above ratio
df_train = df.iloc[:n_train]
df_val = df.iloc[n_train:n_val+n_train] #[:2382]
df_test = df.iloc[n_val+n_train:] #[2382:4764]

df_test

	make	model	year	engine_fuel_type	engine_hp	engine_cylinders	transmission_type	driven_wheels	number_of_doors	market_category	vehicle_size	vehicle_style	highway_mpg	city_mpg	popularity	msrp
9532	chevrolet	silverado_1500	2015	regular_unleaded	355.0	8.0	automatic	rear_wheel_drive	4.0	NaN	large	crew_cab_pickup	23	16	1385	47575
9533	chevrolet	silverado_1500	2015	flex-fuel_(unleaded/e85)	285.0	6.0	automatic	rear_wheel_drive	4.0	flex_fuel	large	extended_cab_pickup	24	18	1385	31940
9534	chevrolet	silverado_1500	2015	flex-fuel_(unleaded/e85)	285.0	6.0	automatic	rear_wheel_drive	4.0	flex_fuel	large	crew_cab_pickup	24	18	1385	38335
9535	chevrolet	silverado_1500	2015	flex-fuel_(unleaded/e85)	285.0	6.0	automatic	rear_wheel_drive	4.0	flex_fuel	large	extended_cab_pickup	24	18	1385	35870
9536	chevrolet	silverado_1500	2015	flex-fuel_(unleaded/e85)	285.0	6.0	automatic	rear_wheel_drive	2.0	flex_fuel	large	regular_cab_pickup	24	18	1385	28155
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
11909	acura	zdx	2012	premium_unleaded_(required)	300.0	6.0	automatic	all_wheel_drive	4.0	crossover,hatchback,luxury	midsize	4dr_hatchback	23	16	204	46120
11910	acura	zdx	2012	premium_unleaded_(required)	300.0	6.0	automatic	all_wheel_drive	4.0	crossover,hatchback,luxury	midsize	4dr_hatchback	23	16	204	56670
11911	acura	zdx	2012	premium_unleaded_(required)	300.0	6.0	automatic	all_wheel_drive	4.0	crossover,hatchback,luxury	midsize	4dr_hatchback	23	16	204	50620
11912	acura	zdx	2013	premium_unleaded_(recommended)	300.0	6.0	automatic	all_wheel_drive	4.0	crossover,hatchback,luxury	midsize	4dr_hatchback	23	16	204	50920
11913	lincoln	zephyr	2006	regular_unleaded	221.0	6.0	automatic	front_wheel_drive	4.0	luxury	midsize	sedan	26	17	61	28995

2382 rows × 16 columns

But the problem in doing above, Our model would memorize everything as we split dataset in sequential manner, So we have to shuffle data and our model find patterns instead of memorizing, which won’t be overfitting

idx = np.arange(n) # capturing the indexes of dataframe
idx

array([    0,     1,     2, ..., 11911, 11912, 11913])

np.random.seed(2)
np.random.shuffle(idx)

idx # shuffled randomly

array([2735, 6720, 5878, ..., 6637, 2575, 7336])

idx[:n_train] # use the same for val and test in dataframe

array([2735, 6720, 5878, ..., 9334, 5284, 2420])

df_train = df.iloc[idx[:n_train]]
df_val = df.iloc[idx[n_train:n_val+n_train]] # [:2382]
df_test = df.iloc[idx[n_val+n_train:]] # [2382:4764]

df_train.head()

	make	model	year	engine_fuel_type	engine_hp	engine_cylinders	transmission_type	driven_wheels	number_of_doors	market_category	vehicle_size	vehicle_style	highway_mpg	city_mpg	popularity	msrp
2735	chevrolet	cobalt	2008	regular_unleaded	148.0	4.0	manual	front_wheel_drive	2.0	NaN	compact	coupe	33	24	1385	14410
6720	toyota	matrix	2012	regular_unleaded	132.0	4.0	automatic	front_wheel_drive	4.0	hatchback	compact	4dr_hatchback	32	25	2031	19685
5878	subaru	impreza	2016	regular_unleaded	148.0	4.0	automatic	all_wheel_drive	4.0	hatchback	compact	4dr_hatchback	37	28	640	19795
11190	volkswagen	vanagon	1991	regular_unleaded	90.0	4.0	manual	rear_wheel_drive	3.0	NaN	large	passenger_minivan	18	16	873	2000
4554	ford	f-150	2017	flex-fuel_(unleaded/e85)	385.0	8.0	automatic	four_wheel_drive	4.0	flex_fuel	large	crew_cab_pickup	21	15	5657	56260

# we have to convert our y values as seen before
np.log1p([1,10,1000,10000000])

array([ 0.69314718,  2.39789527,  6.90875478, 16.11809575])

y_train = np.log1p(df_train.msrp.values)
y_val = np.log1p(df_val.msrp.values)
y_test = np.log1p(df_test.msrp.values)

# after separating target values, deleting target variables from train, val, test dataframe 
del df_train['msrp']
del df_test['msrp']
del df_val['msrp']

Linear Regression

Example

df_train.iloc[10]

make                                 rolls-royce
model                     phantom_drophead_coupe
year                                        2015
engine_fuel_type     premium_unleaded_(required)
engine_hp                                  453.0
engine_cylinders                            12.0
transmission_type                      automatic
driven_wheels                   rear_wheel_drive
number_of_doors                              2.0
market_category        exotic,luxury,performance
vehicle_size                               large
vehicle_style                        convertible
highway_mpg                                   19
city_mpg                                      11
popularity                                    86
Name: 7557, dtype: object

xi = [453,11,86] # example values of one feature from our dataset
w0 = 5.17 # random bias
w = [.01, .04, .002]

def linear_regression(xi):
    """Simple Linear Regression"""
    n_loops = len(xi)

    pred = w0 #initial weight for bias

    for j in range(n_loops):
        pred += w[j] * xi[j]
    return pred

print(linear_regression(xi))
# Note that it produces lower value due to we are going to apply log on y values

10.312

# To get the exact values, we have to reverse the log values by exponential
np.expm1(10.312) # compare this predicted price with 'y' value

30090.55961642849

np.log1p(30090.55961642849)

10.312

Linear Regression in Vector Form

Simple LR

def dot_mat(xi,w):
    """Dot Product of xi vector"""
    n_runs = len(xi)

    result = 0.0

    for run in range(n_runs):
        result += xi[run] * w[run]
    return result

[w0] + w # w0 = 5.17

[5.17, 0.01, 0.04, 0.002]

[1] + xi # xi0 =1

[1, 453, 11, 86]

def lr_form(xi):
    w_new = [w0] + w 
    xi = [1] + xi
    return dot_mat(xi,w_new)

print(lr_form(xi))

10.312

Example - With Multiple features

w0 = 5.17
w_new = [w0] + w
w_new

[5.17, 0.01, 0.04, 0.002]

# three sample features
x1  = [1, 148, 24, 1385]
x2  = [1, 132, 25, 2031]
x10 = [1, 453, 11, 86]

X = [x1, x2, x10]
X = np.array(X)
X

array([[   1,  148,   24, 1385],
       [   1,  132,   25, 2031],
       [   1,  453,   11,   86]])

def lr_multiple_features(xi,w):
    return xi.dot(w)

print(lr_multiple_features(xi=X,w=w_new))

[10.38  11.552 10.312]

Training Linear Regression

# Multiple features
X = [
    [148, 24, 1385],
    [132, 25, 2031],
    [453, 11, 86],
    [158, 24, 185],
    [172, 25, 201],
    [413, 11, 86],
    [38,  54, 185],
    [142, 25, 431],
    [453, 31, 86],
]
X = np.array(X)
X

array([[ 148,   24, 1385],
       [ 132,   25, 2031],
       [ 453,   11,   86],
       [ 158,   24,  185],
       [ 172,   25,  201],
       [ 413,   11,   86],
       [  38,   54,  185],
       [ 142,   25,  431],
       [ 453,   31,   86]])

np.set_printoptions(suppress=True) # roundoff value instead of showng values 1e+07 to normal number

# creating bias for X, keeping bias would help us to understand baseline price of the car
ones = np.ones(X.shape[0])
ones

array([1., 1., 1., 1., 1., 1., 1., 1., 1.])

# append ones values in X column wise for considering ones as bias
X = np.column_stack([ones,X])
X

array([[   1.,  148.,   24., 1385.],
       [   1.,  132.,   25., 2031.],
       [   1.,  453.,   11.,   86.],
       [   1.,  158.,   24.,  185.],
       [   1.,  172.,   25.,  201.],
       [   1.,  413.,   11.,   86.],
       [   1.,   38.,   54.,  185.],
       [   1.,  142.,   25.,  431.],
       [   1.,  453.,   31.,   86.]])

# y values
y = [10000, 20000, 15000, 20050, 10000, 20000, 15000, 25000, 12000]
y

[10000, 20000, 15000, 20050, 10000, 20000, 15000, 25000, 12000]

XTX = X.T.dot(X)
XTX

array([[      9.,    2109.,     230.,    4676.],
       [   2109.,  696471.,   44115.,  718540.],
       [    230.,   44115.,    7146.,  118803.],
       [   4676.,  718540.,  118803., 6359986.]])

XTX_inv = np.linalg.inv(XTX)
XTX_inv

array([[ 3.30686958, -0.00539612, -0.06213256, -0.00066102],
       [-0.00539612,  0.00001116,  0.0000867 ,  0.00000109],
       [-0.06213256,  0.0000867 ,  0.00146189,  0.00000858],
       [-0.00066102,  0.00000109,  0.00000858,  0.00000036]])

X.T

array([[   1.,    1.,    1.,    1.,    1.,    1.,    1.,    1.,    1.],
       [ 148.,  132.,  453.,  158.,  172.,  413.,   38.,  142.,  453.],
       [  24.,   25.,   11.,   24.,   25.,   11.,   54.,   25.,   31.],
       [1385., 2031.,   86.,  185.,  201.,   86.,  185.,  431.,   86.]])

# putting together everything to calculate W
w_full = XTX_inv.dot(X.T).dot(y)
w_full

array([25844.75405577,   -16.08906468,  -199.47254894,    -1.22802883])

# Separating w0 and other weights (w)
w0 = w_full[0]
w = w_full[1:]
w0, w

(25844.75405576679, array([ -16.08906468, -199.47254894,   -1.22802883]))

array([[   1.,  148.,   24., 1385.],
       [   1.,  132.,   25., 2031.],
       [   1.,  453.,   11.,   86.],
       [   1.,  158.,   24.,  185.],
       [   1.,  172.,   25.,  201.],
       [   1.,  413.,   11.,   86.],
       [   1.,   38.,   54.,  185.],
       [   1.,  142.,   25.,  431.],
       [   1.,  453.,   31.,   86.]])

Putting together everything

X = [
    [148, 24, 1385],
    [132, 25, 2031],
    [453, 11, 86],
    [158, 24, 185],
    [172, 25, 201],
    [413, 11, 86],
    [38,  54, 185],
    [142, 25, 431],
    [453, 31, 86],
]
X = np.array(X)
X

array([[ 148,   24, 1385],
       [ 132,   25, 2031],
       [ 453,   11,   86],
       [ 158,   24,  185],
       [ 172,   25,  201],
       [ 413,   11,   86],
       [  38,   54,  185],
       [ 142,   25,  431],
       [ 453,   31,   86]])

[10000, 20000, 15000, 20050, 10000, 20000, 15000, 25000, 12000]

# putting everything inside function - DRY (Dont Repeat Yourself)
def linear_regression(X,y):
    """Calculate linear regression"""
    ones = np.ones(X.shape[0])
    X = np.column_stack([ones,X])

    XTX = X.T.dot(X)
    XTX_inv = np.linalg.inv(XTX)
    w_full = XTX_inv.dot(X.T).dot(y)

    return w_full[0], w_full[1:]
    

print(linear_regression(X,y))

(25844.75405576679, array([ -16.08906468, -199.47254894,   -1.22802883]))

Car price baseline model

df_train.columns

Index(['make', 'model', 'year', 'engine_fuel_type', 'engine_hp',
       'engine_cylinders', 'transmission_type', 'driven_wheels',
       'number_of_doors', 'market_category', 'vehicle_size', 'vehicle_style',
       'highway_mpg', 'city_mpg', 'popularity'],
      dtype='object')

# selecting few features to create baseline model
base = ['engine_hp', 'engine_cylinders', 'highway_mpg',
        'city_mpg', 'popularity']

# before creating baseline model, check for missing values
df_train[base].isnull().sum().sort_values(ascending=False)

engine_hp           40
engine_cylinders    14
highway_mpg          0
city_mpg             0
popularity           0
dtype: int64

# fill missing values with '0' for our understanding
X_train = df_train[base].fillna(0).values # .values convert dataframe to numpy array
X_train

array([[ 148.,    4.,   33.,   24., 1385.],
       [ 132.,    4.,   32.,   25., 2031.],
       [ 148.,    4.,   37.,   28.,  640.],
       ...,
       [ 285.,    6.,   22.,   17.,  549.],
       [ 563.,   12.,   21.,   13.,   86.],
       [ 200.,    4.,   31.,   22.,  873.]])

# calculate base and weights with help of our custom function
w0, w = linear_regression(X_train,y_train)

w0

7.927257388070117

array([ 0.0097059 , -0.15910349,  0.01437921,  0.01494411, -0.00000907])

# Calculating predictions based on base and weights
y_pred = w0 + X_train.dot(w)
y_pred

array([ 9.54792783,  9.38733977,  9.67197758, ..., 10.30423015,
       11.9778914 ,  9.99863111])

# Plot and compare our results
sns.histplot(y_pred, alpha = 0.5, color='red', bins=50) # alpha for transparency
sns.histplot(y_train, alpha = 0.5, color='pink', bins=50);

plot-4

As we check from above figure, our y_pred predicts lower value for higher output values

RMSE

def rmse(y,y_pred):
    """Calculate RMSE value of a model"""
    sq_error = (y - y_pred)**2
    mean_sq_error = sq_error.mean()
    return np.sqrt(mean_sq_error)

rmse(y_train,y_pred)

0.7554192603920132

Validating the Model on Validation Dataset

Based on earlier data preparation steps below, we are going to create function

base = [‘engine_hp’, ‘engine_cylinders’, ‘highway_mpg’,’city_mpg’, ‘popularity’]
df_train[base].isnull().sum().sort_values(ascending=False)
X_train = df_train[base].fillna(0).values
w0, w = linear_regression(X_train,y_train)
y_pred = w0 + X_train.dot(w)

# from earlier selected features
base = ['engine_hp', 'engine_cylinders', 'highway_mpg','city_mpg', 'popularity'] 

def prepare_X(df):
    df_num = df[base] # selecting features from df
    df_num = df_num.fillna(0) # filling missing values with 0
    X = df_num.values # converting dataframe to NumPy array
    return X

# From earlier written linear regression
def linear_regression(X,y):
    """Calculate linear regression"""
    ones = np.ones(X.shape[0])
    X = np.column_stack([ones,X])

    XTX = X.T.dot(X)
    XTX_inv = np.linalg.inv(XTX)
    w_full = XTX_inv.dot(X.T).dot(y)

    return w_full[0], w_full[1:]
    

X_train = prepare_X(df_train) # preparing X_train dataset after cleaning missing values
w0, w = linear_regression(X_train,y_train) # produces weights of base and parameters

X_val = prepare_X(df_val) # Preparing validation dataset to compare with train dataset
y_pred = w0 + X_val.dot(w) # applying known values of w0, w from train dataset

# Comparing the evaluated validation prediction with existing validation target values
rmse(y_val,y_pred) 

0.7616530991301601

Ways to improve

Simple Feature Engineering

Car’s Age plays important role in features and we missed in our baseline

df_train.columns

Index(['make', 'model', 'year', 'engine_fuel_type', 'engine_hp',
       'engine_cylinders', 'transmission_type', 'driven_wheels',
       'number_of_doors', 'market_category', 'vehicle_size', 'vehicle_style',
       'highway_mpg', 'city_mpg', 'popularity'],
      dtype='object')

df_train["year"].max() # max. collected year of car is 2017

2017 - df_train.year

    9
    5
    1
  26
    0
         ..
     2
    2
    2
    3
    0
Name: year, Length: 7150, dtype: int64

# Lets add this feature for preparing our train data
def prepare_X(df):
    # copying just to make sure adding/modifying new features shouldn't affect original data
    df = df.copy()
    df['age'] = 2017 - df['year'] # creating new feature in copied dataset
    features = base + ['age'] # adding 'age' feature with existing features in base list

    df_num = df[features] # selecting features from df
    df_num = df_num.fillna(0) # filling missing values with 0
    X = df_num.values # converting dataframe to NumPy array
    return X

X_train = prepare_X(df_train) # preparing X_train dataset after cleaning missing values
w0, w = linear_regression(X_train,y_train) # produces weights of base and parameters

X_val = prepare_X(df_val) # Preparing validation dataset to compare with train dataset
y_pred = w0 + X_val.dot(w) # applying known values of w0, w from train dataset

# Comparing the evaluated validation prediction with existing validation target values
rmse(y_val,y_pred)

0.5172055461058335

As we checked with our earlier RMSE value, it decreased from 0.76 to 0.51 which is a better model. Adding important features making model prediction approx., equal to actual target values

# Lets plot to visualize the impact of adding new feature
sns.histplot(y_pred, label='prediction', color='red', alpha=0.5, bins=50)
sns.histplot(y_val, label='target', color='blue',  alpha=0.5, bins=50)
plt.legend();

plot-5

Categorical Variables

Number Of Doors

df_train.dtypes

make                  object
model                 object
year                   int64
engine_fuel_type      object
engine_hp            float64
engine_cylinders     float64
transmission_type     object
driven_wheels         object
number_of_doors      float64
market_category       object
vehicle_size          object
vehicle_style         object
highway_mpg            int64
city_mpg               int64
popularity             int64
dtype: object

df_train.number_of_doors.head() # treated as float datatype by the model

   2.0
   4.0
   4.0
  3.0
   4.0
Name: number_of_doors, dtype: float64

As we checked datatypes, number_of_doors marked as ‘float64’ instead of Object datatype
We have to change the datatype, lets check the categorical columns

df_train.select_dtypes('object').columns # displays categorical columns in the dataset

Index(['make', 'model', 'engine_fuel_type', 'transmission_type',
       'driven_wheels', 'market_category', 'vehicle_size', 'vehicle_style'],
      dtype='object')

df_train.select_dtypes('object').head() # first 5 rows of categorical variables

	make	model	engine_fuel_type	transmission_type	driven_wheels	market_category	vehicle_size	vehicle_style
2735	chevrolet	cobalt	regular_unleaded	manual	front_wheel_drive	NaN	compact	coupe
6720	toyota	matrix	regular_unleaded	automatic	front_wheel_drive	hatchback	compact	4dr_hatchback
5878	subaru	impreza	regular_unleaded	automatic	all_wheel_drive	hatchback	compact	4dr_hatchback
11190	volkswagen	vanagon	regular_unleaded	manual	rear_wheel_drive	NaN	large	passenger_minivan
4554	ford	f-150	flex-fuel_(unleaded/e85)	automatic	four_wheel_drive	flex_fuel	large	crew_cab_pickup

# How to change number of doors datatype
(df_train.number_of_doors==4).astype(int)
# converts the True/False from condition to integer '0s' and '1s'

   0
   1
   1
  0
   1
        ..
    0
   0
   1
   1
   1
Name: number_of_doors, Length: 7150, dtype: int64

# Instead of doing this for v = 2, v = 3, v = 4 and create columns, we can do it one step
'num_doors_%s' %4

'num_doors_4'

for doors in [2,3,4]:
    print('num_doors_%s' %doors)

num_doors_2
num_doors_3
num_doors_4

# applying it in our prepare_X function

def prepare_X(df):
    # copying just to make sure adding/modifying new features shouldn't affect original data
    df = df.copy()
    features = base.copy() # Creating Copy of base features

    df['age'] = 2017 - df['year'] # creating new feature in copied dataset
    features.append('age') # adding 'age' feature with existing features in base list

    for doors in [2,3,4]:
        # Creating new columns for each door value 
        # Convert them into binary value wherever condition meets (One-Hot Encoder)
        df['num_doors_%s' %doors] = (df.number_of_doors == doors).astype(int)
        features.append('num_doors_%s' %doors) # Adding new features to existing feature list
        
    df_num = df[features] # selecting features from df
    df_num = df_num.fillna(0) # filling missing values with 0
    X = df_num.values # converting dataframe to NumPy array
    return X

X_train = prepare_X(df_train) # preparing X_train dataset after cleaning missing values
w0, w = linear_regression(X_train,y_train) # produces weights of base and parameters

X_val = prepare_X(df_val) # Preparing validation dataset to compare with train dataset
y_pred = w0 + X_val.dot(w) # applying known values of w0, w from train dataset

# Comparing the evaluated validation prediction with existing validation target values
rmse(y_val,y_pred)

0.5157995641502353

As we checked with previous RMSE value, only slightly decreased
Let’s add top car make as columns and check how the model performs

Car Make

list(df.make.value_counts().head().index) # Select the top car brands and do one-hot encoding

['chevrolet', 'ford', 'volkswagen', 'toyota', 'dodge']

car_make = ['chevrolet', 'ford', 'volkswagen', 'toyota', 'dodge']
# Add this to our prepare_X function

# applying it in our prepare_X function

def prepare_X(df):
    # copying just to make sure adding/modifying new features shouldn't affect original data
    df = df.copy()
    features = base.copy() # Creating Copy of base features

    df['age'] = 2017 - df['year'] # creating new feature in copied dataset
    features.append('age') # adding 'age' feature with existing features in base list

    for doors in [2,3,4]:
        # Creating new columns for each door value 
        # Convert them into binary value wherever condition meets (One-Hot Encoder)
        df['num_doors_%s' %doors] = (df.number_of_doors == doors).astype(int)
        features.append('num_doors_%s' %doors) # Adding new features to existing feature list

    for brand in car_make:
        # Creating new columns for each top car brand 
        df['car_%s' %brand] = (df.make == brand).astype(int)
        features.append('car_%s' %brand)
        
    df_num = df[features] # selecting features from df
    df_num = df_num.fillna(0) # filling missing values with 0
    X = df_num.values # converting dataframe to NumPy array
    return X

X_train = prepare_X(df_train) # preparing X_train dataset after cleaning missing values
w0, w = linear_regression(X_train,y_train) # produces weights of base and parameters

X_val = prepare_X(df_val) # Preparing validation dataset to compare with train dataset
y_pred = w0 + X_val.dot(w) # applying known values of w0, w from train dataset

# Comparing the evaluated validation prediction with existing validation target values
rmse(y_val,y_pred)

0.5076038849557035

After appending car make as feature in our training, RMSE value drcreased from 0.517 to 0.507

Adding more features

What happens if we add some more features? Will model reduce error rate? or Will Overfit?

# Select some more feature from categorical datatypes
categorical_columns = [
    'make', 'model', 'engine_fuel_type', 'driven_wheels', 'market_category',
    'vehicle_size', 'vehicle_style']

categorical = {}

for c in categorical_columns:
    # Selecting top most 5 unique values from each column and making them new column in training
    categorical[c] = list(df_train[c].value_counts().head().index)

# applying in our prepare_X function
def prepare_X(df):
    # copying just to make sure adding/modifying new features shouldn't affect original data
    df = df.copy()
    features = base.copy() # Creating Copy of base features

    df['age'] = 2017 - df['year'] # creating new feature in copied dataset
    features.append('age') # adding 'age' feature with existing features in base list

    for doors in [2,3,4]:
        # Creating new columns for each door value 
        # Convert them into binary value wherever condition meets (One_Hot Encoder)
        df['num_doors_%s' %doors] = (df.number_of_doors == doors).astype(int)
        features.append('num_doors_%s' %doors) # Adding new features to existing feature list

    for name, values in categorical.items():
        # name - name of the column
        # values - top most value in each column and looped to create as new feature
        for value in values:
            df['%s_%s' % (name, value)] = (df[name] == value).astype(int)
            features.append('%s_%s' % (name, value))

    df_num = df[features]
    df_num = df_num.fillna(0)
    X = df_num.values

    return X

X_train = prepare_X(df_train) # preparing X_train dataset after cleaning missing values
w0, w = linear_regression(X_train,y_train) # produces weights of base and parameters

X_val = prepare_X(df_val) # Preparing validation dataset to compare with train dataset
y_pred = w0 + X_val.dot(w) # applying known values of w0, w from train dataset

# Comparing the evaluated validation prediction with existing validation target values
rmse(y_val,y_pred)

24.780192001181355

As we checked, RMSE value increased after adding more features.

Regularization

The higher value of RMSE maybe due to some values in features are identical to other feature values

# For example
X = [
    [4, 4, 4],
    [3, 5, 5],
    [5, 1, 1],
    [5, 4, 4],
    [7, 5, 5],
    [4, 5, 5.00000001],
] # Column 2 and 3 are actually duplicates

X = np.array(X)
y = [1,2,3,1,2,3]

X, y

(array([[4.        , 4.        , 4.        ],
        [3.        , 5.        , 5.        ],
        [5.        , 1.        , 1.        ],
        [5.        , 4.        , 4.        ],
        [7.        , 5.        , 5.        ],
        [4.        , 5.        , 5.00000001]]),
 [1, 2, 3, 1, 2, 3])

# Calculate normal equation
XTX = X.T.dot(X)
XTX

array([[140.        , 111.        , 111.00000004],
       [111.        , 108.        , 108.00000005],
       [111.00000004, 108.00000005, 108.0000001 ]])

XTX_inv = np.linalg.inv(XTX)
XTX_inv

array([[ 3.86409478e-02, -1.26839821e+05,  1.26839770e+05],
       [-1.26839767e+05,  2.88638033e+14, -2.88638033e+14],
       [ 1.26839727e+05, -2.88638033e+14,  2.88638033e+14]])

XTX_inv.dot(X.T).dot(y)

array([      -0.19390888, -3618543.74936484,  3618546.42894508])

As we check, w0 is -0.194 and w are above 3.6 * 10^6 which are huge and would impact in RMSE like before
To Tackle this problem, we add a small number/value at the diagonals

# lets get into action how we can do this
XTX = [
    [1,2,2],
    [2,1,1.0001],
    [2,1.0001,1]
]

XTX = np.array(XTX)
XTX

array([[1.    , 2.    , 2.    ],
       [2.    , 1.    , 1.0001],
       [2.    , 1.0001, 1.    ]])

np.linalg.inv(XTX) # 1

array([[   -0.33335556,     0.33333889,     0.33333889],
       [    0.33333889, -5000.08333472,  4999.91666528],
       [    0.33333889,  4999.91666528, -5000.08333472]])

The column values at 2nd and 3rd column are high
We will try to add small number (0.01) at the diagonal and check the results

# lets get into action how we can do this
XTX = [
    [1.01,2,2],
    [2,1.01,1.0001],
    [2,1.0001,1.01]
]

XTX = np.array(XTX)
XTX

array([[1.01  , 2.    , 2.    ],
       [2.    , 1.01  , 1.0001],
       [2.    , 1.0001, 1.01  ]])

np.linalg.inv(XTX) # 2

array([[ -0.3367115 ,   0.33501965,   0.33501965],
       [  0.33501965,  50.42045804, -50.58964297],
       [  0.33501965, -50.58964297,  50.42045804]])

Comparing # 1 and # 2, the values are decreased much

# How to do add the number
XTX = [
    [1,2, 2],
    [2,1,1.0001],
    [2,1.0001,1]
       ]
XTX

[[1, 2, 2], [2, 1, 1.0001], [2, 1.0001, 1]]

XTX = XTX + 0.01 * np.eye(3)
XTX

array([[1.01  , 2.    , 2.    ],
       [2.    , 1.01  , 1.0001],
       [2.    , 1.0001, 1.01  ]])

np.linalg.inv(XTX)

array([[ -0.3367115 ,   0.33501965,   0.33501965],
       [  0.33501965,  50.42045804, -50.58964297],
       [  0.33501965, -50.58964297,  50.42045804]])

# Applying the regularization parameter in our linear regression function
def linear_regression_reg(X, y, r=0.001):
    ones = np.ones(X.shape[0]) # Creating bias term for dataset
    X = np.column_stack([ones, X]) # Adding column wise, bias with dataset

    XTX = X.T.dot(X) # Matrix Multiplication
    XTX = XTX + r * np.eye(XTX.shape[0]) # Adding regularization parameter at the diagonals

    XTX_inv = np.linalg.inv(XTX) # Inverse of XTX
    w_full = XTX_inv.dot(X.T).dot(y) # Normal equation to find the coefficients of bias and weights
    
    return w_full[0], w_full[1:] # Bias term, Weights - w1,..wn

X_train = prepare_X(df_train)
w0, w = linear_regression_reg(X_train, y_train, r=0.01)

X_val = prepare_X(df_val)
y_pred = w0 + X_val.dot(w)
rmse(y_val, y_pred)

0.4608208286029829

Our earlier RMSE value is 0.51 and now reduced to 0.46
But we have to find the exact ‘r’ value to find the least RMSE value

Tuning the Model

for r_value in [0.0,0.01,0.001,0.0001,0.00001,10]: # Manually given r_values
    # Generate training results for each r_value to find out the optimum r-value
    X_train = prepare_X(df_train)
    w0, w = linear_regression_reg(X_train, y_train, r=r_value)

    X_val = prepare_X(df_val)
    y_pred = w0 + X_val.dot(w)
    score = rmse(y_val, y_pred)

    print(r_value, w0, score)

0 -1892716650172720.2 24.780192001181355
01 7.1183820236285555 0.4608208286029829
001 7.130829068671119 0.4608158583369783
0001 7.139881370176266 0.46081536403011203
1e-05 3.6757689730217615 0.46081532315296586
4.729512585698256 0.472609877266825

Using the Model

Apply everything
Combine train + validation as train dataset
Compare results with test dataset

# Combine df_train and df_val as single train dataset
df_full_train = pd.concat([df_train,df_val])
df_full_train.head()

	make	model	year	engine_fuel_type	engine_hp	engine_cylinders	transmission_type	driven_wheels	number_of_doors	market_category	vehicle_size	vehicle_style	highway_mpg	city_mpg	popularity
2735	chevrolet	cobalt	2008	regular_unleaded	148.0	4.0	manual	front_wheel_drive	2.0	NaN	compact	coupe	33	24	1385
6720	toyota	matrix	2012	regular_unleaded	132.0	4.0	automatic	front_wheel_drive	4.0	hatchback	compact	4dr_hatchback	32	25	2031
5878	subaru	impreza	2016	regular_unleaded	148.0	4.0	automatic	all_wheel_drive	4.0	hatchback	compact	4dr_hatchback	37	28	640
11190	volkswagen	vanagon	1991	regular_unleaded	90.0	4.0	manual	rear_wheel_drive	3.0	NaN	large	passenger_minivan	18	16	873
4554	ford	f-150	2017	flex-fuel_(unleaded/e85)	385.0	8.0	automatic	four_wheel_drive	4.0	flex_fuel	large	crew_cab_pickup	21	15	5657

# Seem the index is shuffled, lets reset the index
df_full_train = df_full_train.reset_index(drop=True) # drop = True will drop existing index
df_full_train.head()

	make	model	year	engine_fuel_type	engine_hp	engine_cylinders	transmission_type	driven_wheels	number_of_doors	market_category	vehicle_size	vehicle_style	highway_mpg	city_mpg	popularity
0	chevrolet	cobalt	2008	regular_unleaded	148.0	4.0	manual	front_wheel_drive	2.0	NaN	compact	coupe	33	24	1385
1	toyota	matrix	2012	regular_unleaded	132.0	4.0	automatic	front_wheel_drive	4.0	hatchback	compact	4dr_hatchback	32	25	2031
2	subaru	impreza	2016	regular_unleaded	148.0	4.0	automatic	all_wheel_drive	4.0	hatchback	compact	4dr_hatchback	37	28	640
3	volkswagen	vanagon	1991	regular_unleaded	90.0	4.0	manual	rear_wheel_drive	3.0	NaN	large	passenger_minivan	18	16	873
4	ford	f-150	2017	flex-fuel_(unleaded/e85)	385.0	8.0	automatic	four_wheel_drive	4.0	flex_fuel	large	crew_cab_pickup	21	15	5657

df_full_train

	make	model	year	engine_fuel_type	engine_hp	engine_cylinders	transmission_type	driven_wheels	number_of_doors	market_category	vehicle_size	vehicle_style	highway_mpg	city_mpg	popularity
0	chevrolet	cobalt	2008	regular_unleaded	148.0	4.0	manual	front_wheel_drive	2.0	NaN	compact	coupe	33	24	1385
1	toyota	matrix	2012	regular_unleaded	132.0	4.0	automatic	front_wheel_drive	4.0	hatchback	compact	4dr_hatchback	32	25	2031
2	subaru	impreza	2016	regular_unleaded	148.0	4.0	automatic	all_wheel_drive	4.0	hatchback	compact	4dr_hatchback	37	28	640
3	volkswagen	vanagon	1991	regular_unleaded	90.0	4.0	manual	rear_wheel_drive	3.0	NaN	large	passenger_minivan	18	16	873
4	ford	f-150	2017	flex-fuel_(unleaded/e85)	385.0	8.0	automatic	four_wheel_drive	4.0	flex_fuel	large	crew_cab_pickup	21	15	5657
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
9527	volvo	v60	2015	regular_unleaded	240.0	4.0	automatic	front_wheel_drive	4.0	luxury	midsize	wagon	37	25	870
9528	maserati	granturismo_convertible	2015	premium_unleaded_(required)	444.0	8.0	automatic	rear_wheel_drive	2.0	exotic,luxury,high-performance	midsize	convertible	20	13	238
9529	cadillac	escalade_hybrid	2013	regular_unleaded	332.0	8.0	automatic	rear_wheel_drive	4.0	luxury,hybrid	large	4dr_suv	23	20	1624
9530	mitsubishi	lancer	2016	regular_unleaded	148.0	4.0	manual	front_wheel_drive	4.0	NaN	compact	sedan	34	24	436
9531	kia	sorento	2015	regular_unleaded	290.0	6.0	automatic	front_wheel_drive	4.0	crossover	midsize	4dr_suv	25	18	1720

9532 rows × 15 columns

# From earlier written function to do feature engg 
def prepare_X(df):
    # copying just to make sure adding/modifying new features shouldn't affect original data
    df = df.copy()
    features = base.copy() # Creating Copy of base features

    df['age'] = 2017 - df['year'] # creating new feature in copied dataset
    features.append('age') # adding 'age' feature with existing features in base list

    for doors in [2,3,4]:
        # Creating new columns for each door value 
        # Convert them into binary value wherever condition meets (One_Hot Encoder)
        df['num_doors_%s' %doors] = (df.number_of_doors == doors).astype(int)
        features.append('num_doors_%s' %doors) # Adding new features to existing feature list

    for name, values in categorical.items():
        # name - name of the column
        # values - top most value in each column and looped to create as new feature
        for value in values:
            df['%s_%s' % (name, value)] = (df[name] == value).astype(int)
            features.append('%s_%s' % (name, value))

    df_num = df[features]
    df_num = df_num.fillna(0)
    X = df_num.values

    return X

# prepare and clean full dataset
X_full_train = prepare_X(df_full_train)
X_full_train

array([[148.,   4.,  33., ...,   1.,   0.,   0.],
       [132.,   4.,  32., ...,   0.,   0.,   1.],
       [148.,   4.,  37., ...,   0.,   0.,   1.],
       ...,
       [332.,   8.,  23., ...,   0.,   0.,   0.],
       [148.,   4.,  34., ...,   0.,   0.,   0.],
       [290.,   6.,  25., ...,   0.,   0.,   0.]])

# Combine y values of train and validation dataset together
y_full_train = np.concatenate([y_train,y_val])
y_full_train

array([ 9.57574708,  9.887663  ,  9.89323518, ..., 11.21756062,
        9.77542688, 10.1924563 ])

# train the x_full_train and y_full_train dataset to find the coefficients
w0, w = linear_regression_reg(X_full_train, y_full_train, r=0.001)

# Apply w0, w to find the prediction values
X_test = prepare_X(df_test) # feature engg on features
y_pred = w0 + X_test.dot(w) # prediction on X_test values
score = rmse(y_test, y_pred) # comparing existing y values with predicted values
score

0.46007539687771004

Testing Out the Model

Lets apply the model on an unseen data and check its performance

df_test.iloc[5]

make                                           audi
model                                            a3
year                                           2015
engine_fuel_type     premium_unleaded_(recommended)
engine_hp                                     220.0
engine_cylinders                                4.0
transmission_type                  automated_manual
driven_wheels                       all_wheel_drive
number_of_doors                                 2.0
market_category                              luxury
vehicle_size                                compact
vehicle_style                           convertible
highway_mpg                                      32
city_mpg                                         23
popularity                                     3105
Name: 1027, dtype: object

car = df_test.iloc[5].to_dict()
car

{'make': 'audi',
 'model': 'a3',
 'year': 2015,
 'engine_fuel_type': 'premium_unleaded_(recommended)',
 'engine_hp': 220.0,
 'engine_cylinders': 4.0,
 'transmission_type': 'automated_manual',
 'driven_wheels': 'all_wheel_drive',
 'number_of_doors': 2.0,
 'market_category': 'luxury',
 'vehicle_size': 'compact',
 'vehicle_style': 'convertible',
 'highway_mpg': 32,
 'city_mpg': 23,
 'popularity': 3105}

# To make this test data to give results, 
# we have to modify the features as like in train data by prepare_X function
# To do that, we have convert that into dataframe
# remember prepare_X function accepts dataframe only

df_test_car = pd.DataFrame([car])
df_test_car

	make	model	year	engine_fuel_type	engine_hp	engine_cylinders	transmission_type	driven_wheels	number_of_doors	market_category	vehicle_size	vehicle_style	highway_mpg	city_mpg	popularity
0	audi	a3	2015	premium_unleaded_(recommended)	220.0	4.0	automated_manual	all_wheel_drive	2.0	luxury	compact	convertible	32	23	3105

# Now apply on prepare_X function to create features in age, no.doors and for categorical variables
X_test_car = prepare_X(df_test_car)

# Lets check our model prediction on our test car details
y_pred = w0 + X_test_car.dot(w) # We already know w0, w
y_pred = y_pred[0]
y_pred

10.474552314979627

# Our model predicted in logarithmic values, convert them to check values in actual MSRP
np.expm1(y_pred)

35402.01676990125

# Lets check with our actual y value
np.expm1(y_test[5])

41149.999999999985

By Comparing, we came to know that we’re around $ 5K lower than actual which is actually good model as we did only few feature engg and trained with few variables.

Twitter Facebook LinkedIn

Keerthana-rl