Week 2 - ML Zoom Camp
Car Price Prediction Project
In this week, we are going to learn our first ML algorithm regression through a machine learning project. The dataset can be download from here. We have many features like Make, Year, City MPG and we have a target feature MSRP
Project Plan
- Prepare data and do EDA
- Use linear regression for predicting price
- Understanding internals of linear regression (Vector form) and implementing it
- Evaluating model with RMSE
- Feature Engineering
- Regularization
Data Preparation
import numpy as np
import pandas as pd
# Uncomment below line to download dataset
# !wget https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv
df = pd.read_csv('data.csv')
df.head() # looking first few rows to understand data
Make | Model | Year | Engine Fuel Type | Engine HP | Engine Cylinders | Transmission Type | Driven_Wheels | Number of Doors | Market Category | Vehicle Size | Vehicle Style | highway MPG | city mpg | Popularity | MSRP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | BMW | 1 Series M | 2011 | premium unleaded (required) | 335.0 | 6.0 | MANUAL | rear wheel drive | 2.0 | Factory Tuner,Luxury,High-Performance | Compact | Coupe | 26 | 19 | 3916 | 46135 |
1 | BMW | 1 Series | 2011 | premium unleaded (required) | 300.0 | 6.0 | MANUAL | rear wheel drive | 2.0 | Luxury,Performance | Compact | Convertible | 28 | 19 | 3916 | 40650 |
2 | BMW | 1 Series | 2011 | premium unleaded (required) | 300.0 | 6.0 | MANUAL | rear wheel drive | 2.0 | Luxury,High-Performance | Compact | Coupe | 28 | 20 | 3916 | 36350 |
3 | BMW | 1 Series | 2011 | premium unleaded (required) | 230.0 | 6.0 | MANUAL | rear wheel drive | 2.0 | Luxury,Performance | Compact | Coupe | 28 | 18 | 3916 | 29450 |
4 | BMW | 1 Series | 2011 | premium unleaded (required) | 230.0 | 6.0 | MANUAL | rear wheel drive | 2.0 | Luxury | Compact | Convertible | 28 | 18 | 3916 | 34500 |
Observations from .head()
- Column names either separated by white space or hypen
- Column names either small or capital letters
- Values in the data either small or capital letters
The columns must be cleaned and put everything in same order
df.columns # display column names in the data
Index(['Make', 'Model', 'Year', 'Engine Fuel Type', 'Engine HP',
'Engine Cylinders', 'Transmission Type', 'Driven_Wheels',
'Number of Doors', 'Market Category', 'Vehicle Size', 'Vehicle Style',
'highway MPG', 'city mpg', 'Popularity', 'MSRP'],
dtype='object')
# data cleaning on column names
df.columns = df.columns.str.lower().str.replace(" ","_")
# converting the strings (column names) into lowercase and then replace & fill whitespace with '-'
df.columns
Index(['make', 'model', 'year', 'engine_fuel_type', 'engine_hp',
'engine_cylinders', 'transmission_type', 'driven_wheels',
'number_of_doors', 'market_category', 'vehicle_size', 'vehicle_style',
'highway_mpg', 'city_mpg', 'popularity', 'msrp'],
dtype='object')
df.dtypes # display data types of each column
make object
model object
year int64
engine_fuel_type object
engine_hp float64
engine_cylinders float64
transmission_type object
driven_wheels object
number_of_doors float64
market_category object
vehicle_size object
vehicle_style object
highway_mpg int64
city_mpg int64
popularity int64
msrp int64
dtype: object
strings = list(df.dtypes[df.dtypes == 'object'].index)
strings
['make',
'model',
'engine_fuel_type',
'transmission_type',
'driven_wheels',
'market_category',
'vehicle_size',
'vehicle_style']
- [df.dtypes == ‘object’] shows column with true if dtype is object
- df.dtypes[df.dtypes == ‘object’] shows only columns with object datatypes
- df.dtypes[df.dtypes == ‘object’].index shows column as indexes
- list(df.dtypes[df.dtypes == ‘object’].index) converts indexes as list
for column in strings:
df[column] = df[column].str.lower().str.replace(" ","_")
# exactly how we did for column names
df.select_dtypes("object").head() # display only object datatypes
make | model | engine_fuel_type | transmission_type | driven_wheels | market_category | vehicle_size | vehicle_style | |
---|---|---|---|---|---|---|---|---|
0 | bmw | 1_series_m | premium_unleaded_(required) | manual | rear_wheel_drive | factory_tuner,luxury,high-performance | compact | coupe |
1 | bmw | 1_series | premium_unleaded_(required) | manual | rear_wheel_drive | luxury,performance | compact | convertible |
2 | bmw | 1_series | premium_unleaded_(required) | manual | rear_wheel_drive | luxury,high-performance | compact | coupe |
3 | bmw | 1_series | premium_unleaded_(required) | manual | rear_wheel_drive | luxury,performance | compact | coupe |
4 | bmw | 1_series | premium_unleaded_(required) | manual | rear_wheel_drive | luxury | compact | convertible |
Exploratory Data Analysis (EDA)
for column in df.columns:
print(column) # display column name
print('-'*10)
print(df[column].unique()[:5]) # display first 5 unique values of each column
print(df[column].nunique())
print()
make
----------
['bmw' 'audi' 'fiat' 'mercedes-benz' 'chrysler']
48
model
----------
['1_series_m' '1_series' '100' '124_spider' '190-class']
914
year
----------
[2011 2012 2013 1992 1993]
28
engine_fuel_type
----------
['premium_unleaded_(required)' 'regular_unleaded'
'premium_unleaded_(recommended)' 'flex-fuel_(unleaded/e85)' 'diesel']
10
engine_hp
----------
[335. 300. 230. 320. 172.]
356
engine_cylinders
----------
[ 6. 4. 5. 8. 12.]
9
transmission_type
----------
['manual' 'automatic' 'automated_manual' 'direct_drive' 'unknown']
5
driven_wheels
----------
['rear_wheel_drive' 'front_wheel_drive' 'all_wheel_drive'
'four_wheel_drive']
4
number_of_doors
----------
[ 2. 4. 3. nan]
3
market_category
----------
['factory_tuner,luxury,high-performance' 'luxury,performance'
'luxury,high-performance' 'luxury' 'performance']
71
vehicle_size
----------
['compact' 'midsize' 'large']
3
vehicle_style
----------
['coupe' 'convertible' 'sedan' 'wagon' '4dr_hatchback']
16
highway_mpg
----------
[26 28 27 25 24]
59
city_mpg
----------
[19 20 18 17 16]
69
popularity
----------
[3916 3105 819 617 1013]
48
msrp
----------
[46135 40650 36350 29450 34500]
6049
df.describe() # by default describe shows for numeric values
year | engine_hp | engine_cylinders | number_of_doors | highway_mpg | city_mpg | popularity | msrp | |
---|---|---|---|---|---|---|---|---|
count | 11914.000000 | 11845.00000 | 11884.000000 | 11908.000000 | 11914.000000 | 11914.000000 | 11914.000000 | 1.191400e+04 |
mean | 2010.384338 | 249.38607 | 5.628829 | 3.436093 | 26.637485 | 19.733255 | 1554.911197 | 4.059474e+04 |
std | 7.579740 | 109.19187 | 1.780559 | 0.881315 | 8.863001 | 8.987798 | 1441.855347 | 6.010910e+04 |
min | 1990.000000 | 55.00000 | 0.000000 | 2.000000 | 12.000000 | 7.000000 | 2.000000 | 2.000000e+03 |
25% | 2007.000000 | 170.00000 | 4.000000 | 2.000000 | 22.000000 | 16.000000 | 549.000000 | 2.100000e+04 |
50% | 2015.000000 | 227.00000 | 6.000000 | 4.000000 | 26.000000 | 18.000000 | 1385.000000 | 2.999500e+04 |
75% | 2016.000000 | 300.00000 | 6.000000 | 4.000000 | 30.000000 | 22.000000 | 2009.000000 | 4.223125e+04 |
max | 2017.000000 | 1001.00000 | 16.000000 | 4.000000 | 354.000000 | 137.000000 | 5657.000000 | 2.065902e+06 |
df.describe(include=[object]) # show statistics for categorical values
make | model | engine_fuel_type | transmission_type | driven_wheels | market_category | vehicle_size | vehicle_style | |
---|---|---|---|---|---|---|---|---|
count | 11914 | 11914 | 11911 | 11914 | 11914 | 8172 | 11914 | 11914 |
unique | 48 | 914 | 10 | 5 | 4 | 71 | 3 | 16 |
top | chevrolet | silverado_1500 | regular_unleaded | automatic | front_wheel_drive | crossover | compact | sedan |
freq | 1123 | 156 | 7172 | 8266 | 4787 | 1110 | 4764 | 3048 |
Describing Price
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = [12,10]
%matplotlib inline
sns.histplot(df.msrp, bins=10);
Based on the above figure, we can observe the prices (msrp) values are higher which would confuse our model
sns.histplot(df.msrp[df.msrp < 100000], bins=50);
Logarithmic function on values
To avoid, we apply logarithmic function on msrp value so that, even higher values become smaller. Check the below example
np.log([1,10,1000,10000,100000])
array([ 0. , 2.30258509, 6.90775528, 9.21034037, 11.51292546])
# but if we apply log on zero, values becomes inifity
np.log(0)
/tmp/ipykernel_19857/1310345647.py:2: RuntimeWarning: divide by zero encountered in log
np.log(0)
-inf
# to avoid that we have to add 1 if we have zero values np.log(0 + 1) ~= np.log(1) = 0
# but numpy has inbuilt function np.log1p which adds 1 automically
np.log1p(0)
0.0
# Coming back to our datasets, we dont have zero msrp values
price_logs = np.log1p(df.msrp)
price_logs
0 10.739349
1 10.612779
2 10.500977
3 10.290483
4 10.448744
...
11909 10.739024
11910 10.945018
11911 10.832122
11912 10.838031
11913 10.274913
Name: msrp, Length: 11914, dtype: float64
# applying hsitogram on price_logs
sns.histplot(price_logs, bins=50);
Missing values
df.isnull().sum().sort_values(ascending=False)
market_category 3742
engine_hp 69
engine_cylinders 30
number_of_doors 6
engine_fuel_type 3
make 0
model 0
year 0
transmission_type 0
driven_wheels 0
vehicle_size 0
vehicle_style 0
highway_mpg 0
city_mpg 0
popularity 0
msrp 0
dtype: int64
Setting validation framework
# we are splitting data into 60% for training, 20% for validation and remaining 20% for testing
n = len(df)
n_val = int(0.2 * n) #20% validation
n_test = int(0.2 * n) #20% testing
n_train = n - (n_test + n_val)
n_train, n_test, n_val
(7150, 2382, 2382)
# Separating dataframe based on above ratio
df_train = df.iloc[:n_train]
df_val = df.iloc[n_train:n_val+n_train] #[:2382]
df_test = df.iloc[n_val+n_train:] #[2382:4764]
df_test
make | model | year | engine_fuel_type | engine_hp | engine_cylinders | transmission_type | driven_wheels | number_of_doors | market_category | vehicle_size | vehicle_style | highway_mpg | city_mpg | popularity | msrp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9532 | chevrolet | silverado_1500 | 2015 | regular_unleaded | 355.0 | 8.0 | automatic | rear_wheel_drive | 4.0 | NaN | large | crew_cab_pickup | 23 | 16 | 1385 | 47575 |
9533 | chevrolet | silverado_1500 | 2015 | flex-fuel_(unleaded/e85) | 285.0 | 6.0 | automatic | rear_wheel_drive | 4.0 | flex_fuel | large | extended_cab_pickup | 24 | 18 | 1385 | 31940 |
9534 | chevrolet | silverado_1500 | 2015 | flex-fuel_(unleaded/e85) | 285.0 | 6.0 | automatic | rear_wheel_drive | 4.0 | flex_fuel | large | crew_cab_pickup | 24 | 18 | 1385 | 38335 |
9535 | chevrolet | silverado_1500 | 2015 | flex-fuel_(unleaded/e85) | 285.0 | 6.0 | automatic | rear_wheel_drive | 4.0 | flex_fuel | large | extended_cab_pickup | 24 | 18 | 1385 | 35870 |
9536 | chevrolet | silverado_1500 | 2015 | flex-fuel_(unleaded/e85) | 285.0 | 6.0 | automatic | rear_wheel_drive | 2.0 | flex_fuel | large | regular_cab_pickup | 24 | 18 | 1385 | 28155 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
11909 | acura | zdx | 2012 | premium_unleaded_(required) | 300.0 | 6.0 | automatic | all_wheel_drive | 4.0 | crossover,hatchback,luxury | midsize | 4dr_hatchback | 23 | 16 | 204 | 46120 |
11910 | acura | zdx | 2012 | premium_unleaded_(required) | 300.0 | 6.0 | automatic | all_wheel_drive | 4.0 | crossover,hatchback,luxury | midsize | 4dr_hatchback | 23 | 16 | 204 | 56670 |
11911 | acura | zdx | 2012 | premium_unleaded_(required) | 300.0 | 6.0 | automatic | all_wheel_drive | 4.0 | crossover,hatchback,luxury | midsize | 4dr_hatchback | 23 | 16 | 204 | 50620 |
11912 | acura | zdx | 2013 | premium_unleaded_(recommended) | 300.0 | 6.0 | automatic | all_wheel_drive | 4.0 | crossover,hatchback,luxury | midsize | 4dr_hatchback | 23 | 16 | 204 | 50920 |
11913 | lincoln | zephyr | 2006 | regular_unleaded | 221.0 | 6.0 | automatic | front_wheel_drive | 4.0 | luxury | midsize | sedan | 26 | 17 | 61 | 28995 |
2382 rows × 16 columns
But the problem in doing above, Our model would memorize everything as we split dataset in sequential manner, So we have to shuffle data and our model find patterns instead of memorizing, which won’t be overfitting
idx = np.arange(n) # capturing the indexes of dataframe
idx
array([ 0, 1, 2, ..., 11911, 11912, 11913])
np.random.seed(2)
np.random.shuffle(idx)
idx # shuffled randomly
array([2735, 6720, 5878, ..., 6637, 2575, 7336])
idx[:n_train] # use the same for val and test in dataframe
array([2735, 6720, 5878, ..., 9334, 5284, 2420])
df_train = df.iloc[idx[:n_train]]
df_val = df.iloc[idx[n_train:n_val+n_train]] # [:2382]
df_test = df.iloc[idx[n_val+n_train:]] # [2382:4764]
df_train.head()
make | model | year | engine_fuel_type | engine_hp | engine_cylinders | transmission_type | driven_wheels | number_of_doors | market_category | vehicle_size | vehicle_style | highway_mpg | city_mpg | popularity | msrp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2735 | chevrolet | cobalt | 2008 | regular_unleaded | 148.0 | 4.0 | manual | front_wheel_drive | 2.0 | NaN | compact | coupe | 33 | 24 | 1385 | 14410 |
6720 | toyota | matrix | 2012 | regular_unleaded | 132.0 | 4.0 | automatic | front_wheel_drive | 4.0 | hatchback | compact | 4dr_hatchback | 32 | 25 | 2031 | 19685 |
5878 | subaru | impreza | 2016 | regular_unleaded | 148.0 | 4.0 | automatic | all_wheel_drive | 4.0 | hatchback | compact | 4dr_hatchback | 37 | 28 | 640 | 19795 |
11190 | volkswagen | vanagon | 1991 | regular_unleaded | 90.0 | 4.0 | manual | rear_wheel_drive | 3.0 | NaN | large | passenger_minivan | 18 | 16 | 873 | 2000 |
4554 | ford | f-150 | 2017 | flex-fuel_(unleaded/e85) | 385.0 | 8.0 | automatic | four_wheel_drive | 4.0 | flex_fuel | large | crew_cab_pickup | 21 | 15 | 5657 | 56260 |
# we have to convert our y values as seen before
np.log1p([1,10,1000,10000000])
array([ 0.69314718, 2.39789527, 6.90875478, 16.11809575])
y_train = np.log1p(df_train.msrp.values)
y_val = np.log1p(df_val.msrp.values)
y_test = np.log1p(df_test.msrp.values)
# after separating target values, deleting target variables from train, val, test dataframe
del df_train['msrp']
del df_test['msrp']
del df_val['msrp']
Linear Regression
Example
df_train.iloc[10]
make rolls-royce
model phantom_drophead_coupe
year 2015
engine_fuel_type premium_unleaded_(required)
engine_hp 453.0
engine_cylinders 12.0
transmission_type automatic
driven_wheels rear_wheel_drive
number_of_doors 2.0
market_category exotic,luxury,performance
vehicle_size large
vehicle_style convertible
highway_mpg 19
city_mpg 11
popularity 86
Name: 7557, dtype: object
xi = [453,11,86] # example values of one feature from our dataset
w0 = 5.17 # random bias
w = [.01, .04, .002]
def linear_regression(xi):
"""Simple Linear Regression"""
n_loops = len(xi)
pred = w0 #initial weight for bias
for j in range(n_loops):
pred += w[j] * xi[j]
return pred
print(linear_regression(xi))
# Note that it produces lower value due to we are going to apply log on y values
10.312
# To get the exact values, we have to reverse the log values by exponential
np.expm1(10.312) # compare this predicted price with 'y' value
30090.55961642849
np.log1p(30090.55961642849)
10.312
Linear Regression in Vector Form
def dot_mat(xi,w):
"""Dot Product of xi vector"""
n_runs = len(xi)
result = 0.0
for run in range(n_runs):
result += xi[run] * w[run]
return result
[w0] + w # w0 = 5.17
[5.17, 0.01, 0.04, 0.002]
[1] + xi # xi0 =1
[1, 453, 11, 86]
def lr_form(xi):
w_new = [w0] + w
xi = [1] + xi
return dot_mat(xi,w_new)
print(lr_form(xi))
10.312
Example - With Multiple features
w0 = 5.17
w_new = [w0] + w
w_new
[5.17, 0.01, 0.04, 0.002]
# three sample features
x1 = [1, 148, 24, 1385]
x2 = [1, 132, 25, 2031]
x10 = [1, 453, 11, 86]
X = [x1, x2, x10]
X = np.array(X)
X
array([[ 1, 148, 24, 1385],
[ 1, 132, 25, 2031],
[ 1, 453, 11, 86]])
def lr_multiple_features(xi,w):
return xi.dot(w)
print(lr_multiple_features(xi=X,w=w_new))
[10.38 11.552 10.312]
Training Linear Regression
# Multiple features
X = [
[148, 24, 1385],
[132, 25, 2031],
[453, 11, 86],
[158, 24, 185],
[172, 25, 201],
[413, 11, 86],
[38, 54, 185],
[142, 25, 431],
[453, 31, 86],
]
X = np.array(X)
X
array([[ 148, 24, 1385],
[ 132, 25, 2031],
[ 453, 11, 86],
[ 158, 24, 185],
[ 172, 25, 201],
[ 413, 11, 86],
[ 38, 54, 185],
[ 142, 25, 431],
[ 453, 31, 86]])
np.set_printoptions(suppress=True) # roundoff value instead of showng values 1e+07 to normal number
# creating bias for X, keeping bias would help us to understand baseline price of the car
ones = np.ones(X.shape[0])
ones
array([1., 1., 1., 1., 1., 1., 1., 1., 1.])
# append ones values in X column wise for considering ones as bias
X = np.column_stack([ones,X])
X
array([[ 1., 148., 24., 1385.],
[ 1., 132., 25., 2031.],
[ 1., 453., 11., 86.],
[ 1., 158., 24., 185.],
[ 1., 172., 25., 201.],
[ 1., 413., 11., 86.],
[ 1., 38., 54., 185.],
[ 1., 142., 25., 431.],
[ 1., 453., 31., 86.]])
# y values
y = [10000, 20000, 15000, 20050, 10000, 20000, 15000, 25000, 12000]
y
[10000, 20000, 15000, 20050, 10000, 20000, 15000, 25000, 12000]
XTX = X.T.dot(X)
XTX
array([[ 9., 2109., 230., 4676.],
[ 2109., 696471., 44115., 718540.],
[ 230., 44115., 7146., 118803.],
[ 4676., 718540., 118803., 6359986.]])
XTX_inv = np.linalg.inv(XTX)
XTX_inv
array([[ 3.30686958, -0.00539612, -0.06213256, -0.00066102],
[-0.00539612, 0.00001116, 0.0000867 , 0.00000109],
[-0.06213256, 0.0000867 , 0.00146189, 0.00000858],
[-0.00066102, 0.00000109, 0.00000858, 0.00000036]])
X.T
array([[ 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[ 148., 132., 453., 158., 172., 413., 38., 142., 453.],
[ 24., 25., 11., 24., 25., 11., 54., 25., 31.],
[1385., 2031., 86., 185., 201., 86., 185., 431., 86.]])
# putting together everything to calculate W
w_full = XTX_inv.dot(X.T).dot(y)
w_full
array([25844.75405577, -16.08906468, -199.47254894, -1.22802883])
# Separating w0 and other weights (w)
w0 = w_full[0]
w = w_full[1:]
w0, w
(25844.75405576679, array([ -16.08906468, -199.47254894, -1.22802883]))
X
array([[ 1., 148., 24., 1385.],
[ 1., 132., 25., 2031.],
[ 1., 453., 11., 86.],
[ 1., 158., 24., 185.],
[ 1., 172., 25., 201.],
[ 1., 413., 11., 86.],
[ 1., 38., 54., 185.],
[ 1., 142., 25., 431.],
[ 1., 453., 31., 86.]])
Putting together everything
X = [
[148, 24, 1385],
[132, 25, 2031],
[453, 11, 86],
[158, 24, 185],
[172, 25, 201],
[413, 11, 86],
[38, 54, 185],
[142, 25, 431],
[453, 31, 86],
]
X = np.array(X)
X
array([[ 148, 24, 1385],
[ 132, 25, 2031],
[ 453, 11, 86],
[ 158, 24, 185],
[ 172, 25, 201],
[ 413, 11, 86],
[ 38, 54, 185],
[ 142, 25, 431],
[ 453, 31, 86]])
y
[10000, 20000, 15000, 20050, 10000, 20000, 15000, 25000, 12000]
# putting everything inside function - DRY (Dont Repeat Yourself)
def linear_regression(X,y):
"""Calculate linear regression"""
ones = np.ones(X.shape[0])
X = np.column_stack([ones,X])
XTX = X.T.dot(X)
XTX_inv = np.linalg.inv(XTX)
w_full = XTX_inv.dot(X.T).dot(y)
return w_full[0], w_full[1:]
print(linear_regression(X,y))
(25844.75405576679, array([ -16.08906468, -199.47254894, -1.22802883]))
Car price baseline model
df_train.columns
Index(['make', 'model', 'year', 'engine_fuel_type', 'engine_hp',
'engine_cylinders', 'transmission_type', 'driven_wheels',
'number_of_doors', 'market_category', 'vehicle_size', 'vehicle_style',
'highway_mpg', 'city_mpg', 'popularity'],
dtype='object')
# selecting few features to create baseline model
base = ['engine_hp', 'engine_cylinders', 'highway_mpg',
'city_mpg', 'popularity']
# before creating baseline model, check for missing values
df_train[base].isnull().sum().sort_values(ascending=False)
engine_hp 40
engine_cylinders 14
highway_mpg 0
city_mpg 0
popularity 0
dtype: int64
# fill missing values with '0' for our understanding
X_train = df_train[base].fillna(0).values # .values convert dataframe to numpy array
X_train
array([[ 148., 4., 33., 24., 1385.],
[ 132., 4., 32., 25., 2031.],
[ 148., 4., 37., 28., 640.],
...,
[ 285., 6., 22., 17., 549.],
[ 563., 12., 21., 13., 86.],
[ 200., 4., 31., 22., 873.]])
# calculate base and weights with help of our custom function
w0, w = linear_regression(X_train,y_train)
w0
7.927257388070117
w
array([ 0.0097059 , -0.15910349, 0.01437921, 0.01494411, -0.00000907])
# Calculating predictions based on base and weights
y_pred = w0 + X_train.dot(w)
y_pred
array([ 9.54792783, 9.38733977, 9.67197758, ..., 10.30423015,
11.9778914 , 9.99863111])
# Plot and compare our results
sns.histplot(y_pred, alpha = 0.5, color='red', bins=50) # alpha for transparency
sns.histplot(y_train, alpha = 0.5, color='pink', bins=50);
As we check from above figure, our y_pred predicts lower value for higher output values
RMSE
def rmse(y,y_pred):
"""Calculate RMSE value of a model"""
sq_error = (y - y_pred)**2
mean_sq_error = sq_error.mean()
return np.sqrt(mean_sq_error)
rmse(y_train,y_pred)
0.7554192603920132
Validating the Model on Validation Dataset
Based on earlier data preparation steps below, we are going to create function
- base = [‘engine_hp’, ‘engine_cylinders’, ‘highway_mpg’,’city_mpg’, ‘popularity’]
- df_train[base].isnull().sum().sort_values(ascending=False)
- X_train = df_train[base].fillna(0).values
- w0, w = linear_regression(X_train,y_train)
- y_pred = w0 + X_train.dot(w)
# from earlier selected features
base = ['engine_hp', 'engine_cylinders', 'highway_mpg','city_mpg', 'popularity']
def prepare_X(df):
df_num = df[base] # selecting features from df
df_num = df_num.fillna(0) # filling missing values with 0
X = df_num.values # converting dataframe to NumPy array
return X
# From earlier written linear regression
def linear_regression(X,y):
"""Calculate linear regression"""
ones = np.ones(X.shape[0])
X = np.column_stack([ones,X])
XTX = X.T.dot(X)
XTX_inv = np.linalg.inv(XTX)
w_full = XTX_inv.dot(X.T).dot(y)
return w_full[0], w_full[1:]
X_train = prepare_X(df_train) # preparing X_train dataset after cleaning missing values
w0, w = linear_regression(X_train,y_train) # produces weights of base and parameters
X_val = prepare_X(df_val) # Preparing validation dataset to compare with train dataset
y_pred = w0 + X_val.dot(w) # applying known values of w0, w from train dataset
# Comparing the evaluated validation prediction with existing validation target values
rmse(y_val,y_pred)
0.7616530991301601
Ways to improve
Simple Feature Engineering
Car’s Age plays important role in features and we missed in our baseline
df_train.columns
Index(['make', 'model', 'year', 'engine_fuel_type', 'engine_hp',
'engine_cylinders', 'transmission_type', 'driven_wheels',
'number_of_doors', 'market_category', 'vehicle_size', 'vehicle_style',
'highway_mpg', 'city_mpg', 'popularity'],
dtype='object')
df_train["year"].max() # max. collected year of car is 2017
2017
2017 - df_train.year
2735 9
6720 5
5878 1
11190 26
4554 0
..
434 2
1902 2
9334 2
5284 3
2420 0
Name: year, Length: 7150, dtype: int64
# Lets add this feature for preparing our train data
def prepare_X(df):
# copying just to make sure adding/modifying new features shouldn't affect original data
df = df.copy()
df['age'] = 2017 - df['year'] # creating new feature in copied dataset
features = base + ['age'] # adding 'age' feature with existing features in base list
df_num = df[features] # selecting features from df
df_num = df_num.fillna(0) # filling missing values with 0
X = df_num.values # converting dataframe to NumPy array
return X
X_train = prepare_X(df_train) # preparing X_train dataset after cleaning missing values
w0, w = linear_regression(X_train,y_train) # produces weights of base and parameters
X_val = prepare_X(df_val) # Preparing validation dataset to compare with train dataset
y_pred = w0 + X_val.dot(w) # applying known values of w0, w from train dataset
# Comparing the evaluated validation prediction with existing validation target values
rmse(y_val,y_pred)
0.5172055461058335
As we checked with our earlier RMSE value, it decreased from 0.76 to 0.51 which is a better model. Adding important features making model prediction approx., equal to actual target values
# Lets plot to visualize the impact of adding new feature
sns.histplot(y_pred, label='prediction', color='red', alpha=0.5, bins=50)
sns.histplot(y_val, label='target', color='blue', alpha=0.5, bins=50)
plt.legend();
Categorical Variables
Number Of Doors
df_train.dtypes
make object
model object
year int64
engine_fuel_type object
engine_hp float64
engine_cylinders float64
transmission_type object
driven_wheels object
number_of_doors float64
market_category object
vehicle_size object
vehicle_style object
highway_mpg int64
city_mpg int64
popularity int64
dtype: object
df_train.number_of_doors.head() # treated as float datatype by the model
2735 2.0
6720 4.0
5878 4.0
11190 3.0
4554 4.0
Name: number_of_doors, dtype: float64
As we checked datatypes, number_of_doors marked as ‘float64’ instead of Object datatype
We have to change the datatype, lets check the categorical columns
df_train.select_dtypes('object').columns # displays categorical columns in the dataset
Index(['make', 'model', 'engine_fuel_type', 'transmission_type',
'driven_wheels', 'market_category', 'vehicle_size', 'vehicle_style'],
dtype='object')
df_train.select_dtypes('object').head() # first 5 rows of categorical variables
make | model | engine_fuel_type | transmission_type | driven_wheels | market_category | vehicle_size | vehicle_style | |
---|---|---|---|---|---|---|---|---|
2735 | chevrolet | cobalt | regular_unleaded | manual | front_wheel_drive | NaN | compact | coupe |
6720 | toyota | matrix | regular_unleaded | automatic | front_wheel_drive | hatchback | compact | 4dr_hatchback |
5878 | subaru | impreza | regular_unleaded | automatic | all_wheel_drive | hatchback | compact | 4dr_hatchback |
11190 | volkswagen | vanagon | regular_unleaded | manual | rear_wheel_drive | NaN | large | passenger_minivan |
4554 | ford | f-150 | flex-fuel_(unleaded/e85) | automatic | four_wheel_drive | flex_fuel | large | crew_cab_pickup |
# How to change number of doors datatype
(df_train.number_of_doors==4).astype(int)
# converts the True/False from condition to integer '0s' and '1s'
2735 0
6720 1
5878 1
11190 0
4554 1
..
434 0
1902 0
9334 1
5284 1
2420 1
Name: number_of_doors, Length: 7150, dtype: int64
# Instead of doing this for v = 2, v = 3, v = 4 and create columns, we can do it one step
'num_doors_%s' %4
'num_doors_4'
for doors in [2,3,4]:
print('num_doors_%s' %doors)
num_doors_2
num_doors_3
num_doors_4
# applying it in our prepare_X function
def prepare_X(df):
# copying just to make sure adding/modifying new features shouldn't affect original data
df = df.copy()
features = base.copy() # Creating Copy of base features
df['age'] = 2017 - df['year'] # creating new feature in copied dataset
features.append('age') # adding 'age' feature with existing features in base list
for doors in [2,3,4]:
# Creating new columns for each door value
# Convert them into binary value wherever condition meets (One-Hot Encoder)
df['num_doors_%s' %doors] = (df.number_of_doors == doors).astype(int)
features.append('num_doors_%s' %doors) # Adding new features to existing feature list
df_num = df[features] # selecting features from df
df_num = df_num.fillna(0) # filling missing values with 0
X = df_num.values # converting dataframe to NumPy array
return X
X_train = prepare_X(df_train) # preparing X_train dataset after cleaning missing values
w0, w = linear_regression(X_train,y_train) # produces weights of base and parameters
X_val = prepare_X(df_val) # Preparing validation dataset to compare with train dataset
y_pred = w0 + X_val.dot(w) # applying known values of w0, w from train dataset
# Comparing the evaluated validation prediction with existing validation target values
rmse(y_val,y_pred)
0.5157995641502353
- As we checked with previous RMSE value, only slightly decreased
- Let’s add top car make as columns and check how the model performs
Car Make
list(df.make.value_counts().head().index) # Select the top car brands and do one-hot encoding
['chevrolet', 'ford', 'volkswagen', 'toyota', 'dodge']
car_make = ['chevrolet', 'ford', 'volkswagen', 'toyota', 'dodge']
# Add this to our prepare_X function
# applying it in our prepare_X function
def prepare_X(df):
# copying just to make sure adding/modifying new features shouldn't affect original data
df = df.copy()
features = base.copy() # Creating Copy of base features
df['age'] = 2017 - df['year'] # creating new feature in copied dataset
features.append('age') # adding 'age' feature with existing features in base list
for doors in [2,3,4]:
# Creating new columns for each door value
# Convert them into binary value wherever condition meets (One-Hot Encoder)
df['num_doors_%s' %doors] = (df.number_of_doors == doors).astype(int)
features.append('num_doors_%s' %doors) # Adding new features to existing feature list
for brand in car_make:
# Creating new columns for each top car brand
df['car_%s' %brand] = (df.make == brand).astype(int)
features.append('car_%s' %brand)
df_num = df[features] # selecting features from df
df_num = df_num.fillna(0) # filling missing values with 0
X = df_num.values # converting dataframe to NumPy array
return X
X_train = prepare_X(df_train) # preparing X_train dataset after cleaning missing values
w0, w = linear_regression(X_train,y_train) # produces weights of base and parameters
X_val = prepare_X(df_val) # Preparing validation dataset to compare with train dataset
y_pred = w0 + X_val.dot(w) # applying known values of w0, w from train dataset
# Comparing the evaluated validation prediction with existing validation target values
rmse(y_val,y_pred)
0.5076038849557035
- After appending car make as feature in our training, RMSE value drcreased from 0.517 to 0.507
Adding more features
- What happens if we add some more features? Will model reduce error rate? or Will Overfit?
# Select some more feature from categorical datatypes
categorical_columns = [
'make', 'model', 'engine_fuel_type', 'driven_wheels', 'market_category',
'vehicle_size', 'vehicle_style']
categorical = {}
for c in categorical_columns:
# Selecting top most 5 unique values from each column and making them new column in training
categorical[c] = list(df_train[c].value_counts().head().index)
# applying in our prepare_X function
def prepare_X(df):
# copying just to make sure adding/modifying new features shouldn't affect original data
df = df.copy()
features = base.copy() # Creating Copy of base features
df['age'] = 2017 - df['year'] # creating new feature in copied dataset
features.append('age') # adding 'age' feature with existing features in base list
for doors in [2,3,4]:
# Creating new columns for each door value
# Convert them into binary value wherever condition meets (One_Hot Encoder)
df['num_doors_%s' %doors] = (df.number_of_doors == doors).astype(int)
features.append('num_doors_%s' %doors) # Adding new features to existing feature list
for name, values in categorical.items():
# name - name of the column
# values - top most value in each column and looped to create as new feature
for value in values:
df['%s_%s' % (name, value)] = (df[name] == value).astype(int)
features.append('%s_%s' % (name, value))
df_num = df[features]
df_num = df_num.fillna(0)
X = df_num.values
return X
X_train = prepare_X(df_train) # preparing X_train dataset after cleaning missing values
w0, w = linear_regression(X_train,y_train) # produces weights of base and parameters
X_val = prepare_X(df_val) # Preparing validation dataset to compare with train dataset
y_pred = w0 + X_val.dot(w) # applying known values of w0, w from train dataset
# Comparing the evaluated validation prediction with existing validation target values
rmse(y_val,y_pred)
24.780192001181355
- As we checked, RMSE value increased after adding more features.
Regularization
The higher value of RMSE maybe due to some values in features are identical to other feature values
# For example
X = [
[4, 4, 4],
[3, 5, 5],
[5, 1, 1],
[5, 4, 4],
[7, 5, 5],
[4, 5, 5.00000001],
] # Column 2 and 3 are actually duplicates
X = np.array(X)
y = [1,2,3,1,2,3]
X, y
(array([[4. , 4. , 4. ],
[3. , 5. , 5. ],
[5. , 1. , 1. ],
[5. , 4. , 4. ],
[7. , 5. , 5. ],
[4. , 5. , 5.00000001]]),
[1, 2, 3, 1, 2, 3])
# Calculate normal equation
XTX = X.T.dot(X)
XTX
array([[140. , 111. , 111.00000004],
[111. , 108. , 108.00000005],
[111.00000004, 108.00000005, 108.0000001 ]])
XTX_inv = np.linalg.inv(XTX)
XTX_inv
array([[ 3.86409478e-02, -1.26839821e+05, 1.26839770e+05],
[-1.26839767e+05, 2.88638033e+14, -2.88638033e+14],
[ 1.26839727e+05, -2.88638033e+14, 2.88638033e+14]])
XTX_inv.dot(X.T).dot(y)
array([ -0.19390888, -3618543.74936484, 3618546.42894508])
As we check, w0 is -0.194 and w are above 3.6 * 10^6 which are huge and would impact in RMSE like before
To Tackle this problem, we add a small number/value at the diagonals
# lets get into action how we can do this
XTX = [
[1,2,2],
[2,1,1.0001],
[2,1.0001,1]
]
XTX = np.array(XTX)
XTX
array([[1. , 2. , 2. ],
[2. , 1. , 1.0001],
[2. , 1.0001, 1. ]])
np.linalg.inv(XTX) # 1
array([[ -0.33335556, 0.33333889, 0.33333889],
[ 0.33333889, -5000.08333472, 4999.91666528],
[ 0.33333889, 4999.91666528, -5000.08333472]])
- The column values at 2nd and 3rd column are high
- We will try to add small number (0.01) at the diagonal and check the results
# lets get into action how we can do this
XTX = [
[1.01,2,2],
[2,1.01,1.0001],
[2,1.0001,1.01]
]
XTX = np.array(XTX)
XTX
array([[1.01 , 2. , 2. ],
[2. , 1.01 , 1.0001],
[2. , 1.0001, 1.01 ]])
np.linalg.inv(XTX) # 2
array([[ -0.3367115 , 0.33501965, 0.33501965],
[ 0.33501965, 50.42045804, -50.58964297],
[ 0.33501965, -50.58964297, 50.42045804]])
- Comparing # 1 and # 2, the values are decreased much
# How to do add the number
XTX = [
[1,2, 2],
[2,1,1.0001],
[2,1.0001,1]
]
XTX
[[1, 2, 2], [2, 1, 1.0001], [2, 1.0001, 1]]
XTX = XTX + 0.01 * np.eye(3)
XTX
array([[1.01 , 2. , 2. ],
[2. , 1.01 , 1.0001],
[2. , 1.0001, 1.01 ]])
np.linalg.inv(XTX)
array([[ -0.3367115 , 0.33501965, 0.33501965],
[ 0.33501965, 50.42045804, -50.58964297],
[ 0.33501965, -50.58964297, 50.42045804]])
# Applying the regularization parameter in our linear regression function
def linear_regression_reg(X, y, r=0.001):
ones = np.ones(X.shape[0]) # Creating bias term for dataset
X = np.column_stack([ones, X]) # Adding column wise, bias with dataset
XTX = X.T.dot(X) # Matrix Multiplication
XTX = XTX + r * np.eye(XTX.shape[0]) # Adding regularization parameter at the diagonals
XTX_inv = np.linalg.inv(XTX) # Inverse of XTX
w_full = XTX_inv.dot(X.T).dot(y) # Normal equation to find the coefficients of bias and weights
return w_full[0], w_full[1:] # Bias term, Weights - w1,..wn
X_train = prepare_X(df_train)
w0, w = linear_regression_reg(X_train, y_train, r=0.01)
X_val = prepare_X(df_val)
y_pred = w0 + X_val.dot(w)
rmse(y_val, y_pred)
0.4608208286029829
Our earlier RMSE value is 0.51 and now reduced to 0.46
But we have to find the exact ‘r’ value to find the least RMSE value
Tuning the Model
for r_value in [0.0,0.01,0.001,0.0001,0.00001,10]: # Manually given r_values
# Generate training results for each r_value to find out the optimum r-value
X_train = prepare_X(df_train)
w0, w = linear_regression_reg(X_train, y_train, r=r_value)
X_val = prepare_X(df_val)
y_pred = w0 + X_val.dot(w)
score = rmse(y_val, y_pred)
print(r_value, w0, score)
0.0 -1892716650172720.2 24.780192001181355
0.01 7.1183820236285555 0.4608208286029829
0.001 7.130829068671119 0.4608158583369783
0.0001 7.139881370176266 0.46081536403011203
1e-05 3.6757689730217615 0.46081532315296586
10 4.729512585698256 0.472609877266825
Using the Model
- Apply everything
- Combine train + validation as train dataset
- Compare results with test dataset
# Combine df_train and df_val as single train dataset
df_full_train = pd.concat([df_train,df_val])
df_full_train.head()
make | model | year | engine_fuel_type | engine_hp | engine_cylinders | transmission_type | driven_wheels | number_of_doors | market_category | vehicle_size | vehicle_style | highway_mpg | city_mpg | popularity | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2735 | chevrolet | cobalt | 2008 | regular_unleaded | 148.0 | 4.0 | manual | front_wheel_drive | 2.0 | NaN | compact | coupe | 33 | 24 | 1385 |
6720 | toyota | matrix | 2012 | regular_unleaded | 132.0 | 4.0 | automatic | front_wheel_drive | 4.0 | hatchback | compact | 4dr_hatchback | 32 | 25 | 2031 |
5878 | subaru | impreza | 2016 | regular_unleaded | 148.0 | 4.0 | automatic | all_wheel_drive | 4.0 | hatchback | compact | 4dr_hatchback | 37 | 28 | 640 |
11190 | volkswagen | vanagon | 1991 | regular_unleaded | 90.0 | 4.0 | manual | rear_wheel_drive | 3.0 | NaN | large | passenger_minivan | 18 | 16 | 873 |
4554 | ford | f-150 | 2017 | flex-fuel_(unleaded/e85) | 385.0 | 8.0 | automatic | four_wheel_drive | 4.0 | flex_fuel | large | crew_cab_pickup | 21 | 15 | 5657 |
# Seem the index is shuffled, lets reset the index
df_full_train = df_full_train.reset_index(drop=True) # drop = True will drop existing index
df_full_train.head()
make | model | year | engine_fuel_type | engine_hp | engine_cylinders | transmission_type | driven_wheels | number_of_doors | market_category | vehicle_size | vehicle_style | highway_mpg | city_mpg | popularity | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chevrolet | cobalt | 2008 | regular_unleaded | 148.0 | 4.0 | manual | front_wheel_drive | 2.0 | NaN | compact | coupe | 33 | 24 | 1385 |
1 | toyota | matrix | 2012 | regular_unleaded | 132.0 | 4.0 | automatic | front_wheel_drive | 4.0 | hatchback | compact | 4dr_hatchback | 32 | 25 | 2031 |
2 | subaru | impreza | 2016 | regular_unleaded | 148.0 | 4.0 | automatic | all_wheel_drive | 4.0 | hatchback | compact | 4dr_hatchback | 37 | 28 | 640 |
3 | volkswagen | vanagon | 1991 | regular_unleaded | 90.0 | 4.0 | manual | rear_wheel_drive | 3.0 | NaN | large | passenger_minivan | 18 | 16 | 873 |
4 | ford | f-150 | 2017 | flex-fuel_(unleaded/e85) | 385.0 | 8.0 | automatic | four_wheel_drive | 4.0 | flex_fuel | large | crew_cab_pickup | 21 | 15 | 5657 |
df_full_train
make | model | year | engine_fuel_type | engine_hp | engine_cylinders | transmission_type | driven_wheels | number_of_doors | market_category | vehicle_size | vehicle_style | highway_mpg | city_mpg | popularity | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chevrolet | cobalt | 2008 | regular_unleaded | 148.0 | 4.0 | manual | front_wheel_drive | 2.0 | NaN | compact | coupe | 33 | 24 | 1385 |
1 | toyota | matrix | 2012 | regular_unleaded | 132.0 | 4.0 | automatic | front_wheel_drive | 4.0 | hatchback | compact | 4dr_hatchback | 32 | 25 | 2031 |
2 | subaru | impreza | 2016 | regular_unleaded | 148.0 | 4.0 | automatic | all_wheel_drive | 4.0 | hatchback | compact | 4dr_hatchback | 37 | 28 | 640 |
3 | volkswagen | vanagon | 1991 | regular_unleaded | 90.0 | 4.0 | manual | rear_wheel_drive | 3.0 | NaN | large | passenger_minivan | 18 | 16 | 873 |
4 | ford | f-150 | 2017 | flex-fuel_(unleaded/e85) | 385.0 | 8.0 | automatic | four_wheel_drive | 4.0 | flex_fuel | large | crew_cab_pickup | 21 | 15 | 5657 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9527 | volvo | v60 | 2015 | regular_unleaded | 240.0 | 4.0 | automatic | front_wheel_drive | 4.0 | luxury | midsize | wagon | 37 | 25 | 870 |
9528 | maserati | granturismo_convertible | 2015 | premium_unleaded_(required) | 444.0 | 8.0 | automatic | rear_wheel_drive | 2.0 | exotic,luxury,high-performance | midsize | convertible | 20 | 13 | 238 |
9529 | cadillac | escalade_hybrid | 2013 | regular_unleaded | 332.0 | 8.0 | automatic | rear_wheel_drive | 4.0 | luxury,hybrid | large | 4dr_suv | 23 | 20 | 1624 |
9530 | mitsubishi | lancer | 2016 | regular_unleaded | 148.0 | 4.0 | manual | front_wheel_drive | 4.0 | NaN | compact | sedan | 34 | 24 | 436 |
9531 | kia | sorento | 2015 | regular_unleaded | 290.0 | 6.0 | automatic | front_wheel_drive | 4.0 | crossover | midsize | 4dr_suv | 25 | 18 | 1720 |
9532 rows × 15 columns
# From earlier written function to do feature engg
def prepare_X(df):
# copying just to make sure adding/modifying new features shouldn't affect original data
df = df.copy()
features = base.copy() # Creating Copy of base features
df['age'] = 2017 - df['year'] # creating new feature in copied dataset
features.append('age') # adding 'age' feature with existing features in base list
for doors in [2,3,4]:
# Creating new columns for each door value
# Convert them into binary value wherever condition meets (One_Hot Encoder)
df['num_doors_%s' %doors] = (df.number_of_doors == doors).astype(int)
features.append('num_doors_%s' %doors) # Adding new features to existing feature list
for name, values in categorical.items():
# name - name of the column
# values - top most value in each column and looped to create as new feature
for value in values:
df['%s_%s' % (name, value)] = (df[name] == value).astype(int)
features.append('%s_%s' % (name, value))
df_num = df[features]
df_num = df_num.fillna(0)
X = df_num.values
return X
# prepare and clean full dataset
X_full_train = prepare_X(df_full_train)
X_full_train
array([[148., 4., 33., ..., 1., 0., 0.],
[132., 4., 32., ..., 0., 0., 1.],
[148., 4., 37., ..., 0., 0., 1.],
...,
[332., 8., 23., ..., 0., 0., 0.],
[148., 4., 34., ..., 0., 0., 0.],
[290., 6., 25., ..., 0., 0., 0.]])
# Combine y values of train and validation dataset together
y_full_train = np.concatenate([y_train,y_val])
y_full_train
array([ 9.57574708, 9.887663 , 9.89323518, ..., 11.21756062,
9.77542688, 10.1924563 ])
# train the x_full_train and y_full_train dataset to find the coefficients
w0, w = linear_regression_reg(X_full_train, y_full_train, r=0.001)
# Apply w0, w to find the prediction values
X_test = prepare_X(df_test) # feature engg on features
y_pred = w0 + X_test.dot(w) # prediction on X_test values
score = rmse(y_test, y_pred) # comparing existing y values with predicted values
score
0.46007539687771004
Testing Out the Model
- Lets apply the model on an unseen data and check its performance
df_test.iloc[5]
make audi
model a3
year 2015
engine_fuel_type premium_unleaded_(recommended)
engine_hp 220.0
engine_cylinders 4.0
transmission_type automated_manual
driven_wheels all_wheel_drive
number_of_doors 2.0
market_category luxury
vehicle_size compact
vehicle_style convertible
highway_mpg 32
city_mpg 23
popularity 3105
Name: 1027, dtype: object
car = df_test.iloc[5].to_dict()
car
{'make': 'audi',
'model': 'a3',
'year': 2015,
'engine_fuel_type': 'premium_unleaded_(recommended)',
'engine_hp': 220.0,
'engine_cylinders': 4.0,
'transmission_type': 'automated_manual',
'driven_wheels': 'all_wheel_drive',
'number_of_doors': 2.0,
'market_category': 'luxury',
'vehicle_size': 'compact',
'vehicle_style': 'convertible',
'highway_mpg': 32,
'city_mpg': 23,
'popularity': 3105}
# To make this test data to give results,
# we have to modify the features as like in train data by prepare_X function
# To do that, we have convert that into dataframe
# remember prepare_X function accepts dataframe only
df_test_car = pd.DataFrame([car])
df_test_car
make | model | year | engine_fuel_type | engine_hp | engine_cylinders | transmission_type | driven_wheels | number_of_doors | market_category | vehicle_size | vehicle_style | highway_mpg | city_mpg | popularity | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | audi | a3 | 2015 | premium_unleaded_(recommended) | 220.0 | 4.0 | automated_manual | all_wheel_drive | 2.0 | luxury | compact | convertible | 32 | 23 | 3105 |
# Now apply on prepare_X function to create features in age, no.doors and for categorical variables
X_test_car = prepare_X(df_test_car)
# Lets check our model prediction on our test car details
y_pred = w0 + X_test_car.dot(w) # We already know w0, w
y_pred = y_pred[0]
y_pred
10.474552314979627
# Our model predicted in logarithmic values, convert them to check values in actual MSRP
np.expm1(y_pred)
35402.01676990125
# Lets check with our actual y value
np.expm1(y_test[5])
41149.999999999985
By Comparing, we came to know that we’re around $ 5K lower than actual which is actually good model as we did only few feature engg and trained with few variables.