[Regression and Prediction] 판매에서 광고의 효과

2024. 5. 7. 15:43노트/Data Science : 데이터과학

지도학습 (Supervised Learning) 에 대한 소개 : Regression

문제

  • 판매에 대한 광고 효과를 정량화 하는것은 흥미로운 응용 회귀문제이다. 광고의 다양한 채널은 신문, TV, 라디오 등등이 있다.
  • 이 케이스에서 한 회사의 광고 데이터를 살펴보고 광고의 효과를 고려해볼것이다.
  • 또한 광고의 다양한 파라미터를 고려해 보면서 판매액을 예측해 볼 것이다.

데이터 정보

광고 소비 관련 3가지 피쳐가 있고, 타겟 변수는 순판매액이다. 속성들은 다음과 같다.

  • TV - TV 광고 예산을 수치화한 독립변수
  • Radio - Radio 광고 예산을 수치화한 독립변수
  • News - news 광고를 수치화한 독립변수
  • Sales - 종속 변수
import pandas as pd
import numpy as np
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
Ad_df = pd.read_csv('data/Advertising.csv')
Ad_df.head()

# we can drop the first column as it is just the index
Ad_df.drop(columns = 'Unnamed: 0', inplace=True)

Ad_df.info()
>>> 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   TV         200 non-null    float64
 1   Radio      200 non-null    float64
 2   Newspaper  200 non-null    float64
 3   Sales      200 non-null    float64
dtypes: float64(4)
memory usage: 6.4 KB
  • 모든 데이터 타입이 실수형 데이터 타입이다.

이제 단순 선형회귀를 시작해보자. 우리는 한번에 한 피쳐를 사용할 것이고 타겟 변수를 살펴볼 것이다.

 

Ad_df.Sales.values.reshape(len(Ad_df['Sales']), 1)

# Dataset is stored in a Pandas Dataframe. Let us take out all the variables in a numpy array.
Sales = Ad_df.Sales.values.reshape(len(Ad_df['Sales']), 1)
TV = Ad_df.TV.values.reshape(len(Ad_df['Sales']), 1)
Radio = Ad_df.Radio.values.reshape(len(Ad_df['Sales']), 1)
Newspaper = Ad_df.Newspaper.values.reshape(len(Ad_df['Sales']), 1)

 

# let us fit the simple linear regression model with the TV feature
tv_model = LinearRegression()
tv_model.fit(TV, Sales)
coeffs_tv = np.array(list(tv_model.intercept_.flatten()) + list(tv_model.coef_.flatten())) ## flatten함수 : 1차원 배열로 변환 + 원래 계수가 변화해도 값이 바뀌지 않도록 복사 
coeffs_tv = list(coeffs_tv)

# let us fit the simple linear regression model with the Radio feature
radio_model = LinearRegression()
radio_model.fit(Radio, Sales)
coeffs_radio = np.array(list(radio_model.intercept_.flatten()) + list(radio_model.coef_.flatten()))
coeffs_radio = list(coeffs_radio)

# let us fit the simple linear regression model with the Newspaper feature
newspaper_model = LinearRegression() 
newspaper_model.fit(Newspaper, Sales)
coeffs_newspaper = np.array(list(newspaper_model.intercept_.flatten()) + list(newspaper_model.coef_.flatten()))
coeffs_newspaper = list(coeffs_newspaper)
coeffs_tv # b0 과 b1 
>>> [7.032593549127693, 0.047536640433019764]
dict_Sales = {}
dict_Sales["TV"] = coeffs_tv
dict_Sales["Radio"] = coeffs_radio 
dict_Sales["Newspaper"] = coeffs_newspaper 

metric_Df_SLR = pd.DataFrame(dict_Sales)
metric_Df_SLR.index = ['intercept', 'Coefficient']
metric_Df_SLR

# Let us now calculate R^2
tv_rsq = tv_model.score(TV, Sales)
radio_rsq = radio_model.score(Radio, Sales)
newspaper_rsq  = newspaper_model.score(Newspaper, Sales)

print("TV simple linear regression R-Square :", tv_rsq)
print("Radio simple linear regression R-Square :", radio_rsq)
print("Newspaper simple linear regression R-Square :", newspaper_rsq)
list_rsq = [tv_rsq, radio_rsq, newspaper_rsq]
list_rsq
>>> 
TV simple linear regression R-Square : 0.611875050850071
Radio simple linear regression R-Square : 0.33203245544529525
Newspaper simple linear regression R-Square : 0.05212044544430516
[0.611875050850071, 0.33203245544529525, 0.05212044544430516]
metric_Df_SLR.loc['R-Squared'] = list_rsq
metric_Df_SLR

  • TV가 61%의 높은 R^2 값을 가지고 있는 것을 볼 수 있습니다. 그다음은 Radio, Newspaper 순입니다.

regression plot으로 최적합선을 시각화

plt.scatter(TV, Sales,  color='red')
plt.xlabel('TV Ads')
plt.ylabel('Sales')
plt.plot(TV, tv_model.predict(TV), color='blue', linewidth=3)
plt.show()

plt.scatter(Radio, Sales,  color='red')
plt.xlabel('Radio Ads')
plt.ylabel('Sales')
plt.plot(Radio, radio_model.predict(Radio), color='blue', linewidth=3)
plt.show()

plt.scatter(Newspaper, Sales,  color='red')
plt.xlabel('Newspaper Ads')
plt.ylabel('Sales')
plt.plot(Newspaper, newspaper_model.predict(Newspaper), color='blue', linewidth=3)
plt.show()

다중선형회귀 (Multiple Linear Regression)

mlr_model = LinearRegression() 
mlr_model.fit(Ad_df[['TV', 'Radio', 'Newspaper']], Ad_df['Sales'])
Ad_df['Sales_Predicted'] = mlr_model.predict(Ad_df[['TV', 'Radio', 'Newspaper']])
Ad_df['Error'] = (Ad_df['Sales_Predicted'] - Ad_df['Sales'])**2 
MSE_MLR = Ad_df['Error'].mean()
MSE_MLR
>>> 2.784126314510936
mlr_model.score(Ad_df[['TV', 'Radio', 'Newspaper']], Ad_df['Sales'])
>>> 0.8972106381789522

다중 선형 회귀의 𝑅2 값이 89.7 %로 나왔습니다. 단순 선형 회귀보다 더 높은 수치입니다.

이제 모델 해석을 위해 statsmodel을 사용해봅시다.

 

import statsmodels.formula.api as smf 
lm1 = smf.ols(formula = 'Sales ~ TV+Radio+Newspaper', data = Ad_df).fit() 
lm1.params
>>> 
Intercept    2.938889
TV           0.045765
Radio        0.188530
Newspaper   -0.001037
dtype: float64
print(lm1.summary()) # Inferential statistics

print("*************Parameters**************")
print(lm1.params)
print("*************P-Values**************")
print(lm1.pvalues)
print("************Standard Errors***************")
print(lm1.bse) 
print("*************Confidence Interval**************")
print(lm1.conf_int())
print("*************Error Covariance Matrix**************")
print(lm1.cov_params())
*************Parameters**************
Intercept    2.938889
TV           0.045765
Radio        0.188530
Newspaper   -0.001037
dtype: float64
*************P-Values**************
Intercept    1.267295e-17
TV           1.509960e-81
Radio        1.505339e-54
Newspaper    8.599151e-01
dtype: float64
************Standard Errors***************
Intercept    0.311908
TV           0.001395
Radio        0.008611
Newspaper    0.005871
dtype: float64
*************Confidence Interval**************
                  0         1
Intercept  2.323762  3.554016
TV         0.043014  0.048516
Radio      0.171547  0.205513
Newspaper -0.012616  0.010541
*************Error Covariance Matrix**************
           Intercept            TV         Radio     Newspaper
Intercept   0.097287 -2.657273e-04 -1.115489e-03 -5.910212e-04
TV         -0.000266  1.945737e-06 -4.470395e-07 -3.265950e-07
Radio      -0.001115 -4.470395e-07  7.415335e-05 -1.780062e-05
Newspaper  -0.000591 -3.265950e-07 -1.780062e-05  3.446875e-05

단순 선형회귀의 신뢰 구간을 시각화해 봅시다.

import seaborn as sns 
sns.lmplot(x = 'TV', y = 'Sales', data = Ad_df)
sns.lmplot(x = 'Radio', y = 'Sales', data = Ad_df)
sns.lmplot(x = 'Newspaper', y = 'Sales', data = Ad_df)

모델 평가 : Cross validation and Bootstrapping

  • newspaper는 p-value에 근거하여 중요 변수 리스트에서 생략될 수 있다는 것을 확인했습니다.
  • 이제 각 피쳐들의 곱한 피쳐를 추가하여 회귀 분석을 수행해봅시다.
Ad_df['TVandRadio'] = Ad_df['TV']*Ad_df['Radio']

# let us remove the sales_predicted and the error column generated earlier
Ad_df.drop(columns = ["Error", "Sales_Predicted"], inplace = True)

# Let us do the modelling with the new feature.
import statsmodels.formula.api as smf 
lm2 = smf.ols(formula = 'Sales ~ TV+Radio+Newspaper+TVandRadio', data = Ad_df).fit()
lm2.params
>>> 
Intercept     6.728412
TV            0.019067
Radio         0.027992
Newspaper     0.001444
TVandRadio    0.001087
dtype: float64
print(lm2.summary()) # Inferencial Statistics

  • R-square가 증가함을 확인할 수 있습니다. 그러나 prediction에 유용한 방법일까요? 보여지지 않는 데이터를 잘 예측할까요? 확인해 봅시다.
Ad_df.head()

Ad_df.describe()

  • TV and Radio 의 계수가 0.001 로 매우 작은것으로 보이나, 데이터 범위를 확인해보면 Radio랑 Newspaper랑 Sales와 같은 스케일의 값이 아니라서 Rescaling이 필요함
  • TVandradio_rescale = TV(normalization) * Radio(normalization) * 100 값으로 대체
Ad_df['Radio_nor'] = Ad_df['Radio'].map(lambda x : (x - Ad_df['Radio'].min()) / (Ad_df['Radio'].max() - Ad_df['Radio'].min()))
Ad_df['TV_nor'] = Ad_df['TV'].map(lambda x : (x- Ad_df['TV'].min()) / (Ad_df['TV'].max() - Ad_df['TV'].min()))
Ad_df['TVandRadio_rsc']  = round(Ad_df['Radio_nor'] * Ad_df['TV_nor'] * 100, 1)

Ad_df = Ad_df.drop(['Radio_nor', 'TV_nor'], axis = 1) 

Ad_df.describe()

lm3 = smf.ols(formula = 'Sales ~ TV+Radio+Newspaper+TVandRadio_rsc', data = Ad_df).fit()
lm3.params
>>> 
Intercept         6.727052
TV                0.019078
Radio             0.028774
Newspaper         0.001462
TVandRadio_rsc    0.159436

퍼포먼스 평가, testing 과 validation

Train, Test, and Validation set

  • 하나는 모델 학습용, 하나는 모델 검증용, 마지막은 모델 테스트용으로 3개 데이터셋으로 분할하겠습니다.
from sklearn.model_selection import train_test_split

features_base = [i for i in Ad_df.columns if i not in ("Sales", "TVandRadio")]
features_added = [i for i in Ad_df.columns if i not in ("Sales")]
target = 'Sales'
train, test = train_test_split(Ad_df, test_size = 0.10, train_size = 0.90)
train, validation = train_test_split(train, test_size = 0.20, train_size = 0.80)

train.shape, validation.shape, test.shape
>>> ((144, 5), (36, 5), (20, 5))

 

# now let us start with the modelling
from sklearn.linear_model import LinearRegression 

mlr = LinearRegression() 
mlr.fit(train[features_base], train[target]) 
print("*********Training set Metrics**************")
print("R-Squared: ", mlr.score(train[features_base], train[target]))

se_train = (train[target] - mlr.predict(train[features_base]))**2
mse_train = se_train.mean() 
print('MSE: ', mse_train)

print("********Validation set Metrics**************")
print("R-Sqaured: ", mlr.score(validation[features_base], validation[target]))

se_val = (validation[target] - mlr.predict(validation[features_base]))**2
mse_val = se_val.mean() 
print('MSE: ', mse_val)

>>> 
*********Training set Metrics**************
R-Squared:  0.903216337464553
MSE:  2.7599592052059285
********Validation set Metrics**************
R-Sqaured:  0.8415880634273336
MSE:  3.2029463467232415

 

# Can we increase the model performance by adding the new feature? 
# We found that to be the case in the analysis above but let's check the same for the validation dataset


mlr_added_feature = LinearRegression()
mlr_added_feature.fit(train[features_added], train[target])
print("*********Training set Metrics**************")
print("R-Squared:", mlr_added_feature.score(train[features_added], train[target]))
se_train = (train[target] - mlr_added_feature.predict(train[features_added]))**2
mse_train = se_train.mean()
print('MSE: ', mse_train)
print("********Validation set Metrics**************")
print("R-Squared:", mlr_added_feature.score(validation[features_added], validation[target]))
se_val = (validation[target] - mlr_added_feature.predict(validation[features_added]))**2
mse_val = se_val.mean()
print('MSE: ', mse_val)
>>> 
*********Training set Metrics**************
R-Squared: 0.9675029996104462
MSE:  0.9267100770636859
********Validation set Metrics**************
R-Squared: 0.9572813685448501
MSE:  0.8637321625919551

 

feature를 증가함에 따라 R-sqaured 가 증가했고, error는 감소함을 확인했습니다. 정규화된 모델에 피팅해봅시다.

features_added
>>> ['TV', 'Radio', 'Newspaper', 'TVandRadio']
from sklearn.linear_model import Ridge 
from sklearn.linear_model import Lasso 

# fitting Ridge with the default features 
ridge = Ridge() 
ridge.fit(train[features_added], train[target])

print("*********Training set Metrics**************")
print("R-Squared: ", ridge.score(train[features_added], train[target]))
se_train = (train[target] - ridge.predict(train[features_added]))**2
mse_train = se_train.mean() 
print('MSE: ', mse_train)

print("********Validation set Metrics**************")
print("R-Squared:", ridge.score(validation[features_added], validation[target]))
se_val = (validation[target] - ridge.predict(validation[features_added]))**2
mse_val = se_val.mean()
print('MSE: ', mse_val)
>>> 
*********Training set Metrics**************
R-Squared:  0.9675029995924045
MSE:  0.9267100775781751
********Validation set Metrics**************
R-Squared: 0.9572809715934404
MSE:  0.8637401885911019
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

#fitting Lasso with the default features
lasso = Lasso()
lasso.fit(train[features_added], train[target])

print("*********Training set Metrics**************")
print("R-Squared:", lasso.score(train[features_added], train[target]))
se_train = (train[target] - lasso.predict(train[features_added]))**2
mse_train = se_train.mean()
print('MSE: ', mse_train)
print("********Validation set Metrics**************")
print("R-Squared:", lasso.score(validation[features_added], validation[target]))
se_val = (validation[target] - lasso.predict(validation[features_added]))**2
mse_val = se_val.mean()
print('MSE: ', mse_val)
>>> 
*********Training set Metrics**************
R-Squared: 0.9666645766262986
MSE:  0.9506192077199765
********Validation set Metrics**************
R-Squared: 0.9535527261445209
MSE:  0.9391219457911119
#Let us predict on the unseen data using Ridge

rsq_test = ridge.score(test[features_added], test[target])
se_test = (test[target] - ridge.predict(test[features_added]))**2
mse_test = se_test.mean()

print("*****************Test set Metrics******************")

print("Rsquared: ", rsq_test)
print("MSE: ", mse_test)
print("Intercept is {} and Coefficients are {}".format(ridge.intercept_, ridge.coef_))
>>> 
*****************Test set Metrics******************
Rsquared:  0.9752629628269709
MSE:  0.6312520830999426
Intercept is 6.644670991994743 and Coefficients are [0.01920693 0.02087446 0.00495735 0.00110206]
#Let us predict on the unseen data using lasso

rsq_test = lasso.score(test[features_added], test[target])
se_test = (test[target] - lasso.predict(test[features_added]))**2
mse_test = se_test.mean()

print("*****************Test set Metrics******************")

print("Rsquared: ", rsq_test)
print("MSE: ", mse_test)
print("Intercept is {} and Coefficients are {}".format(lasso.intercept_, lasso.coef_))
>>> 
*****************Test set Metrics******************
Rsquared:  0.9724048268716567
MSE:  0.7041874254756284
Intercept is 7.139320593749483 and Coefficients are [0.01659925 0.         0.0040236  0.00121284]
  • Lidge 정규화를 사용하여 선형모델을 생성한 것이 MSE가 더 작고 R-sqaured가 더 높음
  • 이제 LooCV 와 KFold 방식을 사용해서 성능을 평가해볼 것이다.

 

from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.model_selection import cross_val_score

ridgeCV = Ridge() 
cvs = cross_val_score(ridgeCV, Ad_df[features_added], Ad_df[target], cv = 10 ) 
print("Mean Score:")
print(cvs.mean(), '\n')
print("Confidence Interval:")
cvs.mean() - cvs.std(), cvs.mean() + cvs.std() 

>>> 
Mean Score:
0.9649887636257691 

Confidence Interval:
(0.9430473456799695, 0.9869301815715688)

Statsmodel to fit regularized model

# You can also use the statsmodel for the regularization using the below code
import statsmodels.formula.api as smf
# We will use the below code to fit a regularized regression.

# Here, lasso is fit
lm3 = smf.ols(formula = 'Sales ~ TV+Radio+Newspaper+TVandRadio', data = Ad_df).fit_regularized(method = 'elastic_net', L1_wt = 0)  # L1_wt = 0 : Ridge fit 
print("*************Parameters**************")
print(lm3.params)
>>> *************Parameters**************
[6.72841195e+00 1.90668162e-02 2.79916606e-02 1.44424442e-03
 1.08733333e-03]

 

# Here, lasso regularization has been fit
lm4 = smf.ols(formula= 'Sales ~ TV+Radio+Newspaper+TVandRadio', data = Ad_df).fit_regularized(method = 'elastic_net', L1_wt = 1) # L1_wt : 1 Lasso fit 
print("*************Parameters**************")
print(lm4.params)
>>> 
*************Parameters**************
Intercept     6.471607
TV            0.020325
Radio         0.036049
Newspaper     0.002202
TVandRadio    0.001044
dtype: float64

 

https://datascienceschool.net/03%20machine%20learning/06.05%20%EC%A0%95%EA%B7%9C%ED%99%94%20%EC%84%A0%ED%98%95%ED%9A%8C%EA%B7%80.html

 

6.5 정규화 선형회귀 — 데이터 사이언스 스쿨

.ipynb .pdf to have style consistency -->

datascienceschool.net

 

Bootstrapping

# let us get a more detailed model through statsmodel.
import statsmodels.formula.api as smf

lm2 = smf.ols(formula = 'Sales ~ TV', data = Ad_df).fit()
lm2.params
>>> 
Intercept    7.032594
TV           0.047537
dtype: float64
print(lm2.summary())

#Now, let us calculate the slopes a 1000 times using bootstrapping

import statsmodels.formula.api as smf 

Slope = [] 
for i in range(1000):
    bootstrap_df = Ad_df.sample(n = 200, replace = True) 
    lm3 = smf.ols(formula = 'Sales ~ TV', data = bootstrap_df).fit()
    Slope.append(lm3.params.TV)

    plt.xlabel('TV Ads')
    plt.ylabel('Sales')
    plt.plot(bootstrap_df['TV'], lm3.predict(bootstrap_df['TV']), color = 'green', linewidth = 3)
    
plt.scatter(Ad_df['TV'], Ad_df['Sales'], color = (0, 0, 0.5)) 
plt.show()

# Let's now find out the 2.5 and 97.5 percentile for the slopes obtained
import numpy as np

Slope = np.array(Slope)
Sort_Slope = np.sort(Slope)


Slope_limits = np.percentile(Sort_Slope, (2.5, 97.5))
Slope_limits

>>> array([0.04168595, 0.05293238])

 

# Plotting the slopes and the upper and the lower limits

plt.hist(Slope, 50)
plt.axvline(Slope_limits[0], color = 'r')
plt.axvline(Slope_limits[1], color = 'r')