[Regression and Prediction] 판매에서 광고의 효과
2024. 5. 7. 15:43ㆍ노트/Data Science : 데이터과학
지도학습 (Supervised Learning) 에 대한 소개 : Regression
문제
- 판매에 대한 광고 효과를 정량화 하는것은 흥미로운 응용 회귀문제이다. 광고의 다양한 채널은 신문, TV, 라디오 등등이 있다.
- 이 케이스에서 한 회사의 광고 데이터를 살펴보고 광고의 효과를 고려해볼것이다.
- 또한 광고의 다양한 파라미터를 고려해 보면서 판매액을 예측해 볼 것이다.
데이터 정보
광고 소비 관련 3가지 피쳐가 있고, 타겟 변수는 순판매액이다. 속성들은 다음과 같다.
- TV - TV 광고 예산을 수치화한 독립변수
- Radio - Radio 광고 예산을 수치화한 독립변수
- News - news 광고를 수치화한 독립변수
- Sales - 종속 변수
import pandas as pd
import numpy as np
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
Ad_df = pd.read_csv('data/Advertising.csv')
Ad_df.head()
# we can drop the first column as it is just the index
Ad_df.drop(columns = 'Unnamed: 0', inplace=True)
Ad_df.info()
>>>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 TV 200 non-null float64
1 Radio 200 non-null float64
2 Newspaper 200 non-null float64
3 Sales 200 non-null float64
dtypes: float64(4)
memory usage: 6.4 KB
- 모든 데이터 타입이 실수형 데이터 타입이다.
이제 단순 선형회귀를 시작해보자. 우리는 한번에 한 피쳐를 사용할 것이고 타겟 변수를 살펴볼 것이다.
Ad_df.Sales.values.reshape(len(Ad_df['Sales']), 1)
# Dataset is stored in a Pandas Dataframe. Let us take out all the variables in a numpy array.
Sales = Ad_df.Sales.values.reshape(len(Ad_df['Sales']), 1)
TV = Ad_df.TV.values.reshape(len(Ad_df['Sales']), 1)
Radio = Ad_df.Radio.values.reshape(len(Ad_df['Sales']), 1)
Newspaper = Ad_df.Newspaper.values.reshape(len(Ad_df['Sales']), 1)
# let us fit the simple linear regression model with the TV feature
tv_model = LinearRegression()
tv_model.fit(TV, Sales)
coeffs_tv = np.array(list(tv_model.intercept_.flatten()) + list(tv_model.coef_.flatten())) ## flatten함수 : 1차원 배열로 변환 + 원래 계수가 변화해도 값이 바뀌지 않도록 복사
coeffs_tv = list(coeffs_tv)
# let us fit the simple linear regression model with the Radio feature
radio_model = LinearRegression()
radio_model.fit(Radio, Sales)
coeffs_radio = np.array(list(radio_model.intercept_.flatten()) + list(radio_model.coef_.flatten()))
coeffs_radio = list(coeffs_radio)
# let us fit the simple linear regression model with the Newspaper feature
newspaper_model = LinearRegression()
newspaper_model.fit(Newspaper, Sales)
coeffs_newspaper = np.array(list(newspaper_model.intercept_.flatten()) + list(newspaper_model.coef_.flatten()))
coeffs_newspaper = list(coeffs_newspaper)
coeffs_tv # b0 과 b1
>>> [7.032593549127693, 0.047536640433019764]
dict_Sales = {}
dict_Sales["TV"] = coeffs_tv
dict_Sales["Radio"] = coeffs_radio
dict_Sales["Newspaper"] = coeffs_newspaper
metric_Df_SLR = pd.DataFrame(dict_Sales)
metric_Df_SLR.index = ['intercept', 'Coefficient']
metric_Df_SLR
# Let us now calculate R^2
tv_rsq = tv_model.score(TV, Sales)
radio_rsq = radio_model.score(Radio, Sales)
newspaper_rsq = newspaper_model.score(Newspaper, Sales)
print("TV simple linear regression R-Square :", tv_rsq)
print("Radio simple linear regression R-Square :", radio_rsq)
print("Newspaper simple linear regression R-Square :", newspaper_rsq)
list_rsq = [tv_rsq, radio_rsq, newspaper_rsq]
list_rsq
>>>
TV simple linear regression R-Square : 0.611875050850071
Radio simple linear regression R-Square : 0.33203245544529525
Newspaper simple linear regression R-Square : 0.05212044544430516
[0.611875050850071, 0.33203245544529525, 0.05212044544430516]
metric_Df_SLR.loc['R-Squared'] = list_rsq
metric_Df_SLR
- TV가 61%의 높은 R^2 값을 가지고 있는 것을 볼 수 있습니다. 그다음은 Radio, Newspaper 순입니다.
regression plot으로 최적합선을 시각화
plt.scatter(TV, Sales, color='red')
plt.xlabel('TV Ads')
plt.ylabel('Sales')
plt.plot(TV, tv_model.predict(TV), color='blue', linewidth=3)
plt.show()
plt.scatter(Radio, Sales, color='red')
plt.xlabel('Radio Ads')
plt.ylabel('Sales')
plt.plot(Radio, radio_model.predict(Radio), color='blue', linewidth=3)
plt.show()
plt.scatter(Newspaper, Sales, color='red')
plt.xlabel('Newspaper Ads')
plt.ylabel('Sales')
plt.plot(Newspaper, newspaper_model.predict(Newspaper), color='blue', linewidth=3)
plt.show()
다중선형회귀 (Multiple Linear Regression)
mlr_model = LinearRegression()
mlr_model.fit(Ad_df[['TV', 'Radio', 'Newspaper']], Ad_df['Sales'])
Ad_df['Sales_Predicted'] = mlr_model.predict(Ad_df[['TV', 'Radio', 'Newspaper']])
Ad_df['Error'] = (Ad_df['Sales_Predicted'] - Ad_df['Sales'])**2
MSE_MLR = Ad_df['Error'].mean()
MSE_MLR
>>> 2.784126314510936
mlr_model.score(Ad_df[['TV', 'Radio', 'Newspaper']], Ad_df['Sales'])
>>> 0.8972106381789522
다중 선형 회귀의 𝑅2 값이 89.7 %로 나왔습니다. 단순 선형 회귀보다 더 높은 수치입니다.
이제 모델 해석을 위해 statsmodel을 사용해봅시다.
import statsmodels.formula.api as smf
lm1 = smf.ols(formula = 'Sales ~ TV+Radio+Newspaper', data = Ad_df).fit()
lm1.params
>>>
Intercept 2.938889
TV 0.045765
Radio 0.188530
Newspaper -0.001037
dtype: float64
print(lm1.summary()) # Inferential statistics
print("*************Parameters**************")
print(lm1.params)
print("*************P-Values**************")
print(lm1.pvalues)
print("************Standard Errors***************")
print(lm1.bse)
print("*************Confidence Interval**************")
print(lm1.conf_int())
print("*************Error Covariance Matrix**************")
print(lm1.cov_params())
*************Parameters**************
Intercept 2.938889
TV 0.045765
Radio 0.188530
Newspaper -0.001037
dtype: float64
*************P-Values**************
Intercept 1.267295e-17
TV 1.509960e-81
Radio 1.505339e-54
Newspaper 8.599151e-01
dtype: float64
************Standard Errors***************
Intercept 0.311908
TV 0.001395
Radio 0.008611
Newspaper 0.005871
dtype: float64
*************Confidence Interval**************
0 1
Intercept 2.323762 3.554016
TV 0.043014 0.048516
Radio 0.171547 0.205513
Newspaper -0.012616 0.010541
*************Error Covariance Matrix**************
Intercept TV Radio Newspaper
Intercept 0.097287 -2.657273e-04 -1.115489e-03 -5.910212e-04
TV -0.000266 1.945737e-06 -4.470395e-07 -3.265950e-07
Radio -0.001115 -4.470395e-07 7.415335e-05 -1.780062e-05
Newspaper -0.000591 -3.265950e-07 -1.780062e-05 3.446875e-05
단순 선형회귀의 신뢰 구간을 시각화해 봅시다.
import seaborn as sns
sns.lmplot(x = 'TV', y = 'Sales', data = Ad_df)
sns.lmplot(x = 'Radio', y = 'Sales', data = Ad_df)
sns.lmplot(x = 'Newspaper', y = 'Sales', data = Ad_df)
모델 평가 : Cross validation and Bootstrapping
- newspaper는 p-value에 근거하여 중요 변수 리스트에서 생략될 수 있다는 것을 확인했습니다.
- 이제 각 피쳐들의 곱한 피쳐를 추가하여 회귀 분석을 수행해봅시다.
Ad_df['TVandRadio'] = Ad_df['TV']*Ad_df['Radio']
# let us remove the sales_predicted and the error column generated earlier
Ad_df.drop(columns = ["Error", "Sales_Predicted"], inplace = True)
# Let us do the modelling with the new feature.
import statsmodels.formula.api as smf
lm2 = smf.ols(formula = 'Sales ~ TV+Radio+Newspaper+TVandRadio', data = Ad_df).fit()
lm2.params
>>>
Intercept 6.728412
TV 0.019067
Radio 0.027992
Newspaper 0.001444
TVandRadio 0.001087
dtype: float64
print(lm2.summary()) # Inferencial Statistics
- R-square가 증가함을 확인할 수 있습니다. 그러나 prediction에 유용한 방법일까요? 보여지지 않는 데이터를 잘 예측할까요? 확인해 봅시다.
Ad_df.head()
Ad_df.describe()
- TV and Radio 의 계수가 0.001 로 매우 작은것으로 보이나, 데이터 범위를 확인해보면 Radio랑 Newspaper랑 Sales와 같은 스케일의 값이 아니라서 Rescaling이 필요함
- TVandradio_rescale = TV(normalization) * Radio(normalization) * 100 값으로 대체
Ad_df['Radio_nor'] = Ad_df['Radio'].map(lambda x : (x - Ad_df['Radio'].min()) / (Ad_df['Radio'].max() - Ad_df['Radio'].min()))
Ad_df['TV_nor'] = Ad_df['TV'].map(lambda x : (x- Ad_df['TV'].min()) / (Ad_df['TV'].max() - Ad_df['TV'].min()))
Ad_df['TVandRadio_rsc'] = round(Ad_df['Radio_nor'] * Ad_df['TV_nor'] * 100, 1)
Ad_df = Ad_df.drop(['Radio_nor', 'TV_nor'], axis = 1)
Ad_df.describe()
lm3 = smf.ols(formula = 'Sales ~ TV+Radio+Newspaper+TVandRadio_rsc', data = Ad_df).fit()
lm3.params
>>>
Intercept 6.727052
TV 0.019078
Radio 0.028774
Newspaper 0.001462
TVandRadio_rsc 0.159436
퍼포먼스 평가, testing 과 validation
Train, Test, and Validation set
- 하나는 모델 학습용, 하나는 모델 검증용, 마지막은 모델 테스트용으로 3개 데이터셋으로 분할하겠습니다.
from sklearn.model_selection import train_test_split
features_base = [i for i in Ad_df.columns if i not in ("Sales", "TVandRadio")]
features_added = [i for i in Ad_df.columns if i not in ("Sales")]
target = 'Sales'
train, test = train_test_split(Ad_df, test_size = 0.10, train_size = 0.90)
train, validation = train_test_split(train, test_size = 0.20, train_size = 0.80)
train.shape, validation.shape, test.shape
>>> ((144, 5), (36, 5), (20, 5))
# now let us start with the modelling
from sklearn.linear_model import LinearRegression
mlr = LinearRegression()
mlr.fit(train[features_base], train[target])
print("*********Training set Metrics**************")
print("R-Squared: ", mlr.score(train[features_base], train[target]))
se_train = (train[target] - mlr.predict(train[features_base]))**2
mse_train = se_train.mean()
print('MSE: ', mse_train)
print("********Validation set Metrics**************")
print("R-Sqaured: ", mlr.score(validation[features_base], validation[target]))
se_val = (validation[target] - mlr.predict(validation[features_base]))**2
mse_val = se_val.mean()
print('MSE: ', mse_val)
>>>
*********Training set Metrics**************
R-Squared: 0.903216337464553
MSE: 2.7599592052059285
********Validation set Metrics**************
R-Sqaured: 0.8415880634273336
MSE: 3.2029463467232415
# Can we increase the model performance by adding the new feature?
# We found that to be the case in the analysis above but let's check the same for the validation dataset
mlr_added_feature = LinearRegression()
mlr_added_feature.fit(train[features_added], train[target])
print("*********Training set Metrics**************")
print("R-Squared:", mlr_added_feature.score(train[features_added], train[target]))
se_train = (train[target] - mlr_added_feature.predict(train[features_added]))**2
mse_train = se_train.mean()
print('MSE: ', mse_train)
print("********Validation set Metrics**************")
print("R-Squared:", mlr_added_feature.score(validation[features_added], validation[target]))
se_val = (validation[target] - mlr_added_feature.predict(validation[features_added]))**2
mse_val = se_val.mean()
print('MSE: ', mse_val)
>>>
*********Training set Metrics**************
R-Squared: 0.9675029996104462
MSE: 0.9267100770636859
********Validation set Metrics**************
R-Squared: 0.9572813685448501
MSE: 0.8637321625919551
feature를 증가함에 따라 R-sqaured 가 증가했고, error는 감소함을 확인했습니다. 정규화된 모델에 피팅해봅시다.
features_added
>>> ['TV', 'Radio', 'Newspaper', 'TVandRadio']
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
# fitting Ridge with the default features
ridge = Ridge()
ridge.fit(train[features_added], train[target])
print("*********Training set Metrics**************")
print("R-Squared: ", ridge.score(train[features_added], train[target]))
se_train = (train[target] - ridge.predict(train[features_added]))**2
mse_train = se_train.mean()
print('MSE: ', mse_train)
print("********Validation set Metrics**************")
print("R-Squared:", ridge.score(validation[features_added], validation[target]))
se_val = (validation[target] - ridge.predict(validation[features_added]))**2
mse_val = se_val.mean()
print('MSE: ', mse_val)
>>>
*********Training set Metrics**************
R-Squared: 0.9675029995924045
MSE: 0.9267100775781751
********Validation set Metrics**************
R-Squared: 0.9572809715934404
MSE: 0.8637401885911019
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
#fitting Lasso with the default features
lasso = Lasso()
lasso.fit(train[features_added], train[target])
print("*********Training set Metrics**************")
print("R-Squared:", lasso.score(train[features_added], train[target]))
se_train = (train[target] - lasso.predict(train[features_added]))**2
mse_train = se_train.mean()
print('MSE: ', mse_train)
print("********Validation set Metrics**************")
print("R-Squared:", lasso.score(validation[features_added], validation[target]))
se_val = (validation[target] - lasso.predict(validation[features_added]))**2
mse_val = se_val.mean()
print('MSE: ', mse_val)
>>>
*********Training set Metrics**************
R-Squared: 0.9666645766262986
MSE: 0.9506192077199765
********Validation set Metrics**************
R-Squared: 0.9535527261445209
MSE: 0.9391219457911119
#Let us predict on the unseen data using Ridge
rsq_test = ridge.score(test[features_added], test[target])
se_test = (test[target] - ridge.predict(test[features_added]))**2
mse_test = se_test.mean()
print("*****************Test set Metrics******************")
print("Rsquared: ", rsq_test)
print("MSE: ", mse_test)
print("Intercept is {} and Coefficients are {}".format(ridge.intercept_, ridge.coef_))
>>>
*****************Test set Metrics******************
Rsquared: 0.9752629628269709
MSE: 0.6312520830999426
Intercept is 6.644670991994743 and Coefficients are [0.01920693 0.02087446 0.00495735 0.00110206]
#Let us predict on the unseen data using lasso
rsq_test = lasso.score(test[features_added], test[target])
se_test = (test[target] - lasso.predict(test[features_added]))**2
mse_test = se_test.mean()
print("*****************Test set Metrics******************")
print("Rsquared: ", rsq_test)
print("MSE: ", mse_test)
print("Intercept is {} and Coefficients are {}".format(lasso.intercept_, lasso.coef_))
>>>
*****************Test set Metrics******************
Rsquared: 0.9724048268716567
MSE: 0.7041874254756284
Intercept is 7.139320593749483 and Coefficients are [0.01659925 0. 0.0040236 0.00121284]
- Lidge 정규화를 사용하여 선형모델을 생성한 것이 MSE가 더 작고 R-sqaured가 더 높음
- 이제 LooCV 와 KFold 방식을 사용해서 성능을 평가해볼 것이다.
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.model_selection import cross_val_score
ridgeCV = Ridge()
cvs = cross_val_score(ridgeCV, Ad_df[features_added], Ad_df[target], cv = 10 )
print("Mean Score:")
print(cvs.mean(), '\n')
print("Confidence Interval:")
cvs.mean() - cvs.std(), cvs.mean() + cvs.std()
>>>
Mean Score:
0.9649887636257691
Confidence Interval:
(0.9430473456799695, 0.9869301815715688)
Statsmodel to fit regularized model
# You can also use the statsmodel for the regularization using the below code
import statsmodels.formula.api as smf
# We will use the below code to fit a regularized regression.
# Here, lasso is fit
lm3 = smf.ols(formula = 'Sales ~ TV+Radio+Newspaper+TVandRadio', data = Ad_df).fit_regularized(method = 'elastic_net', L1_wt = 0) # L1_wt = 0 : Ridge fit
print("*************Parameters**************")
print(lm3.params)
>>> *************Parameters**************
[6.72841195e+00 1.90668162e-02 2.79916606e-02 1.44424442e-03
1.08733333e-03]
# Here, lasso regularization has been fit
lm4 = smf.ols(formula= 'Sales ~ TV+Radio+Newspaper+TVandRadio', data = Ad_df).fit_regularized(method = 'elastic_net', L1_wt = 1) # L1_wt : 1 Lasso fit
print("*************Parameters**************")
print(lm4.params)
>>>
*************Parameters**************
Intercept 6.471607
TV 0.020325
Radio 0.036049
Newspaper 0.002202
TVandRadio 0.001044
dtype: float64
Bootstrapping
# let us get a more detailed model through statsmodel.
import statsmodels.formula.api as smf
lm2 = smf.ols(formula = 'Sales ~ TV', data = Ad_df).fit()
lm2.params
>>>
Intercept 7.032594
TV 0.047537
dtype: float64
print(lm2.summary())
#Now, let us calculate the slopes a 1000 times using bootstrapping
import statsmodels.formula.api as smf
Slope = []
for i in range(1000):
bootstrap_df = Ad_df.sample(n = 200, replace = True)
lm3 = smf.ols(formula = 'Sales ~ TV', data = bootstrap_df).fit()
Slope.append(lm3.params.TV)
plt.xlabel('TV Ads')
plt.ylabel('Sales')
plt.plot(bootstrap_df['TV'], lm3.predict(bootstrap_df['TV']), color = 'green', linewidth = 3)
plt.scatter(Ad_df['TV'], Ad_df['Sales'], color = (0, 0, 0.5))
plt.show()
# Let's now find out the 2.5 and 97.5 percentile for the slopes obtained
import numpy as np
Slope = np.array(Slope)
Sort_Slope = np.sort(Slope)
Slope_limits = np.percentile(Sort_Slope, (2.5, 97.5))
Slope_limits
>>> array([0.04168595, 0.05293238])
# Plotting the slopes and the upper and the lower limits
plt.hist(Slope, 50)
plt.axvline(Slope_limits[0], color = 'r')
plt.axvline(Slope_limits[1], color = 'r')
'노트 > Data Science : 데이터과학' 카테고리의 다른 글
Complete Guide to A/B Testing Design, Implementation and Pitfalls (0) | 2024.07.05 |
---|---|
프로처럼 A/B 테스팅: 통계적 검정 선택 기법 마스터하기 (0) | 2024.07.02 |
[Unstructured Data] 고객 클러스터링 (0) | 2024.05.03 |
[Regression and Prediction] 부트스트랩핑 (Bootstrapping) (0) | 2024.05.03 |
[Regression and Prediction] 선형회귀 (Linear Regression) (0) | 2024.05.02 |