[캐글 필사] 타이타닉 튜토리얼 1 - Exploratory data analysis, visualization, machine learning

2021. 5. 10. 14:00ㆍ레퍼런스/Kaggle

출처 : https://kaggle-kr.tistory.com/17?category=868316

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 

sns.set(font_scale = 2)
plt.style.use('dark_background')

import missingno as msno 

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

Process

데이터셋 확인 - null data 확인, 수정
EDA (탐색적 데이터 분석) - 개별 feature 분석, 상관관계 확인, 시각화 insight
feature engineering - one-hot encoding, class로 나누기, 구간으로 나누기, 텍스트 데이터 처리
modeling - sklearn 사용 (ML) / tensorflow, pytorch 사용 (DL)
training , prediction - trainset으로 모델 학습, testset으로 prediction
evaluation - 예측 성능이 원하는 수준인지 판단.

1. Dataset 확인

df_train = pd.read_csv('./train.csv')
df_test = pd.read_csv('./test.csv')

df_train.head()

df_train.describe() # 각 feature가 가진 통계치 반환

df_test.describe()

1.1 Null data check

for col in df_train.columns:
    msg = 'column: {:>10}\t Percent of NaN value : {:.2f}%'.format(col, 
            100 *(df_train[col].isnull().sum() / df_train[col].shape[0]))
    print(msg)

for col in df_test.columns:
    msg = 'column: {:>10}\t Percent of NaN value : {:.2f}%'.format(col,
            100* (df_test[col].isnull().sum() / df_test[col].shape[0]))
    print(msg)

msno.matrix(df=df_train.iloc[:,:], figsize=(8,8), color = (0.8,0.5,0.2))

msno.bar(df = df_train.iloc[:,:], figsize = (8,8), color = (0.8,0.5,0.2))

msno.bar(df = df_test.iloc[:, :], figsize = (8,8) , color = (0.8,0.5,0.2))

1.2 Target label 확인

f, ax = plt.subplots(1,2, figsize = (18,8))
df_train['Survived'].value_counts().plot.pie(explode=[0,0.1], 
                                             autopct = '%1.1f%%', 
                                             textprops = {'fontsize':20 , 'color':'Black'},
                                             ax= ax[0], 
                                             shadow = True)
ax[0].set_title('Pie plot - Survived', fontsize = 20)
ax[0].set_ylabel('')

sns.countplot('Survived', data= df_train, ax = ax[1])
# ax[1].set_yticklabels(ax[1].get_yticks(), size = 15)
# ax[1].set_xticklabels(ax[1].get_xticks(), size = 15)

ax[1].set_title('Count plot - Survived', fontsize = 20)
plt.show()

생존 : 38.4%
target label 분포 균일 balanced
cf) embalanced (불균일) 할 경우,
(100개 중, 1이 99개, 0이 1개 일때, 모델이 모든 것을 1이라고 해도 정확도가 99%가 나옴,
0을 찾는 문제일 경우, 정확도가 99% 임에도 원하는 결과를 얻지 못함)

2. 탐색적 데이터 분석 (Exploratory data analysis)

시각화 라이브러리 이용 : matplotlib, seaborn, plotly

2.1 Pclass

ordinal (서수형) 데이터, 카테고리이면서, 순서가 있는 데이터 타입

"""
Pclass: 티켓의 클래스 (categorical feature)
1 : 1st
2 : 2nd 
3 : 3rd
"""
df_train[['Pclass', 'Survived']]

df_train[['Pclass', 'Survived']].groupby(['Pclass'], as_index = True).count()

df_train[['Pclass', 'Survived']].groupby(['Pclass'], as_index = True).sum()

pd.crosstab(df_train['Pclass'],
            df_train['Survived'], 
            margins = True).style.background_gradient(cmap='summer_r')

# 각 클래스별 생존률
df_train[['Pclass', 'Survived']].groupby(['Pclass'], as_index = True).mean()

df_train[['Pclass', 'Survived']].groupby(['Pclass'], 
                                         as_index = True).mean().sort_values(by = 'Survived', 
                                                                             ascending = False).plot.bar()
# pclass가 좋을 수록(1st) 생존률이 높음

y_position = 1.02 
f, ax = plt.subplots(1,2, figsize = (18,8))

df_train['Pclass'].value_counts().plot.bar(
    color = ['#CD7F32', '#FFDF00', '#D3D3D3'], ax = ax[0])
ax[0].set_title('Number of Passengers By Pclass', y = y_position)
ax[0].set_ylabel('Count')

sns.countplot('Pclass', hue = 'Survived', data= df_train, ax = ax[1])
plt.setp(ax[1].get_legend().get_title(), fontsize = '21')
ax[1].set_title('Pclass: Survived vs Dead', y = y_position)
plt.show()

결론: 클래스가 높을 수록, 생존 확률이 높음,
Pclass 1,2,3 순으로 63% , 48% , 25%임
Pclass가 큰 영향을 미친다고 판단

2.2 Sex

f, ax = plt.subplots(1,2 ,figsize = (18,8))
df_train[['Sex', 'Survived']].groupby(['Sex'], as_index = True).mean().plot.bar(ax = ax[0])
ax[0].set_xticklabels(ax[0].get_xticklabels(), rotation = 0)
ax[0].set_title('Survived vs Sex')

sns.countplot('Sex', hue= 'Survived', data = df_train, ax = ax[1])
ax[1].set_title('Sex: Survived vs Dead')
plt.show()

여자가 생존할 확률이 더 높음

df_train[['Sex', 'Survived']].groupby(['Sex'], 
                                      as_index = False).mean().sort_values(by = 'Survived', ascending = False)

pd.crosstab(df_train['Sex'], 
            df_train['Survived'], 
            margins = True).style.background_gradient(cmap = 'summer_r')

2.3 Both Sex and Pclass

seaborn factorplot 이용 : 3차원 그래프

sns.factorplot('Pclass', 'Survived', hue = 'Sex', data = df_train, size = 5, aspect = 1.5)

모든 클래스에서 female이 살 확률이 male 보다 높음
남자, 여자 상관없이 클래스가 높을 수록 살 확률이 높음

sns.factorplot(x = 'Sex', y = 'Survived', col = 'Pclass', 
              data = df_train, satureation =.5, 
              size = 4, aspect = 1)

2.4 Age

print('제일 나이 많은 탑승객 : {:.1f} 살'.format(df_train['Age'].max()))
print('제일 어린 탑승객 : {:.1f} 살'.format(df_train['Age'].min()))
print('탑승객 평균 나이 : {:.1f} 살'.format(df_train['Age'].mean(), common_grid=False))
>>> 
제일 나이 많은 탑승객 : 80.0 살
제일 어린 탑승객 : 0.4 살
탑승객 평균 나이 : 29.7 살

fig, ax = plt.subplots(1,1 ,figsize = (9,5))
sns.kdeplot(df_train[df_train['Survived'] == 1]['Age'], ax = ax)
sns.kdeplot(df_train[df_train['Survived'] == 0]['Age'], ax = ax)
plt.legend(['Survived == 1', 'Survived ==0'])
plt.show()

생존자 중 나이가 어린 경우가 많음

# Age Distribution withing classes 
plt.figure(figsize = (8,6))
df_train['Age'][df_train['Pclass'] == 1].plot(kind = 'kde')
df_train['Age'][df_train['Pclass'] == 2].plot(kind = 'kde')
df_train['Age'][df_train['Pclass'] == 3].plot(kind = 'kde')

plt.xlabel('Age')
plt.title('Age Distribution within classes')
plt.legend(['1st Class', '2nd Class', '3rd Class'])

class가 커질 수록 나이 많은 사람의 비중이 커짐

cummulate_survival_ratio = []
for i in range(1, 80):
    cummulate_survival_ratio.append(df_train[df_train['Age'] < i]['Survived'].sum() / 
                                   len(df_train[df_train['Age'] < i]['Survived']))
plt.figure(figsize = (7,7))
plt.plot(cummulate_survival_ratio)
plt.title('Survival rate change depending on range of Age', y =1.02)
plt.ylabel('Survival rate')
plt.xlabel('Range of Age(0~x)')
plt.show()

나이가 어릴 수록, 생존률이 높음.

2.5 Pclass, Sex, Age

violinplot 이용
x축 : (Pclass, Sex)
y축 : (Age)

f, ax = plt.subplots(1,2 , figsize = (18, 8))
sns.violinplot('Pclass', 'Age', hue = "Survived", 
               data = df_train, scale ='count', 
               split = True, ax = ax[0])
ax[0].set_title('Pclass and Age vs Survived')
ax[0].set_yticks(range(0,110,10))

sns.violinplot('Sex', 'Age', hue = 'Survived',
              data= df_train, scale = 'count', 
              split = True, ax = ax[1])
ax[1].set_title('Sex and Age vs Survived')
ax[1].set_yticks(range(0,110,10))
plt.show()

모든 클래스에서 나이가 어릴 수록, 생존을 많이 함
오른쪽 그림만 보면, 명확히 여자가 생존을 많이 함
여성과 아이를 먼저 챙김

2.6 Embarked

탑승한 항구

f,ax = plt.subplots(1,1, figsize = (7,7))
df_train[['Embarked', 'Survived']].groupby(['Embarked'], 
                                           as_index = True).mean().sort_values(by = 'Survived' 
                                                                               , ascending = False).plot.bar(ax=ax)
ax.set_xticklabels(ax.get_xticklabels(), rotation = 0)

C 항구가 제일 높음

f, ax = plt.subplots(2,2 , figsize = (20,15))
sns.countplot('Embarked', data = df_train, ax= ax[0,0])
ax[0,0].set_title('(1) No. Of Passengers Boarded')

sns.countplot('Embarked', hue = 'Sex', data = df_train, ax = ax[0,1])
ax[0,1].set_title('(2) Male - Female Split for Embarked')

sns.countplot('Embarked', hue = 'Survived', data = df_train, ax = ax[1,0])
ax[1,0].set_title('(3) Embarked vs Survived')

sns.countplot('Embarked', hue = 'Pclass', data = df_train, ax = ax[1,1])
ax[1,1].set_title('(4) Embarked vs Pclass')
plt.subplots_adjust(wspace= 0.2, hspace = 0.5)
plt.show()

Figure(1) - S에서 가장 많은 사람이 탑승함
Figure(2) - C와 Q는 남녀 비율이 비슷, S는 남자가 더 많음
Figure(3) - S의 경우 생존 확률이 많이 낮음
Figure(4) - C가 생존률이 높은 것은, 클래스가 높은 사람이 많이 타서임. S는 3rd class가 많아서 생존 확률이 낮게 나옴

2.7 Family - SibSp(형제 자매) + Parch(부모, 자녀)

df_train['FamilySize'] = df_train['SibSp'] + df_train['Parch'] + 1 # 자신을 포함
df_test['FamilySize'] = df_test['SibSp'] + df_test['Parch'] + 1 # 자신을 포함

print('Maximum size of Family: ' , df_train['FamilySize'].max())
print('Minimum size of Family: ' , df_train['FamilySize'].min())
>>> 
Maximum size of Family:  11
Minimum size of Family:  1

f, ax = plt.subplots(1,3 , figsize = (25,6))
sns.countplot('FamilySize', data = df_train, ax = ax[0])
ax[0].set_title('(1) No. Of Passengers Boarded', y = 1.02)

sns.countplot('FamilySize', hue = 'Survived', data = df_train, ax = ax[1])
ax[1].set_title('(2) Survived countplot depending on FamilySize', y =1.02)

df_train[['FamilySize', 'Survived']].groupby(['FamilySize'], 
                                             as_index = True).mean().sort_values(by = 'Survived', ascending = False).plot.bar(ax=ax[2])
ax[2].set_xticklabels(ax[2].get_xticklabels(), rotation = 0)
ax[2].set_title('(3) Survived rate depending on FamilySize', y = 1.02)

plt.subplots_adjust(wspace = 0.2, hspace = 0.5)
plt.show()

Figure (1) - 가족 크기가 1~11까지 있음, 대부분 1 명 그다음으로 2,3,4 명
Figure (2), (3) - 가족 크기에 따른 생존 비교, 가족이 4명인 경우 가장 생존확률이 높음
가족 수가 많아질 수록 , (5,6,7,8,11) 생존 확률이 낮아짐
가족 수가 너무 작아도(1), 너무 커도 (5,6,8,11) 생존확률이 작음
3~4명 선에서 생존확률이 높음

2.8 Fare

탑승 요금 : contious feature

fig, ax = plt.subplots(1, 1, figsize = (8,8))
g = sns.distplot(df_train['Fare'], color = 'w', 
                 label = 'Skewness : {:.2f}'.format(df_train['Fare'].skew()) , ax = ax)
g = g.legend(loc = 'best')

distriobution 이 매우 비대칭임 (high skewness)
몇개 없는 outlier에 대해 너무 민감하게 반응하면, 실제 예측 시에 좋지 못한 결과를 얻을 수 있음
outlier의 영향을 줄이기 위해 Fare에 log를 취함

# 결측치를 평균 값으로 치환 
df_test.loc[df_test.Fare.isnull(), 'Fare'] = df_test['Fare'].mean() 

df_train['Fare'] = df_train['Fare'].map(lambda i : np.log(i) if i > 0 else 0)
df_test['Fare'] = df_test['Fare'].map(lambda i : np.log(i) if i > 0 else 0)

fig, ax = plt.subplots(1, 1, figsize = (8,8))
g = sns.distplot(df_train['Fare'], color = 'w', 
                label = 'Skewness : {:.2f}'.format(df_train['Fare'].skew()), ax = ax )
g = g.legend(loc = 'best')

log를 취하니 비대칭성이 많이 사라짐
이런 작업을 사용하여 모델이 좀 더 좋은 성능을 내도록 할 수 있음
(feature engineering 의 일종)

2.9 Cabin

Nan 이 80% 임. 모델에 포함시키지 않겠음

df_train['Cabin'].isnull().sum()/len(df_train['Cabin'])
>>> 0.7710437710437711

2.10 Ticket

Nan 은 없으나, string data이므로 실제 모델에 어떻게 사용할지 아이디어가 필요함

df_train['Ticket'].isnull().sum()
>>> 0

df_train['Ticket'].value_counts()

'레퍼런스 > Kaggle' 카테고리의 다른 글

[캐글필사] EDA to Prediction (DieTanic) (0)	2021.06.13
kaggle 필사 커리큘럼 -(진행중) (0)	2021.05.10

다이엔 스페이스

다이엔 스페이스

태그

최근글

댓글

공지사항

아카이브

1. Dataset 확인

1.1 Null data check

1.2 Target label 확인

2. 탐색적 데이터 분석 (Exploratory data analysis)

2.1 Pclass

2.2 Sex

2.3 Both Sex and Pclass

2.4 Age

2.5 Pclass, Sex, Age

2.6 Embarked

2.7 Family - SibSp(형제 자매) + Parch(부모, 자녀)

2.8 Fare

2.9 Cabin

2.10 Ticket

'레퍼런스 > Kaggle' 카테고리의 다른 글

관련글

티스토리툴바