[분류] k-means clustering(군집화, 클러스터링)

[분류] k-means clustering(군집화, 클러스터링)

2020. 11. 16. 07:05ㆍ노트/R : 통계

군집(clustering) : 데이터 셋을 클러스터 라는 그룹으로 나누는 작업

기업 입장에서는 타깃 마케팅, 고객 데이터, 유사한 구매 패턴을 가지는 그룹들로 나눔

데이터

데이터 불러오기

teens<-read.csv("C:\\Users\\LG\\Documents\\Data\\snsdata.csv")

str(teens)

# 성별로 나눠서 데이터 결측치 처리 
table(teens$gender, useNA = "ifany")

summary(teens$age)

teens$age<-ifelse(teens$age>=13 & teens$age <20, teens$age , NA)

summary(teens$age)

teens$female <- ifelse(teens$gender =="F" & !is.na(teens$gender),1,0)

table(teens$female)

teens$no_gender <- ifelse(is.na(teens$gender),1,0)

table(teens$gender, useNA='ifany')

table(teens$no_gender)

mean(teens$age, na.rm = TRUE)

myagg <- aggregate(data = teens , age~gradyear , mean)

class(aggregate(data= teens, age~gradyear, mean)) # dataframe

#그룹(졸업 연도)에 대한 통계(평균) 계산 
#졸업 연도 

avg_ave <- ave(teens$age, teens$gradyear, FUN = function(x)mean(x,na.rm= TRUE))

class(avg_ave) # numeric 

teens$age <- ifelse(is.na(teens$age), avg_ave, teens$age)

** 주의사항
kmeans clustering : 거리를 구하여 브리핑을 한다는 것, 표준편차로 나누는 표준화 작업과 정규화 작업이 필요

interests <- teens[5:40]
set.seed(2345)

interests_z <- as.data.frame(lapply(interests, scale))
head(interests_z)

teen_clusters<-kmeans(interests_z, 5)
teen_clusters$size

teen_clusters$centers

teens$cluster<-teen_clusters$cluster

teens[1:5, c("cluster","gender","age","friends")]

클러스터 단위로 나이, 여성, 친구에 대한 평균?

aggregate(data= teens, female~cluster,mean)

aggregate(data=teens, age~cluster,mean)

aggregate(data= teens ,friends~cluster,mean)

참고

파이썬 라이브러리를 활용한 머신러닝 ( p225 )

k평균 군집:

- 클러스터 중심과 각 데이터 간의 거리의 합이 최소화
- 클러스터 중심 간에 거리의 합 최대화
1. 몇개의 그룹으로 나눌건지 k 결정
2. k개의 초기 중심점 설정
3. data와 클러스터 중심이 가장 가까운 클러스터로 데이터를 할당
4. 중심점이 update (3<>4반복)

# iris 데이터로 실습

iris
str(iris)
head(iris)

colSums(is.na(iris)) # columns별로 결측치 갯수 조사

panel.fun<-function(x,y,...){
  horizontal <-(par("usr")[1] + par("usr")[2])/2; 
  vertical <-(par("usr")[3] + par("usr")[4])/2;
  text(horizontal, vertical, format(abs(cor(x,y)), digits=2))
  }


pairs(iris[1:4], pch = 21,
      bg=c("red","green","blue")[unclass(iris$Species)],
      upper.pannel = panel.fun,
      main = "Scatter") # 산점도(Scater plot)를 그릴 때 사용하는 함수 

# 참고: 
# ggplot2 패키지의 geom_point(): 변수 1개의 산점도 그리기 
# corplot 패키지: 상관계수 행렬 그리기

pairs(iris[-5], log = "xy") # plot all variables on log scale

pairs(iris, log = 1:4, # log the first four
      main = "Lengths and Widths in [log]", line.main=1.5, oma=c(2,2,3,2))

pairs(iris[1:4], main = "Anderson's Iris Data -- 3 species", pch = 21 , bg = c("red", "green3", "blue")[unclass(iris$Species)])

# airquality data로 실습

str(airquality)

airquality_1<-airquality[,c(1:4)]

str(airquality_1)

colSums(is.na(airquality_1)) # columns 별로 결측치 갯수 조사

cor(airquality_1)

# 결측값이 있는 행을 제거 
airquality_2<-na.omit(airquality_1)

str(airquality_2)

colSums(is.na(airquality_2)) # columns 별로 결측치 갯수 조사

Help(pairs 검색) 하여 맨 아래 3개 함수 복사해옴

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}
pairs(USJudgeRatings, lower.panel = panel.smooth, upper.panel = panel.cor,
      gap=0, row1attop=FALSE)

panel.hist <- function(x, ...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(usr[1:2], 0, 1.5) )
    h <- hist(x, plot = FALSE)
    breaks <- h$breaks; nB <- length(breaks)
    y <- h$counts; y <- y/max(y)
    rect(breaks[-nB], 0, breaks[-1], y, col = "cyan", ...)
}

panel.lm <- function(x,y, col=par("col"), bg =NA, 
                     pch = par("pch"),
                     cex = 1, col.smooth="black",...){
  points(x,y,pch=pch, col=col, bg =bg , cex = cex)
  abline(stats::lm(y~x), col=col, smooth,...)
                     }
}


pairs(airquality_2 , pch="*" , main="scatter plot", 
      lower.panel = panel.lm, 
      upper.panel = panel.cor,
      diag.panel = panel.hist)

군집화 시에 어떤 속성으로 나눌 것인가?

library(ggplot2)

iris_plot<-ggplot(data = iris, aes(x=Petal.Length, y = Petal.Width, colour = Species))+ geom_point(shape = 19 , size = 4)

iris_plot

iris_plot2<-iris_plot+
  annotate("text", x = 1.5, y = 0.7 , label = "Setosa")+
  annotate("text", x = 3.5, y = 1.5 , label = "Versicolor")+
  annotate("text", x = 6, y = 2.7 , label = "Virginica")


iris_plot2+
  annotate("rect", xmin=0, xmax=2.6 , ymin =0 , ymax = 0.8 , alpha=0.1, fill ="red")+
  annotate("rect", xmin=2.6, xmax=4.9 , ymin =0.8 , ymax = 0.8 , alpha=0.1, fill ="green")+
  annotate("rect", xmin=4.8, xmax=7.2 , ymin =1.5 , ymax = 2.7 , alpha=0.1, fill ="blue")

iris_plot2

iris_kmeans <-kmeans(iris[,c("Petal.Length","Petal.Width")],3)

iris_kmeans

'노트 > R : 통계' 카테고리의 다른 글

[분류] 스팸메일 분류기 만들기 (0)	2020.11.28
[분류] C50, rpart 라이브러리를 활용한 의사결정 (0)	2020.11.22
[분류] Apriori 라이브러리를 이용한 장바구니 분석 하기 (0)	2020.11.16
[분류] KNN 알고리즘을 이용한 데이터 분류하기 (0)	2020.10.24
[R 기초] 데이터 다루기(2) - apply 함수 (0)	2020.10.24

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

다이엔 스페이스

다이엔 스페이스

태그

최근글

댓글

공지사항

아카이브

데이터

데이터 불러오기

k평균 군집:

군집화 시에 어떤 속성으로 나눌 것인가?

'노트 > R : 통계' 카테고리의 다른 글

관련글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역