๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

Computer Science/ํ•™๊ต๊ณต๋ถ€

machine learning >> clustering(1)

Unsupervised Learning

: unsupervise learning์ด๋ผ๋Š” ๋ง์€ supervisor๊ฐ€ ์—†๋‹ค๋Š” ๋œป์ด๋‹ค. ์ฆ‰ input X๋ฅผ ์œ„ํ•œ 'label'์ด ์กด์žฌํ•˜์ง€ ์•Š๋Š”๋‹ค.

  • ์ข…๋ฅ˜

    • Density Estimation(KDE): y label์€ ํ•„์š”์—†๊ณ , x data๋งŒ ํ•„์š”ํ•˜๋‹ค.
    • Clustering : kMeans, MoG
    • Dimension Reduction : x data๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, ๋†’์€ ์ฐจ์›์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋‚ฎ์€ ์ฐจ์›์˜ ๋ฐ์ดํ„ฐ์— projection ํ•ด์ฃผ๋Š” ๊ฒƒ์ด๋‹ค. ์ด๊ฑด 'compression'์ด๋ž‘ ๋น„์Šทํ•˜๋‹ค.
    • Factor analysis: ์ฃผ์–ด์ง„ signal์„ ๋ฐœ์ƒ์‹œํ‚ค๋Š” ๋ฐ์— ์ฃผ์š”์ธ์ด ๋ฌด์—‡์ธ๊ฐ€?
    • Representation Learning
  • Clustering

    1. dataset ์ดํ•ดํ•˜๊ธฐ๐Ÿ“ˆ
      : ๋จผ์ € ๋ฐ์ดํ„ฐ๋“ค์˜ distribution์„ ์•Œ์•„์•ผ ํ•œ๋‹ค. ๊ทผ๋ฐ ๋ชจ๋‘ ํŒŒ์•…ํ•˜๊ธฐ๋Š” ์–ด๋ ค์šฐ๋‹ˆ๊นŒ ์ฒ˜์Œ์— ๊ฐ€์žฅ ์‰ฝ๊ฒŒ ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ๋“ค์ด ํ‰๊ท , ๋ถ„์‚ฐ, ์ตœ๋Œ€/์ตœ์†Œ ๊ฐ’ ๋“ฑ์ด ์žˆ๋‹ค. ํ•˜์ง€๋งŒ distribution์ด multimodal์ด๋ฉด ์–ด๋–กํ•˜์ง€?? ์ฆ‰ ๋ณผ๋กํ•œ ๊ฒŒ ์—ฌ๋Ÿฌ๊ฒŒ ์žˆ๋Š” ๊ฒฝ์šฐ ๋ง์ด๋‹ค. ๊ทธ๋Ÿด ๋•, ํ‰๊ท ์ด ๊ทธ๋ ‡๊ฒŒ ์ค‘์š”ํ•œ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•ด์ฃผ์ง€๋Š” ์•Š๋Š”๋‹ค. ์ด๋Ÿฐ ๊ฒฝ์šฐ์— clustering์ด ๋งค์šฐ ๋„์›€์ด ๋œ๋‹ค.

    2. clustering
      : 'Grouping the samples', ์ƒ˜ํ”Œ๋“ค์„ ๊ทธ๋ฃนํ™” ํ•ด์ค˜์•ผ ํ•˜๋Š”๋ฐ ์ด ๋•Œ ๋ช‡๊ฐœ์˜ ๊ทธ๋ฃน์œผ๋กœ ํ•ด์ฃผ๋Š”๊ฒŒ ์˜๋ฏธ ์žˆ์„๊นŒ? ์ผ๋‹จ ์šฐ๋ฆฌ๊ฐ€ supervisor๊ฐ€ ๋˜์–ด์„œ ๋ชจ๋ธํ•œํ…Œ ๋ฌด์–ธ๊ฐ€๋ฅผ ์ œ๊ณตํ•ด์ค˜์•ผ ํ•œ๋‹ค.

    3. ์ ‘๊ทผ๋ฒ•

      • connectivity based
      • centroid based
      • distribution based
      • hierarchical clustering
        • 100๊ฐœ์˜ ์ƒ˜ํ”Œ๋กœ 100๊ฐœ์˜ ํด๋Ÿฌ์Šคํ„ฐ๋กœ ํ•˜๋Š”๊ฒจ, ๊ทธ ๋‹ค์Œ์— ๊ฐ€๊นŒ์šด ์• ๋“ค์„ ๊ฒฐํ•ฉ์‹œํ‚ค๊ณ  ์ ์  ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ์ค„์—ฌ๊ฐ€๋Š” ๊ฑฐ์ง€. ๊ทธ๋Ÿผ ๋ช‡๊ฐœ์˜ ํด๋Ÿฌ์Šคํ„ฐ๊ฐ€ ํ•„์š”ํ•œ๊ฐ€? ๊ทธ๊ฑฐ๋Š” ๋ชจ๋ฅธ๋‹ค. (bottom up)
        • 100๊ฐœ์˜ ์ƒ˜ํ”Œ์„ ์ฒ˜์Œ์— 1๊ฐœ์˜ ํด๋Ÿฌ์Šคํ„ฐ์— ๋ชจ๋‘ ์ง‘์–ด๋„ฃ๊ณ  ์‹œ์ž‘ํ•˜๋Š” ๊ฑฐ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ ์ด๊ฑธ splitํ•ด๋‚˜๊ฐ„๋‹ค.
      • Graph theoretic : spectral clusterint
  • kMeans ๐Ÿ’ฅ

    • ํŠน์ง•?!?!๐Ÿง
      • ์ฒ˜์Œ์— ๋ช‡๊ฐœ์˜ ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ์ œ๊ณตํ•ด์ค„์ง€ 'k'๋ฅผ ์ •ํ•ด์ค€๋‹ค. ์ด ๋ชจ๋ธ์€ 'global convergence' ๋ฅผ ๊ฐ€์ง€๊ธฐ ๋•Œ๋ฌธ์— ์–ด๋””์„œ ์‹œ์ž‘ํ•˜๋˜ ๊ฐ„์— ์–ด๋”˜๊ฐ€์— ์ˆ˜๋ ดํ•œ๋Œฑ. ์ฐธ๊ณ ๋กœ 'global optimization'์ด๋ž‘ ๋‹ค๋ฅธ ๋‹จ์–ด์ด๋‹ค. ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— ์ฒ˜์Œ์— initialization์„ ์–ด๋–ป๊ฒŒ ํ•ด์ฃผ๋Š๋ƒ๊ฐ€ ์ค‘์š”ํ•˜๋‹ค. it is sensitive to initialization and outliers.
      • 1 sample์ด 2๊ฐœ์˜ cluster์— ์†ํ•  ์ˆ˜๋Š” ์—†๋‹ค.
      • dataset์— ๋Œ€ํ•ด์„œ ์•„๋ฌด๊ฒƒ๋„ ๋ชจ๋ฅผ ๊ฒฝ์šฐ์—” ์œ ํด๋ฆฌ๋””์–ธ ๊ฑฐ๋ฆฌ๊ฐ€ ์ œ์ผ ์ข‹์€ ์˜ต์…˜์ด๋‹ค.๊ฑฐ์˜ ๋ชจ๋“  ๊ฒฝ์šฐ ๊ทธ๋ ‡๋‹ค๊ณ  ํ•œ๋Œฑ.
      • M(i) = mean vector ์œ ํด๋ฆฌ๋””์–ธ ๊ฑฐ๋ฆฌ์™€ ํ•˜๋‚˜์˜ ํด๋Ÿฌ์Šคํ„ฐ ์•ˆ์˜ ๋ชจ๋“  ์ƒ˜ํ”Œ๋“ค๊ณผ ๊ฑฐ๋ฆฌ๋ฅผ ๊ตฌํ•ด์ค€๋‹ค.
        %E1%84%89%E1%85%B3%E1%84%8F%E1%85%B3%E1%84%85%E1%85%B5%E1%86%AB%E1%84%89%E1%85%A3%E1%86%BA%202020-10-19%20%E1%84%8B%E1%85%A9%E1%84%92%E1%85%AE%204.35.36.png
      • ๋ฐ–์˜ summation์—์„œ๋Š” 5๊ฐœ์˜ ํด๋Ÿฌ์Šคํ„ฐ๊ฐ€ ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋ฉด ๊ฐ cluster๋งˆ๋‹ค ๋ฐ์ดํ„ฐ์™€ mean์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๋”ํ•ด์ค€๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋ชจ๋“  cluster์˜ ๊ฑฐ๋ฆฌ์˜ ํ•ฉ์„ ๋”ํ•ด์„œ ์ตœ์†Œํ™”๊ฐ€ ๋˜๋„๋ก ํ•ด์ค€๋‹ค.
    • ๊ณผ์ •?!?๐Ÿง
      %E1%84%89%E1%85%B3%E1%84%8F%E1%85%B3%E1%84%85%E1%85%B5%E1%86%AB%E1%84%89%E1%85%A3%E1%86%BA%202020-10-19%20%E1%84%8B%E1%85%A9%E1%84%92%E1%85%AE%204.38.32.png
      1. ์ฒ˜์Œ centroid๋ฅผ ๋žœ๋คํ•˜๊ฒŒ ์ดˆ๊ธฐํ™” ํ•ด์ค€๋‹ค.
      2. ํ•˜๋‚˜์˜ sample์— ๋Œ€ํ•ด ๋ชจ๋“  centroid์™€์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ํ•ด๋‹น cluster๋ฅผ ์ง€์ •ํ•ด์ค€๋‹ค.
      3. ๊ฐ๊ฐ์˜ ํด๋Ÿฌ์Šคํ„ฐ๋‚ด์˜ mean vector๋ฅผ ๊ณ„์‚ฐํ•ด์ค€๋‹ค. ์ฆ‰ ์ƒˆ๋กœ์šด centroid๋ฅผ ๊ณ„์‚ฐํ•ด์ฃผ๊ณ  ์—…๋ฐ์ดํŠธ ํ•ด์ค€๋‹น.
      4. 2๋ฒˆ๊ณผ 3๋ฒˆ์„ ๊ณ„์† ๋ฐ˜๋ณตํ•ด์ค€๋‹ค.์–ธ์ œ๊นŒ์ง€?? ์—๋Ÿฌ๊ฐ€ ์ฆ๊ฐ€ํ•˜์ง€ ์•Š๊ฑฐ๋‚˜! ๊ฐ์†Œํ•˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ! ๋˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์—…๋ฐ์ดํŠธํ•  ๋•Œ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์—…๋ฐ์ดํŠธ๋ฅผ ํ†ตํ•ด ๋ฐ”๋€Œ์ง€ ์•Š๋Š”๋‹ค๋ฉด ํ•™์Šต์„ ๋ฉˆ์ถฐ์•ผ ํ•œ๋‹น.

PCA?

๋ฐ์ดํ„ฐ์˜ ์ฐจ์›์ด ๋†’์•„์งˆ์ˆ˜๋ก ์šฐ๋ฆฌ๊ฐ€ ์ข‹์•„ํ•˜๋Š” Euclidean distance์— ์˜ˆ์ƒ์น˜ ๋ชปํ•œ ๋ฐฉ์‹์œผ๋กœ ๋™์ž‘ํ•  ์ˆ˜๊ฐ€ ์žˆ๋‹ค. ๋ณดํ†ต ์ฐจ์›์˜ ํฌ๊ธฐ๋ฅผ 'feature์˜ ๊ฐœ์ˆ˜'๋ผ๊ณ  ํ•œ๋‹ค.


Demensionality reduction

https://brunch.co.kr/@rlawlgy43/33

  • PCA(Principal Component Analysis):

    • ํˆฌ์˜๋œ ๋ฐ์ดํ„ฐ์˜ ๋ถ„์‚ฐ์ด ์ตœ๋Œ€ํ™”๋˜๋Š” projection matrix๋ฅผ ์ฐพ๋Š” ๋ฌธ์ œ.why? ๋ฐ์ดํ„ฐ๋“ค์˜ ๋ถ„์‚ฐ์ด ์ œ์ผ ์ปค์•ผ ๋ฐ์ดํ„ฐ ์œ ์‹ค์„ ๋ง‰์„ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ(https://nittaku.tistory.com/291), ๋ถ„์‚ฐ์ด ํฌ๋‹ค๋Š” ๋ง์€ eigen value๊ฐ€ ๋†’์€ ๊ฐ’์„ ๋˜‘๊ฐ™์ด ์˜๋ฏธํ•จ!. ๊ทธ๋ž˜์„œ cov matrix ์ค‘ Eigen value๊ฐ€ ๋†’์€ ๊ฐ’์„ ์ฐพ์•„ Eigen Vector๋ฅผ ์ฐพ๊ณ  ๊ทธ๊ฒƒ์„ ๊ธฐ์ค€์œผ๋กœ ๋ฐ์ดํ„ฐ๋“ค์„ ์˜ฎ๊ฒจ ์ฐจ์›์„ ์ถ•์†Œํ•ด์ค˜์•ผ ํ•œ๋‹ค. ๊ทธ๋ž˜์„œ ๊ตฌํ•œ eigen vector๋ฅผ ๋‚ด๋ฆผ์ฐจ์ˆœํ•ด์„œ ๊ฑฐ๊ธฐ์„œ ์›ํ•˜๋Š” ์ฐจ์›๋งŒํผ ๊ฐœ์ˆ˜๋ฅผ ๊ณ ๋ฅด๊ณ  ๊ฑฐ๊ธฐ์—๋‹ค๊ฐ€ projection์„ ํ•ด์ฃผ๋Š”๊ฑฐ์ง€!(fit_transform)
    • ๊ณ ์œ ๊ณต๊ฐ„์œผ๋กœ oroginal data๋ฅผ projectionํ•œ๋‹ค๋Š” ๊ฒƒ์€ ๊ทธ ์ƒˆ๋กœ์šด ๋ฒกํ„ฐ์˜ ์ฐจ์›์— ๋ฐ์ดํ„ฐ๋ฅผ ์œ„์น˜์‹œ์ผœ์ค€๋‹ค๋Š” ๋œป!
  • fit_transform : fit the model with X and X์—๋‹ค๊ฐ€ ์ฐจ์› ์ถ•์†Œ๋„ ๊ฐ™์ด ํ•ด์ฃผ๋Š” ๊ฑฐ์ž„

    • Returns: X_new:array-like, shape(n_samples, n_components)

kMeans ๊ตฌํ˜„ํ•˜๊ธฐ

1. ๋žœ๋ค์œผ๋กœ centroids๋ฅผ ์„ ์ •ํ•ด์ค€๋‹ค.
2. centroid์— ๊ทผ๊ฑฐํ•˜์—ฌ assign each observation to a clsuter.
3. ๊ฐ ํด๋Ÿฌ์Šคํ„ฐ์˜ ํ‰๊ท  ์ขŒํ‘œ๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ, ๊ทธ๊ฒƒ์ด ์ƒˆ๋กœ์šด centroid๊ฐ€ ๋œ๋‹ค. 
4. ์ƒˆ๋กœ์šด centroid์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ํด๋Ÿฌ์Šคํ„ฐ ์žฌ ํ• ๋‹น
5. 3๋ฒˆ๊ณผ 4๋ฒˆ์„ ์ˆ˜๋ ดํ• ๋•Œ๊นŒ์ง€ ๋ฐ˜๋ณต

* ์—ฌ๊ธฐ์„œ ํ‰๊ฐ€๊ธฐ์ค€์œผ๋กœ ์‚ผ์„๋งŒํ•œ acc๋ฅผ ๊ณ„์‚ฐํ•˜์ž.