๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

Computer Science/ํŒŒ์ด์ฌ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ&sklearn&keras

๋”ฅ๋Ÿฌ๋‹ >> Sequence Model๊ณผ Attention mechanism(deep learning.ai๊ฐ•์˜)

                        deeplearning.ai์˜ course 5์—์„œ week3๋ฅผ ๊ณต๋ถ€ํ•˜๊ณ  ์ ๋Š” ๋ฆฌ๋ทฐ์ž…๋‹ˆ๋‹นโœ๐Ÿป

Basic Models

์–ด๋–ป๊ฒŒ ํ›ˆ๋ จ์‹œํ‚ฌ ๊ฒƒ์ธ๊ฐ€

์—ฌ๊ธฐ์„œ๋Š” sequence to sequence์— ๋Œ€ํ•ด์„œ ๋ฐฐ์šธ ๊ฒ๋‹ˆ๋‹ค.

  • ๋ณดํ†ต์˜ machine translation problem์—์„œ๋Š” ์ธํ’‹(x)์—๋Š” ์˜์–ด๋ฌธ์žฅ, ์•„์›ƒํ’‹(y)์œผ๋กœ๋Š” ํ”„๋ž‘์Šค์–ด์ธ ๋ฐ์ดํ„ฐ๋กœ ํ›ˆ๋ จ์‹œํ‚ต๋‹ˆ๋‹ค.

  • ์ด ๋ชจ๋ธ์€ encoder์™€ decoder ๋‘๊ฐ€์ง€์˜ ๊ตฌ์กฐ๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ encoder๋Š” ์ด์ „์— ๋ฐฐ์šด LSTM์ด๋‚˜ GRU๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , input sequence๋ฅผ ๋ฐ›์œผ๋ฉด ๊ทธ ์ธํ’‹์„ ๋‚˜ํƒ€๋‚ด์ฃผ๋Š” vector๋ฅผ ๋งŒ๋“ค์–ด์ค๋‹ˆ๋‹ค.

  • ์ด ๊ตฌ์กฐ๋Š” image captioning ๊ตฌ์กฐ์™€ ๋น„์Šทํ•ฉ๋‹ˆ๋‹ค. ์™œ๋ƒํ•˜๋ฉด image captioning์„ ํ•  ๋–„๋„ ์ธํ’‹์„ ์‚ฌ์ง„์œผ๋กœ ๋ฐ›์œผ๋ฉด ์•„์›ƒํ’‹์œผ๋กœ ๋ฌธ์žฅ(caption)์„ ๋งŒ๋“ค์–ด์ฃผ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.


picking the most likely sentence

  • conditional probability๋ฅผ ์ตœ๋Œ€ํ™” ์‹œํ‚ค๋Š” y๋ฅผ ์˜ˆ์ธกํ•ด์ค๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ machine translation์„ conditional language model์ด๋ผ๊ณ ๋„ ํ•ฉ๋‹ˆ๋‹ค.

  • Example:
    X = "Jane visite lโ€™Afrique en septembre."
    Y may be:

    • Jane is visiting Africa in September.
    • Jane is going to be visiting Africa in September.
    • In September, Jane will visit Africa.
    • ์ด๋Ÿฐ ์‹์œผ๋กœ y๋กœ ๋‹ค์–‘ํ•œ ๋ฒˆ์—ญ๋ฌธ์ด ๋‚˜์˜ฌ ์ˆ˜ ์žˆ๋Š”๋ฐ, ์ด ์ค‘ ๊ฐ€์žฅ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์€ ๋ฒˆ์—ญ๋ฌธ์„ ๋žœ๋ค์œผ๋กœ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
  • ์—ฌ๊ธฐ์„œ ๊ฐ€์žฅ ํ”ํ•˜๊ฒŒ ์“ฐ์ด๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด beam search์ธ๋ฐ, ์™œ ๊ทธ๋ฆฌ๋”” ๊ธฐ๋ฒ•์„ ์•ˆ์“ฐ๊ณ  ๋น” ์„œ์น˜๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฑธ๊นŒ์š”?

    • ์™œ๋ƒํ•˜๋ฉด ํ™•๋ฅ ์€ ๋†’์•„๋„ less optimal ํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ!
    • ๊ทธ๋ž˜์„œ ์šฐ๋ฆฌ๋Š” 'search algorithm'์„ ๋ฐฐ์›Œ์•ผ ํ•ฉ๋‹ˆ๋‹ค.

Beam Search

์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ heuristic search algorithm์ž…๋‹ˆ๋‹ค.

B, ์ฆ‰ beam width๋ฅผ 3์œผ๋กœ ๋‘๋ฉด ๋ฒˆ์—ญ๋œ ๋ฌธ์žฅ์˜ ์ฒซ๋ฒˆ์งธ ๋‹จ์–ด๋กœ ๋ชจ๋“  ๋‹จ์–ด๋“ค์„ ํ–ˆ์„ ๋•Œ ๊ฐ€์žฅ ์ ํ•ฉํ•œ ๋‹จ์–ด 3๊ฐ€์ง€๋ฅผ ์ฒ˜์Œ์— ๊ณ ๋ฅด๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ [in, jane, september]์ด ๊ณจ๋ผ์กŒ์ฃ ? ๊ทธstep 2์—์„œ๋Š” ๊ณ ๋ฅธ 3๊ฐœ์˜ ๋‹จ์–ด์— ๋Œ€ํ•œ candidat์˜ ํ™•๋ฅ ๋“ค์„ ๊ณ„์‚ฐํ•˜๊ณ  ๊ฑฐ๊ธฐ์„œ ๋˜ ํ™•๋ฅ ์ด ์ œ์ผ ๋†’์€ 3๊ฐ€์ง€์˜ ๋‹จ์–ด๋ฅผ ๊ณจ๋ผ์ฃผ๋ฉด ๋ฉ๋‹ˆ๋‹ค. ๋งŒ์•ฝ์— ๋‹จ์–ด์˜ ์ด ๊ฐœ์ˆ˜๊ฐ€ 10,000๊ฐœ๋ผ๊ณ  ํ–ˆ์„ ๋•Œ step2์—์„œ๋Š” ๋‹จ์–ด๊ฐ€ ์ด 30,000๊ฐœ๊ฐ€ ๋˜๊ฒ ์ฃ ? ์ด๋Ÿฐ์‹์œผ๋กœ ๋งค ์Šคํ…๋งˆ๋‹ค ์ธ์Šคํ„ด์Šค๋Š” ํ•ญ์ƒ 3๊ฐœ๊ฐ€ ์žˆ์–ด์•ผ ํ•˜๊ณ , ์ด๋Ÿฐ ์‹์œผ๋กœ ํ™•๋ฅ ์„ ๊ณ„์† ๊ณ„์‚ฐํ•ด์ค๋‹ˆ๋‹ค. ๋งŒ์•ฝ์—! ์—ฌ๊ธฐ์„œ B=1์ด ๋˜๋ฉด, ์ด๋Š” greedy search๊ฐ€ ๋ผ ๋ฒ„๋ฆฝ๋‹ˆ๋‹ค.


Refinements to Beam Search

  • Beam search๋Š” ํ™•๋ฅ ์„ ์ตœ๋Œ€ํ™” ์‹œ์ผœ์ฃผ๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์ธ๋ฐ, ๊ฐ๊ฐ์˜ ํ™•๋ฅ ์„ ๊ณ„์† ๊ณ„์† ๊ณฑํ•ด์ฃผ๋ฉด ์—„์ฒญ ์—„์ฒญ ์ž‘์€ ์ˆซ์ž๊ฐ€ ๋‚˜์™€๋ฒ„๋ ค์š”. ๊ทธ๋ž˜์„œ numerically stable ํ•˜๊ฒŒ ํ•˜๋ ค๊ณ  ๋กœ๊ทธ๋ฅผ ์ทจํ•ด์คŒ์œผ๋กœ์จ Length Normalization์„ ํ•ด์ค˜์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋ฐ‘์— ์‚ฌ์ง„์ฒ˜๋Ÿผ ๋ฒˆ์—ญํ•˜๋ ค๋Š” ๋‹จ์–ด์˜ ๊ฐฏ์ˆ˜๋กœ Normalizeํ•ด์ฃผ๋ฉด ๋ฉ๋‹ˆ๋‹ค.

  • ๊ทธ๋ ‡๋‹ค๋ฉด Beam width๋Š” ์–ด๋–ป๊ฒŒ ์กฐ์ •ํ•ด์ค˜์•ผ ํ• ๊นŒ?
    • B is Larger : ์„ ํƒํ•  ์ˆ˜ ์žˆ๋Š” option์ด ๋งŽ์•„์ง€๋‹ˆ๊นŒ ๋” ๋‚˜์€ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜ค๊ฒŸ์ฃ  ๊ทผ๋ฐ ์†๋„๋Š” ๋Š๋ ค์ง‘๋‹ˆ๋‹ค.
    • B is Smaller : ๊ฒฐ๊ณผ๋Š” ๋” ์•ˆ์ข‹์•„์ง€๊ฒ ์ง€๋งŒ ์†๋„๋Š” ๋นจ๋ผ์ง‘๋‹ˆ๋‹ค.

Error Analysis

  • beam search์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์ธ B๊ฐ€ ๋ฌธ์ œ์ธ์ง€, RNN๋ถ€๋ถ„์ด ๋ฌธ์ œ์ธ์ง€๋ฅผ ์ •์˜ํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ด Error Analysis๋ฅผ ์‚ฌ์šฉํ•ด๋ณด๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

  • ๋จผ์ € ์ธ๊ฐ„์ด ํ•ด์„ํ•œ ๋ฌธ์žฅ์„ y_๋ผ๊ณ  ํ•˜๊ณ , ๊ธฐ๊ณ„๊ฐ€ ๋ฒˆ์—ญํ•œ ๋ฌธ์žฅ์„ yhat์ด๋ผ๊ณ  ํ•ด๋ด…์‹œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  P(y_ | X) and P(ลท | X)๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ณ„์‚ฐ๊ฐ’์„ ํ†ตํ•ด์„œ 2๊ฐ€์ง€์˜ ๊ฒฝ์šฐ๋ฅผ ์ƒ๊ฐํ•ด๋ณผ ์ˆ˜ ์žˆ๋Š”๋ฐ

    1. case 1 (P(y* | X) > P(ลท | X)): beam search๋Š” ลท๋ฅผ ํƒํ–ˆ๋Š”๋ฐ, y*๊ฐ€ ํ™•๋ฅ ์ด ๋” ๋†’๋‹ค๋ฉด ์ด๊ฒƒ์€ Beam search์—์„œ ์˜ค๋ฅ˜๊ฐ€ ๋‚œ ๊ฑฐ๋ผ ๋น”์˜ ํญ์„ ๋Š˜๋ ค์ค˜์•ผ ํ•ฉ๋‹ˆ๋‹ค.
    2. case 2 (P(y* | X) <= P(ลท | X)): ์ธ๊ฐ„์ด ํ•œ ๋ฒˆ์—ญ์ด ๋” ๋‚˜์€๋ฐ๋„ RNN์ด ์ €๋ ‡๊ฒŒ ์˜ˆ์ธก์„ ํ–ˆ๋‹ค๋ฉด RNN๋ชจ๋ธ์ด ์ž˜๋ชป ๋œ ๊ฒƒ์ž„์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฏ€๋กœ ํ›ˆ๋ จ๋ฐ์ดํ„ฐ๋ฅผ ๋” ๋Š˜๋ฆฌ๊ฑฐ๋‚˜, ๋ ˆ์ด์–ด๋ฅผ ๋” deepํ•˜๊ฒŒ ๋งŒ๋“ ๋‹ค๊ฑฐ๋‚˜, ๋‹ค๋ฅธ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ๋ฅผ ์ƒ๊ฐํ•ด๋ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

    x = "Jane visite lโ€™Afrique en septembre."
    y* = "Jane visits Africa in September." - right answer
    ลท = "Jane visited Africa last September." - answer produced by model

Attention Model intution

encoder์™€ decoder๋ฅผ ์‚ฌ์šฉํ•œ sequence model์„ ๋” ์ข‹์€ ๋ฐฉํ–ฅ์œผ๋กœ ๋ฐœ์ „์‹œํ‚จ ๊ฒƒ์ด ๋ฐ”๋กœ 'attention'์ด๋ผ ๋ถˆ๋ฆฌ๋Š” ๊ธฐ์ˆ ์ž…๋‹ˆ๋‹ค. ๊ธด ์‹œํ€€์Šค์˜ ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•  ๋•Œ, attention์ด ์—†๋‹ค๋ฉด ์ธ์ฝ”๋”๋Š” ๊ทธ ๊ธด ๋ฌธ์žฅ์„ ๋ชจ๋‘ ๋ฒกํ„ฐ๋กœ ๋ฐ”๊ฟ”์ค˜์•ผ ํ•˜๋ฉฐ, ๋””์ฝ”๋”๋Š” ๊ทธ ๋ฒกํ„ฐ๋ฅผ ๋ฒˆ์—ญ ๋ฌธ์žฅ์œผ๋กœ ๋ฐ”๊ฟ”์ค˜์•ผ ํ•  ๊ฑฐ์—์š”. ๊ทผ๋ฐ attention model์„ ์‚ฌ์šฉํ•˜๋ฉด ์•„๋ž˜์˜ ํ‘œ์™€ ๊ฐ™์ด ์„ฑ๋Šฅ์ด ๋งค์šฐ ์ข‹์•„์ง‘๋‹ˆ๋‹ค.

  • ํ•œ ๋‹จ์–ด๋ฅผ ๋ฒˆ์—ญํ•  ๋•Œ, ๋ฒˆ์—ญ์„ ํ•ด์•ผํ•˜๋Š” ๋‹จ์–ด์— ์–ผ๋งˆ๋‚˜ ์ง‘์ค‘์„ ํ•ด์•ผํ•˜๋Š”๊ฐ€

Attention Model

ํ•™์Šต ๋ชฉ์ ์œผ๋กœ a<t'>๊ฐ€ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•ด๋ด…์‹œ๋‹ค. ์–˜๋Š” Bidirectional RNN์—์„œ ๋‘ ๋ฐฉํ–ฅ์— ๋Œ€ํ•œ ์‹œ๊ฐ„ t'์ผ ๋•Œ activation ๊ฐ’์„ ๋‹ด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
๊ทธ๋ฆฌ๊ณ  context c๋ฅผ ๊ณ„์‚ฐํ•ด์ค„ ๊ฑด๋ฐ, ์–˜๋Š” alpha<1, t'>์™€ a<t'>๋ฅผ ๊ณฑํ•œ ๊ฐ’์˜ ํ•ฉ๋“ค์„ ๋‹ด์€ ๋ฒกํ„ฐ์ž…๋‹ˆ๋‹ค.

์ด๋ฒˆ์—” attention weight์ธalpha<1, t'>๋ฅผ ์–ด๋–ป๊ฒŒ ๊ตฌํ•˜๋Š”์ง€ ์•Œ์•„๋ด…์‹œ๋‹ค. alpha<1, t'>๋Š” y๊ฐ€ a< t'' > ์— ์–ผ๋งˆ๋‚˜ ์ง‘์ค‘ํ•ด์•ผ ํ•˜๋Š”์ง€๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋‹จ์ ์€ quadratic ์‹œ๊ฐ„์ด ๋“ค์–ด์„œ ๋น„์šฉ์ด ๋งŽ์ด ๋“ญ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด attention weight๋ฅผ ์‹œ๊ฐํ™” ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

Speech Recognition

์Œ์„ฑ ์ธ์‹์„ ํ•  ๋•Œ๋Š” input X : audio clip, output Y : transcript์œผ๋กœ ์„ค์ •ํ•ด์ค๋‹ˆ๋‹ค.