๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

Computer Science/๋…ผ๋ฌธ์ฝ๊ธฐ

๋”ฅ๋Ÿฌ๋‹ ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ >> NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE ๋ฆฌ๋ทฐ

์˜ค๋Š˜์€ NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE์„ ๊ณต๋ถ€ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹น!๐Ÿค“


Introduction

Neural machine translation์€ machine translation๋ถ„์•ผ์—์„œ ์ƒˆ๋กœ ๋ฐœ๊ฒฌ๋œ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ํ•˜๋‚˜์˜, ์ปค๋‹ค๋ž€ ์‹ ๊ฒฝ๋ง์„ ์„ค๊ณ„ํ•˜๊ณ  ํ•™์Šต์‹œํ‚ด์œผ๋กœ์จ ์˜ฌ๋ฐ”๋ฅธ ๋ฒˆ์—ญ์„ ํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. ๋ณดํ†ต ์ด๋Ÿฌํ•œ ์‹ ๊ฒฝ๋ง์„ ํ†ตํ•œ ๊ธฐ๊ณ„๋ฒˆ์—ญ์—” ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋”๋กœ ๊ตฌ์„ฑ์ด ๋ฉ๋‹ˆ๋‹ค. ์ธ์ฝ”๋” ์‹ ๊ฒฝ๋ง(encoder nerual network)๋Š” source sentence(๋ฒˆ์—ญํ•ด์•ผ ํ•˜๋Š” ๋ฌธ์žฅ)์„ ๊ณ ์ •๋œ ํฌ๊ธฐ์˜ ๋ฒกํ„ฐ๋กœ ์ธ์ฝ”๋”ฉํ•ด ์ค๋‹ˆ๋‹ค. ๋””์ฝ”๋” ์‹ ๊ฒฝ๋ง์€ ์ธ์ฝ”๋”ฉ๋œ ๋ฒกํ„ฐ๋กœ๋ถ€ํ„ฐ ๋ฒˆ์—ญ์„ ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋‹ค์Œ, encoder-decoder system์„ ํ†ตํ•ด์„œ ์˜ฌ๋ฐ”๋ฅธ ๋ฒˆ์—ญ์ผ ํ™•๋ฅ ์„ ์ตœ๋Œ€ํ™” ์‹œํ‚ค๋„๋ก ํ•™์Šต์‹œ์ผœ ์ค๋‹ˆ๋‹ค.

์ด ์ธ์ฝ”๋”ฉ์˜ ์—ญํ• ์„ ์‹ ๊ฒฝ๋ง์ด ํ•ด์ค„ ์ˆ˜ ์žˆ๋Š”๋ฐ ๊ทธ๋ ‡๊ฒŒ ๋˜๋ฉด ๊ธธ์ด๊ฐ€ ๊ธด ๋ฌธ์žฅ์€ ํ•ด์„ํ•˜๊ธฐ๊ฐ€ ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ์ด ๋…ผ๋ฌธ์—์„œ๋Š” align๊ณผ translate์„ ํ•จ๊ป˜ ํ•ด์ค„ ์ˆ˜ ์žˆ๋Š” encoder-decoder model์˜ ๋” ๋‚˜์€ ๋ฒ„์ „์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

Each time the proposed model generates a word in a translation, it (soft-)searchs for a set of positions in a source sentence where the most relevant information is concetrated.

source sentence์—์„œ ๊ด€๋ จ์žˆ๋Š” ์ •๋ณด๊ฐ€ ๋ชฐ๋ ค์žˆ๋Š” ๋ถ€๋ถ„์„ ๋‚˜ํƒ€๋‚ด๋Š” context vector๋ฅผ ๊ฐ€์ง€๊ณ  target word๋ฅผ ์˜ˆ์ธกํ•ด์ค๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ธํ’‹์œผ๋กœ ๋“ค์–ด๊ฐ€๋Š” ๋ฌธ์žฅ ์ „์ฒด๋ฅผ ๋‹จ์ผ ๊ณ ์ •๊ธธ์ด์˜ ๋ฒกํ„ฐ๋กœ ๋ฐ”๊ฟ”์ฃผ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ adaptiveํ•˜๊ฒŒ ๋ฒกํ„ฐ์˜ ๋ถ€๋ถ„์ง‘ํ•ฉ๋งŒ์„ ์„ ํƒํ•ด์ฃผ๊ณ  ๊ทธ๊ฒƒ์„ ๊ฐ€์ง€๊ณ  ๋ฒˆ์—ญ์„ ํ•ด์ค๋‹ˆ๋‹ค.


Background

ํ™•๋ฅ ์ ์ธ ๊ด€์ ์—์„œ '๋ฒˆ์—ญ'์ด๋ž€ ๋ฌด์—‡์„ ์˜๋ฏธํ•˜๋Š” ๊ฒƒ์ผ๊นŒ์š”?

translation is equivalent to finding a target sentence y that maximizes the conditional probability of y given a source sentence x, i.e. arg maxy p(y | x).

์ผ๋‹จ conditional distribution์ด ๋ฒˆ์—ญ ๋ชจ๋ธ์— ์˜ํ•ด์„œ ํ•™์Šต๋˜๋ฉด, source sentence๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, conditional probability๋ฅผ ๊ทน๋Œ€ํ™”์‹œํ‚ค๋Š” ๋ฌธ์žฅ์„ ์ฐพ์•„์ค๋‹ˆ๋‹ค. ์ตœ๊ทผ์— ์‹ ๊ฒฝ๋ง์„ ํ†ตํ•ด์„œ conditional distribution์„ ์ฐพ๋Š” ๋ชจ๋ธ์ด ์ œ์•ˆ๋˜๊ณ  ์žˆ๋Š”๋ฐ, ์ด๋ฅผ ์œ„ํ•ด์„œ ์ด 2๊ฐ€์ง€์˜ RNN๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์œ„์—์„œ ์–˜๊ธฐํ•œ ๊ฒƒ์ฒ˜๋Ÿผ ํ•˜๋‚˜๋Š” encoder์—ญํ• ์„, ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” decoder์—ญํ• ์„ ํ•ด์ค๋‹ˆ๋‹ค. ๊ทผ๋ž˜์— ๋‚˜์˜จ ์ ‘๊ทผ๋ฒ•์ž„์—๋„ ๋งค์šฐ ์œ ๋งํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

2.1 RNN Encoder-Decoder

  1. ๋จผ์ € ์ธ์ฝ”๋”๊ฐ€ x=(x1,···,xTx)๋ฅผ vector c๋กœ ์ฝ์–ด ๋“ค์ž…๋‹ˆ๋‹ค.

    • ht = f (xt, ht−1)(์—ฌ๊ธฐ์„œ f๋Š” LSTM์„ ์‚ฌ์šฉ)
    • c = q({h1,··· ,hTx})(q({h1,··· ,hT})=hT)
  2. ์—ฌ๊ธฐ์„œ ht๋Š” ์‹œ๊ฐ„ t์ผ ๋•Œ hidden state๋ฅผ ์˜๋ฏธํ•˜๊ณ , c๋Š” hidden state์˜ sequence๋กœ๋ถ€ํ„ฐ ๋งŒ๋“ค์–ด์ง„ ๋ฒกํ„ฐ๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

  3. ๋””์ฝ”๋”๋Š” c์™€ ์ด์ „์— ์˜ˆ์ธก๋œ ๋‹จ์–ด๋“ค {y1 , · · · , yt′ −1 }์„ ์ด์šฉํ•˜์—ฌ yt'๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด ํ•™์Šต๋ฉ๋‹ˆ๋‹ค. ๊ทธ ํ›„์— ๋””์ฝ”๋”๊ฐ€ ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•ด์ฃผ๋นˆ๋‹ค.

  4. ์—ฌ๊ธฐ์„œ g๋Š” yt์˜ ํ™•๋ฅ ์„ ๋‚˜ํƒ€๋‚ด์ฃผ๋Š” nonlinear, multi-layered function์ด๊ณ  st๋Š” RNN์˜ hidden state๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.


Learning to align and translate

3.1 Decoder

p(yi|y1, . . . , yi−1, x) = g(yi−1, si, ci) ์‹์„ ํ†ตํ•ด์„œ conditional probability๋ฅผ ๊ณ„์‚ฐํ•ด ์ค˜์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด 3๊ฐ€์ง€๊ฐ€ ํ•„์š”ํ•œ๊ฑฐ์ฃ ! ์ฒซ๋ฒˆ์งธ, si๋Š” RNN hidden state๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. si =f(si−1,yi−1,ci) ๋ฅผ ํ†ตํ•ด์„œ ๊ณ„์‚ฐํ•ด์ค๋‹ˆ๋‹น. ๋‘๋ฒˆ์งธ ci๋Š” (h1 , · · · , hTx )์˜ annotation์ž…๋‹ˆ๋‹ค. ์–˜๋„ค๋Š” ์ธ์ฝ”๋”๊ฐ€ input ๋ฌธ์žฅ์„ ๋งคํ•‘ํ•  ๋•Œ hidden state์œผ๋กœ ์“ฐ์ธ ์• ๋“ค์ž…๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ hi๋Š” i๋ฒˆ์งธ ๋‹จ์–ด ์ฃผ๋ณ€ ๋ถ€๋ถ„๋“ค์— ๋” ๋งŽ์€ ์˜ํ–ฅ์„ ๋ฐ›๋Š” ์ •๋ณด๋ฅผ ๋‹ด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ci๋Š” alpha ij์™€ hj๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ณ„์‚ฐ์ด ๋ฉ๋‹ˆ๋‹ค.

๊ทธ๋ฆฌ๊ณ  alpha ij๋Š” ์•„๋ž˜์˜ ์‹์œผ๋กœ ๊ณ„์‚ฐ์ด ๋ฉ๋‹ˆ๋‹ค.

: ์ด๋Ÿฐ์‹์œผ๋กœ annotation๋“ค์˜ ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌํ•œ ํ•ฉ๊ณผ ๊ฐ™์€ ์ ‘๊ทผ๋ฒ•์„ expected annotation์„ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ ์ด๋ผ๊ณ ๋„ ๋ถ€๋ฆ…๋‹ˆ๋‹ค. alpha ij๋ฅผ ํƒ€๊ฒŸ ๋‹จ์–ด yi๊ฐ€ source word xj์— align๋  ํ™•๋ฅ ์ด๋ผ๊ณ  ๋ด…์‹œ๋‹ค. ๊ทธ๋Ÿผ ci๋Š” alpha ij๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ณ„์‚ฐ๋œ expected annotation์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๊ฒŸ์ฃ . ์™œ๋ƒํ•˜๋ฉด ci๋Š” alpha ij * hj์˜ ํ•ฉ์ด๋‹ˆ๊นŒ์š”. ์ง๊ด€์ ์œผ๋กœ ๋งํ•˜์ž๋ฉด, ์ด๋Š” ๋””์ฝ”๋”์—์„œ mechanism of attention ์„ ์‹คํ˜„ํ–ˆ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— ๋””์ฝ”๋”์˜ ์ด๋Ÿฌํ•œ ํŠน์„ฑ ๋•๋ถ„์— ์ธ์ฝ”๋”๊ฐ€ ๋ชจ๋“  ๋ฌธ์žฅ์„ ์ •ํ•ด์ง„ ๊ธธ์ด์˜ ๋ฒกํ„ฐ๋กœ ๋ฐ”๊ฟ”์•ผํ•˜๋Š” ๋ถ€๋‹ด์„ ๋œ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค!

3.2 Encoder

์ธ์ฝ”๋”๋Š” input sequence x๋ฅผ ์ฝ์Šต๋‹ˆ๋‹ค. encoder์—๋Š” BiRNN์„ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ๋Š”๋ฐ, BiRNN์€ forward RNN๊ณผ backward RNN์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹น. forward RNN์€ ๋ฌด์Šจ ์—ญํ• ์„ ํ• ๊นŒ์š”? input sequence๋ฅผ ์ฝ์–ด๋“ค์ด๊ณ  forward hidden states( ( h 1, · · · , h Tx )๋ฅผ ๊ณ„์‚ฐํ•ด์ค๋‹ˆ๋‹ค. ๋ฐ˜๋Œ€๋กœ backward RNN์€ ๋ฐ˜๋Œ€ ์ˆœ์„œ๋กœ sequence๋ฅผ ์ฝ์–ด๋“ค์ด๊ณ  backward hidden states๋ฅผ ๊ณ„์‚ฐํ•ด์ค๋‹ˆ๋‹ค. ์ด๋Ÿฐ์‹์œผ๋กœ xj์ฃผ๋ณ€์„ ํฌํ•จํ•˜๋Š” annotation hj๊ฐ€ ๊ณ„์‚ฐ๋˜๋ฉด ์ด๋Š” ๋””์ฝ”๋”์—์„œ ์‚ฌ์šฉ์ด ๋ฉ๋‹ˆ๋‹ค.


Experiment settings

** 4.1 Dataset **

monolingual corpus(ํ•œ๊ฐ€์ง€์˜ ์–ธ์–ด๋กœ ์ด๋ฃจ์–ด์ง„)์ด ์•„๋‹ˆ๋ผ parallel corpora data ์‚ฌ์šฉ. ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ค๊ธฐ ์œ„ํ•ด์„œ ๊ฐ ์–ธ์–ด๋งˆ๋‹ค ๊ฐ€์žฅ ๋งŽ์ด ์ด์šฉ๋˜๋Š” ๋‹จ์–ด 30000๊ฐœ๋ฅผ ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค.

** 4.2 Models **

โ“sigle maxout hidden layer

  • ๊ธฐ์กด์˜ RNN Encoder-Decoder ๋ชจ๋ธ, ์ด ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•œ RNNsearch๋ชจ๋ธ ๋‘๊ฐ€์ง€๋ฅผ ํ•™์Šต.
  • ์ฒ˜์Œ์—” 30๊ฐœ ๋‹จ์–ด๋ฅผ ํฌํ•จํ•œ RNNencdec-30, RNNsearch-30์„ ๋‘๋ฒˆ์งธ์—” 50๊ฐœ ๋‹จ์–ด๋ฅผ ํฌํ•จํ•œ RNNencdec-50, RNNsearch-50์„ ์‚ฌ์šฉ.
  • ๊ฐ ๋‹จ์–ด์˜ˆ์ธก์— ํ•„์š”ํ•œ conditional probability๊ณ„์‚ฐ์„ ์œ„ํ•ด์„œ multilayer network with a single maxout hidden layer์‚ฌ์šฉ
  • SGD ์‚ฌ์šฉ
  • ์ œ์ผ ๊ทผ์ ‘ํ•˜๊ฒŒ conditional probability๋ฅผ ์ตœ๋Œ€ํ™” ์‹œํ‚ค๊ธฐ ์œ„ํ•ด beam search ์‚ฌ์šฉ

Results

5.1 Quantitave results

  • BLEU ์Šค์ฝ”์–ด๋ฅผ ์ธก์ •ํ•ด๋ณด๋ฉด RNNsearch๊ฐ€ RNNencdec์˜ ์„ฑ๋Šฅ์„ ํ›จ์”ฌ ๋›ฐ์–ด๋„˜์Œ.
  • RNNencdec์€ ๋ฌธ์žฅ์˜ ๊ธธ์ด๊ฐ€ ๊ธธ์–ด์งˆ์ˆ˜๋ก ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง€๋Š” ๋ฐ˜๋ฉด, RNNsearch๋Š” ๊ดœ์ฐฎ์Œ.

5.2 Qualitative analysis

  1. alignment

    • ์˜์–ด์™€ ๋ถˆ์–ด์‚ฌ์ด์˜ alignment๋Š” ๋งค์šฐ monotonicํ•˜๋‹ค.
    • ๊ฐ๊ฐ์˜ matrix์˜ ์‚ฌ์„ ๋ฐฉํ–ฅ์ด weight์ด ์ œ์ผ ํฐ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.
    • soft-alignment์˜ ์žฅ์ ? the man์„ ๋ฒˆ์—ญํ•  ๋•Œ, man์— ๋”ฐ๋ผ the์˜ ๋ฒˆ์—ญ์ด ๊ต‰์žฅํžˆ ์ค‘์š”ํ•œ๋ฐ soft alignment๋Š” ๊ทธ๊ฒƒ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ๋‹ค.
  2. long sentences

    • ์ด์ „์˜ ๋ชจ๋ธ๋ณด๋‹ค ๊ธด ๋ฌธ์žฅ์„ ํ•ด์„ํ•˜๋Š”๋ฐ์— ์žˆ์–ด์„œ ํ™•์—ฐํžˆ ์„ฑ๋Šฅ์ด ์ข‹๋‹ค.

Related Work

6.1 learning to align

handwriting synthesis๋กœ๋ถ€ํ„ฐ ์ œ์•ˆ๋œ aliging ์ ‘๊ทผ๋ฒ•์ธ๋ฐ, ๋‹ค๋ฅธ์ ์€ weight mode ๊ฐ€ ํ•œ๋ฐฉํ–ฅ์œผ๋กœ๋งŒ ์›€์ง์ธ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Š” ๊ธฐ๊ณ„ ๋ฒˆ์—ญ์— ์žˆ์–ด์„œ ๊ต‰์žฅํžˆ ํฐ ์ œํ•œ์ ์ด๋ผ๊ณ  ํ•œ๋‹ค. ์™œ๋ƒํ•˜๋ฉด ๋ฌธ๋ฒ•์ ์œผ๋กœ ์˜ณ์€ ๋ฌธ์žฅ์„ ๋ฒˆ์—ญํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” (long-distance)reordering์ด ํ•„์š”ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

์ด ์ ‘๊ทผ๋ฒ•์€ source sentence์˜ ๋ชจ๋“  ๋‹จ์–ด์— annotation weight์„ ๊ณ„์‚ฐํ•ด์ค€๋‹ค.

6.2 Neural networks for machine translation

๊ธฐ์กด์— neural network๋Š” ๊ธฐ์กด์˜ ํ†ต๊ณ„ํ•™์  ๊ธฐ๊ณ„์— ๋Œ€ํ•œ ๋‹จ์ผ feature๋ฅผ ์ œ๊ณตํ•˜๊ฑฐ๋‚˜ ํ›„๋ณด ๋ฒˆ์—ญ๋ณธ๋“ค์˜ ๋ฆฌ์ŠคํŠธ๋ฅผ ์žฌ๋žญํ‚นํ•˜๋Š”๋ฐ์— ์‚ฌ์šฉ๋˜์—ˆ๋‹ค. ๊ธฐ์กด์— neural network๋ฅผ tarket-side language model๋กœ์จ ์‚ฌ์šฉํ•˜์˜€๋Š”๋ฐ, ์ด ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•˜๋Š” ๋ชจ๋ธ์€ source sentence์—์„œ ๋ฐ”๋กœ ๋ฒˆ์—ญ์„ ๋ฐœ์ƒ์‹œํ‚ค๊ณ  ๊ทธ ์ž์ฒด์—์„œ ๋™์ž‘ํ•˜๋Š” ๋ชจ๋ธ์ด๋‹ค.


Conclusion

This frees the model from having to encode a whole source sentence into a fixed-length vector, and also lets the model focus only on information relevant to the generation of the next target word.

๋”ฐ๋ผ์„œ RNNsearch๋Š” ๊ธด ๋ฌธ์žฅ์„ ๋ฒˆ์—ญํ•  ๋•Œ ๊ต‰์žฅํžˆ ์œ ์šฉํ•˜๊ณ  ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ƒ…๋‹ˆ๋‹ค. ์•ž์œผ๋กœ๋Š” unknown , rare ๋‹จ์–ด๋“ค์„ ๋” ์ž˜ ๋‹ค๋ฃจ๋Š” ๊ฒƒ์„ ํ•ด๊ฒฐํ•ด์•ผ ํ•œ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.