๐ŸŒŒ Deep Learning/DL & ML ์กฐ๊ฐ ์ง€์‹

Transformer์˜ positional encoding (PE)

๋ณต๋งŒ 2021. 10. 18. 18:09

Transformer์„ ๊ตฌ์„ฑํ•˜๋Š” Multi-Head Self-Attention layer๋Š” permutation equivariantํ•œ ํŠน์„ฑ์„ ๊ฐ–๊ธฐ ๋•Œ๋ฌธ์—, postitional encoding์ด ํ•„์ˆ˜์ ์œผ๋กœ ํ•„์š”ํ•˜๋‹ค. 

 

 

 

Transformer์—์„œ ์‚ฌ์šฉํ•˜๋Š” positional encoding

 

์šฐ์„ , Transformer์—์„œ ์‚ฌ์šฉํ•˜๋Š” positional encoding์˜ ์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

$PE_{(pos,2i)}=sin(pos/10000^{2i/d_{model}})$

$PE_{(pos,2i+1)}=cos(pos/10000^{2i/d_{model}})$

 

์ด๋ฅผ ํ’€์–ด ์“ฐ๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ˜•ํƒœ๋ฅผ ๊ฐ–๊ฒŒ ๋˜๊ณ ,

 

 

์ด๋ฅผ ์‹œ๊ฐํ™”ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

 

๋ณธ ๊ธ€์—์„œ๋Š” ์™œ transformer์˜ positional encoding์ด ์ด๋ ‡๊ฒŒ ๋ณต์žกํ•œ ํ˜•ํƒœ๋ฅผ ์‚ฌ์šฉํ•˜๊ฒŒ ๋๋Š”์ง€๋ฅผ ์•Œ์•„๋ณด๊ธฐ๋กœ ํ•œ๋‹ค.

 

 

Other possible methods

์œ„์˜ ์‚ฌ์šฉํ•œ ๋ฐฉ๋ฒ• ์™ธ์—๋„ positional encoding์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์ด ์žˆ๋‹ค.

 

Method 1

๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ, 0~1 ์‚ฌ์ด์˜ ๊ฐ’์„ ์ด์šฉํ•ด ์ผ์ •ํ•œ ๋น„์œจ๋กœ position ๊ฐ’์„ ์ •ํ•  ์ˆ˜ ์žˆ๋‹ค.

PyTorch๋กœ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

pos = torch.arange(max_len)/max_len

 

https://towardsdatascience.com/master-positional-encoding-part-i-63c05d90a0c3

 

์ด ๋ฐฉ๋ฒ•์˜ ๋‹จ์ ์€, ๋ฌธ์žฅ ๊ธธ์ด๊ฐ€ ๋‹ฌ๋ผ์ง€๋ฉด ๊ฐ time step ๊ฐ„์˜ ์ฐจ์ด, ์ฆ‰ delta๊ฐ’์ด ๋‹ฌ๋ผ์ง„๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

 

Method 2

๋‘ ๋ฒˆ์งธ ๋ฐฉ๋ฒ•์œผ๋กœ๋Š”, i๋ฒˆ์งธ token์˜ position ๊ฐ’์„ i๋กœ ์ •ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ด ๋ฐฉ๋ฒ•์„ ์ด์šฉํ•˜๋ฉด ๊ฐ time step ๊ฐ„ ํฌ๊ธฐ๋Š” ์ผ์ •ํ•˜์ง€๋งŒ, input์˜ ๊ธธ์ด๊ฐ€ ๊ธธ์–ด์ง€๋ฉด ๊ทธ๋งŒํผ position ๊ฐ’๋„ ์ปค์ง€๊ณ , normalize๋˜์ง€ ์•Š์€ ๊ฐ’์ด๊ธฐ ๋•Œ๋ฌธ์— (๊ฐ’์ด 0~1 ์‚ฌ์ด๊ฐ€ ์•„๋‹˜) ํ•™์Šต์ด ๋งค์šฐ ๋ถˆ์•ˆ์ •ํ•ด์งˆ ์ˆ˜ ์žˆ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค.

 

https://towardsdatascience.com/master-positional-encoding-part-i-63c05d90a0c3

 

Method 3

์„ธ ๋ฒˆ์งธ ๋ฐฉ๋ฒ•์œผ๋กœ๋Š”, ๋‘ ๋ฒˆ์งธ ๋ฐฉ๋ฒ•์„ ์ด์ง„์ˆ˜๋กœ ํ‘œํ˜„ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. Method 2์˜ position value๋ฅผ ์ด์ง„์ˆ˜๋กœ ํ‘œํ˜„ํ•˜๊ณ , position encoding์˜ ์ฐจ์›์„ ํ•„์š”ํ•œ ๋งŒํผ ๋Š˜๋ ค ($d_{model}$) scalar ๊ฐ’์„ vector๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค. ์ด์ œ position encoding์€ position vector์ด ๋œ๋‹ค. ๊ฐ ๊ฐ’์ด 0๊ณผ 1 ์‚ฌ์ด์˜ ๊ฐ’์ด ๋˜๊ธฐ ๋•Œ๋ฌธ์— ๋‘ ๋ฒˆ์งธ ๋ฐฉ๋ฒ•์˜ ๋ฌธ์ œ๋Š” ์‚ฌ๋ผ์ง„๋‹ค.

 

๋‹จ, ์ด ๋ฐฉ๋ฒ•์˜ ๋‹จ์ ์€ ๊ฐ ๊ฐ’๋“ค์ด ์–ด๋– ํ•œ continuous function์˜ ์ด์ง„ํ™”๋œ ๊ฒฐ๊ณผ๋กœ ๋งŒ๋“ค์–ด์ง„ ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, discrete function์œผ๋กœ๋ถ€ํ„ฐ ์™”๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ๋”ฐ๋ผ์„œ interpolation ๊ฐ’์„ ์–ป๊ธฐ๊ฐ€ ํž˜๋“ค๋‹ค. ์šฐ๋ฆฌ๋Š” ์—ฐ์†์ ์ธ ํ•จ์ˆ˜๋กœ๋ถ€ํ„ฐ ๊ฐ position vector์˜ ๊ฐ’์„ ์–ป๋Š” ๋ฐฉ๋ฒ•์„ ์ฐพ๊ณ ์ž ํ•œ๋‹ค.

 

https://towardsdatascience.com/master-positional-encoding-part-i-63c05d90a0c3

 

 

 

Continuous binary vector

 

Method 3์—์„œ ๊ฐ position ๊ฐ’์„ ์ด์ง„ํ™”ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด, ๊ฐ ์ฐจ์›์˜ ๊ฐ’๋“ค์ด 0๊ณผ 1์„ ์ˆœํ™˜ํ•˜๋ฉฐ ๋ณ€ํ™”ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 

https://skyjwoo.tistory.com/entry/positional-encoding%EC%9D%B4%EB%9E%80-%EB%AC%B4%EC%97%87%EC%9D%B8%EA%B0%80

 

์˜ˆ๋ฅผ ๋“ค์–ด, ๊ฐ€์žฅ ์ž‘์€ bit๋Š” ํ•œ ์ˆซ์ž๋งˆ๋‹ค 0๊ณผ 1์ด ๋ฐ”๋€Œ๊ณ , ๋‘ ๋ฒˆ์งธ๋กœ ์ž‘์€ bit๋Š” ๋‘ ์ˆซ์ž๋งˆ๋‹ค 0๊ณผ 1์ด ๋ฐ”๋€๋‹ค. ์ด๋Ÿฌํ•œ ๊ทœ์น™์„ ํ†ตํ•ด ์šฐ๋ฆฌ๋Š” ์—ฐ์†์ ์ด๋ฉด์„œ 0๊ณผ 1์„ ์ˆœํ™˜ํ•˜๋Š” ํ•จ์ˆ˜์ธ ์‚ผ๊ฐํ•จ์ˆ˜๋ฅผ ์ƒ๊ฐํ•ด ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

์‰ฌ์šด ์˜ˆ์‹œ๋กœ, ๋ณผ๋ฅจ์„ ์กฐ์ ˆํ•˜๋Š” ๋‹ค์ด์–ผ์ด ์žˆ๋‹ค๊ณ  ํ•˜์ž. ๊ฐ ๋‹ค์ด์–ผ๋“ค์€ ์กฐ์ ˆํ•˜๋Š” ๋ณผ๋ฅจ์˜ ํฌ๊ธฐ๊ฐ€ ๋‹ค๋ฅด๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์ฒซ ๋ฒˆ์งธ ๋‹ค์ด์–ผ์€ ๋ณผ๋ฅจ์„ ๋งค์šฐ ๋ฏธ์„ธํ•˜๊ฒŒ, ํ•œ ์นธ์”ฉ ์กฐ์ ˆํ•˜๊ณ , ๋‘ ๋ฒˆ์งธ ๋‹ค์ด์–ผ์€ ๋ณผ๋ฅจ์„ ๋‘ ์นธ์”ฉ, ์„ธ ๋ฒˆ์งธ ๋‹ค์ด์–ผ์€ ๋ณผ๋ฅจ์„ ๋„ค ์นธ์”ฉ ์กฐ์ ˆํ•˜๋Š” ์‹์ด๋‹ค. ์ด์ง„๋ฒ•์˜ ์ž‘๋™๋ฐฉ์‹๊ณผ ๊ฐ™๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค. ๋งŒ์•ฝ 512์นธ์ด ์žˆ๋Š” ๋ณผ๋ฅจ์„ ์กฐ์ ˆํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด, 8๊ฐœ์˜ ๋‹ค์ด์–ผ์„ ์ด์šฉํ•˜๋ฉด ๋œ๋‹ค. ์ด๋Ÿฌํ•œ ๋‹ค์ด์–ผ๋“ค์„ ์ด์šฉํ•˜๋ฉด ์ค‘๊ฐ„์˜ continuousํ•œ ๊ฐ’์„ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค.

 

https://towardsdatascience.com/master-positional-encoding-part-i-63c05d90a0c3

 

์ด์ œ ์ด๋ฅผ ์ˆ˜์‹์œผ๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋‹ค. ์ฒซ ๋ฒˆ์žฌ ๋‹ค์ด์–ผ์€ ๋ณผ๋ฅจ์ด ํ•˜๋‚˜ ์ปค์งˆ ๋•Œ๋งˆ๋‹ค 0<->1๋กœ ๊ฐ’์ด ๋ฐ”๋€Œ๊ณ , ๋‘ ๋ฒˆ์งธ ๋‹ค์ด์–ผ์€ ๋ณผ๋ฅจ์ด ๋‘ ๊ฐœ ์ปค์งˆ ๋•Œ๋งˆ๋‹ค, ์„ธ ๋ฒˆ์งธ ๋‹ค์ด์–ผ์€ ๋ณผ๋ฅจ์ด ๋„ค ๊ฐœ ์ปค์งˆ ๋•Œ๋งˆ๋‹ค ๋ฐ”๋€Œ์–ด์•ผ ํ•œ๋‹ค. ์ฆ‰, ๊ฐ ๋‹ค์ด์–ผ์˜ ์ฃผ๊ธฐ๊ฐ€ $\pi/2$, $\pi/4$, $\pi/8$์ธ sineํ•จ์ˆ˜๋ฅผ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ด๋‹ค. 

 

 

๋”ฐ๋ผ์„œ, ์šฐ๋ฆฌ๋Š” ๊ฐ positional encoding tensor์„ ๋‹ค์Œ๊ณผ ๊ฐ™์€ matrix $M$์œผ๋กœ ๋‚œํƒ€๋‚ผ ์ˆ˜ ์žˆ๋‹ค. $i$๋Š” ๊ฐ sequence์˜ index๋ฅผ ๋‚˜ํƒ€๋‚ด๊ณ , $j$๋Š” position encoding์˜ dimension์„ ๋‚˜ํƒ€๋‚ธ๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค.

 

$M_{ij} = sin(2\pi i/2^j) = sin(x_i w_j)$

 

์ด๋กœ์จ ์šฐ๋ฆฌ๋Š” interpolation์ด ๊ฐ€๋Šฅํ•œ position encoding ๋ฐฉ๋ฒ•์„ ์ฐพ์•˜๋‹ค. Position encoding vector์˜ dimension์„ 3์ด๋ผ๊ณ  ํ–ˆ์„ ๋•Œ ๊ฐ point๋ฅผ ์‹œ๊ฐํ™”ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

https://towardsdatascience.com/master-positional-encoding-part-i-63c05d90a0c3

 

 

 

Problem 1: closed curve

 

๊ทธ๋Ÿผ์—๋„ ์—ฌ์ „ํžˆ ๋ช‡ ๊ฐ€์ง€ ๋ฌธ์ œ๊ฐ€ ๋‚จ์•„์žˆ๋Š”๋ฐ, ์ฒซ ๋ฒˆ์งธ๋Š” position encoding์˜ ๊ฐ’๋“ค์˜ ๋ฒ”์œ„๊ฐ€ ๋‹ซํ˜€ ์žˆ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ํ‘œํ˜„ ๊ฐ€๋Šฅํ•œ ๋งˆ์ง€๋ง‰ position์ด n์ด๋ผ๊ณ  ํ•  ๋•Œ, ๋‹ค์Œ ์ˆ˜์ธ n+1์€ ์ฒซ ๋ฒˆ์งธ ๊ฐ’๊ณผ ๊ฐ™์€ position encoding ๊ฐ’์„ ๊ฐ–๊ฒŒ ๋˜์–ด ๋ฒ„๋ฆฐ๋‹ค. 

 

์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, position encoding์˜ ๊ฐ’๋“ค์ด ์ฆ๊ฐ€ํ–ˆ๋‹ค๊ฐ€ ๊ฐ์†Œํ–ˆ๋‹ค๊ฐ€ ์ˆœํ™˜ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ, ๊ณ„์† ์ฆ๊ฐ€ํ•˜๊ฒŒ๋งŒ ๋งŒ๋“ ๋‹ค. position encoding tensor์˜ ์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ฐ”๋€๋‹ค. ์ด๋กœ์จ ๊ฐ position encoding ๊ฐ’๋“ค์ด boundary (-1๋˜๋Š” 1)์— ๊ฐ€๊นŒ์ง€ ๊ฐ€์ง€ ์•Š๊ณ  ๊ณ„์† ์ฆ๊ฐ€ํ•˜๋Š” ํ˜•ํƒœ๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค.

 

$M_{ij} = sin(x_i w_0^{j/d_{model}})$

 

Transformer์—์„œ๋Š” $w_0$์˜ ๊ฐ’์„ 1/10000์œผ๋กœ ์„ค์ •ํ•˜์˜€๋‹ค.

 

 

 

Problem 2: linearly transfromable

 

Position encoding์„ transformer์—์„œ ์ด์šฉํ•  ๋•Œ, ๊ฐ ์œ„์น˜์˜ position encoding ๊ฐ’์„ ๋‹ค๋ฅธ ์œ„์น˜์˜ position encoding ๊ฐ’์œผ๋กœ linear translation๋งŒ์„ ํ†ตํ•ด ๋ณ€ํ™˜ํ•˜๋Š” ๊ฒƒ์ด ํ•„์š”ํ•  ๋•Œ๊ฐ€ ์žˆ๋‹ค.

 

์ด๋Š” attention layer์—์„œ ์ด์šฉํ•˜๊ธฐ ์œ„ํ•จ์ธ๋ฐ, ์˜ˆ๋ฅผ ๋“ค์–ด "I am going to eat" ์ด๋ผ๋Š” ๋ฌธ์žฅ์ด ์žˆ๋‹ค๊ณ  ํ•  ๋•Œ, "eat"์„ ๋ฒˆ์—ญํ•˜๊ธฐ ์œ„ํ•ด "I"๋ผ๋Š” ๋‹จ์–ด์— ์ดˆ์ ์„ ๋งž์ถ”์–ด์•ผ ํ•˜๋Š”๋ฐ, ๋‘˜์˜ ์œ„์น˜๋Š” ๋ฉ€๋ฆฌ ๋–จ์–ด์ ธ ์žˆ๋‹ค. ์ด ๋•Œ ๋‘ ๋‹จ์–ด์˜ position encoding์ด linear transformation์„ ํ†ตํ•ด ๋ณ€ํ™˜์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋ฉด attention์„ ์‚ฌ์šฉํ•˜๊ธฐ์— ๋ณด๋‹ค ์šฉ์ดํ•  ๊ฒƒ์ด๋‹ค. 

 

๊ฐ position vector ๊ฐ„ linear transformation์ด ๊ฐ€๋Šฅํ•˜๋ ค๋ฉด ๋‹ค์Œ์˜ ์‹์„ ๋งŒ์กฑํ•˜๋Š” linear transformation $T(dx)$๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ์–ด์•ผ ํ•œ๋‹ค.

 

$PE(x+\Delta x) = PE(x) * T(\Delta x)$

 

์ด๋Š” PE๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ์‹์ด ์‚ผ๊ฐํ•จ์ˆ˜์ด๊ธฐ ๋•Œ๋ฌธ์— ํšŒ์ „๋ณ€ํ™˜ํ–‰๋ ฌ์„ ์ด์šฉํ•˜์—ฌ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค.

 

ํšŒ์ „๋ณ€ํ™˜ํ–‰๋ ฌ

 

Position encoding์„ ๋ณ€ํ™˜ํ•˜์—ฌ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ตฌ์„ฑํ•œ๋‹ค๋ฉด, 

 

 

$T(\Delta x)$๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ฐพ์„ ์ˆ˜ ์žˆ๋‹ค.

 

 

 

 

 

์ถœ์ฒ˜: https://towardsdatascience.com/master-positional-encoding-part-i-63c05d90a0c3

๋ฐ˜์‘ํ˜•