๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

Computer Science/ํ•™๊ต๊ณต๋ถ€

Trainingํ•  ๋•Œ ์˜ค๋ฅ˜์™€ ํ•ด๊ฒฐ๋ฒ•(Dataloader killed, Connection reset by peer, Exception 0 SISKILL)

์ฒ˜์Œ์œผ๋กœ ๊นƒํ—™์—์„œ ๋”ฅ๋Ÿฌ๋‹ ์˜คํ”ˆ์†Œ์Šค๋ฅผ ๋‹ค์šด๋ฐ›์•„ ์‹คํ–‰์„ ํ•˜๋Š” ๊ฒƒ์„ ์‹œ์ž‘์œผ๋กœ ์„ฑ๋Šฅ ๊ฐœ์„ ์„ ์œ„ํ•ด ๊ณต๋ถ€ํ•˜๊ณ  ์—ฐ๊ตฌ(??)ํ•ด ๋ณผ ๊ธฐํšŒ๊ฐ€ ์ƒ๊ฒผ๋‹ค. ํ•˜์ง€๋งŒ ์ œ์ผ ์ฒ˜์Œ ๋‹ค๋ฃจ๊ฒŒ ๋œ ๋ฐ์ดํ„ฐ๊ฐ€ ํ•˜ํ•„ ์—„~์ฒญ ํฐ ๋ฐ์ดํ„ฐ๋ผ ์ •๋ง ๋งŽ์€ ๊ณ ๋น„๋“ค์ด ์žˆ์—ˆ๋‹ค....๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ ๊ทธ๋ƒฅ ๋Œ๋ ค๋ณด๋Š” ๊ฑด๋ฐ... ๋ชจ๋“  ๊ฒŒ ์ฒ˜์Œ์ธ ๋‚˜์—๊ฒŒ ๋„ˆ๋ฌด ๋งŽ์€ ์‹œ๊ฐ„์ด ํ•„์š”ํ•˜๋”๋ผ... ์ง„์งœ ์‹œ์ž‘์กฐ์ฐจ ๋ชปํ–ˆ๋Š”๋ฐ

1. RuntimeError : DataLoader worker (pid ~~) is killed by signal: Killed.

์ง„์‹ฌ ์ด ์˜ค๋ฅ˜๋•Œ๋ฌธ์— ๊ตฌ๊ธ€์— ์น˜๋ฉด ๋‚˜์˜ค๋Š” ๊ธ€์€ ๋ชจ๋‘ ์ฝ์–ด๋ดค๋‹ค.

1.1 ์—๋Ÿฌ์˜ ์›์ธ

1.2 ์‹œ๋„

1) batch size ์ค„์ด๊ธฐโŒ

๊ตฌ๊ธ€์— ๊ฒ€์ƒ‰ํ•ด๋ณด๋‹ˆ ๊ฐ€์žฅ ๋จผ์ € ๋‚˜์˜ค๋Š” ํ•ด๊ฒฐ๋ฒ•์ด batch size๋ฅผ ์ค„์ด๋ผ๋Š” ๋ง์ด ์žˆ์–ด์„œ 512๋ถ€ํ„ฐ 32๊นŒ์ง€ ์ค„์—ฌ์„œ ๋‹ค ํ•ด๋ดค๋Š”๋ฐ ์—ฌ์ „ํžˆ ํ•ด๊ฒฐ๋˜์ง€ ์•Š์•˜๋‹ค.

2) num_worker = 0 ํ•˜๊ธฐโŒ

์ด๋ ‡๊ฒŒ ํ•˜๋ฉด.. ๊ฑฐ์˜ ์„ธ์›”์•„ ๋„ค์›”์•„ ๋๋‚˜์ง€ ์•Š์„ ์ •๋„์˜ ์‹œ๊ฐ„์ด ๋– ์„œ ๋ฐ”๋กœ ์ค‘๋‹จ..

3) multi-gpu ์“ฐ๊ธฐ by using DataParallelโŒ

์‘...์•ˆ๋ผ..... ์—ฌ๋Ÿฌ๊ฐœ๋ฅผ ์“ฐ๋Š”๋ฐ๋„ ํŠน์ • ์‹œ์  ์˜ˆ๋ฅผ ๋“ค์–ด ํ•˜๋ฃจ ์ •๋„ ๋Œ๋ฆฌ๋ฉด ๊ณ„์† Dataloader๊ฐ€ ์ฃฝ์—ˆ๋‹ค.. ๊ฒŒ๋‹ค๊ฐ€ gpu๋ฅผ 20%~30%๋งŒ ์‚ฌ์šฉํ•˜๊ณ , 100% ํ™œ์šฉํ•˜์ง€ ๋ชปํ•˜๋”๋ผ..๐Ÿ˜ญ ๊ทธ๋Ÿฌ๋‹ค๊ฐ€ htop ์„ ์จ์„œ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์ง€์ผœ๋ณด๋‹ˆ ๋Š˜ ํ•˜๋ฃจ์ •๋„๊ฐ€ ๋˜๋ฉด cpu ๋ฉ”๋ชจ๋ฆฌ(62.8G)๋ฅผ ๊ฝ‰ ์ฑ„์›Œ์„œ dataloader killed ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๊นจ๋‹ฌ์•˜๋‹ค.. ๊ทธ๋ž˜์„œ ๊ท€์ฐฎ์ง€๋งŒ ๋˜ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์„ ์ฐพ์•„๋ดค๋‹ค..

4) multi-gpu ์“ฐ๊ธฐ by using DistributedDataParallel(์ฐธ์กฐ : https://github.com/pytorch/examples/blob/master/imagenet/main.py)

2.

3. Exeception 0 SISKILL