hye-log

[๋ถ€์ŠคํŠธ์บ ํ”„ AI Tech]WEEK 02_DAY 09 ๋ณธ๋ฌธ

Boostcourse/AI Tech 4๊ธฐ

[๋ถ€์ŠคํŠธ์บ ํ”„ AI Tech]WEEK 02_DAY 09

iihye_ 2022. 9. 29. 20:49

๐ŸŽ€ ๊ฐœ๋ณ„ํ•™์Šต


[8] Multi-GPU ํ•™์Šต

1. ๊ฐœ๋… ์ •๋ฆฌ

1) Single(ํ•œ ๊ฐœ) vs. Multi(๋‘ ๊ฐœ ์ด์ƒ)

2) GPU vs. Node(System. ํ•œ ๋Œ€์˜ ์ปดํ“จํ„ฐ)

3) Single Node Single GPU(ํ•œ ๋Œ€์˜ ์ปดํ“จํ„ฐ์˜ ํ•œ ๊ฐœ์˜ GPU)

4) Single Node Multi GPU(ํ•œ ๋Œ€์˜ ์ปดํ“จํ„ฐ์˜ ์—ฌ๋Ÿฌ ๊ฐœ์˜ GPU)

5) Multi Node Multi GPU(ํ•œ ๋Œ€์˜ ์ปดํ“จํ„ฐ์˜ ํ•œ ๊ฐœ์˜ GPU. ์„œ๋ฒ„์‹ค)

 

2. Model parallel

1) ๋‹ค์ค‘ GPU์— ํ•™์Šต์„ ๋ถ„์‚ฐํ•˜๋Š” ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ• - ๋ชจ๋ธ์„ ๋‚˜๋ˆ„๊ฑฐ๋‚˜, ๋ฐ์ดํ„ฐ๋ฅผ ๋‚˜๋ˆ”

2) ๋ชจ๋ธ์˜ ๋ณ‘๋ชฉ, ํŒŒ์ดํ”„๋ผ์ธ์˜ ์–ด๋ ค์›€ ๋“ฑ์œผ๋กœ ์ธํ•ด ๋ชจ๋ธ ๋ณ‘๋ ฌํ™”๋Š” ๊ณ ๋‚œ์ด๋„ ๊ณผ์ œ

3) AlexNet

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks."  Communications of the ACM  60.6 (2017): 84-90.

# Model parallel

class ModelParallelResNet50(ResNet):
    def __init__(self, *args, **kwargs):
    	super(ModelParallelResNet50, self).__init__(
        	BottleNeck, [3, 4, 6, 3], num_classes=num_classes, *args, **kwargs)
            
        # ์ฒซ ๋ฒˆ์งธ ๋ชจ๋ธ์„ cuda:0์— ํ• ๋‹น
        self.seq1 = nn.Sequential(self.conv1, self.bn1, self.relu, self.maxpool, self.layer1, self.layer2).to('cuda:0')
        
        # ๋‘ ๋ฒˆ์งธ ๋ชจ๋ธ์„ cuda:1์— ํ• ๋‹น
        self.seq2 = nn.Sequential(self.layer3, self.layer4, self.avgpool).to('cuda:1')
        
    # ๋‘ ๋ชจ๋ธ์„ ์—ฐ๊ฒฐ
    def forward(self, x):
    	x = self.seq2(self.seq1(x).to('cuda:1'))
        return self.fc(x.view(x.size(0), -1))

 

3. Data parallel

1) ๋ฐ์ดํ„ฐ๋ฅผ ๋‚˜๋ˆ  GPU์— ํ• ๋‹น ํ›„ ๊ฒฐ๊ณผ์˜ ํ‰๊ท ์„ ์ทจํ•˜๋Š” ๋ฐฉ๋ฒ•

2) ํ•œ ๋ฒˆ์— ์—ฌ๋Ÿฌ GPU์—์„œ ์ˆ˜ํ–‰ 

3) DataParallel : ๋‹จ์ˆœํžˆ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„๋ฐฐํ•œ ํ›„ ํ‰๊ท ์„ ์ทจํ•จ -> GPU ์‚ฌ์šฉ ๋ถˆ๊ท ํ˜• ๋ฌธ์ œ, batch ์‚ฌ์ด์ฆˆ ๊ฐ์†Œ

4) DistributedDataParallel - ๊ฐ CPU๋งˆ๋‹ค process ์ƒ์„ฑํ•˜์—ฌ ๊ฐœ๋ณ„ GPU์— ํ• ๋‹น -> ๊ฐœ๋ณ„์ ์œผ๋กœ ์—ฐ์‚ฐ์˜ ํ‰๊ท ์„ ๋ƒ„

# Data parallel

parallel_model = torch.nn.DataParallel(model)
# Distributed Data parallel

train_sampler = torch.utils.data.distributed.DistributedSampler(train_data)
shuffle = False
pin_memory = True

train_loader = torch.utils.data.DataLoader(
		train_data, batch_size=20, shuffle=True, 
                pin_memory=pin_memory, num_worker=3, shuffle=shuffle, 
                sampler=train_sampler)

[9] Hyperparameter Tuning

1. Hyperparameter Tuning

1) ๋ชจ๋ธ ์Šค์Šค๋กœ ํ•™์Šตํ•˜์ง€ ์•Š๋Š” ๊ฐ’์€ ์‚ฌ๋žŒ์ด ์ง€์ •

- ์˜ˆ) learning rate, ๋ชจ๋ธ์˜ ํฌ๊ธฐ, optimizer ๋“ฑ

2) grid search : ์ผ์ •ํ•œ ๋ฒ”์œ„๋กœ ์ž˜๋ผ์„œ ๊ฐ’์˜ ์กฐํ•ฉ์„ ์ •ํ•จ ์˜ˆ) 0.1, 0.01, 0.001

3) Random Search : ๋žœ๋ค์œผ๋กœ ๊ฐ’์˜ ์กฐํ•ฉ์„ ์ •ํ•จ

 

2. Ray

https://docs.ray.io/en/latest/index.html

1) multi-node multi-processing ์ง€์› ๋ชจ๋“ˆ

2) ML/DL์˜ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•ด ๊ฐœ๋ฐœ๋œ ๋ชจ๋“ˆ(ML/DL ๋ชจ๋“ˆ์˜ ํ‘œ์ค€)

3) hyperparameter ์กฐ์ ˆ๋ณด๋‹ค ์ข‹์€ ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ์œผ๋Š” ๊ฒƒ์ด ๋” ์ค‘์š”!


๐ŸŽ€ ์˜ค๋Š˜์˜ ํšŒ๊ณ 

์˜ค์ „์—๋Š” ํŒŒ์ดํ† ์น˜์˜ 8, 9๊ฐ•์„ ํ•™์Šตํ–ˆ๋‹ค. ์‚ฌ์‹ค multi-gpu๋กœ ํ•™์Šตํ•  ์ผ์ด ์•„์ง๊นŒ์ง€๋Š” ์—†์–ด์„œ ์ž˜ ์™€๋‹ฟ์ง€ ์•Š๋Š” ๋‚ด์šฉ์ด์ง€๋งŒ, ์–ธ์  ๊ฐ€ multi-gpu๋กœ ํ•™์Šตํ•˜๋Š” ๋‚ ์„ ์œ„ํ•ด์„œ ์—ด์‹ฌํžˆ ๋“ค์–ด๋‘์—ˆ๋‹ค. Dataparallel๊ณผ DistributedDataparallel์€ ์–ธ์ œ ๋“ค์–ด๋„ ์–ด๋ ค์šด ๊ฐœ๋…์ธ ๊ฒƒ ๊ฐ™๋‹คใ… ใ…  ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹์€ ๋ชจ๋ธ๊ณผ ๋ฐ์ดํ„ฐ, ํ•™์Šต๊นŒ์ง€ ๋ชจ๋‘ ํ•˜๊ณ ๋„ ์ผ๋ถ€ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’์„ ๋ณ€๊ฒฝํ•ด์„œ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ๋“ค์—ˆ๋Š”๋ฐ, RAY๋ผ๋Š” ๋ชจ๋“ˆ์„ ์ด์šฉํ•ด์„œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ฐœ์„ ํ•œ ๊ฒƒ์€ ์ฒ˜์Œ์ด๋ผ ๋‹ค์Œ์— ๋ชจ๋ธ์„ ํ•™์Šตํ•  ๋•Œ ์จ๋จน์–ด์•ผ ๊ฒ ๋‹ค๊ณ  ์ƒ๊ฐํ–ˆ๋‹ค! ์˜คํ›„์—๋Š” ๋ฉ˜ํ† ๋ง, ํ”ผ์–ด์„ธ์…˜, ์˜คํ”ผ์Šค์•„์›Œ๊นŒ์ง€ ์žˆ์—ˆ๋Š”๋ฐ, ๋ฉ˜ํ† ๋ง์—์„œ๋Š” ์—ฌ๋Ÿฌ ๋”ฅ๋Ÿฌ๋‹ ์ปค๋ฎค๋‹ˆํ‹ฐ์™€ ์‚ฌ์šฉํ•˜๊ธฐ ์ข‹์€ ํˆด๋“ค, ๊ทธ๋ฆฌ๊ณ  ์ง„๋กœ ๊ณ ๋ฏผ๊ณผ ์ปดํ“จํ„ฐ ๋น„์ „์— ๋Œ€ํ•œ ์งˆ๋ฌธ๋“ค์— ๋Œ€ํ•œ ๋‹ต๋ณ€์„ ํ•ด์ฃผ์…จ๋‹คใ…Žใ…Ž ์ผ์ฃผ์ผ์— ํ•œ ๋ฒˆ ๋ฐ–์— ์—†๋Š” ์‹œ๊ฐ„์ด์ง€๋งŒ ์ธ๊ณต์ง€๋Šฅ์˜ ํ•™์Šต ์™ธ์ ์œผ๋กœ ์•Œ์•„๊ฐ€๋Š” ๊ฒŒ ๋งŽ์•„์„œ ์œ ์šฉํ•œ ์‹œ๊ฐ„์ด์—ˆ๋‹ค. ํ”ผ์–ด์„ธ์…˜ ๋•Œ๋Š” ๊ณผ์ œ๊ฐ€ ๋งˆ๊ฐ๋˜๊ณ  ์„œ๋กœ ์–ด๋ ค์› ๋˜ ๊ฒƒ๋“ค์— ๋Œ€ํ•ด์„œ ์ด์•ผ๊ธฐ ํ•˜๋Š” ์‹œ๊ฐ„์„ ๊ฐ€์กŒ๋Š”๋ฐ, ์ง€๋‚œ ์ฃผ ๊ณผ์ œ์— ๋น„ํ•ด์„œ ๋ณธ๊ฒฉ์ ์œผ๋กœ ํŒŒ์ดํ† ์น˜๋ฅผ ๊ณต๋ถ€ํ•˜๋‹ˆ ์–ด๋ ค์›Œ์ง€๊ณ  ๋‹ค ๋ชป ๋๋‚ด๊ณ ..(^.ใ… ) ํ•œ ์‹œ๊ฐ„ ์ •๋„ break ํƒ€์ž„์„ ๊ฐ€์ง€๋‹ค๊ฐ€ ์˜คํ”ผ์Šค์•„์›Œ์—๋Š” ์ด๋ฒˆ ์ฃผ ๊ธฐ๋ณธ ๊ณผ์ œ์— ๋Œ€ํ•œ ํ’€์ด๋ฅผ ํ•ด์ฃผ์…จ๋‹ค. ์ •๋ง ๊ผผ๊ผผํ•˜๊ฒŒ ์ •๋ฆฌ๋œ ๊ณผ์ œ๋ผ์„œ ๋‘๊ณ ๋‘๊ณ  ํŒŒ์ดํ† ์น˜์—์„œ ๋ชจ๋ฅด๋Š” ๊ฐœ๋…์ด ๋‚˜์˜ฌ ๋•Œ ์ฐพ์•„๋ณผ ๊ฑฐ ๊ฐ™๋‹ค. ๊ธฐ๋ณธ ๊ณผ์ œ์— ๋Œ€ํ•ด์„œ๋Š” ๋‚ด์ผ ๊ฐ•์˜ ํ•˜๋‚˜ ๋“ฃ๊ณ  ๋‹ค์‹œ ๋ณต์Šตํ•˜๋Š”๊ฑธ๋กœ!๐ŸŒž

728x90
Comments