hye-log

[๋ถ€์ŠคํŠธ์บ ํ”„ AI Tech]WEEK 05_DAY 23 ๋ณธ๋ฌธ

Boostcourse/AI Tech 4๊ธฐ

[๋ถ€์ŠคํŠธ์บ ํ”„ AI Tech]WEEK 05_DAY 23

iihye_ 2022. 10. 22. 16:45

๐Ÿฅ” ๊ฐœ๋ณ„ํ•™์Šต


[9] Multi-modal

1. Overview of multi-modal learning

1) Multi-modal learning : ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ type, ํ˜•ํƒœ, ํŠน์„ฑ์„ ๊ฐ–๋Š” ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•

2) challenge

- ๋ฐ์ดํ„ฐ์˜ ํ˜•ํƒœ๊ฐ€ ๋‹ค์–‘ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ‘œํ˜„ ๋ฐฉ์‹๋„ ๋‹ค๋ฆ„

- ์„œ๋กœ ๋‹ค๋ฅธ modality์—์„œ ์˜ค๋Š” ์ •๋ณด์˜ ์–‘์ด unbalanceํ•จ

- ์—ฌ๋Ÿฌ modality๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ biased๋จ

3) maching, translating, referencing ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ multi-modal learning ์‚ฌ์šฉ

 

2. Multi-modal tasks(1) - Visual data & Text

1) Text embedding

- character๋Š” ML์—์„œ ์‚ฌ์šฉํ•˜๊ธฐ ์–ด๋ ค์›€ -> dense vector๋กœ ํ‘œํ˜„

- ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์ด ์žˆ์Œ

์˜ˆ) man - woman -> king - queen

- skip-gram model : ๋‹จ์–ด ์‚ฌ์ด์˜ ๊ด€๊ณ„์„ฑ์„ ํ†ตํ•ด ์ฃผ๋ณ€ N๊ฐœ์˜ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธก

2) Joint embedding

(1) Image tagging

- ์ฃผ์–ด์ง„ image์—์„œ tag๋ฅผ ์ƒ์„ฑํ•˜๊ฑฐ๋‚˜ tag๋ฅผ ์ด์šฉํ•ด์„œ image ์ƒ์„ฑ

- image์™€ text๋ฅผ ๊ฐ™์€ dimension์œผ๋กœ ํ‘œํ˜„ํ•ด์„œ ๋น„์Šทํ•˜๋ฉด embedding vector ํฌ๊ธฐ๊ฐ€ ์ž‘๊ณ , ๋‹ค๋ฅด๋ฉด embedding vector ํฌ๊ธฐ๊ฐ€ ํฌ๊ฒŒ ํ‘œํ˜„

- ๊ฐ™์€ embedding space์— ๋งค์นญ์‹œ์ผฐ์„ ๋•Œ text์™€ image๊ฐ€ pair๋ฉด distance๋ฅผ ์ค„์ด๊ณ , pair๊ฐ€ ์•„๋‹ˆ๋ฉด distance๊ฐ€ ํฌ๋„๋ก metric learning ์ง„ํ–‰

(2) Image&food recipe retrieval

- recipt๋ฅผ RNN์„ ํ†ตํ•ด์„œ fixed vector๋ฅผ ๋ฝ‘์•„๋ƒ„

- cosine similarity loss๋ฅผ ์ด์šฉํ•˜์—ฌ recipt๊ณผ image๊ฐ€ ์—ฐ๊ด€์ด ๋†’์œผ๋ฉด loss๋ฅผ ํฌ๊ฒŒ, ์—ฐ๊ด€์ด ๋‚ฎ์œผ๋ฉด loss๋ฅผ ๋‚ฎ๊ฒŒ ํ•จ

- semantic regularization loss๋ฅผ ์ด์šฉํ•˜์—ฌ high-level semantics๋ฅผ ํ†ตํ•ฉ

3) Corss modal translation

(1) Image captioning

- image๋Š” CNN์„ ํ†ตํ•ด์„œ ํ•™์Šตํ•˜๊ณ , sentence๋Š” RNN์„ ํ†ตํ•ด์„œ ํ•™์Šตํ•จ

(2) Show and tell

- Encoder๋Š” ImageNet ๊ธฐ๋ฐ˜์˜ pre-trained CNN model์„ ์‚ฌ์šฉ

- Decoder๋Š” LSTM module ์‚ฌ์šฉ

(3) Show, attend and tell

- ์‚ฌ๋žŒ์˜ ์‹œ์„ ์ด ์›€์ง์ด๋Š” ๊ฒƒ(attention)์ฒ˜๋Ÿผ ํŠน์ง•์ ์ธ ๋ถ€๋ถ„๋ถ€ํ„ฐ ๋ณด๋Š” ๊ฒƒ์ด ํŠน์ง•

- image๋ฅผ CNN์„ ํ†ตํ•ด์„œ ์–ป์€ heatmap๊ณผ RNN์—์„œ ๋‚˜์˜จ attention grid๋ฅผ ํ•ฉ์นœ vector๋ฅผ ์ถœ๋ ฅํ•จ

(4) Visual quetion answering

- Image stream์—์„œ ์˜์ƒ์˜ feature๋ฅผ ์ถ”์ถœํ•˜๊ณ , Question stream์—์„œ text sequence๋ฅผ RNN์œผ๋กœ encoding

 

3. Multi-modal tasks(2) - Visual data & Audio

1) Sound representation

https://hyunlee103.tistory.com/ โ“’ Naver Connect Foundation

- Waveform -> Power spectrum -> Spectogram

- Fourier transform : waveform์„ power spectrum์œผ๋กœ ๋ณ€ํ™˜ -> ์‹œ๊ฐ„ ์ถ• ๊ธฐ์ค€์„ ์ฃผํŒŒ์ˆ˜ ์ถ• ๊ธฐ์ค€์œผ๋กœ ๋ฐ”๊พธ์–ด์„œ ์‚ผ๊ฐํ•จ์ˆ˜๊ฐ€ ์–ผ๋งˆ๋‚˜ ๋“ค์–ด์žˆ๋Š”์ง€ ํŒŒ์•…

- spectogram : ์‹œ๊ฐ„์— ๋”ฐ๋ผ ์ฃผํŒŒ์ˆ˜ ์„ฑ๋ถ„์ด ์–ด๋–ป๊ฒŒ ๋ณ€ํ•˜๋Š”์ง€ ํŒŒ์•… ๊ฐ€๋Šฅ

2) Joint embedding

(1) Scene recognition by sound

- SountNet :๋น„๋””์˜ค์˜ RGB frames๋กœ๋ถ€ํ„ฐ audio reprentation์„ ํ•™์Šตํ•จ

3) Cross modal translation

(1) Speech2Face 

- ์Œ์„ฑ์„ ๋“ฃ๊ณ  ์–ผ๊ตด์„ ์ƒ์ƒํ•˜๋Š” ๋ชจ๋ธ

(2) Image-to-speech synthesis

- image๋ฅผ ๋ณด๊ณ  speech๋ฅผ ๋งŒ๋“ค์–ด ๋ƒ„

- Image๋ฅผ CNN ๋ชจ๋ธ์— ๋„ฃ๊ณ  unit์„ ์ถœ๋ ฅํ•˜๊ณ , unit์„ TTS(Text-to-Speech)์— ๋„ฃ์–ด speech๋ฅผ ์ถœ๋ ฅํ•จ

4) Cross modal reasoning

(1) Sound source localization

- image์—์„œ ์†Œ๋ฆฌ๊ฐ€ ์–ด๋””์—์„œ ๋‚˜๋Š”์ง€ ์ฐพ๊ธฐ

- image์™€ audio๋ฅผ CNN์— ๋„ฃ๊ณ  spatial feature๋ฅผ ์œ ์ง€ํ•˜์—ฌ localization score๋ฅผ ์ถœ๋ ฅํ•จ



๐Ÿฅ” ์˜ค๋Š˜์˜ ํšŒ๊ณ 

์˜ค๋Š˜์€ ๋ฉ˜ํ† ๋ง์„ ์‹œ์ž‘! ๋ฉ˜ํ† ๋ง ๋•Œ์—๋Š” ์ง„๋กœ ๊ด€๋ จ๋œ ๊ณ ๋ฏผ ํ•˜๋‚˜๋ฅผ ํ•ด๊ฒฐ(?)ํ•ด์ฃผ์‹œ๊ณ , ํ•œ ๊ฐ€์ง€ ๋…ผ๋ฌธ์„ ๋ฆฌ๋ทฐํ•ด์ฃผ์…จ๋‹ค. ์–ด๋–ป๊ฒŒ ์„ ํƒํ•˜๊ฒŒ ๋˜์—ˆ๋Š”์ง€, ๋…ผ๋ฌธ์—์„œ ์–ด๋–ค ๋ถ€๋ถ„์„ ์ฃผ์˜ ๊นŠ๊ฒŒ ๋ด์•ผํ•˜๋Š”์ง€, ์–ด๋–ป๊ฒŒ ๋…ผ๋ฌธ์„ ์ž์‹ ์˜ ๋ถ„์•ผ์— ์ ์šฉํ•  ์ˆ˜ ์žˆ๋Š”์ง€๊นŒ์ง€ ์•Œ๋ ค์ฃผ์…”์„œ ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ์—์„œ ์–ด๋””์— ์ฃผ๋ชฉํ•ด์„œ ๋ด์•ผํ•˜๋Š”์ง€๋ฅผ ์•Œ ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์˜คํ›„์—๋Š” ๊ฐ•์˜ ํ•˜๋‚˜๋ฅผ ๋“ฃ๋Š”๋ฐ, multi-modal์ด ๋“ค์–ด๋ณด๊ธฐ๋งŒ ํ–ˆ์ง€ ์ด๋ก ์ ์œผ๋กœ ๊ณต๋ถ€ํ•˜๋Š”๊ฑด ์ฒ˜์Œ์ด๋ผ ์–ด๋ ต๊ธฐ๋„ ํ–ˆ์ง€๋งŒ ์ƒ๊ฐ๋ณด๋‹ค ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ํ•™์Šต์„ ์‹œํ‚ค๊ณ  ์ •๋ณด๋ฅผ ์–ป๋Š”๋‹ค๋Š” ๊ฒƒ์ด ์ƒˆ๋กœ์› ๋‹ค. ์ŠคํŽ˜์…œ ํ”ผ์–ด์„ธ์…˜ ๋•Œ์—๋Š” ๊ฐ์ž ํŒ€ ํ˜„์žฌ ์–ด๋–ป๊ฒŒ ๊ตฌ์„ฑํ•˜๊ณ , ์–ด๋–ค ์ฃผ์ œ๋กœ ํ• ์ง€๋ฅผ ๊ณ ๋ฏผํ–ˆ๋Š”๋ฐ, ์‚ฌ์‹ค ์šฐ๋ฆฌ๋Š” ์ด์ œ ๋ง‰ CV๋ฅผ ์‹œ์ž‘ํ•œ ๋‹จ๊ณ„์ผ ๋ฟ์ธ๋ฐ ํŒ€ ์ •ํ•˜๋Š” ๊ฒƒ์€ ๋„ˆ๋ฌด ์–ด๋ ต๋‹ค๋Š” ์ด์•ผ๊ธฐ๋ฅผ ๊ณตํ†ต์ ์œผ๋กœ ํ–ˆ๋‹ค... ํ”ผ์–ด์„ธ์…˜ ๋•Œ์—๋Š” ํ•œ ์ฃผ๋ฅผ ๋งˆ๋ฌด๋ฆฌ ํ•˜๋Š” ํšŒ๊ณ ๋ฅผ ์ž‘์„ฑํ•˜๊ณ , ๋‹ค์Œ์ฃผ๋ถ€ํ„ฐ ์‹œ์ž‘๋˜๋Š” ๊ฒฝ์ง„๋Œ€ํšŒ ๊ด€๋ จํ•ด์„œ ์–ด๋–ป๊ฒŒ ๊นƒ์„ ์šด์˜ํ• ์ง€๋ฅผ ์ด์•ผ๊ธฐํ–ˆ๋‹ค. ์‚ฌ์‹ค ์•„์ง ์•„๋ฌด๊ฒƒ๋„ ์•„๋Š” ๊ฒŒ ์—†์–ด์„œ ์–ด๋–ค ๊ฒƒ์ด ์ •๋‹ต์ผ์ง€ ๋ชจ๋ฅด๊ฒ ์ง€๋งŒ ๋‹ค๋“ค ์ฒ˜์Œํ•˜๋Š” ํ”„๋กœ์ ํŠธ์ธ๋งŒํผ ๋งŽ์ด ๋ถ€๋”ชํ˜€๋ณด๊ณ  ๋ฐฐ์›Œ๋ณด๋Š” ๊ฒƒ์— ๊ณต๊ฐํ–ˆ๋‹ค. ์ด๋ฒˆ ํ•œ ์ฃผ๋„ ๋ฌด์‚ฌํžˆ ์ง€๋‚˜๊ฐ”๊ณ , ๋‹ค์Œ ์ฃผ ๋Œ€ํšŒ๋„ ์—ด์‹ฌํžˆ ํ•ด๋ณด์ž!!!!

728x90
Comments