기타/공부노트

[구글 AI 블로그] 페가수스: 최첨단 추상화 텍스트 요약

코드아키택트 2021. 8. 7. 21:21
반응형

https://ai.googleblog.com/2020/06/pegasus-state-of-art-model-for.html

 

PEGASUS: A State-of-the-Art Model for Abstractive Text Summarization

Posted by Peter J. Liu and Yao Zhao, Software Engineers, Google Research Students are often tasked with reading a document and producing...

ai.googleblog.com

※위의 글을 요약하고 다른 내용들을 첨부함.

 

인트로

- 대다수의 학생들은 문서를 읽고 요약하는 과제를 받게됨

- 우리는 단순히 문장을 복사붙여넣기가 아닌 의미를 추출해서 요약함(패러프레이징). 이러한 것을 추상적인 텍스트 요약이라고 함

- 추상적인 텍스트 요약(Abstractive text summarization)은 자연어 처리에서 가장 힘든 분야중 하나임

- 긴 글 이해, 정보 압축, 단어 생성등을 포함하고 있기 때문에

- 가장 많이 사용되는 머신러닝 방법은 sequence-to-sequence(seq2seq)방법이며 인풋을 아웃풋과 매핑하는 방식임

- seq2seq는 RNN으로 만들어졌지만 최근엔 Transformer encoder-decoder방식이 더 선호되는 방식임. 왜냐하면 긴 요약의 의존성을 더 효율적으로 하기 때문에

 

- 자기지도학습 사전학습(BERT,GPT-2,RoBERTa등)과 결합된 Transformer 모델은 언어처리에 강력한 프레임워크임

- 페가수스에 자기지도학습을 쓴 이유중 하나는 이전 작업에 자기지도학습으로 잘 안되서 더 잘되는지 테스트 해보고 싶었음

 

-2020년에 페가수스를 발표했고, 12개 데이터셋(각 데이터셋은 특화된 용도가 있음)에 대해서 최고의 성능을 보여줬음. 깃허브도 있음

 

요약을 위한 자기지도학습의 목표값

- 우리의 가정은 최종 다운스트림과 자기지도학습의 목표치가 가까울수록 미세조정 성능이 좋을 것.

- 페가수스의 사전학습 방식은 문장에서 지워진 일부를 복구하는 것임.

- 사람에게 조차 힘든 작업이고, 우리도 완벽한 결과를 바라지 않음.

- 하지만 이런 방식은 모델이 세상에대해 전반적으로 배우고, 정보를 추출할 수 있도록 만듬

- 이런 자기지도학습의 장점은 문서의 갯수만큼 트레이닝 데이터를 만들 수 있다는 것임

- 기존 방식은 사람이 직접 주석을 달아줘야했음

 

- 중요한 문장을 마스크 하는 것이 자기지도학습 예제를 더 요약에 가깝게 만듬(?)

- ROUGE점수를 이용해 판별함. 

- ROUGE는 n-gram이 겹치는 정도로서 점수를 산정함

 

- T5등과 같이 수많은 웹 문서를 가지고 사전학습을 시킨 후, 12개 트레이닝 세트에 대해서 미세 조정을 함

- T5에 비해 5%만큼의 파라미터만 가지고 매우 좋은 성능을 나타냄

 

적은 수의 예시로 미세조정

-페가수스는 커다란 데이터셋에 대해 놀랄만한 성능을 보여줬지만, 작은 수의 예제로도 거의 최첨단의 성능을 보여준 것은 놀라웠음.

ROUGE점수와 학습 데이터 갯수. 평행으로 그려진 점선은 사전학습되지 않은 Transformer encoder-decoder의 full-supervision(?)성능임.

- 오직 1000개 정도의 미세조정 예시(데이터셋?)만 가지고도 베이스라인 이상 성능을 보여줬음

- 효율적인 요약이 가능했음

 

사람수준의 요약결과

- ROUGE는 객관적인 점수를 보여주지만, 사람의 요약결과와 비교, 문장이 얼마나 자연스러운지 등은 알려주지 않음.

- 이 시점에선 사람이 페가수스의 결과사람이 요약한 것을 비교하는 테스트를 진행함(Turing test비슷)

원문과 요약문을 주어주고 평가하도록 함. 요약문 중에는 사람이 한것과 페가수스가 한 것이 섞여있음

 

- 3개의 데이터셋에대해 실행했고, 평가자가 일관적으로 사람이 만든 결과만을 선호하진 않았음(페가수스를 더 높게 평가할때도 있었다)

- 게다가 페가수스는 1000개의 예시로만 학습됨

- XSum과 CNN/Dailymail에 대해선 사람과 비슷한 성능을 보여줌

 

모델의 이해도 테스트: 배 갯수 세기

- 아래 내용은 XSUM 데이터 셋 중 하나와 Pegasus의 추상화 요약 결과임https://www.bbc.com/news/uk-england-21326309

 

Navy frigates in Portsmouth 'to be sunk or scrapped'

More than 20 parties have come forward with bids to either recycle or sink four Royal Navy frigates, the BBC learns.

www.bbc.com

- 모델은 4개의 프리깃에 대해서 패러프레이징을 해냄

- HMS Cumberland, HMS Campbeltown, HMS Chatham and HMS Cornwall을 "four Royal Navy frigates"로 표시

- 이는 Extractive(추출형)방식으로는 할 수 없는 것임. 왜냐하면 "4척"이라는 말이 써있지 않기 떄문에.

- 4척임을 알아낸것이 단순히 행운인지 모델이 직접 한것인지 알기위해 테스트함.

- 한가지 방법은 배의 수를 추가하거나 빼보고 결과를 보면 됨

※추상화 요약(abstractive summary): 패러프레이징 등을 통해 의미가 같은 요약을 만드는 것. 다른 방식으로는 Extractive Summary가 있음(중요한 문장 뽑아내는 방식).

 

- 테스트를 해보면, 2~5척은 잘 세는 것을 볼 수 있음

- 6척으로 설정하면(5척에서 HMS Alphabet이라는 배 이름이 추가) 모델은 7척으로 잘못 셈

- 위로부터 모델이 작은 수의 항목들을 세는 방법은 배웠지만, 우리가 원하는 만큼 우아하겐 작동하지 않는다는 것을 알 수 있음

- 하지만 성능은 놀라움. 왜냐하면 숫자를 세는법을 명쾌하게 가르치진 않았기 때문에

 

페가수스 코드와 모델 배포

- Pegasus 모델과 체크포인트를 깃허브에 배포함.  https://github.com/google-research/pegasus

 

GitHub - google-research/pegasus

Contribute to google-research/pegasus development by creating an account on GitHub.

github.com

- 깃허브엔 페가수스를 다른 용도로 미세조정할 수 있는 코드도 포함하고 있음

아래는 페가수스 테스트 결과임.

문장 구조는 볼드 처리된 부분은 배의 이름을 설정하는 부분임. 이름을 통해 몇대가 Portmouth 항구에 정박했는지 알 수 있음. 메인 글에 해당하는 Bidder이하 부분은 임무가 끝난 영국 프리깃함 보존을위해 아무도 비딩하지 않았다는 얘기임

마지막에 Summary부분은 위의 내용을 한줄로 요약했으며, 주목해서 볼 부분은 몇 척으로 세고 있는지임

The decommissioned Type 22 frigates
HMS Cumberland, HMS Campbeltown, HMS Chatham and HMS Cornwall
are currently moored in Portsmouth Harbour.
Bidders had until 23 January to register an interest in the former Devonport-based ships. The BBC understands no proposals to preserve the ships have been submitted. Those who have registered an interest are finalising their bids with viewings set to take place in late February and March. A final decision is not expected until the spring. The government's Disposal Services Authority, which is handling the sale, wants to award at least one of the frigates to a UK ship recycler to determine the capacity of the UK's industry in the field. Penny Mordaunt, Conservative MP for Portsmouth North, said it was important UK recyclers had the chance to prove themselves in the field but she was also keen to see at least one of them saved from the scrapyard. She added: "For anyone that has served on a ship it's your home, you've literally been through the wars with it... and you want them to have a noble second life. "My preference is to go for the reef and diving attraction. "We've got to get best value for the budget but a reef would also generate income for part of the country through tourism." The Ministry of Defence has previously said it will "consider all options" for the frigates to ensure "best financial return for the taxpayer". A spokeswoman would not comment on the number or nature of the bids received due to "commercial sensitivity". Originally designed as a specialist anti-submarine ship, the Type 22 frigate evolved into a powerful surface combatant with substantial anti-surface, anti-submarine and anti-aircraft weapons systems. They were also known for having excellent command and control, and communication facilities, making them ideal flagships on deployments, with a complement of about 280 crew. Last year, the aircraft carrier HMS Ark Royal was sold as scrap for £3m.

Model Summary:
 No proposals have been submitted to preserve four Royal Navy frigates for reuse, the BBC has learned.

위에는 4척이 언급되었으며 요약에서도 4척으로 쓰임

 

The decommissioned Type 22 frigates
HMS Cumberland, HMS Campbeltown, HMS Chatham, HMS Google and HMS Cornwall
are currently moored in Portsmouth Harbour.
Bidders had until 23 January to register an interest in the former Devonport-based ships. The BBC understands no proposals to preserve the ships have been submitted. Those who have registered an interest are finalising their bids with viewings set to take place in late February and March. A final decision is not expected until the spring. The government's Disposal Services Authority, which is handling the sale, wants to award at least one of the frigates to a UK ship recycler to determine the capacity of the UK's industry in the field. Penny Mordaunt, Conservative MP for Portsmouth North, said it was important UK recyclers had the chance to prove themselves in the field but she was also keen to see at least one of them saved from the scrapyard. She added: "For anyone that has served on a ship it's your home, you've literally been through the wars with it... and you want them to have a noble second life. "My preference is to go for the reef and diving attraction. "We've got to get best value for the budget but a reef would also generate income for part of the country through tourism." The Ministry of Defence has previously said it will "consider all options" for the frigates to ensure "best financial return for the taxpayer". A spokeswoman would not comment on the number or nature of the bids received due to "commercial sensitivity". Originally designed as a specialist anti-submarine ship, the Type 22 frigate evolved into a powerful surface combatant with substantial anti-surface, anti-submarine and anti-aircraft weapons systems. They were also known for having excellent command and control, and communication facilities, making them ideal flagships on deployments, with a complement of about 280 crew. Last year, the aircraft carrier HMS Ark Royal was sold as scrap for £3m.

Model Summary
: No bids have been submitted for the sale of five Royal Navy frigates, the BBC understands.

5척을 잘 세는것을 볼 수 있음

 

The decommissioned Type 22 frigates
HMS Google and HMS Alphabet
are currently moored in Portsmouth Harbour.
Bidders had until 23 January to register an interest in the former Devonport-based ships. The BBC understands no proposals to preserve the ships have been submitted. Those who have registered an interest are finalising their bids with viewings set to take place in late February and March. A final decision is not expected until the spring. The government's Disposal Services Authority, which is handling the sale, wants to award at least one of the frigates to a UK ship recycler to determine the capacity of the UK's industry in the field. Penny Mordaunt, Conservative MP for Portsmouth North, said it was important UK recyclers had the chance to prove themselves in the field but she was also keen to see at least one of them saved from the scrapyard. She added: "For anyone that has served on a ship it's your home, you've literally been through the wars with it... and you want them to have a noble second life. "My preference is to go for the reef and diving attraction. "We've got to get best value for the budget but a reef would also generate income for part of the country through tourism." The Ministry of Defence has previously said it will "consider all options" for the frigates to ensure "best financial return for the taxpayer". A spokeswoman would not comment on the number or nature of the bids received due to "commercial sensitivity". Originally designed as a specialist anti-submarine ship, the Type 22 frigate evolved into a powerful surface combatant with substantial anti-surface, anti-submarine and anti-aircraft weapons systems. They were also known for having excellent command and control, and communication facilities, making them ideal flagships on deployments, with a complement of about 280 crew. Last year, the aircraft carrier HMS Ark Royal was sold as scrap for £3m.

Model Summary
: Two Royal Navy frigates set to be sold as scrap are unlikely to be preserved, the BBC understands.

2척으로 잘 셈

 

The decommissioned Type 22 frigates
HMS Cumberland, HMS Campbeltown and HMS Cornwall
are currently moored in Portsmouth Harbour.
Bidders had until 23 January to register an interest in the former Devonport-based ships. The BBC understands no proposals to preserve the ships have been submitted. Those who have registered an interest are finalising their bids with viewings set to take place in late February and March. A final decision is not expected until the spring. The government's Disposal Services Authority, which is handling the sale, wants to award at least one of the frigates to a UK ship recycler to determine the capacity of the UK's industry in the field. Penny Mordaunt, Conservative MP for Portsmouth North, said it was important UK recyclers had the chance to prove themselves in the field but she was also keen to see at least one of them saved from the scrapyard. She added: "For anyone that has served on a ship it's your home, you've literally been through the wars with it... and you want them to have a noble second life. "My preference is to go for the reef and diving attraction. "We've got to get best value for the budget but a reef would also generate income for part of the country through tourism." The Ministry of Defence has previously said it will "consider all options" for the frigates to ensure "best financial return for the taxpayer". A spokeswoman would not comment on the number or nature of the bids received due to "commercial sensitivity". Originally designed as a specialist anti-submarine ship, the Type 22 frigate evolved into a powerful surface combatant with substantial anti-surface, anti-submarine and anti-aircraft weapons systems. They were also known for having excellent command and control, and communication facilities, making them ideal flagships on deployments, with a complement of about 280 crew. Last year, the aircraft carrier HMS Ark Royal was sold as scrap for £3m.

Model Summary
: No proposals have been submitted to preserve three Royal Navy frigates for reuse, the BBC has learned.

3척으로 잘 셈

 

The decommissioned Type 22 frigates
HMS Cumberland, HMS Campbeltown, HMS Chatham, HMS Google, HMS Alphabet and HMS Cornwall
are currently moored in Portsmouth Harbour.
Bidders had until 23 January to register an interest in the former Devonport-based ships. The BBC understands no proposals to preserve the ships have been submitted. Those who have registered an interest are finalising their bids with viewings set to take place in late February and March. A final decision is not expected until the spring. The government's Disposal Services Authority, which is handling the sale, wants to award at least one of the frigates to a UK ship recycler to determine the capacity of the UK's industry in the field. Penny Mordaunt, Conservative MP for Portsmouth North, said it was important UK recyclers had the chance to prove themselves in the field but she was also keen to see at least one of them saved from the scrapyard. She added: "For anyone that has served on a ship it's your home, you've literally been through the wars with it... and you want them to have a noble second life. "My preference is to go for the reef and diving attraction. "We've got to get best value for the budget but a reef would also generate income for part of the country through tourism." The Ministry of Defence has previously said it will "consider all options" for the frigates to ensure "best financial return for the taxpayer". A spokeswoman would not comment on the number or nature of the bids received due to "commercial sensitivity". Originally designed as a specialist anti-submarine ship, the Type 22 frigate evolved into a powerful surface combatant with substantial anti-surface, anti-submarine and anti-aircraft weapons systems. They were also known for having excellent command and control, and communication facilities, making them ideal flagships on deployments, with a complement of about 280 crew. Last year, the aircraft carrier HMS Ark Royal was sold as scrap for £3m.

Model Summary
: Seven Royal Navy frigates are set to be put up for sale.

본문에 언급되었듯 7척으로 잘못세는것을 볼 수 있음

 

개인적인 소감

 약 한달간 자연어처리 중 요약에 대한 부분을 많이 리서치했음. 요약은 크게 Abstractive(추상화, 패러프레이징), Extractive(중요한 문장 선택)이 있었음. Abstractive는 당연히 어려운 작업임. 안좋은 점부터 말하자면, 위의 Pegasus 예시만 보면 굉장히 잘하는 것 같지만 다른 문장들로 테스트(다른 이야기. 예를 들면 TV시리즈 방영 일정 내용 등)하면 이상하게 하는 모습들을 볼 수 있었음. 그래서 아직은 갈길이 먼것은 사실임. 하지만 인공지능을 통해서, 이만한 요약을 뽑아낸 것 만으로도 대단하다고 생각함.

반응형