Tensorflow tutorial-Classify text with BERT를 번역 및 정리한 글. BERT를 이용한 예제가 대부분 Huggingface를 이용한 것인데, BERT를 공부하기에는 Huggingface를 쓰지 않고 Tensorflow나 PyTorch를 이용한 코드가 더 나을 거라고 생각해 찾다가 발견했다.
원글 링크: (한국어 번역이 잘 안되어 있음)
Setup
- preprocessing을 위해 tensorflow-text 설치
pip install -q -U "tensorflow-text==2.8.*"
- AdamW optimizer 사용을 위해 다음을 설치
pip install -q tf-models-official==2.7.0
- 예제에서 사용할 패키지 목록은 다음과 같다.
import os
import shutil
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
from official.nlp import optimization # to create AdamW optimizer
import matplotlib.pyplot as plt
tf.get_logger().setLevel('ERROR')
Prepare data
데이터셋 다운로드
본 예제에서는 영화 리뷰를 모아놓은 IMDB 데이터셋을 이용해 긍정/부정으로 평가하는 sentiment analysis를 수행한다.
url = 'https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
dataset = tf.keras.utils.get_file('aclImdb_v1.tar.gz', url,
untar=True, cache_dir='.',
cache_subdir='')
dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')
train_dir = os.path.join(dataset_dir, 'train')
# remove unused folders to make it easier to load the data
remove_dir = os.path.join(train_dir, 'unsup')
shutil.rmtree(remove_dir)
prepare dataset, train-val split
Train dataset을 8:2 비율로 train-validation set으로 나누고, Train, validation, test 데이터셋을 선언한다.
AUTOTUNE = tf.data.AUTOTUNE
batch_size = 32
seed = 42
#train dataset
raw_train_ds = tf.keras.utils.text_dataset_from_directory(
'aclImdb/train',
batch_size=batch_size,
validation_split=0.2,
subset='training',
seed=seed)
class_names = raw_train_ds.class_names
train_ds = raw_train_ds.cache().prefetch(buffer_size=AUTOTUNE)
#validation dataset
val_ds = tf.keras.utils.text_dataset_from_directory(
'aclImdb/train',
batch_size=batch_size,
validation_split=0.2,
subset='validation',
seed=seed)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
#test dataset
test_ds = tf.keras.utils.text_dataset_from_directory(
'aclImdb/test',
batch_size=batch_size)
test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)
데이터 확인
for text_batch, label_batch in train_ds.take(1):
for i in range(3):
print(f'Review: {text_batch.numpy()[i]}')
label = label_batch.numpy()[i]
print(f'Label : {label} ({class_names[label]})')
[실행결과]
Review: b'"Pandemonium" is a horror movie spoof that comes off more stupid than funny. Believe me when I tell you, I love comedies. Especially comedy spoofs. "Airplane", "The Naked Gun" trilogy, "Blazing Saddles", "High Anxiety", and "Spaceballs" are some of my favorite comedies that spoof a particular genre. "Pandemonium" is not up there with those films. Most of the scenes in this movie had me sitting there in stunned silence because the movie wasn\'t all that funny. There are a few laughs in the film, but when you watch a comedy, you expect to laugh a lot more than a few times and that\'s all this film has going for it. Geez, "Scream" had more laughs than this film and that was more of a horror film. How bizarre is that?<br /><br />*1/2 (out of four)'
Label : 0 (neg)
Review: b"David Mamet is a very interesting and a very un-equal director. His first movie 'House of Games' was the one I liked best, and it set a series of films with characters whose perspective of life changes as they get into complicated situations, and so does the perspective of the viewer.<br /><br />So is 'Homicide' which from the title tries to set the mind of the viewer to the usual crime drama. The principal characters are two cops, one Jewish and one Irish who deal with a racially charged area. The murder of an old Jewish shop owner who proves to be an ancient veteran of the Israeli Independence war triggers the Jewish identity in the mind and heart of the Jewish detective.<br /><br />This is were the flaws of the film are the more obvious. The process of awakening is theatrical and hard to believe, the group of Jewish militants is operatic, and the way the detective eventually walks to the final violent confrontation is pathetic. The end of the film itself is Mamet-like smart, but disappoints from a human emotional perspective.<br /><br />Joe Mantegna and William Macy give strong performances, but the flaws of the story are too evident to be easily compensated."
Label : 0 (neg)
Review: b'Great documentary about the lives of NY firefighters during the worst terrorist attack of all time.. That reason alone is why this should be a must see collectors item.. What shocked me was not only the attacks, but the"High Fat Diet" and physical appearance of some of these firefighters. I think a lot of Doctors would agree with me that,in the physical shape they were in, some of these firefighters would NOT of made it to the 79th floor carrying over 60 lbs of gear. Having said that i now have a greater respect for firefighters and i realize becoming a firefighter is a life altering job. The French have a history of making great documentary\'s and that is what this is, a Great Documentary.....'
Label : 1 (pos)
2022-03-29 12:30:15.775528: W tensorflow/core/kernels/data/cache_dataset_ops.cc:768] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
Pre-trained 모델 다운로드 받기
Tensorflow Hub에서 다음 URL을 통해 사전학습된 BERT 모델의 weight를 다운로드할 수 있다. BERT 외에도 ALBERT, Electra 등의 모델도 다운로드할 수 있으며 이는 원본 링크를 참고 바람.
tfhub_handle_encoder = 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1'
tfhub_handle_preprocess = 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3'
Preprocessing model
다음과 같이 input을 token id로 변환하는 모델을 선언한다.
bert_preprocess_model = hub.KerasLayer(tfhub_handle_preprocess)
다음과 같이 테스트해볼 수 있다.
text_test = ['this is such an amazing movie!']
text_preprocessed = bert_preprocess_model(text_test)
print(f'Keys : {list(text_preprocessed.keys())}')
print(f'Shape : {text_preprocessed["input_word_ids"].shape}')
print(f'input_word_ids : {text_preprocessed["input_word_ids"][0, :12]}')
print(f'input_mask : {text_preprocessed["input_mask"][0, :12]}')
print(f'input_type_ids : {text_preprocessed["input_type_ids"][0, :12]}')
[실행결과]
Keys : ['input_type_ids', 'input_word_ids', 'input_mask']
Shape : (1, 128)
input_word_ids : [ 101 2023 2003 2107 2019 6429 3185 999 102 0 0 0]
input_mask : [1 1 1 1 1 1 1 1 1 0 0 0]
input_type_ids : [0 0 0 0 0 0 0 0 0 0 0 0]
Preprocessing model은 다음과 같이 세 개의 output을 출력한다.
- input_word_ids: 각 token의 id
- input_mask: padding 여부. padding된 token은 0, 그렇지 않은 token은 1
- input_type_ids: 각 문장을 구별하는 id. 본 예제에서는 모두 단일 문장이므로 항상 0이다.
BERT model
다음과 같이 BERT 모델을 선언한다.
bert_model = hub.KerasLayer(tfhub_handle_encoder)
다음과 같이 테스트해볼 수 있다.
bert_results = bert_model(text_preprocessed)
print(f'Loaded BERT: {tfhub_handle_encoder}')
print(f'Pooled Outputs Shape:{bert_results["pooled_output"].shape}')
print(f'Pooled Outputs Values:{bert_results["pooled_output"][0, :12]}')
print(f'Sequence Outputs Shape:{bert_results["sequence_output"].shape}')
print(f'Sequence Outputs Values:{bert_results["sequence_output"][0, :12]}')
[실행결과]
Loaded BERT: https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1
Pooled Outputs Shape:(1, 512)
Pooled Outputs Values:[ 0.76262873 0.99280983 -0.1861186 0.36673835 0.15233682 0.65504444
0.9681154 -0.9486272 0.00216158 -0.9877732 0.0684272 -0.9763061 ]
Sequence Outputs Shape:(1, 128, 512)
Sequence Outputs Values:[[-0.28946388 0.3432126 0.33231565 ... 0.21300787 0.7102078
-0.05771166]
[-0.28742015 0.31981024 -0.2301858 ... 0.58455074 -0.21329722
0.7269209 ]
[-0.66157013 0.6887685 -0.87432927 ... 0.10877253 -0.26173282
0.47855264]
...
[-0.2256118 -0.28925604 -0.07064401 ... 0.4756601 0.8327715
0.40025353]
[-0.29824278 -0.27473143 -0.05450511 ... 0.48849759 1.0955356
0.18163344]
[-0.44378197 0.00930723 0.07223766 ... 0.1729009 1.1833246
0.07897988]]
BERT 모델은 다음과 같이 세 개의 output을 출력한다.
- pooled_output: 전체 문장의 embedding (batch_size, H)
- sequence_output: 각 token의 embedding (batch_size, seq_length, H)
- encoder_outputs: L번째 transformer block의 중간 activation 값. (batch_size, seq_length, 1024) * L
Define entire model
위의 두 모델에 classifier을 더해 sentiment analysis를 위한 전체 모델을 다음과 같이 선언한다.
def build_classifier_model():
preprocessing_layer = hub.KerasLayer(tfhub_handle_preprocess, name='preprocessing')
encoder = hub.KerasLayer(tfhub_handle_encoder, trainable=True, name='BERT_encoder')
dropout = tf.keras.layers.Dropout(0.1)
classifier = tf.keras.layers.Dense(1, activation=None, name='classifier')
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
encoder_inputs = preprocessing_layer(text_input)
outputs = encoder(encoder_inputs)
net = outputs['pooled_output']
net = dropout(net)
net = classifier(net)
return tf.keras.Model(text_input, net)
Preprocessing 모델을 통해 token을 id로 변환하고, BERT 모델을 통해 embedding을 얻는다.
문장 단위의 classification이므로, Embedding의 pooled_output (전체 문장의 embedding)만을 이용하여 dropout-dense layer(classifier)을 통과해 최종 예측값을 얻는다.
다음과 같이 모델의 output을 테스트해볼 수 있다.
classifier_model = build_classifier_model()
bert_raw_result = classifier_model(tf.constant(text_test))
print(tf.sigmoid(bert_raw_result))
[실행결과]
tf.Tensor([[0.79717386]], shape=(1, 1), dtype=float32)
0과 1사이의 확률값이 출력됨을 확인할 수 있다.
전체 모델의 구조는 다음과 같다.
tf.keras.utils.plot_model(classifier_model)
Train
Loss function
이진분류 문제를 수행하므로, binary cross entropy loss를 사용한다. 평가지표로는 accuracy를 이용한다.
loss = tf.keras.losses.BinaryCrossentropy(from_logits=True)
metrics = tf.metrics.BinaryAccuracy()
Optimizer
optimizer로는 Adam과 weight decay를 결합한 AdamW를 사용한다.
epochs = 5
steps_per_epoch = tf.data.experimental.cardinality(train_ds).numpy()
num_train_steps = steps_per_epoch * epochs
num_warmup_steps = int(0.1*num_train_steps)
init_lr = 3e-5
optimizer = optimization.create_optimizer(init_lr=init_lr,
num_train_steps=num_train_steps,
num_warmup_steps=num_warmup_steps,
optimizer_type='adamw')
Train
다음과 같이 학습을 진행한다.
classifier_model.compile(optimizer=optimizer,
loss=loss,
metrics=metrics)
print(f'Training model with {tfhub_handle_encoder}')
history = classifier_model.fit(x=train_ds,
validation_data=val_ds,
epochs=epochs)
[실행결과]
Training model with https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1
Epoch 1/5
625/625 [==============================] - 90s 136ms/step - loss: 0.4784 - binary_accuracy: 0.7528 - val_loss: 0.3713 - val_binary_accuracy: 0.8350
Epoch 2/5
625/625 [==============================] - 83s 133ms/step - loss: 0.3295 - binary_accuracy: 0.8525 - val_loss: 0.3675 - val_binary_accuracy: 0.8472
Epoch 3/5
625/625 [==============================] - 83s 133ms/step - loss: 0.2503 - binary_accuracy: 0.8963 - val_loss: 0.3905 - val_binary_accuracy: 0.8470
Epoch 4/5
625/625 [==============================] - 83s 133ms/step - loss: 0.1930 - binary_accuracy: 0.9240 - val_loss: 0.4566 - val_binary_accuracy: 0.8506
Epoch 5/5
625/625 [==============================] - 83s 133ms/step - loss: 0.1526 - binary_accuracy: 0.9429 - val_loss: 0.4813 - val_binary_accuracy: 0.8468
Evaluation
학습이 완료되면 성능평가를 진행한다.
loss, accuracy = classifier_model.evaluate(test_ds)
print(f'Loss: {loss}')
print(f'Accuracy: {accuracy}')
[실행결과]
782/782 [==============================] - 59s 75ms/step - loss: 0.4568 - binary_accuracy: 0.8557
Loss: 0.45678260922431946
Accuracy: 0.855679988861084