KerasのCNNを使って文書分類する
TL;DR(笑)
これ見たら終わり。
日本語の文書分類したい
Mecabで分かち書きしたテキストを適当な配列に変換すればOK
配列変換はTokenizerという便利なクラスがKerasで用意してくれてるので、これを使う。
コードは下記の通り。
ほぼほぼ参考元と同じなので、自身の価値出してない。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 |
from keras.preprocessing.text import Tokenizer from keras.preprocessing import sequence from keras.models import Sequential from keras.layers import Dense, Dropout, Activation, Flatten from keras.layers import Embedding from keras.layers import Convolution1D, MaxPooling1D from keras import backend as K from keras.wrappers.scikit_learn import KerasClassifier from sklearn.cross_validation import StratifiedKFold from sklearn.cross_validation import cross_val_score import pandas as pd import MeCab # 適当なデータを用意してください df = pd.read_csv('caesar.csv', header=None) # テキストを分かち書きして返す def tokenize(text): wakati = MeCab.Tagger('-O wakati') return wakati.parse(text) tokenized_text_list = [tokenize(texts) for texts in df[1]] max_features = 5000 maxlen = 400 batch_size = 32 embedding_dims = 50 nb_filter = 250 filter_length = 3 hidden_dims = 250 nb_epoch = 5 tokenizer = Tokenizer() tokenizer.fit_on_texts(tokenized_text_list) seq = tokenizer.texts_to_sequences(tokenized_text_list) X = sequence.pad_sequences(seq, maxlen=maxlen) Y = df[2] # モデルを返すメソッド def build_model(): print('Build model...') model = Sequential() # we start off with an efficient embedding layer which maps # our vocab indices into embedding_dims dimensions model.add(Embedding(max_features, embedding_dims, input_length=maxlen, dropout=0.2)) # we add a Convolution1D, which will learn nb_filter # word group filters of size filter_length: model.add(Convolution1D(nb_filter=nb_filter, filter_length=filter_length, border_mode='valid', activation='relu', subsample_length=1)) # we use max pooling: model.add(MaxPooling1D(pool_length=model.output_shape[1])) # We flatten the output of the conv layer, # so that we can add a vanilla dense layer: model.add(Flatten()) # We add a vanilla hidden layer: model.add(Dense(hidden_dims)) model.add(Dropout(0.2)) model.add(Activation('relu')) # We project onto a single unit output layer, and squash it with a sigmoid: model.add(Dense(1)) model.add(Activation('sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) return model model = KerasClassifier(build_fn=build_model, nb_epoch=nb_epoch, batch_size=batch_size) # evaluate using 10-fold cross validation kfold = StratifiedKFold(y=Y, n_folds=10, shuffle=True) results = cross_val_score(model, X, Y, cv=kfold) print(results.mean()) |
わからないこと
テキストをどのような配列に変換するのが適当か?
今回はTokenizerを使ってテキストたちを配列へ変換した。
1 2 3 |
tokenizer = Tokenizer() tokenizer.fit_on_texts(tokenized_text_list) seq = tokenizer.texts_to_sequences(tokenized_text_list) |
実装を読んでないので、詳しくは知らんがだいたい次のようなノリ。
- 文書群に登場する単語に番号を貼る
- 文書の単語を、単語に貼った番号に置き換えた配列にする
文章で説明するのが下手なので、具体例を挙げる。
['I am a pen', 'this is a pen', 'he is poop']
という文書群が与えられたとする。
文書群に登場する単語に番号を貼る
素直に番号ふればいい。
{ I : 1, am : 2, a : 3, pen : 4, this: 5, is: 6, pen: 7, he: 8, poop: 9 }
こんな感じ。
文書の単語を、単語に貼った番号に置き換えた配列にする
‘I am a pen’ は [1,2,3,4]となり…
‘this is a pen’ は [5,6,3,7]となり…
‘he is poop’ は [8,6,9]となる…
上のような3つの配列を得たので『覚悟完了!』とはならない。
よくみると最後の配列だけ長さが違う。
Modelに突っ込む学習データの長さは統一しないとダメらしい。
なので、次のようなコードを追加してパディングする。
1 |
X = sequence.pad_sequences(seq, maxlen=maxlen) |
maxlenを5に指定すると長さ5の配列に膨らまされたり、切り詰められたりする。
例えば…’he is poop’ は [0,0,8,6,9]となる。
One-Hot配列とかはダメなのか気になる。
疎な配列はよくない?
後で試してみたい。
実行結果
以下のようになりました。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 |
Epoch 1/5 355/355 [==============================] - 8s - loss: 0.6807 - acc: 0.5746 Epoch 2/5 355/355 [==============================] - 7s - loss: 0.6491 - acc: 0.6704 Epoch 3/5 355/355 [==============================] - 7s - loss: 0.5765 - acc: 0.7211 Epoch 4/5 355/355 [==============================] - 6s - loss: 0.4416 - acc: 0.8310 Epoch 5/5 355/355 [==============================] - 7s - loss: 0.3151 - acc: 0.9099 41/41 [==============================] - 0s Build model... Epoch 1/5 355/355 [==============================] - 7s - loss: 0.6869 - acc: 0.4930 Epoch 2/5 355/355 [==============================] - 7s - loss: 0.6508 - acc: 0.6817 Epoch 3/5 355/355 [==============================] - 7s - loss: 0.5727 - acc: 0.7803 Epoch 4/5 355/355 [==============================] - 9s - loss: 0.4280 - acc: 0.8704 Epoch 5/5 355/355 [==============================] - 9s - loss: 0.2860 - acc: 0.8817 41/41 [==============================] - 0s Build model... Epoch 1/5 356/356 [==============================] - 7s - loss: 0.6876 - acc: 0.5843 Epoch 2/5 356/356 [==============================] - 7s - loss: 0.6607 - acc: 0.7444 Epoch 3/5 356/356 [==============================] - 9s - loss: 0.5929 - acc: 0.7472 Epoch 4/5 356/356 [==============================] - 7s - loss: 0.4880 - acc: 0.8118 Epoch 5/5 356/356 [==============================] - 7s - loss: 0.3670 - acc: 0.8455 40/40 [==============================] - 0s Build model... Epoch 1/5 356/356 [==============================] - 7s - loss: 0.6900 - acc: 0.5084 Epoch 2/5 356/356 [==============================] - 7s - loss: 0.6609 - acc: 0.6770 Epoch 3/5 356/356 [==============================] - 8s - loss: 0.5998 - acc: 0.7444 Epoch 4/5 356/356 [==============================] - 13s - loss: 0.5011 - acc: 0.8090 Epoch 5/5 356/356 [==============================] - 14s - loss: 0.3820 - acc: 0.8596 40/40 [==============================] - 0s Build model... Epoch 1/5 357/357 [==============================] - 7s - loss: 0.6835 - acc: 0.6387 Epoch 2/5 357/357 [==============================] - 7s - loss: 0.6412 - acc: 0.6527 Epoch 3/5 357/357 [==============================] - 8s - loss: 0.5600 - acc: 0.8067 Epoch 4/5 357/357 [==============================] - 9s - loss: 0.4166 - acc: 0.8599 Epoch 5/5 357/357 [==============================] - 7s - loss: 0.2947 - acc: 0.8711 39/39 [==============================] - 0s Build model... Epoch 1/5 357/357 [==============================] - 9s - loss: 0.6833 - acc: 0.6415 Epoch 2/5 357/357 [==============================] - 11s - loss: 0.6450 - acc: 0.6639 Epoch 3/5 357/357 [==============================] - 9s - loss: 0.5558 - acc: 0.8095 Epoch 4/5 357/357 [==============================] - 13s - loss: 0.4496 - acc: 0.8067 Epoch 5/5 357/357 [==============================] - 15s - loss: 0.3376 - acc: 0.8711 39/39 [==============================] - 0s Build model... Epoch 1/5 357/357 [==============================] - 8s - loss: 0.6913 - acc: 0.5070 Epoch 2/5 357/357 [==============================] - 8s - loss: 0.6571 - acc: 0.6723 Epoch 3/5 357/357 [==============================] - 8s - loss: 0.5728 - acc: 0.8151 Epoch 4/5 357/357 [==============================] - 8s - loss: 0.4539 - acc: 0.8431 Epoch 5/5 357/357 [==============================] - 7s - loss: 0.2871 - acc: 0.9020 39/39 [==============================] - 0s Build model... Epoch 1/5 357/357 [==============================] - 7s - loss: 0.6844 - acc: 0.5406 Epoch 2/5 357/357 [==============================] - 8s - loss: 0.6517 - acc: 0.6443 Epoch 3/5 357/357 [==============================] - 7s - loss: 0.5713 - acc: 0.7619 Epoch 4/5 357/357 [==============================] - 9s - loss: 0.4495 - acc: 0.8347 Epoch 5/5 357/357 [==============================] - 8s - loss: 0.3262 - acc: 0.8908 39/39 [==============================] - 0s Build model... Epoch 1/5 357/357 [==============================] - 9s - loss: 0.6865 - acc: 0.5966 Epoch 2/5 357/357 [==============================] - 8s - loss: 0.6529 - acc: 0.6359 Epoch 3/5 357/357 [==============================] - 8s - loss: 0.5760 - acc: 0.7787 Epoch 4/5 357/357 [==============================] - 6s - loss: 0.4226 - acc: 0.8571 Epoch 5/5 357/357 [==============================] - 7s - loss: 0.3148 - acc: 0.8739 39/39 [==============================] - 0s Build model... Epoch 1/5 357/357 [==============================] - 7s - loss: 0.6872 - acc: 0.5434 Epoch 2/5 357/357 [==============================] - 6s - loss: 0.6561 - acc: 0.7003 Epoch 3/5 357/357 [==============================] - 7s - loss: 0.5928 - acc: 0.7591 Epoch 4/5 357/357 [==============================] - 6s - loss: 0.4490 - acc: 0.8431 Epoch 5/5 357/357 [==============================] - 6s - loss: 0.3598 - acc: 0.8627 39/39 [==============================] - 0s 0.790078175204 |
畳み込んだわりに、たいした精度出てないですね。
- 学習に使うデータ数を増やす
- ストップワードを指定する(現状、
『
のような分類する上で不要、邪魔そうな記号が混ざってる) - GridSearchでハイパーパラメータを試行錯誤してみる
この辺をやったら良くなるのかな。
336px
336px
関連記事
- PREV
- メンズファッション誌の読書感想文
- NEXT
- 男が抱くファッションへの幻想集