不均衡データでClassification of text documents using sparse featuresしてみる
読み取った新書のデータから自分にとって興味あるか否かを判定する、という一連をやった。
生まれつき好奇心が全く無い病気なので、全体の比で見ると「興味あり」が圧倒的に少ない。
不均衡データってヤツ。
ベルヌーイ分布なナイーブベイズでこの不均衡データを予測してみたら精度が97%くらい出てしまった。
これは絶対ウソだな。
他の分類器でやってみたらどうなるんだろうかと思った。
けど、他の分類器って何あるのかよく分からないし挫折しようとしてたら下記を見つけた。
Classification of text documents using sparse features – scikit-learn 0.17.1 documentation
いろんな分類器試してるので、(フェイスブックで全く知らない他人の投稿に)イイネ(した)。
これをマネします。
データ
新書のデータはいつものDMMから得た。
https://gist.github.com/nihon-taro/b195c6a3b1a0f59c7a5f705232e5ab2f
ソースコード
コードはこんな感じ。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 |
# coding: utf-8 # In[20]: from time import time import pandas as pd import numpy as np import MeCab from sklearn.feature_selection import SelectKBest, chi2 from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import RidgeClassifier from sklearn.pipeline import Pipeline from sklearn.svm import LinearSVC from sklearn.linear_model import SGDClassifier from sklearn.linear_model import Perceptron from sklearn.linear_model import PassiveAggressiveClassifier from sklearn.naive_bayes import BernoulliNB, MultinomialNB from sklearn.neighbors import KNeighborsClassifier from sklearn.neighbors import NearestCentroid from sklearn.ensemble import RandomForestClassifier from sklearn.utils.extmath import density from sklearn import metrics # In[21]: df = pd.read_csv('book_data.csv', header=None) # In[22]: def _split_to_words_mecab(text, to_stem=False): """ 入力: 'すべて自分のほうへ' 出力: tuple(['すべて', '自分', 'の', 'ほう', 'へ']) """ tagger = MeCab.Tagger("-Ochasen") mecab_result = tagger.parse(text) info_of_words = mecab_result.split('\n') words = [] for info in info_of_words: # macabで分けると、文の最後に’’が、その手前に'EOS'が来る if info == 'EOS' or info == '': break # info => 'な\t助詞,終助詞,*,*,*,*,な,ナ,ナ' info_elems = info.split('\t') # 6番目に、無活用系の単語が入る。もし6番目が'*'だったら0番目を入れる if info_elems[2] == '*': words.append(info_elems[0]) continue if to_stem: # 語幹に変換 words.append(info_elems[2]) continue # 語をそのまま words.append(info_elems[0]) return words # In[23]: def japanese_token_stemming(text): stems = _split_to_words_mecab(text=text, to_stem=True) return stems # In[24]: stop_word_list = ["て", "に", "を", "は", "が", "へ", "した", "しました" , "ました", "です", "する", "した", "ある", "いる", "。", "、"] # In[25]: from sklearn.feature_extraction.text import TfidfVectorizer # In[26]: tfidf_vectorizer = TfidfVectorizer(stop_words=stop_word_list, analyzer=japanese_token_stemming, sublinear_tf=True, max_df=0.5) # In[27]: train_data = list(df[1])[:900] # In[28]: X_train_tfidf = tfidf_vectorizer.fit_transform(train_data) # In[29]: test_data = list(df[1])[900:1000] # In[30]: X_test_tfidf = tfidf_vectorizer.transform(test_data) # In[31]: def benchmark(clf): t0 = time() clf.fit(X_train_tfidf, train_target) train_time = time() - t0 t0 = time() predicted = clf.predict(X_test_tfidf) test_time = time() - t0 score = metrics.accuracy_score(test_target, predicted) print("accuracy: %0.6f" % score) if hasattr(clf, 'coef_'): print("dimensionality: %d" % clf.coef_.shape[1]) print("density: %f" % density(clf.coef_)) print() print("classification report:") print(metrics.classification_report(test_target, predicted, target_names=['no', 'yes'])) print("confusion matrix:") print(metrics.confusion_matrix(test_target, predicted)) clf_descr = str(clf).split('(')[0] return clf_descr, score, train_time, test_time # In[32]: train_target = list(map(int, list(df[3])[:900])) # In[33]: test_target = list(map(int, list(df[3])[900:1000])) # In[34]: results = [] # In[35]: for clf, name in ( (RidgeClassifier(tol=1e-2, solver='sag'), "Ridge Classifier"), (Perceptron(n_iter=50), "Perceptron"), (PassiveAggressiveClassifier(n_iter=50), "Passive-Aggresive"), (KNeighborsClassifier(n_neighbors=10), "kNN"), (RandomForestClassifier(n_estimators=100), "Random forest")): print('=' * 80) print(name) results.append(benchmark(clf)) # In[36]: for penalty in ["l2", "l1"]: print('=' * 80) print("%s penalty" % penalty.upper()) # Train Liblinear model results.append(benchmark(LinearSVC(loss='squared_hinge', penalty=penalty, dual=False, tol=1e-3))) # Train SGD model results.append(benchmark(SGDClassifier(alpha=.0001, n_iter=50, penalty=penalty))) # In[37]: # Train SGD with Elastic Net penalty print('=' * 80) print("Elastic-Net penalty") results.append(benchmark(SGDClassifier(alpha=.0001, n_iter=50, penalty="elasticnet"))) # Train NearestCentroid without threshold print('=' * 80) print("NearestCentroid (aka Rocchio classifier)") results.append(benchmark(NearestCentroid())) # Train sparse Naive Bayes classifiers print('=' * 80) print("Naive Bayes") results.append(benchmark(MultinomialNB(alpha=.01))) results.append(benchmark(BernoulliNB(alpha=.01))) print('=' * 80) print("LinearSVC with L1-based feature selection") # The smaller C, the stronger the regularization. # The more regularization, the more sparsity. results.append(benchmark(Pipeline([ ('feature_selection', LinearSVC(penalty="l1", dual=False, tol=1e-3)), ('classification', LinearSVC()) ]))) |
日本語の文章だったので、Mecabでわかち書きしながらベクトルにした。
下記の2つをそのまま使わせてもらった。
http://qiita.com/katryo/items/f86971afcb65ce1e7d40
http://qiita.com/HirofumiYashima/items/9308ea0607312218b20c#_reference-980db7bbab5dfcf49155
結果
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 |
================================================================================ Ridge Classifier accuracy: 0.970000 dimensionality: 10858 density: 1.000000 classification report: precision recall f1-score support no 0.97 1.00 0.98 97 yes 0.00 0.00 0.00 3 avg / total 0.94 0.97 0.96 100 confusion matrix: [[97 0] [ 3 0]] ================================================================================ Perceptron accuracy: 0.970000 dimensionality: 10858 density: 0.302910 classification report: precision recall f1-score support no 0.98 0.99 0.98 97 yes 0.50 0.33 0.40 3 avg / total 0.97 0.97 0.97 100 confusion matrix: [[96 1] [ 2 1]] ================================================================================ Passive-Aggresive accuracy: 0.960000 dimensionality: 10858 density: 0.864155 classification report: precision recall f1-score support no 0.97 0.99 0.98 97 yes 0.00 0.00 0.00 3 avg / total 0.94 0.96 0.95 100 confusion matrix: [[96 1] [ 3 0]] ================================================================================ kNN accuracy: 0.970000 classification report: precision recall f1-score support no 0.97 1.00 0.98 97 yes 0.00 0.00 0.00 3 avg / total 0.94 0.97 0.96 100 confusion matrix: [[97 0] [ 3 0]] ================================================================================ Random forest accuracy: 0.960000 classification report: precision recall f1-score support no 0.97 0.99 0.98 97 yes 0.00 0.00 0.00 3 avg / total 0.94 0.96 0.95 100 confusion matrix: [[96 1] [ 3 0]] ================================================================================ L2 penalty accuracy: 0.970000 dimensionality: 10858 density: 1.000000 classification report: precision recall f1-score support no 0.97 1.00 0.98 97 yes 0.00 0.00 0.00 3 avg / total 0.94 0.97 0.96 100 confusion matrix: [[97 0] [ 3 0]] accuracy: 0.960000 dimensionality: 10858 density: 0.687235 classification report: precision recall f1-score support no 0.97 0.99 0.98 97 yes 0.00 0.00 0.00 3 avg / total 0.94 0.96 0.95 100 confusion matrix: [[96 1] [ 3 0]] ================================================================================ L1 penalty accuracy: 0.960000 dimensionality: 10858 density: 0.009578 classification report: precision recall f1-score support no 0.97 0.99 0.98 97 yes 0.00 0.00 0.00 3 avg / total 0.94 0.96 0.95 100 confusion matrix: [[96 1] [ 3 0]] accuracy: 0.950000 dimensionality: 10858 density: 0.062442 classification report: precision recall f1-score support no 0.97 0.98 0.97 97 yes 0.00 0.00 0.00 3 avg / total 0.94 0.95 0.95 100 confusion matrix: [[95 2] [ 3 0]] ================================================================================ Elastic-Net penalty accuracy: 0.960000 dimensionality: 10858 density: 0.276847 classification report: precision recall f1-score support no 0.97 0.99 0.98 97 yes 0.00 0.00 0.00 3 avg / total 0.94 0.96 0.95 100 confusion matrix: [[96 1] [ 3 0]] ================================================================================ NearestCentroid (aka Rocchio classifier) accuracy: 0.940000 classification report: precision recall f1-score support no 0.97 0.97 0.97 97 yes 0.00 0.00 0.00 3 avg / total 0.94 0.94 0.94 100 confusion matrix: [[94 3] [ 3 0]] ================================================================================ Naive Bayes accuracy: 0.960000 dimensionality: 10858 density: 1.000000 classification report: precision recall f1-score support no 0.97 0.99 0.98 97 yes 0.00 0.00 0.00 3 avg / total 0.94 0.96 0.95 100 confusion matrix: [[96 1] [ 3 0]] accuracy: 0.960000 dimensionality: 10858 density: 1.000000 classification report: precision recall f1-score support no 0.97 0.99 0.98 97 yes 0.00 0.00 0.00 3 avg / total 0.94 0.96 0.95 100 confusion matrix: [[96 1] [ 3 0]] ================================================================================ LinearSVC with L1-based feature selection accuracy: 0.960000 classification report: precision recall f1-score support no 0.97 0.99 0.98 97 yes 0.00 0.00 0.00 3 avg / total 0.94 0.96 0.95 100 confusion matrix: [[96 1] [ 3 0]] |
どの分類器も「とりあえず全部NO!って言っておけば大丈夫っしょw」って感じを醸しててアレ。
テストデータも不均衡になってるのでダメそう。
この残念な感じからマトモな感じになる試行錯誤を記録していきたい。
336px
336px
関連記事
-
-
KerasのCNNを使って文書分類する
Co …