前段时间参加了一个傻逼的网络比赛——基于视角的领域情感分析,主页在这里。比赛的任务是找出一段话的实体然后判断情感,比如“我喜欢本田,我不喜欢丰田”这句话中,要标出“本田”和“丰田”,并且站在本田的角度,情感是积极的,站在丰田的角度,情感就是消极的。也就是说,等价于将实体识别和情感分析结合起来了。

吐槽

看起来很高端,哪里傻逼了?比赛任务本身还不错,值得研究,然而官方却很傻逼,主要体现为:1、比赛分初赛、复赛、决赛三个阶段,初赛一个多月时间,然后筛选部分进入复赛,复赛就简单换了一点数据,题目、数据的领域都没有变化,复赛也是一个月的时间,这傻逼复赛究竟有什么意义?2、大家可以看看选手们在群里讨论什么:

嗷嗷嗷嗷 17:40:54
128004 【杭州德奥奥迪品荐二手车】奥迪ttcoupe45tfsiquattro2015年53.69万
嗷嗷嗷嗷 17:40:57
@国双赛题指导
嗷嗷嗷嗷 17:41:09
这个视角取到什么位置啊
国双赛题指导 17:41:19
奥迪tt

风云 20:19:47
没开过好车,感觉本田的操控比丰田 日产好吧 这里的“丰田”、“日产”应该neg还是neu
风云 20:20:00
感觉初赛复赛对这种标准不统一
风云 20:20:12
@国双赛题指导 @国双赛题指导3
国双赛题指导 21:29:52
neu

Kk_asd 10:15:00
@国双赛题指导 上海大众,上海要删掉吗?
国双赛题指导 10:15:18
bu

出门向右 20:49:06
有进口福特,这样的视角吗@国双赛题指导
出门向右 20:49:16
进口宝马?
国双赛题指导 20:54:43
没有

Kk_asd 10:57:28
起亚律动出现了好多,要标出起亚吗?@国双赛题指导
国双赛题指导 11:43:04
不要

我也就不说什么了,如果官方认为这是机器学习,那就是机器学习吧,只是我看上去更像“管理员学习”。

反正是一个傻逼的比赛,我就也当一回傻逼吧。我也不奢望有什么名次,比赛还没结束,我先把我自己的模型公开了,大家如果成绩比我低的,可以按照这个模版,刷一下成绩。

模型

其实这个任务,我的做法跟《基于双向LSTM和迁移学习的seq2seq核心实体识别》差不多,视为一个序列标注问题,只不过将LSTM换成了参数更少的GRU。这次我使用了字标注法,用0标注非实体部分,用1标注积极实体,用2标注中性实体,用3标注消极实体,仅此而已。由于标签语料是汽车领域的,我自己爬了一些汽车领域的语料,并且自己写了基于GRU的语言模型,用来训练字向量,因为我感觉Word2Vec的字向量做法太粗糙,对于小语料效果可能不好。

然后呢?没有然后了,剩下的基本就是重复《基于双向LSTM和迁移学习的seq2seq核心实体识别》得了,连代码都一样。当然,最后它给出了一个汽车领域的实体列表,因此,我用这个列表在后期viterbi算法中进行了强行对齐。最后的迁移学习效果提升不大,大家看着办即可。

整个过程我自己比较满足的一点是端到端,语料下来后,几乎没有人工干预了。换个领域的语料,照样很快跑通。

效果

初赛准确率0.56,复赛目前我的准确率0.55,不算好,榜上最优成绩有0.67的,不知道他们用什么方法做,希望有大神指导下。反正我是不打算做了。

代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
#! -*- coding:utf-8 -*-
 
import numpy as np
import pandas as pd
from tqdm import tqdm
import re
import time
import os
 
print u'read data ...'
train_data = pd.read_csv('Train.csv', index_col='SentenceId', delimiter='\t', encoding='utf-8')
test_data = pd.read_csv('Test.csv', index_col='SentenceId', delimiter='\t', encoding='utf-8')
train_label = pd.read_csv('Label.csv', index_col='SentenceId', delimiter='\t', encoding='utf-8')
addition_data = pd.read_csv('addition_data.csv', header=None, encoding='utf-8')[0]
train_data.dropna(inplace=True) # drop some empty sentences
neg_data = pd.read_excel('neg.xls', header=None)[0]
pos_data = pd.read_excel('pos.xls', header=None)[0]
 
script_name = 'shibie.py'
now = int(time.time())
os.system('mkdir %s'%now)
os.system('cp %s %s'%(script_name, now))
os.system('cp addition_data.csv %s'%now)
 
# soma parameters
min_count = 5
maxlen = 100
word_size = 64
 
print u'making mapping dictionary ...'
word2id = ''.join(train_data['Content']) + ''.join(test_data['Content']) + ''.join(addition_data)
word2id = pd.Series(list(word2id)).value_counts()
word2id = word2id[word2id >= min_count]
word2id[:] = range(1, len(word2id)+1)
print u'keep %s words.'%len(word2id)
 
def doc2id(s):
    return list(word2id[list(s)].fillna(len(word2id)+1).astype(np.int32))
 
print u'translating texts into id sequences ...'
train_data['doc2id'] = map(lambda i: doc2id(train_data.loc[i, 'Content']), tqdm(iter(train_data.index)))
test_data['doc2id'] = map(lambda i: doc2id(test_data.loc[i, 'Content']), tqdm(iter(test_data.index)))
addition_data[:] = map(lambda i: doc2id(addition_data[i]), tqdm(iter(addition_data.index)))
pos_data[:] = map(lambda i: doc2id(pos_data[i]), tqdm(iter(pos_data.index)))
neg_data[:] = map(lambda i: doc2id(neg_data[i]), tqdm(iter(neg_data.index)))
 
# make n-grams for train language model
n = 8
def gen_ngrams(s):
    s = [0]*(n-1) + s + [0]*(n-1)
    return zip(*[s[i:] for i in range(n)])
 
print u'generating ngrams ...'
from itertools import chain
ngrams = pd.concat([train_data['doc2id'].apply(gen_ngrams),
                    test_data['doc2id'].apply(gen_ngrams),
                    addition_data.apply(gen_ngrams),
                    pos_data.apply(gen_ngrams),
                    neg_data.apply(gen_ngrams)])
ngrams = np.array(list(chain(*ngrams)))
 
def findall(sub_string, string):
    start = 0
    idxs = []
    while True:
        idx = string[start:].find(sub_string)
        if idx == -1:
            return idxs
        else:
            idxs.append(start + idx)
            start += idx + len(sub_string)
 
tags = {'pos':1, 'neu':2, 'neg':3}
 
def label2tag(i):
    s = train_data.loc[i]['Content']
    r = np.array([0]*len(s))
    try:
        l = train_label.loc[[i]].as_matrix()
    except:
        return r
    for i in l:
        for j in findall(i[0], s):
            r[j:j+len(i[0])] = tags[i[1]]
    return r
 
print u'translating target into tags ...'
train_data['label'] = map(label2tag, tqdm(iter(train_data.index)))
print u'keep %s train sample.'%len(train_data)
 
from keras.layers import Input, Embedding, GRU, Dense, TimeDistributed, Bidirectional
from keras.models import Model
from keras.utils import np_utils
 
RNN = GRU # which type of RNN we used, try LSTM or GRU
 
# in order to gain good word embedding, we use GRU to train a n-grams language model
# it costs more time, but it produces better word embedding.
print u'training language model ...'
lm_input = Input(shape=(n-1,), dtype='int32')
lm_embedded = Embedding(len(word2id)+2,
                         word_size,
                         input_length=n-1,
                         mask_zero=True)(lm_input)
lm_rnn = RNN(64)(lm_embedded)
lm_output = Dense(len(word2id)+2, activation='softmax')(lm_rnn)
language_model = Model(input=lm_input, output=lm_output)
language_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
 
def lm_generator(ngrams, batch_size):
    while True:
        np.random.shuffle(ngrams)
        for p in np.split(ngrams, range(batch_size, len(ngrams), batch_size)):
            yield p[:, :-1], np_utils.to_categorical(p[:, -1], len(word2id)+2)
 
nb_epoch = 8 # accuracy changes slightly after 5 epoch
batch_size = 4096
lm_history = language_model.fit_generator(lm_generator(ngrams, batch_size), nb_epoch=nb_epoch, samples_per_epoch=len(ngrams))
language_model.save_weights('%s/language_model_weights.model'%now)
structure = open('%s/language_model_structure.model'%now, 'w')
structure.write(language_model.to_json())
structure.close()
 
# here we use 2 layers of bidirectional GRU to make a sequence tagging model
print u'training ner model ...'
ner_input = Input(shape=(maxlen,), dtype='int32')
ner_embedded = Embedding(len(word2id)+2,
                         word_size,
                         input_length=maxlen,
                         mask_zero=True,
                         trainable=False,
                         weights=[language_model.get_weights()[0]])(ner_input)
ner_brnn = Bidirectional(RNN(64, return_sequences=True), merge_mode='sum')(ner_embedded)
ner_brnn = Bidirectional(RNN(32, return_sequences=True), merge_mode='sum')(ner_brnn)
ner_output = TimeDistributed(Dense(5, activation='softmax'))(ner_brnn)
ner_model = Model(input=ner_input, output=ner_output)
ner_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
 
ner_data = train_data['doc2id'].apply(lambda s: s[:maxlen] + [0]*(maxlen - len(s[:maxlen])))
ner_data = np.array(list(ner_data))
ner_target = train_data['label'].apply(list).apply(lambda s: s[:maxlen] + [0]*(maxlen - len(s[:maxlen])))
ner_target = np.array(list(ner_target))
ner_target = np.array(map(lambda y:np_utils.to_categorical(y,5), ner_target))
sample_weight = (3/(train_data['label'].apply(lambda s:(np.array(s)==2).sum())+3)).as_matrix()
 
nb_epoch = 300
batch_size = 1024
ner_history_1 = ner_model.fit(ner_data, ner_target, batch_size=batch_size, nb_epoch=nb_epoch, sample_weight=sample_weight)
ner_model.save_weights('%s/ner_model_weights_1.model'%now)
structure = open('%s/ner_model_structure_1.model'%now, 'w')
structure.write(ner_model.to_json())
structure.close()
 
test_ner_data = test_data['doc2id'].apply(lambda s: s[:maxlen] + [0]*(maxlen - len(s[:maxlen])))
test_ner_data = np.array(list(test_ner_data))
 
print u'predicting ...'
train_data['predict'] = list(ner_model.predict(ner_data, batch_size=batch_size, verbose=1))
test_data['predict'] = list(ner_model.predict(test_ner_data, batch_size=batch_size, verbose=1))
 
def viterbi(nodes):
    paths = nodes[0]
    for l in range(1,len(nodes)):
        paths_ = paths.copy()
        paths = {}
        for i in nodes[l].keys():
            nows = {}
            for j in paths_.keys():
                if j[-1]+i in zy.keys():
                    nows[j+i]= paths_[j]+nodes[l][i]+zy[j[-1]+i]
            k = np.argmax(nows.values())
            paths[nows.keys()[k]] = nows.values()[k]
    return paths.keys()[np.argmax(paths.values())]
 
zy = {'00':1,
      '01':1,
      '02':1,
      '03':1,
      '10':1,
      '11':1,
      '20':1,
      '22':1,
      '30':1,
      '33':1}
 
zy = {i:np.log(zy[i]) for i in zy.keys()}
 
from acora import AcoraBuilder
views = pd.read_csv('View.csv', delimiter='\t', encoding='utf-8')['View']
views = AcoraBuilder(*views)
views = views.build()
 
def predict(i, data):
    y_pred = data.loc[i, 'predict']
    s = data.loc[i, 'Content'][:maxlen]
    nodes = [dict(zip(['0','1','2','3'], k)) for k in np.log(y_pred[:len(s)])]
    tags_pred_1 = viterbi(nodes)
    for j in views.finditer(s):
        for k in range(j[1], j[1]+len(j[0])):
            nodes[k]['1'] += 100
            nodes[k]['2'] += 100
            nodes[k]['3'] += 100
        try:
            nodes[j[1]-1]['0'] += 50
            nodes[k+1]['0'] += 50
        except:
            pass
    tags_pred_2 = viterbi(nodes)
    r = []
    for j in re.finditer('1+|2+|3+', tags_pred_2):
        t = pd.Series(list(tags_pred_1[j.start():j.end()])).value_counts()
        t = t[t.index != '0']
        if len(t) == 0:
            continue
        else:
            if t.index[0] == '1':
                r.append((i, s[j.start():j.end()], 'pos'))
            elif t.index[0] == '2':
                r.append((i, s[j.start():j.end()], 'neu'))
            else:
                r.append((i, s[j.start():j.end()], 'neg'))
    return r
 
print u'creating the final export ...'
train_data['pred'] = map(lambda i: predict(i, train_data), tqdm(iter(train_data.index)))
test_data['pred'] = map(lambda i: predict(i, test_data), tqdm(iter(test_data.index)))
 
result_1 = pd.DataFrame(list(chain(*test_data['pred'])), columns=['SentenceId', 'View', 'Opinion'])
result_1 = result_1.drop_duplicates()
result_1.to_csv('%s/result_1.csv'%now, index=None, encoding='utf-8')
 
# transfer learning
# we use the train result to train ner model again
result_1['SentenceId'] = result_1['SentenceId'].apply(int)
result = result_1.set_index('SentenceId')
 
def label2tag(i):
    s = test_data.loc[i]['Content']
    r = np.array([0]*len(s))
    try:
        l = result.loc[[i]].as_matrix()
    except:
        return r
    for i in l:
        for j in findall(i[0], s):
            r[j:j+len(i[0])] = tags[i[1]]
    return r
 
test_data['label'] = map(label2tag, tqdm(iter(test_data.index)))
ner_data = train_data['doc2id'].append(test_data['doc2id']).apply(lambda s: s[:maxlen] + [0]*(maxlen - len(s[:maxlen])))
ner_data = np.array(list(ner_data))
ner_target = train_data['label'].append(test_data['label']).apply(list).apply(lambda s: s[:maxlen] + [0]*(maxlen - len(s[:maxlen])))
ner_target = np.array(list(ner_target))
ner_target = np.array(map(lambda y:np_utils.to_categorical(y, 5), ner_target))
 
nb_epoch = 100
batch_size = 1024
ner_history_2 = ner_model.fit(ner_data, ner_target, batch_size=batch_size, nb_epoch=nb_epoch)
ner_model.save_weights('%s/ner_model_weights_2.model'%now)
structure = open('%s/ner_model_structure_2.model'%now, 'w')
structure.write(ner_model.to_json())
structure.close()
 
print u'predicting again ...'
test_data['predict'] = list(ner_model.predict(test_ner_data, batch_size=batch_size, verbose=1))
 
print u'creating the final export again ...'
test_data['pred'] = map(lambda i: predict(i, test_data), tqdm(iter(test_data.index)))
 
result_2 = pd.DataFrame(list(chain(*test_data['pred'])), columns=['SentenceId', 'View', 'Opinion'])
result_2 = result_2.drop_duplicates()
result_2.to_csv('%s/result_2.csv'%now, index=None, encoding='utf-8')

打包下载:基于视角的领域感情分析_打包.7z


转载到请包括本文地址:http://spaces.ac.cn/archives/4118/

如果您觉得本文还不错,欢迎点击下面的按钮对博主进行打赏。打赏并非要从中获得收益,而是希望知道有多少人曾在科学空间驻足。当然,如果你无视它,也不会影响你的阅读。再次表示欢迎和感谢!