word2vec对搜狗中文新闻进行聚类

liuyuqi-dellpc 5a31c90171 忽略 6 years ago
shell a40570160e init 6 years ago
src a40570160e init 6 years ago
.gitignore 5a31c90171 忽略 6 years ago
LICENSE a89db82a88 Initial commit 6 years ago
README.md a40570160e init 6 years ago
pom.xml a40570160e init 6 years ago

README.md

ChineseParticiple

word2vec对搜狗中文新闻进行聚类

(1)下载搜狗数据 http://www.sogou.com/labs/sogoudownload/SogouCA/news_tensite_xml.full.zip

(2)去除html标签 cat news_tensite_xml.dat | iconv -f gbk -t utf-8 -c | grep "" > corpus.txt

(3)分词 可以通过java包:ANSJ对文本分词。

(4) ./word2vec -train resultbig.txt -output vectors.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -binary 1

(5)计算距离 ./distance vectors.bin

(6)聚类 ./word2vec -train resultbig.txt -output classes.txt -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -classes 500

sort classes.txt -k 2 -n > classes.sorted.txt