# ChineseParticiple word2vec对搜狗中文新闻进行聚类 (1)下载搜狗数据 http://www.sogou.com/labs/sogoudownload/SogouCA/news_tensite_xml.full.zip (2)去除html标签 cat news_tensite_xml.dat | iconv -f gbk -t utf-8 -c | grep "" > corpus.txt (3)分词 可以通过java包:ANSJ对文本分词。 (4) ./word2vec -train resultbig.txt -output vectors.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -binary 1 (5)计算距离 ./distance vectors.bin (6)聚类 ./word2vec -train resultbig.txt -output classes.txt -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -classes 500 sort classes.txt -k 2 -n > classes.sorted.txt