Browse Source

* 删除data数据。

添加matlab代码

误删main.py

整理项目

更新 'README.md'
liuyuqi-dellpc 7 years ago
commit
eb5896528d

+ 98 - 0
.gitignore

@@ -0,0 +1,98 @@
+# ---> R
+# History files
+.Rhistory
+.Rapp.history
+
+# Example code in package build process
+*-Ex.R
+
+# RStudio files
+.Rproj.user/
+
+# produced vignettes
+vignettes/*.html
+vignettes/*.pdf
+
+# OAuth2 token, see https://github.com/hadley/httr/releases/tag/v0.3
+.httr-oauth
+
+# ---> Python
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+env/
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+*.egg-info/
+.installed.cfg
+*.egg
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*,cover
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# ---> Matlab
+##---------------------------------------------------
+## Remove autosaves generated by the Matlab editor
+## We have git for backups!
+##---------------------------------------------------
+
+# Windows default autosave extension
+*.asv
+
+# OSX / *nix default autosave extension
+*.m~
+
+# Compiled MEX binaries (all platforms)
+*.mex*
+
+# Simulink Code Generation
+slprj/
+
+/data
+/output

+ 72 - 0
LICENSE

@@ -0,0 +1,72 @@
+Apache License 
+Version 2.0, January 2004 
+http://www.apache.org/licenses/
+TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+1. Definitions.
+
+"License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.
+
+"Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.
+
+"Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.
+
+"You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License.
+
+"Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.
+
+"Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
+
+"Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below).
+
+"Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.
+
+"Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution."
+
+"Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.
+
+2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.
+
+3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed.
+
+4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
+
+(a) You must give any other recipients of the Work or Derivative Works a copy of this License; and
+
+(b) You must cause any modified files to carry prominent notices stating that You changed the files; and
+
+(c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and
+
+(d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License.
+
+You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License.
+
+5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions.
+
+6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file.
+
+7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.
+
+8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.
+
+9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.
+
+END OF TERMS AND CONDITIONS
+
+APPENDIX: How to apply the Apache License to your work.
+
+To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives.
+
+Copyright [yyyy] [name of copyright owner]
+
+Licensed under the Apache License, Version 2.0 (the "License"); 
+you may not use this file except in compliance with the License. 
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software 
+distributed under the License is distributed on an "AS IS" BASIS, 
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 
+See the License for the specific language governing permissions and 
+limitations under the License.

+ 37 - 0
README.md

@@ -0,0 +1,37 @@
+# haoxin
+
+好信杯相关技术文档,迁移学习非常重要的技术文档。
+
+### 赛题来源
+科赛平台:[前海征信“好信杯”大数据算法大赛](https://www.kesci.com/apps/home/competition/58e46b3b9ed26b1e09bfbbb7)
+
+### 作品介绍
+采用迁移学习的基本思想解决赛题,具体的赛题为:
+依据给定的4万条业务A数据及4千条业务B数据,建立业务B的信用评分模型。其中业务A为信用贷款, 其特征就是债务人无需提供抵押品,仅凭自己的信誉就能取得贷款,并以借款人信用程度作为还款保证的;业务B为现金贷,即发薪日贷款(payday loan),与一般的消费金融产品相比,现金贷主要具有以下五个特点:额度小、周期短、无抵押、流程快、利率高,这也是与其借贷门槛低的特征相适应的。由于业务A、B存在关联性,选手如**何将业务A的知识迁移到业务B**,以此增强业务B的信用评分模型,是本次比赛的重点。
+
+### 数学建模
+建模过程相对简单,细化可以分为两步:① 把目标域训练得到的模型作为主模型1,②将在源域上训练得到的模型迁移到目标域并利用目标域数据进行fine-tune得到的模型作为主模型2,③最后利用合理的加权方式得到最终结果,合理的考虑了迁移模型与目标域模型的关系。
+
+### 遇到的挑战
+①业务AB两大数据用户id的理解误差,浪费了很多时间,后来发现相同id并非同一用户; ②最后一天被各位厉害的选手刷榜,哈哈,心累。
+
+### 作品亮点
+最大亮点就是简单实用,通过目标域数据对源数据模型的fine-tune的来做迁移,算法部分
+由XGBoost和简单的bagging构成,构建两大主模型的关系的方式也很易懂。
+
+### 获得的经验
+1. 实用的特征工程小技巧(方差,均值等实用的老套路);
+2. 采用弱分类器获取源域和目标域数据之间相似的数据,然后合并这些数据训练model的实用方法(来自dandange);
+3. 赛方分享的集成技巧;
+4. 比赛切记要看论坛的消息!
+
+## 迁移学习相关技术文档与代码:
+<<<<<<< HEAD
+	
+链接: https://pan.baidu.com/s/1jHIwb1G 密码: wbaf
+
+下载数据:前海征信数据 4-14 .zip到data目录
+=======
+
+链接: https://pan.baidu.com/s/1jHIwb1G 密码: wbaf
+>>>>>>> 0e69b64c15c2355308c753335d603680d3a8fb60

File diff suppressed because it is too large
+ 1951 - 0
src/R/.ipynb_checkpoints/haoxin-checkpoint.ipynb


+ 9 - 0
src/R/README.md

@@ -0,0 +1,9 @@
+#测试
+importData.R 导入输入
+testResult.R 输出初始值测试结果
+
+#运行
+importData.R 导入输入
+
+preMain.R 数据预处理
+mian.R 数据处理

File diff suppressed because it is too large
+ 1951 - 0
src/R/haoxin.ipynb


+ 59 - 0
src/R/importData.R

@@ -0,0 +1,59 @@
+
+#设置目录
+setwd("D:/liuyuqi/数学建模项目/kseci比赛/好信杯")
+getwd()
+filenames=dir()
+
+####################################
+#读取数据
+A_train=read.csv('data/A_train.csv',header = TRUE)
+B_train=read.csv('data/B_train.csv',header = TRUE)
+B_test=read.csv('data/B_test.csv',header = TRUE)
+#dataA<-read.table('A_train.csv')
+
+####################################
+#查看数据类型
+mode(A_train)
+mode(B_train)
+mode(B_test)
+
+#查看数据行列数
+dim(A_train)
+#查看列名
+names(A_train)
+#查看数据类型
+typeof(A_train$UserInfo_1)
+
+####################################
+#summary(A_train)
+#预览数据
+readfile<-function(file, n=1000, header=T){
+  pt <- file(file, "r")
+  name <- NULL
+  if(header){
+    name <- strsplit(readLines(pt, 1), split=',')[[1]];  #璇诲彇鏍囬
+    f1 <- readLines(pt, n)
+    data <- read.table(text=f1, sep=',', col.names=name)
+  }else{
+    data <- read.table(text=f1, sep=',')
+  }
+  close(pt)
+  data 
+}
+
+
+#A_train.csv
+#(data1 <- readfile(file="data/A_train.csv", n=5, header=T))
+#B_train.csv
+#(data2<- readfile(file="data/B_train.csv", n=5, header=T))
+#B_test.csv
+#(data3 <- readfile(file="data/B_test.csv", n=5, header=T))
+
+# loandata=data.frame(read.csv('loan_data.csv',header = 1))
+#result_table.dt <- data.frame(no = data3$no)
+# dim(B_test)
+
+#在线编辑
+#fix(A_train)
+#head(A_train,n=10)
+#tail(A_train,n=10)

+ 63 - 0
src/R/main.r

@@ -0,0 +1,63 @@
+library(RSNNS) 
+
+library(ggplot2)
+library(plyr)
+library(nnet)
+library(rpart)
+library(tree)
+library(AMORE)
+#library(neuralnet)
+#svm
+library(e1071)
+
+#定义网络输入 
+#flag变量的下标
+flag=which(names(B_train3)=="flag")
+row=c(2:length(B_train3))
+row=row[row!=flag]#去掉result列
+B_train3_data= B_train3[,row]
+B_train3_result=B_train3[,flag]
+
+## kernlab包里的spam数据集,spam是一个邮件数据集,共有4601个观测值,58个变量
+library(caret) 
+library(kernlab)
+
+
+training <- B_train3
+testing <- spam[-inTrain, ]
+
+#训练模型
+modelFit <- train(flag~.,data=training,method="glm")
+# train( )函数就是我们的训练器,type~是回归方程,data=training指定数据集,method="glm"指定具体的模型形式,这里我们用的是glm估计,当然读者也可以用SVM(支持向量机),nnet神经网络等其他模型形式
+modelFit$finalModel
+
+#验证模型
+predictions <- predict(modelFit,newdata=testing)
+head(predictions)####预测结果如下
+
+##错误分类矩阵,将模型预测的值和真实的值进行比较,计算错误分类率。我们可知准确率为0.9184
+confusionMatrix(predictions,testing$type)
+
+
+
+
+
+
+
+
+
+
+
+
+#定义网络输出,并将数据进行格式转换 
+B_trainTargets = decodeClassLabels(B_train3_result)
+#从中划分出训练样本和检验样本 
+iris = splitForTrainingAndTest(B_train3_data, B_trainTargets, ratio=0.15)
+#数据标准化 
+iris = normTrainingAndTestSet(iris)
+#利用mlp命令执行前馈反向传播神经网络算法(多层感知器)
+model = mlp(iris$inputsTrain, iris$targetsTrain, size=5, learnFunc="Quickprop", learnFuncParams=c(0.1, 2.0, 0.0001, 0.1),maxit=100, inputsTest=iris$inputsTest, targetsTest=iris$targetsTest) 
+#利用上面建立的模型进行预测 
+predictions = predict(model,iris$inputsTest)
+#生成混淆矩阵,观察预测精度
+confusionMatrix(iris$targetsTest,predictions)

+ 8 - 0
src/R/plot.R

@@ -0,0 +1,8 @@
+
+library(ggplot2)
+library(plyr)
+library(nnet)
+library(rpart)
+library(tree)
+#svm
+library(e1071)

+ 53 - 0
src/R/preMain.R

@@ -0,0 +1,53 @@
+#数据预处理:缺失数据和错误数据
+
+##缺失数据的识别,可以发现绝大多数缺失值n=10207769
+##而第一列绝大多数都是0,所以对现有数据进行相关性分析
+#is.na(A_train)#判断是否存在缺失
+n=sum(is.na(A_train))#输出缺失值个数
+summary(A_train$UserInfo_4)   #39896个NA
+length(A_train$UserInfo_4)
+
+##数据备份与恢复
+A_train_bak=A_train
+#A_train=A_train_bak
+
+fn.persent=function(data){
+  persent=NULL
+  for (i in 1:length(data)) {
+    temp=data[,i]
+    persent[i]=sum(is.na(temp))/length(temp)
+  }
+  plot(persent)
+  return(persent)
+}
+
+persentAtr=fn.persent(A_train)
+##统计缺失率,通过作图可以发现,缺失率很大,279个变量中,缺失率总体大于40%
+
+
+#异常变量识别与处理
+#作图发现,有1很多变量(196个)缺失值大于50%,填充也不太好,为此,不考虑这些数据的影响,删除这些数据。
+tempAtr=which(persentAtr>0.39)
+A_train2=A_train[,-tempAtr]
+length(A_train2)
+fn.persent(A_train2)
+
+#######################################################################
+##B_train数据处理,同理,删除缺失数据太多的变量
+presentBtr=fn.persent(B_train)
+
+tempBtr=which(presentBtr>0.65)
+B_train2=B_train[,-tempBtr]
+length(B_train2)
+fn.persent(B_train2)
+
+#######################################################################
+##B_test数据处理,同理,删除缺失数据太多的变量
+presentBte=fn.persent(B_test)
+tempBte=which(presentBte>0.65)
+B_test2=B_test[,-tempBte]
+fn.persent(B_test2)
+
+
+
+

+ 49 - 0
src/R/preMain2.R

@@ -0,0 +1,49 @@
+#通过preMain.R得到如下值(persentAtr表示A_train缺失率向量/一维数组/变量;tempAtr表示缺失率大于阈值变量序号):
+#persentAtr
+#presentBtr
+#presentBte
+#tempAtr
+#tempBtr
+#tempBte
+
+#intersect集合求交集,即A_train,B_train共同通过阈值的变量
+#tempABtr=intersect(tempBtr,tempAtr)
+#tempB=intersect(tempBtr,tempBte)
+#tempAB=intersect(tempABtr,tempBte)
+#union集合并集
+tempABtr=union(tempBtr,tempAtr)
+
+#异常变量识别与处理
+#很多变量缺失值很大,填充也不太好,为此,不考虑这些数据的影响,删除这些数据。
+#去除A_train,B_train数据缺失太大的数据
+A_train3=A_train[,-tempABtr]
+B_train3=B_train[,-tempABtr]
+length(A_train3)
+fn.persent(A_train3)
+length(B_train3)
+fn.persent(B_train3)
+
+#保存
+#写结果
+#write.csv(A_train3, "data/A_train3.csv", row.names = F, fileEncoding = "utf-8" )
+#write.csv(B_train3, "data/B_train3.csv", row.names = F, fileEncoding = "utf-8" )
+
+
+#缺失值处理
+#缺失率这么高,如果直接删除缺失数据,会使得观测值急剧减少,就需要对缺失值进行填充。
+
+
+
+#Hmisc包更加简单,可以插补均值、中位数等,你也可以插补指定值。
+library(Hmisc)
+#impute(newdata$Dream,mean)
+#impute(newdata$Dream,median)
+#impute(newdata$Dream,2)
+A_train3[2]=impute(A_train3[2],mean)
+for (i in 1:length(A_train3)) {
+  A_train3[i]=impute(A_train3[i],mean)
+}
+for (i in 1:length(B_train3)) {
+  B_train3[i]=impute(B_train3[i],mean)
+}
+##########################stop###############################

+ 99 - 0
src/R/preMain2.bak.R

@@ -0,0 +1,99 @@
+#通过preMain.R得到如下值(persentAtr表示A_train缺失率向量/一维数组/变量;tempAtr表示缺失率大于阈值变量序号):
+#persentAtr
+#presentBtr
+#presentBte
+#tempAtr
+#tempBtr
+#tempBte
+
+#intersect集合求交集,即A_train,B_train共同通过阈值的变量
+#tempABtr=intersect(tempBtr,tempAtr)
+#tempB=intersect(tempBtr,tempBte)
+#tempAB=intersect(tempABtr,tempBte)
+#union集合并集
+tempABtr=union(tempBtr,tempAtr)
+
+#异常变量识别与处理
+#很多变量缺失值很大,填充也不太好,为此,不考虑这些数据的影响,删除这些数据。
+#去除A_train,B_train数据缺失太大的数据
+A_train3=A_train[,-tempABtr]
+B_train3=B_train[,-tempABtr]
+length(A_train3)
+fn.persent(A_train3)
+length(B_train3)
+fn.persent(B_train3)
+
+#保存
+#写结果
+#write.csv(A_train3, "data/A_train3.csv", row.names = F, fileEncoding = "utf-8" )
+#write.csv(B_train3, "data/B_train3.csv", row.names = F, fileEncoding = "utf-8" )
+
+
+#缺失值处理
+#缺失率这么高,如果直接删除缺失数据,会使得观测值急剧减少,就需要对缺失值进行填充。
+
+
+
+#Hmisc包更加简单,可以插补均值、中位数等,你也可以插补指定值。
+library(Hmisc)
+#impute(newdata$Dream,mean)
+#impute(newdata$Dream,median)
+#impute(newdata$Dream,2)
+A_train3[2]=impute(A_train3[2],mean)
+for (i in 1:length(A_train3)) {
+  A_train3[i]=impute(A_train3[i],mean)
+}
+for (i in 1:length(B_train3)) {
+  B_train3[i]=impute(B_train3[i],mean)
+}
+##########################stop###############################
+
+#多重插补法处理缺失。暂时未完成
+library(lattice) #调入函数包
+library(MASS)
+library(nnet)
+library(mice) #前三个包是mice的基础
+#help(mice)
+#library(VIM)
+
+#显示缺失值
+head(md.pattern(A_train3))
+
+
+tmp1=md.pattern(A_train3)#缺失值0/1表示
+fit=with(tmp1,lm(sales~date,data=A_train3))
+pooled=pool(fit)
+summary(pooled)
+resultsA=complete(tmp1,action=3)
+
+#m=5指的是插补数据集的数量 meth='pmm'指的使用预测均值匹配(Predictive mean matching )作为插补方法。其他插补方法可以通过methods(mice)来查看。
+imp=mice(A_train,m=3,maxit=50,meth='pmm',seed=500)#4重插补,即生成4个无缺失数据集/mice函数通过链式方程生成多元插补
+result4=complete(imp,action=3) #选择第三个插补数据集作为结果
+
+#查看插补效果
+library(lattice)
+xyplot(tempData,Ozone ~ Wind+Temp+Solar.R,pch=18,cex=1)
+
+#整理得到训练集
+#A_train_x=A_train[,-]
+A_train_x=
+  A_train_y=A_train$flag
+write.csv(A_train, "data/pre_A_train.csv", row.names = F, fileEncoding = "utf-8" )
+
+#######################################################################
+##B_train数据处理,同理,删除缺失数据太多的变量
+persentA=fn.persentA(B_train)
+temp=which(persentA>0.75)
+B_train2=B_train[,-temp]
+length(B_train2)
+persentA=fn.persentA(B_train2)
+B_train3=B_train2$flag=1
+
+#######################################################################
+##B_test数据处理,同理,删除缺失数据太多的变量
+persentA=fn.persentA(B_test)
+temp=which(persentA>0.75)
+B_test2=B_test[,-temp]
+persentA=fn.persentA(B_test)
+
+

+ 23 - 0
src/R/testCaret.R

@@ -0,0 +1,23 @@
+## kernlab包里的spam数据集,spam是一个邮件数据集,共有4601个观测值,58个变量
+library(caret) 
+library(kernlab)
+data(spam)
+head(spam)
+
+##数据划分createDataPartition
+inTrain <- createDataPartition(y=spam$type,p=0.8,list=FALSE)
+training <- spam[inTrain, ]
+testing <- spam[-inTrain, ]
+
+
+#训练模型
+modelFit <- train(type~.,data=training,method="glm")
+# train( )函数就是我们的训练器,type~是回归方程,data=training指定数据集,method="glm"指定具体的模型形式,这里我们用的是glm估计,当然读者也可以用SVM(支持向量机),nnet神经网络等其他模型形式
+modelFit$finalModel
+
+#验证模型
+predictions <- predict(modelFit,newdata=testing)
+head(predictions)####预测结果如下
+
+##错误分类矩阵,将模型预测的值和真实的值进行比较,计算错误分类率。我们可知准确率为0.9184
+confusionMatrix(predictions,testing$type)

+ 4 - 0
src/R/testResult.R

@@ -0,0 +1,4 @@
+#写结果,把所有结果赋值为0.5
+result <- data.frame(no = B_test$no, pred=rep(0.5,dim(B_test)[1]))
+write.csv(result, "submit.csv", row.names = F, fileEncoding = "utf-8" )
+(data3 <- readfile(file="submit.csv", n=5, header=T))

+ 20 - 0
src/R/trainIris.R

@@ -0,0 +1,20 @@
+#载入程序和数据 
+library(Rcpp) 
+library(RSNNS) 
+data(iris)
+#将数据顺序打乱
+iris = iris[sample(1:nrow(iris),length(1:nrow(iris))),1:ncol(iris)]
+#定义网络输入 
+irisValues= iris[,1:4]
+#定义网络输出,并将数据进行格式转换 
+irisTargets = decodeClassLabels(iris[,5])
+#从中划分出训练样本和检验样本 
+iris = splitForTrainingAndTest(irisValues, irisTargets, ratio=0.15)
+#数据标准化 
+iris = normTrainingAndTestSet(iris)
+#利用mlp命令执行前馈反向传播神经网络算法(多层感知器)
+model = mlp(iris$inputsTrain, iris$targetsTrain, size=5, learnFunc="Quickprop", learnFuncParams=c(0.1, 2.0, 0.0001, 0.1),maxit=100, inputsTest=iris$inputsTest, targetsTest=iris$targetsTest) 
+#利用上面建立的模型进行预测 
+predictions = predict(model,iris$inputsTest)
+#生成混淆矩阵,观察预测精度
+confusionMatrix(iris$targetsTest,predictions)

+ 4 - 0
src/R/writeResult.R

@@ -0,0 +1,4 @@
+#写结果
+result <- data.frame(no = B_test$no, pred=rep(0.5,dim(B_test)[1]))
+write.csv(result, "submit.csv", row.names = F, fileEncoding = "utf-8" )
+(data3 <- readfile(file="submit.csv", n=5, header=T))

+ 23 - 0
src/matlab/Transfer.m

@@ -0,0 +1,23 @@
+clc,clear all
+% 读取图片
+allImages = imageDatastore('TrainingData', 'IncludeSubfolders', true, 'LabelSource', 'foldernames');
+% 分开训练测试数据,训练80%
+[trainingImages, testImages] = splitEachLabel(allImages, 0.8, 'randomize');
+
+alex = alexnet;
+layers = alex.Layers
+layers(23) = fullyConnectedLayer(5);
+layers(25) = classificationLayer
+
+
+opts = trainingOptions('sgdm', 'InitialLearnRate', 0.001, 'MaxEpochs', 20, 'MiniBatchSize', 64);
+trainingImages.ReadFcn = @readFunctionTrain;
+myNet = trainNetwork(trainingImages, layers, opts);
+testImages.ReadFcn = @readFunctionTrain;
+predictedLabels = classify(myNet, testImages);
+accuracy = mean(predictedLabels == testImages.Labels)
+
+
+
+
+

+ 3 - 0
src/matlab/importData.m

@@ -0,0 +1,3 @@
+%ÉèÖ÷¾¶
+addpath(genpath(pwd))
+M = csvread('data/A_train3.csv',1);

+ 1 - 0
src/matlab/set_paths.m

@@ -0,0 +1 @@
+addpath(genpath(pwd))

+ 178 - 0
src/python/.ipynb_checkpoints/Feature_engineering-checkpoint.ipynb

@@ -0,0 +1,178 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 特征工程"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "from sklearn.ensemble import GradientBoostingClassifier\n",
+    "from sklearn.model_selection import train_test_split,StratifiedKFold\n",
+    "from sklearn.metrics import auc,roc_auc_score\n",
+    "import xgboost as xgb  \n",
+    "import os"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 切换项目目录\n",
+    "os.getcwd()\n",
+    "os.chdir(\"/media/sf_share/linux/haoxin\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 把特征个数少的列全部丢弃,使用和不适用效果影响不大,建议使用,省时间内存\n",
+    "def null_ratio(df):\n",
+    "    features=df.columns\n",
+    "    feature_selected=[]\n",
+    "    drop_index=[]\n",
+    "    sz=df.size\n",
+    "    for feat in features:\n",
+    "        sz_null=df[df[feat].isnull()].size\n",
+    "        ratio=float(sz_null)/sz\n",
+    "        if ratio > 0.9:\n",
+    "            feature_selected.append((feat,ratio))\n",
+    "            drop_index.append(feat) \n",
+    "    return feature_selected,drop_index"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def dumyuserfeature(train):\n",
+    "    pre_feat = 'UserInfo_'\n",
+    "    # 检查离散变量和连续变量\n",
+    "    asd = 0\n",
+    "    index=[]\n",
+    "    train_copy = train.copy()\n",
+    "    for i ,col in enumerate(train.columns):\n",
+    "        print(i)\n",
+    "        cofe = len(train.groupby(col).count())\n",
+    "        if cofe < 20: #10,15都一样\n",
+    "            feikong = np.sum([train[col] != -999] )\n",
+    "            if feikong < len(train) * 0.1:\n",
+    "                continue\n",
+    "                 \n",
+    "            train_copy = train_copy.join(pd.get_dummies(train[col], prefix=col+'_'))\n",
+    "            index.append(col)\n",
+    "            asd += 1 \n",
+    "    return train_copy"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 导入数据\n",
+    "a_train = pd.read_csv('data/A_train.csv')\n",
+    "# a_train = pd.read_csv('data/A_train.csv',nrows =5)\n",
+    "b_train = pd.read_csv('data/B_train.csv')\n",
+    "b_test  =  pd.read_csv('data/B_test.csv')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 35,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 用字符串-999代替缺失值\n",
+    "b_train = b_train.fillna(-999)\n",
+    "b_test = b_test.fillna(-999)\n",
+    "flags = b_train['flag']"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "0\n",
+      "1\n"
+     ]
+    },
+    {
+     "ename": "ValueError",
+     "evalue": "columns overlap but no suffix specified: Index([u'UserInfo_1__2054.47'], dtype='object')",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
+      "\u001b[0;32m<ipython-input-26-81e6aa71203c>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mb_train\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdumyuserfeature\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mb_train\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      2\u001b[0m \u001b[0mb_test\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdumyuserfeature\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mb_test\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      3\u001b[0m \u001b[0mb_test\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m<ipython-input-8-f92e64ceb144>\u001b[0m in \u001b[0;36mdumyuserfeature\u001b[0;34m(train)\u001b[0m\n\u001b[1;32m     13\u001b[0m                 \u001b[0;32mcontinue\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     14\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 15\u001b[0;31m             \u001b[0mtrain_copy\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtrain_copy\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mjoin\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtrain\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mcol\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mprefix\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcol\u001b[0m\u001b[0;34m+\u001b[0m\u001b[0;34m'_'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     16\u001b[0m             \u001b[0mindex\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcol\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     17\u001b[0m             \u001b[0masd\u001b[0m \u001b[0;34m+=\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc\u001b[0m in \u001b[0;36mjoin\u001b[0;34m(self, other, on, how, lsuffix, rsuffix, sort)\u001b[0m\n\u001b[1;32m   4667\u001b[0m         \u001b[0;31m# For SparseDataFrame's benefit\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   4668\u001b[0m         return self._join_compat(other, on=on, how=how, lsuffix=lsuffix,\n\u001b[0;32m-> 4669\u001b[0;31m                                  rsuffix=rsuffix, sort=sort)\n\u001b[0m\u001b[1;32m   4670\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   4671\u001b[0m     def _join_compat(self, other, on=None, how='left', lsuffix='', rsuffix='',\n",
+      "\u001b[0;32m/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc\u001b[0m in \u001b[0;36m_join_compat\u001b[0;34m(self, other, on, how, lsuffix, rsuffix, sort)\u001b[0m\n\u001b[1;32m   4682\u001b[0m             return merge(self, other, left_on=on, how=how,\n\u001b[1;32m   4683\u001b[0m                          \u001b[0mleft_index\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mon\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mright_index\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mTrue\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 4684\u001b[0;31m                          suffixes=(lsuffix, rsuffix), sort=sort)\n\u001b[0m\u001b[1;32m   4685\u001b[0m         \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   4686\u001b[0m             \u001b[0;32mif\u001b[0m \u001b[0mon\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m/usr/local/lib/python2.7/dist-packages/pandas/core/reshape/merge.pyc\u001b[0m in \u001b[0;36mmerge\u001b[0;34m(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator)\u001b[0m\n\u001b[1;32m     52\u001b[0m                          \u001b[0mright_index\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mright_index\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msort\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0msort\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msuffixes\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0msuffixes\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     53\u001b[0m                          copy=copy, indicator=indicator)\n\u001b[0;32m---> 54\u001b[0;31m     \u001b[0;32mreturn\u001b[0m \u001b[0mop\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     55\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     56\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m/usr/local/lib/python2.7/dist-packages/pandas/core/reshape/merge.pyc\u001b[0m in \u001b[0;36mget_result\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m    573\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    574\u001b[0m         llabels, rlabels = items_overlap_with_suffix(ldata.items, lsuf,\n\u001b[0;32m--> 575\u001b[0;31m                                                      rdata.items, rsuf)\n\u001b[0m\u001b[1;32m    576\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    577\u001b[0m         \u001b[0mlindexers\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mleft_indexer\u001b[0m\u001b[0;34m}\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mleft_indexer\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mNone\u001b[0m \u001b[0;32melse\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0;34m}\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc\u001b[0m in \u001b[0;36mitems_overlap_with_suffix\u001b[0;34m(left, lsuffix, right, rsuffix)\u001b[0m\n\u001b[1;32m   4699\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mlsuffix\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mrsuffix\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   4700\u001b[0m             raise ValueError('columns overlap but no suffix specified: %s' %\n\u001b[0;32m-> 4701\u001b[0;31m                              to_rename)\n\u001b[0m\u001b[1;32m   4702\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   4703\u001b[0m         \u001b[0;32mdef\u001b[0m \u001b[0mlrenamer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;31mValueError\u001b[0m: columns overlap but no suffix specified: Index([u'UserInfo_1__2054.47'], dtype='object')"
+     ]
+    }
+   ],
+   "source": [
+    "b_train = dumyuserfeature(b_train)\n",
+    "b_test = dumyuserfeature(b_test)\n",
+    "b_test"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 37,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 导出预处理数据的结果,删除了很多缺失值。\n",
+    "b_train.to_csv('output/B_train_dummy.csv',index=False)\n",
+    "b_test.to_csv('output/B_test_dummy.csv',index=False)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 2",
+   "language": "python",
+   "name": "python2"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}

+ 448 - 0
src/python/.ipynb_checkpoints/Target_model-checkpoint.ipynb

@@ -0,0 +1,448 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "from sklearn.ensemble import GradientBoostingClassifier\n",
+    "from sklearn.model_selection import train_test_split,StratifiedKFold\n",
+    "from sklearn.metrics import auc,roc_auc_score\n",
+    "import xgboost as xgb  "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 切换项目目录\n",
+    "os.getcwd()\n",
+    "os.chdir(\"/media/sf_share/linux/haoxin\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "ename": "IOError",
+     "evalue": "File output/B_train_dummy.csv does not exist",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mIOError\u001b[0m                                   Traceback (most recent call last)",
+      "\u001b[0;32m<ipython-input-4-dd8f3729a2bf>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mb_train\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread_csv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'output/B_train_dummy.csv'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      2\u001b[0m \u001b[0mb_test\u001b[0m \u001b[0;34m=\u001b[0m  \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread_csv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'output/B_test_dummy.csv'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.pyc\u001b[0m in \u001b[0;36mparser_f\u001b[0;34m(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)\u001b[0m\n\u001b[1;32m    653\u001b[0m                     skip_blank_lines=skip_blank_lines)\n\u001b[1;32m    654\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 655\u001b[0;31m         \u001b[0;32mreturn\u001b[0m \u001b[0m_read\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfilepath_or_buffer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkwds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    656\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    657\u001b[0m     \u001b[0mparser_f\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__name__\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.pyc\u001b[0m in \u001b[0;36m_read\u001b[0;34m(filepath_or_buffer, kwds)\u001b[0m\n\u001b[1;32m    403\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    404\u001b[0m     \u001b[0;31m# Create the parser.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 405\u001b[0;31m     \u001b[0mparser\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mTextFileReader\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfilepath_or_buffer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    406\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    407\u001b[0m     \u001b[0;32mif\u001b[0m \u001b[0mchunksize\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0miterator\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.pyc\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, f, engine, **kwds)\u001b[0m\n\u001b[1;32m    762\u001b[0m             \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0moptions\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'has_index_names'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mkwds\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'has_index_names'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    763\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 764\u001b[0;31m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_make_engine\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mengine\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    765\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    766\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0mclose\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.pyc\u001b[0m in \u001b[0;36m_make_engine\u001b[0;34m(self, engine)\u001b[0m\n\u001b[1;32m    983\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0m_make_engine\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mengine\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'c'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    984\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0mengine\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m'c'\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 985\u001b[0;31m             \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mCParserWrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0moptions\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    986\u001b[0m         \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    987\u001b[0m             \u001b[0;32mif\u001b[0m \u001b[0mengine\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m'python'\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.pyc\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, src, **kwds)\u001b[0m\n\u001b[1;32m   1603\u001b[0m         \u001b[0mkwds\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'allow_leading_cols'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mindex_col\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mFalse\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1604\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1605\u001b[0;31m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_reader\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mparsers\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mTextReader\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msrc\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1606\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1607\u001b[0m         \u001b[0;31m# XXX\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader.__cinit__ (pandas/_libs/parsers.c:4209)\u001b[0;34m()\u001b[0m\n",
+      "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader._setup_parser_source (pandas/_libs/parsers.c:8873)\u001b[0;34m()\u001b[0m\n",
+      "\u001b[0;31mIOError\u001b[0m: File output/B_train_dummy.csv does not exist"
+     ]
+    }
+   ],
+   "source": [
+    "b_train = pd.read_csv('output/B_train_dummy.csv')\n",
+    "b_test =  pd.read_csv('output/B_test_dummy.csv') "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# flags = b_train['flag']\n",
+    "# b_train['flag']"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "col = [x for x in b_train.columns if x in b_test.columns] \n",
+    "col = [x for x in col if x not in ['no','flag']]  \n",
+    "# col_1 = []\n",
+    "# 可加可不加,效果影响不大,删除缺省值的列\n",
+    "# for i in col:\n",
+    "#     if '999' not in i:\n",
+    "#         col_1.append(i)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "train_X,test_X,train_Y,test_Y = train_test_split(b_train[col],b_train['flag'],test_size=0.2,random_state  = 0) \n",
+    "watchlist=[(xgb.DMatrix(train_X,label=train_Y),'train'),(xgb.DMatrix(test_X,label=test_Y),'eval')]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dtrain_B = xgb.DMatrix(b_train[col],b_train['flag'])\n",
+    "# 线上效果为0.600018的参数\n",
+    "Trate=0.25 \n",
+    "params = {'booster':'gbtree',\n",
+    "              'eta': 0.05, \n",
+    "              'max_depth': 4,                  \n",
+    "              'max_delta_step': 0,\n",
+    "              'subsample':1,              \n",
+    "              'colsample_bytree': 0.9,      \n",
+    "              'base_score': Trate, \n",
+    "              'objective': 'binary:logistic', \n",
+    "              'lambda':3,\n",
+    "              'alpha':5\n",
+    "              }\n",
+    "params['eval_metric'] = 'auc' \n",
+    "model_phase_1_cla_2 = xgb.train(params,dtrain_B,num_boost_round=130,maximize=True,verbose_eval=True) \n",
+    "\n",
+    "#0.599155\n",
+    "# Trate=0.15\n",
+    "# params = {'booster':'gbtree',\n",
+    "#               'eta': 0.05, \n",
+    "#               'max_depth': 3,                  \n",
+    "#               'max_delta_step': 0,\n",
+    "#               'subsample':1,              \n",
+    "#               'colsample_bytree': 0.9,      \n",
+    "#               'base_score': Trate, \n",
+    "# #               'objective': 'binary:logitraw', \n",
+    "#               'objective': 'binary:logistic',\n",
+    "#               'lambda':3,\n",
+    "#               'alpha':5\n",
+    "#               }\n",
+    "# params['eval_metric'] = 'auc'\n",
+    "# model_phase_1_cla_2 = xgb.train(params,dtrain_B,num_boost_round=130,maximize=True,verbose_eval=True) \n",
+    "# model_phase_1_cla = xgb.train(params,xgb.DMatrix(train_X,label=train_Y),num_boost_round=1000,evals=watchlist,early_stopping_rounds=50,maximize=True,verbose_eval=True)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\n",
+    "# 0.594276\n",
+    "# Trate=0.25 \n",
+    "# params = {'booster':'gbtree',\n",
+    "#               'eta': 0.05, \n",
+    "#               'max_depth': 5,                  \n",
+    "#               'max_delta_step': 0,\n",
+    "#               'subsample':1,              \n",
+    "#               'colsample_bytree': 0.9,      \n",
+    "#               'base_score': Trate, \n",
+    "#               'objective': 'binary:logistic', \n",
+    "#               'lambda':2,\n",
+    "#               'alpha':5\n",
+    "#               }\n",
+    "# params['eval_metric'] = 'auc' \n",
+    "# model_phase_1_cla_2 = xgb.train(params,dtrain_B,num_boost_round=150,maximize=True,verbose_eval=True) \n",
+    "# model_phase_1_cla = xgb.train(params,xgb.DMatrix(train_X,label=train_Y),num_boost_round=1000,evals=watchlist,early_stopping_rounds=50,maximize=True,verbose_eval=True)\n",
+    "\n",
+    "# 0.595855\n",
+    "# Trate=0.25 \n",
+    "# params = {'booster':'gbtree',\n",
+    "#               'eta': 0.05, \n",
+    "#               'max_depth': 5,                  \n",
+    "#               'max_delta_step': 0,\n",
+    "#               'subsample':1,              \n",
+    "#               'colsample_bytree': 0.9,      \n",
+    "#               'base_score': Trate, \n",
+    "#               'objective': 'binary:logistic', \n",
+    "#               'lambda':3,\n",
+    "#               'alpha':5\n",
+    "#               }\n",
+    "# params['eval_metric'] = 'auc' \n",
+    "# model_phase_1_cla_2 = xgb.train(params,dtrain_B,num_boost_round=150,maximize=True,verbose_eval=True) \n",
+    "\n",
+    "# 0.594632\n",
+    "# Trate=0.25 \n",
+    "# params = {'booster':'gbtree',\n",
+    "#               'eta': 0.05, \n",
+    "#               'max_depth': 5,                  \n",
+    "#               'max_delta_step': 0,\n",
+    "#               'subsample':1,              \n",
+    "#               'colsample_bytree': 0.9,      \n",
+    "#               'base_score': Trate, \n",
+    "#               'objective': 'binary:logistic', \n",
+    "#               'lambda':3,\n",
+    "#               'alpha':5\n",
+    "#               }\n",
+    "# params['eval_metric'] = 'auc' \n",
+    "# model_phase_1_cla_2 = xgb.train(params,dtrain_B,num_boost_round=200,maximize=True,verbose_eval=True) \n",
+    "\n",
+    "#0.596701\n",
+    "# Trate=0.25 \n",
+    "# params = {'booster':'gbtree',\n",
+    "#               'eta': 0.05, \n",
+    "#               'max_depth': 5,                  \n",
+    "#               'max_delta_step': 0,\n",
+    "#               'subsample':1,              \n",
+    "#               'colsample_bytree': 0.9,      \n",
+    "#               'base_score': Trate, \n",
+    "#               'objective': 'binary:logistic', \n",
+    "#               'lambda':3,\n",
+    "#               'alpha':5\n",
+    "#               }\n",
+    "# params['eval_metric'] = 'auc' \n",
+    "# model_phase_1_cla_2 = xgb.train(params,dtrain_B,num_boost_round=138,maximize=True,verbose_eval=True) \n",
+    "\n",
+    "# 0.596326\n",
+    "# Trate=0.25 \n",
+    "# params = {'booster':'gbtree',\n",
+    "#               'eta': 0.05, \n",
+    "#               'max_depth': 5,                  \n",
+    "#               'max_delta_step': 0,\n",
+    "#               'subsample':1,              \n",
+    "#               'colsample_bytree': 0.9,      \n",
+    "#               'base_score': Trate, \n",
+    "#               'objective': 'binary:logistic', \n",
+    "#               'lambda':3,\n",
+    "#               'alpha':5\n",
+    "#               }\n",
+    "# params['eval_metric'] = 'auc' \n",
+    "# model_phase_1_cla_2 = xgb.train(params,dtrain_B,num_boost_round=120,maximize=True,verbose_eval=True) \n",
+    "\n",
+    "# 0.598221\n",
+    "# Trate=0.25 \n",
+    "# params = {'booster':'gbtree',\n",
+    "#               'eta': 0.05, \n",
+    "#               'max_depth': 4,                  \n",
+    "#               'max_delta_step': 0,\n",
+    "#               'subsample':1,              \n",
+    "#               'colsample_bytree': 0.9,      \n",
+    "#               'base_score': Trate, \n",
+    "#               'objective': 'binary:logistic', \n",
+    "#               'lambda':3,\n",
+    "#               'alpha':5\n",
+    "#               }\n",
+    "# params['eval_metric'] = 'auc' \n",
+    "# model_phase_1_cla_2 = xgb.train(params,dtrain_B,num_boost_round=150,maximize=True,verbose_eval=True) \n",
+    "\n",
+    "# 0.599235\n",
+    "# Trate=0.25 \n",
+    "# params = {'booster':'gbtree',\n",
+    "#               'eta': 0.05, \n",
+    "#               'max_depth': 4,                  \n",
+    "#               'max_delta_step': 0,\n",
+    "#               'subsample':1,              \n",
+    "#               'colsample_bytree': 0.9,      \n",
+    "#               'base_score': Trate, \n",
+    "#               'objective': 'binary:logistic', \n",
+    "#               'lambda':3,\n",
+    "#               'alpha':5\n",
+    "#               }\n",
+    "# params['eval_metric'] = 'auc' \n",
+    "# model_phase_1_cla_2 = xgb.train(params,dtrain_B,num_boost_round=138,maximize=True,verbose_eval=True) \n",
+    "\n",
+    "\n",
+    "# 0.598832\n",
+    "# Trate=0.25 \n",
+    "# params = {'booster':'gbtree',\n",
+    "#               'eta': 0.05, \n",
+    "#               'max_depth': 4,                  \n",
+    "#               'max_delta_step': 0,\n",
+    "#               'subsample':1,              \n",
+    "#               'colsample_bytree': 0.9,      \n",
+    "#               'base_score': Trate, \n",
+    "#               'objective': 'binary:logistic', \n",
+    "#               'lambda':3,\n",
+    "#               'alpha':5\n",
+    "#               }\n",
+    "# params['eval_metric'] = 'auc' \n",
+    "# model_phase_1_cla_2 = xgb.train(params,dtrain_B,num_boost_round=120,maximize=True,verbose_eval=True) \n",
+    "\n",
+    "# 0.600018\n",
+    "# Trate=0.25 \n",
+    "# params = {'booster':'gbtree',\n",
+    "#               'eta': 0.05, \n",
+    "#               'max_depth': 4,                  \n",
+    "#               'max_delta_step': 0,\n",
+    "#               'subsample':1,              \n",
+    "#               'colsample_bytree': 0.9,      \n",
+    "#               'base_score': Trate, \n",
+    "#               'objective': 'binary:logistic', \n",
+    "#               'lambda':3,\n",
+    "#               'alpha':5\n",
+    "#               }\n",
+    "# params['eval_metric'] = 'auc' \n",
+    "# model_phase_1_cla_2 = xgb.train(params,dtrain_B,num_boost_round=130,maximize=True,verbose_eval=True) \n",
+    "\n",
+    "#0.595537\n",
+    "# Trate=0.25 \n",
+    "# params = {'booster':'gbtree',\n",
+    "#               'eta': 0.05, \n",
+    "#               'max_depth': 3,                  \n",
+    "#               'max_delta_step': 0,\n",
+    "#               'subsample':1,              \n",
+    "#               'colsample_bytree': 0.9,      \n",
+    "#               'base_score': Trate, \n",
+    "#               'objective': 'binary:logistic', \n",
+    "#               'lambda':3,\n",
+    "#               'alpha':5\n",
+    "#               }\n",
+    "# params['eval_metric'] = 'auc' \n",
+    "# model_phase_1_cla_2 = xgb.train(params,dtrain_B,num_boost_round=150,maximize=True,verbose_eval=True) \n",
+    "\n",
+    "# 0.593465\n",
+    "# Trate=0.25 \n",
+    "# params = {'booster':'gbtree',\n",
+    "#               'eta': 0.05, \n",
+    "#               'max_depth': 3,                  \n",
+    "#               'max_delta_step': 0,\n",
+    "#               'subsample':1,              \n",
+    "#               'colsample_bytree': 0.9,      \n",
+    "#               'base_score': Trate, \n",
+    "#               'objective': 'binary:logistic', \n",
+    "#               'lambda':3,\n",
+    "#               'alpha':5\n",
+    "#               }\n",
+    "# params['eval_metric'] = 'auc' \n",
+    "# model_phase_1_cla_2 = xgb.train(params,dtrain_B,num_boost_round=130,maximize=True,verbose_eval=True) \n",
+    "\n",
+    "\n",
+    "#0.594750\n",
+    "# Trate=0.25 \n",
+    "# params = {'booster':'gbtree',\n",
+    "#               'eta': 0.05, \n",
+    "#               'max_depth': 3,                  \n",
+    "#               'max_delta_step': 0,\n",
+    "#               'subsample':1,              \n",
+    "#               'colsample_bytree': 0.9,      \n",
+    "#               'base_score': Trate, \n",
+    "#               'objective': 'binary:logistic', \n",
+    "#               'lambda':3,\n",
+    "#               'alpha':5\n",
+    "#               }\n",
+    "# params['eval_metric'] = 'auc' \n",
+    "# model_phase_1_cla_2 = xgb.train(params,dtrain_B,num_boost_round=200,maximize=True,verbose_eval=True) \n",
+    "\n",
+    "# #0.599226\n",
+    "# Trate=0.25 \n",
+    "# params = {'booster':'gbtree',\n",
+    "#               'eta': 0.05, \n",
+    "#               'max_depth': 4,                  \n",
+    "#               'max_delta_step': 0,\n",
+    "#               'subsample':1,              \n",
+    "#               'colsample_bytree': 0.9,      \n",
+    "#               'base_score': Trate, \n",
+    "#               'objective': 'binary:logistic', \n",
+    "#               'lambda':3,\n",
+    "#               'alpha':5\n",
+    "#               }\n",
+    "# params['eval_metric'] = 'auc' \n",
+    "# model_phase_1_cla_2 = xgb.train(params,dtrain_B,num_boost_round=135,maximize=True,verbose_eval=True) \n",
+    "\n",
+    "\n",
+    "#0.598256\n",
+    "# Trate=0.25 \n",
+    "# params = {'booster':'gbtree',\n",
+    "#               'eta': 0.05, \n",
+    "#               'max_depth': 4,                  \n",
+    "#               'max_delta_step': 0,\n",
+    "#               'subsample':1,              \n",
+    "#               'colsample_bytree': 0.9,      \n",
+    "#               'base_score': Trate, \n",
+    "#               'objective': 'binary:logistic', \n",
+    "#               'lambda':3,\n",
+    "#               'alpha':5\n",
+    "#               }\n",
+    "# params['eval_metric'] = 'auc' \n",
+    "# model_phase_1_cla_2 = xgb.train(params,dtrain_B,num_boost_round=125,maximize=True,verbose_eval=True) \n",
+    "\n",
+    "# 0.599600\n",
+    "# Trate=0.25 \n",
+    "# params = {'booster':'gbtree',\n",
+    "#               'eta': 0.05, \n",
+    "#               'max_depth': 4,                  \n",
+    "#               'max_delta_step': 0,\n",
+    "#               'subsample':1,              \n",
+    "#               'colsample_bytree': 0.9,      \n",
+    "#               'base_score': Trate, \n",
+    "#               'objective': 'binary:logistic', \n",
+    "#               'lambda':3,\n",
+    "#               'alpha':5\n",
+    "#               }\n",
+    "# params['eval_metric'] = 'auc' \n",
+    "# model_phase_1_cla_2 = xgb.train(params,dtrain_B,num_boost_round=132,maximize=True,verbose_eval=True) \n",
+    "\n",
+    "\n",
+    "# 0.599844\n",
+    "# Trate=0.25 \n",
+    "# params = {'booster':'gbtree',\n",
+    "#               'eta': 0.05, \n",
+    "#               'max_depth': 4,                  \n",
+    "#               'max_delta_step': 0,\n",
+    "#               'subsample':1,              \n",
+    "#               'colsample_bytree': 0.9,      \n",
+    "#               'base_score': Trate, \n",
+    "#               'objective': 'binary:logistic', \n",
+    "#               'lambda':3,\n",
+    "#               'alpha':5\n",
+    "#               }\n",
+    "# params['eval_metric'] = 'auc' \n",
+    "# model_phase_1_cla_2 = xgb.train(params,dtrain_B,num_boost_round=132,maximize=True,verbose_eval=True) "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pred = model_phase_1_cla_2.predict(xgb.DMatrix(b_test[col_1]))\n",
+    "result1 = pd.DataFrame()\n",
+    "result1['no'] = b_test['no']\n",
+    "result1['pred'] = pred[:]  "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "result1.to_csv('subimit_target.csv',index=False)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 2",
+   "language": "python",
+   "name": "python2"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}

+ 101 - 0
src/python/.ipynb_checkpoints/weight_submit-checkpoint.ipynb

@@ -0,0 +1,101 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import matplotlib.pyplot as plt\n",
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "from os import chdir, getcwd"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [
+    {
+     "ename": "NameError",
+     "evalue": "name 'os' is not defined",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
+      "\u001b[0;32m<ipython-input-16-608806573d92>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# 切换项目目录\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mos\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgetcwd\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      3\u001b[0m \u001b[0mos\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mchdir\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"/media/sf_share/linux/haoxin\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;31mNameError\u001b[0m: name 'os' is not defined"
+     ]
+    }
+   ],
+   "source": [
+    "# 切换项目目录\n",
+    "os.getcwd()\n",
+    "os.chdir(\"/media/sf_share/linux/haoxin\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "metadata": {},
+   "outputs": [
+    {
+     "ename": "IOError",
+     "evalue": "File data/subimit_target.csv does not exist",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mIOError\u001b[0m                                   Traceback (most recent call last)",
+      "\u001b[0;32m<ipython-input-20-e49e85594227>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0ma1\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread_csv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'data/subimit_target.csv'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      2\u001b[0m \u001b[0ma2\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread_csv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'data/transfer_submit.csv'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.pyc\u001b[0m in \u001b[0;36mparser_f\u001b[0;34m(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)\u001b[0m\n\u001b[1;32m    653\u001b[0m                     skip_blank_lines=skip_blank_lines)\n\u001b[1;32m    654\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 655\u001b[0;31m         \u001b[0;32mreturn\u001b[0m \u001b[0m_read\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfilepath_or_buffer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkwds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    656\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    657\u001b[0m     \u001b[0mparser_f\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__name__\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.pyc\u001b[0m in \u001b[0;36m_read\u001b[0;34m(filepath_or_buffer, kwds)\u001b[0m\n\u001b[1;32m    403\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    404\u001b[0m     \u001b[0;31m# Create the parser.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 405\u001b[0;31m     \u001b[0mparser\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mTextFileReader\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfilepath_or_buffer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    406\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    407\u001b[0m     \u001b[0;32mif\u001b[0m \u001b[0mchunksize\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0miterator\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.pyc\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, f, engine, **kwds)\u001b[0m\n\u001b[1;32m    762\u001b[0m             \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0moptions\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'has_index_names'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mkwds\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'has_index_names'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    763\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 764\u001b[0;31m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_make_engine\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mengine\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    765\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    766\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0mclose\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.pyc\u001b[0m in \u001b[0;36m_make_engine\u001b[0;34m(self, engine)\u001b[0m\n\u001b[1;32m    983\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0m_make_engine\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mengine\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'c'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    984\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0mengine\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m'c'\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 985\u001b[0;31m             \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mCParserWrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0moptions\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    986\u001b[0m         \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    987\u001b[0m             \u001b[0;32mif\u001b[0m \u001b[0mengine\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m'python'\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.pyc\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, src, **kwds)\u001b[0m\n\u001b[1;32m   1603\u001b[0m         \u001b[0mkwds\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'allow_leading_cols'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mindex_col\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mFalse\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1604\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1605\u001b[0;31m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_reader\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mparsers\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mTextReader\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msrc\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1606\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1607\u001b[0m         \u001b[0;31m# XXX\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader.__cinit__ (pandas/_libs/parsers.c:4209)\u001b[0;34m()\u001b[0m\n",
+      "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader._setup_parser_source (pandas/_libs/parsers.c:8873)\u001b[0;34m()\u001b[0m\n",
+      "\u001b[0;31mIOError\u001b[0m: File data/subimit_target.csv does not exist"
+     ]
+    }
+   ],
+   "source": [
+    "a1 = pd.read_csv('data/subimit_target.csv')\n",
+    "a2 = pd.read_csv('data/transfer_submit.csv')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "a1['pred'] = a1['pred'] * 0.85+ a2['pred'] * 0.15\n",
+    "a1.to_csv('submit_online.csv',index=False)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 2",
+   "language": "python",
+   "name": "python2"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}

+ 178 - 0
src/python/Feature_engineering.ipynb

@@ -0,0 +1,178 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 特征工程"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "from sklearn.ensemble import GradientBoostingClassifier\n",
+    "from sklearn.model_selection import train_test_split,StratifiedKFold\n",
+    "from sklearn.metrics import auc,roc_auc_score\n",
+    "import xgboost as xgb  \n",
+    "import os"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 切换项目目录\n",
+    "os.getcwd()\n",
+    "os.chdir(\"/media/sf_share/linux/haoxin\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 把特征个数少的列全部丢弃,使用和不适用效果影响不大,建议使用,省时间内存\n",
+    "def null_ratio(df):\n",
+    "    features=df.columns\n",
+    "    feature_selected=[]\n",
+    "    drop_index=[]\n",
+    "    sz=df.size\n",
+    "    for feat in features:\n",
+    "        sz_null=df[df[feat].isnull()].size\n",
+    "        ratio=float(sz_null)/sz\n",
+    "        if ratio > 0.9:\n",
+    "            feature_selected.append((feat,ratio))\n",
+    "            drop_index.append(feat) \n",
+    "    return feature_selected,drop_index"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def dumyuserfeature(train):\n",
+    "    pre_feat = 'UserInfo_'\n",
+    "    # 检查离散变量和连续变量\n",
+    "    asd = 0\n",
+    "    index=[]\n",
+    "    train_copy = train.copy()\n",
+    "    for i ,col in enumerate(train.columns):\n",
+    "        print(i)\n",
+    "        cofe = len(train.groupby(col).count())\n",
+    "        if cofe < 20: #10,15都一样\n",
+    "            feikong = np.sum([train[col] != -999] )\n",
+    "            if feikong < len(train) * 0.1:\n",
+    "                continue\n",
+    "                 \n",
+    "            train_copy = train_copy.join(pd.get_dummies(train[col], prefix=col+'_'))\n",
+    "            index.append(col)\n",
+    "            asd += 1 \n",
+    "    return train_copy"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 导入数据\n",
+    "a_train = pd.read_csv('data/A_train.csv')\n",
+    "# a_train = pd.read_csv('data/A_train.csv',nrows =5)\n",
+    "b_train = pd.read_csv('data/B_train.csv')\n",
+    "b_test  =  pd.read_csv('data/B_test.csv')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 35,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 用字符串-999代替缺失值\n",
+    "b_train = b_train.fillna(-999)\n",
+    "b_test = b_test.fillna(-999)\n",
+    "flags = b_train['flag']"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "0\n",
+      "1\n"
+     ]
+    },
+    {
+     "ename": "ValueError",
+     "evalue": "columns overlap but no suffix specified: Index([u'UserInfo_1__2054.47'], dtype='object')",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
+      "\u001b[0;32m<ipython-input-26-81e6aa71203c>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mb_train\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdumyuserfeature\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mb_train\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      2\u001b[0m \u001b[0mb_test\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdumyuserfeature\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mb_test\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      3\u001b[0m \u001b[0mb_test\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m<ipython-input-8-f92e64ceb144>\u001b[0m in \u001b[0;36mdumyuserfeature\u001b[0;34m(train)\u001b[0m\n\u001b[1;32m     13\u001b[0m                 \u001b[0;32mcontinue\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     14\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 15\u001b[0;31m             \u001b[0mtrain_copy\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtrain_copy\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mjoin\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtrain\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mcol\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mprefix\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcol\u001b[0m\u001b[0;34m+\u001b[0m\u001b[0;34m'_'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     16\u001b[0m             \u001b[0mindex\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcol\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     17\u001b[0m             \u001b[0masd\u001b[0m \u001b[0;34m+=\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc\u001b[0m in \u001b[0;36mjoin\u001b[0;34m(self, other, on, how, lsuffix, rsuffix, sort)\u001b[0m\n\u001b[1;32m   4667\u001b[0m         \u001b[0;31m# For SparseDataFrame's benefit\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   4668\u001b[0m         return self._join_compat(other, on=on, how=how, lsuffix=lsuffix,\n\u001b[0;32m-> 4669\u001b[0;31m                                  rsuffix=rsuffix, sort=sort)\n\u001b[0m\u001b[1;32m   4670\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   4671\u001b[0m     def _join_compat(self, other, on=None, how='left', lsuffix='', rsuffix='',\n",
+      "\u001b[0;32m/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc\u001b[0m in \u001b[0;36m_join_compat\u001b[0;34m(self, other, on, how, lsuffix, rsuffix, sort)\u001b[0m\n\u001b[1;32m   4682\u001b[0m             return merge(self, other, left_on=on, how=how,\n\u001b[1;32m   4683\u001b[0m                          \u001b[0mleft_index\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mon\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mright_index\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mTrue\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 4684\u001b[0;31m                          suffixes=(lsuffix, rsuffix), sort=sort)\n\u001b[0m\u001b[1;32m   4685\u001b[0m         \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   4686\u001b[0m             \u001b[0;32mif\u001b[0m \u001b[0mon\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m/usr/local/lib/python2.7/dist-packages/pandas/core/reshape/merge.pyc\u001b[0m in \u001b[0;36mmerge\u001b[0;34m(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator)\u001b[0m\n\u001b[1;32m     52\u001b[0m                          \u001b[0mright_index\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mright_index\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msort\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0msort\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msuffixes\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0msuffixes\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     53\u001b[0m                          copy=copy, indicator=indicator)\n\u001b[0;32m---> 54\u001b[0;31m     \u001b[0;32mreturn\u001b[0m \u001b[0mop\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     55\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     56\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m/usr/local/lib/python2.7/dist-packages/pandas/core/reshape/merge.pyc\u001b[0m in \u001b[0;36mget_result\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m    573\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    574\u001b[0m         llabels, rlabels = items_overlap_with_suffix(ldata.items, lsuf,\n\u001b[0;32m--> 575\u001b[0;31m                                                      rdata.items, rsuf)\n\u001b[0m\u001b[1;32m    576\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    577\u001b[0m         \u001b[0mlindexers\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mleft_indexer\u001b[0m\u001b[0;34m}\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mleft_indexer\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mNone\u001b[0m \u001b[0;32melse\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0;34m}\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc\u001b[0m in \u001b[0;36mitems_overlap_with_suffix\u001b[0;34m(left, lsuffix, right, rsuffix)\u001b[0m\n\u001b[1;32m   4699\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mlsuffix\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mrsuffix\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   4700\u001b[0m             raise ValueError('columns overlap but no suffix specified: %s' %\n\u001b[0;32m-> 4701\u001b[0;31m                              to_rename)\n\u001b[0m\u001b[1;32m   4702\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   4703\u001b[0m         \u001b[0;32mdef\u001b[0m \u001b[0mlrenamer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;31mValueError\u001b[0m: columns overlap but no suffix specified: Index([u'UserInfo_1__2054.47'], dtype='object')"
+     ]
+    }
+   ],
+   "source": [
+    "b_train = dumyuserfeature(b_train)\n",
+    "b_test = dumyuserfeature(b_test)\n",
+    "b_test"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 37,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 导出预处理数据的结果,删除了很多缺失值。\n",
+    "b_train.to_csv('output/B_train_dummy.csv',index=False)\n",
+    "b_test.to_csv('output/B_test_dummy.csv',index=False)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 2",
+   "language": "python",
+   "name": "python2"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}

+ 420 - 0
src/python/Target_model.ipynb

@@ -0,0 +1,420 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "from sklearn.ensemble import GradientBoostingClassifier\n",
+    "from sklearn.model_selection import train_test_split,StratifiedKFold\n",
+    "from sklearn.metrics import auc,roc_auc_score\n",
+    "import xgboost as xgb  \n",
+    "import os"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 切换项目目录\n",
+    "os.getcwd()\n",
+    "os.chdir(\"/media/sf_share/linux/haoxin\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "b_train = pd.read_csv('output/B_train_dummy.csv')\n",
+    "b_test =  pd.read_csv('output/B_test_dummy.csv') "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "col = [x for x in b_train.columns if x in b_test.columns] \n",
+    "col = [x for x in col if x not in ['no','flag']]  \n",
+    "# col_1 = []\n",
+    "# 可加可不加,效果影响不大,删除缺省值的列\n",
+    "# for i in col:\n",
+    "#     if '999' not in i:\n",
+    "#         col_1.append(i)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "train_X,test_X,train_Y,test_Y = train_test_split(b_train[col],b_train['flag'],test_size=0.2,random_state  = 0) \n",
+    "watchlist=[(xgb.DMatrix(train_X,label=train_Y),'train'),(xgb.DMatrix(test_X,label=test_Y),'eval')]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dtrain_B = xgb.DMatrix(b_train[col],b_train['flag'])\n",
+    "# 线上效果为0.600018的参数\n",
+    "Trate=0.25 \n",
+    "params = {'booster':'gbtree',\n",
+    "              'eta': 0.05, \n",
+    "              'max_depth': 4,                  \n",
+    "              'max_delta_step': 0,\n",
+    "              'subsample':1,              \n",
+    "              'colsample_bytree': 0.9,      \n",
+    "              'base_score': Trate, \n",
+    "              'objective': 'binary:logistic', \n",
+    "              'lambda':3,\n",
+    "              'alpha':5\n",
+    "              }\n",
+    "params['eval_metric'] = 'auc' \n",
+    "model_phase_1_cla_2 = xgb.train(params,dtrain_B,num_boost_round=130,maximize=True,verbose_eval=True) \n",
+    "\n",
+    "#0.599155\n",
+    "# Trate=0.15\n",
+    "# params = {'booster':'gbtree',\n",
+    "#               'eta': 0.05, \n",
+    "#               'max_depth': 3,                  \n",
+    "#               'max_delta_step': 0,\n",
+    "#               'subsample':1,              \n",
+    "#               'colsample_bytree': 0.9,      \n",
+    "#               'base_score': Trate, \n",
+    "# #               'objective': 'binary:logitraw', \n",
+    "#               'objective': 'binary:logistic',\n",
+    "#               'lambda':3,\n",
+    "#               'alpha':5\n",
+    "#               }\n",
+    "# params['eval_metric'] = 'auc'\n",
+    "# model_phase_1_cla_2 = xgb.train(params,dtrain_B,num_boost_round=130,maximize=True,verbose_eval=True) \n",
+    "# model_phase_1_cla = xgb.train(params,xgb.DMatrix(train_X,label=train_Y),num_boost_round=1000,evals=watchlist,early_stopping_rounds=50,maximize=True,verbose_eval=True)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\n",
+    "# 0.594276\n",
+    "# Trate=0.25 \n",
+    "# params = {'booster':'gbtree',\n",
+    "#               'eta': 0.05, \n",
+    "#               'max_depth': 5,                  \n",
+    "#               'max_delta_step': 0,\n",
+    "#               'subsample':1,              \n",
+    "#               'colsample_bytree': 0.9,      \n",
+    "#               'base_score': Trate, \n",
+    "#               'objective': 'binary:logistic', \n",
+    "#               'lambda':2,\n",
+    "#               'alpha':5\n",
+    "#               }\n",
+    "# params['eval_metric'] = 'auc' \n",
+    "# model_phase_1_cla_2 = xgb.train(params,dtrain_B,num_boost_round=150,maximize=True,verbose_eval=True) \n",
+    "# model_phase_1_cla = xgb.train(params,xgb.DMatrix(train_X,label=train_Y),num_boost_round=1000,evals=watchlist,early_stopping_rounds=50,maximize=True,verbose_eval=True)\n",
+    "\n",
+    "# 0.595855\n",
+    "# Trate=0.25 \n",
+    "# params = {'booster':'gbtree',\n",
+    "#               'eta': 0.05, \n",
+    "#               'max_depth': 5,                  \n",
+    "#               'max_delta_step': 0,\n",
+    "#               'subsample':1,              \n",
+    "#               'colsample_bytree': 0.9,      \n",
+    "#               'base_score': Trate, \n",
+    "#               'objective': 'binary:logistic', \n",
+    "#               'lambda':3,\n",
+    "#               'alpha':5\n",
+    "#               }\n",
+    "# params['eval_metric'] = 'auc' \n",
+    "# model_phase_1_cla_2 = xgb.train(params,dtrain_B,num_boost_round=150,maximize=True,verbose_eval=True) \n",
+    "\n",
+    "# 0.594632\n",
+    "# Trate=0.25 \n",
+    "# params = {'booster':'gbtree',\n",
+    "#               'eta': 0.05, \n",
+    "#               'max_depth': 5,                  \n",
+    "#               'max_delta_step': 0,\n",
+    "#               'subsample':1,              \n",
+    "#               'colsample_bytree': 0.9,      \n",
+    "#               'base_score': Trate, \n",
+    "#               'objective': 'binary:logistic', \n",
+    "#               'lambda':3,\n",
+    "#               'alpha':5\n",
+    "#               }\n",
+    "# params['eval_metric'] = 'auc' \n",
+    "# model_phase_1_cla_2 = xgb.train(params,dtrain_B,num_boost_round=200,maximize=True,verbose_eval=True) \n",
+    "\n",
+    "#0.596701\n",
+    "# Trate=0.25 \n",
+    "# params = {'booster':'gbtree',\n",
+    "#               'eta': 0.05, \n",
+    "#               'max_depth': 5,                  \n",
+    "#               'max_delta_step': 0,\n",
+    "#               'subsample':1,              \n",
+    "#               'colsample_bytree': 0.9,      \n",
+    "#               'base_score': Trate, \n",
+    "#               'objective': 'binary:logistic', \n",
+    "#               'lambda':3,\n",
+    "#               'alpha':5\n",
+    "#               }\n",
+    "# params['eval_metric'] = 'auc' \n",
+    "# model_phase_1_cla_2 = xgb.train(params,dtrain_B,num_boost_round=138,maximize=True,verbose_eval=True) \n",
+    "\n",
+    "# 0.596326\n",
+    "# Trate=0.25 \n",
+    "# params = {'booster':'gbtree',\n",
+    "#               'eta': 0.05, \n",
+    "#               'max_depth': 5,                  \n",
+    "#               'max_delta_step': 0,\n",
+    "#               'subsample':1,              \n",
+    "#               'colsample_bytree': 0.9,      \n",
+    "#               'base_score': Trate, \n",
+    "#               'objective': 'binary:logistic', \n",
+    "#               'lambda':3,\n",
+    "#               'alpha':5\n",
+    "#               }\n",
+    "# params['eval_metric'] = 'auc' \n",
+    "# model_phase_1_cla_2 = xgb.train(params,dtrain_B,num_boost_round=120,maximize=True,verbose_eval=True) \n",
+    "\n",
+    "# 0.598221\n",
+    "# Trate=0.25 \n",
+    "# params = {'booster':'gbtree',\n",
+    "#               'eta': 0.05, \n",
+    "#               'max_depth': 4,                  \n",
+    "#               'max_delta_step': 0,\n",
+    "#               'subsample':1,              \n",
+    "#               'colsample_bytree': 0.9,      \n",
+    "#               'base_score': Trate, \n",
+    "#               'objective': 'binary:logistic', \n",
+    "#               'lambda':3,\n",
+    "#               'alpha':5\n",
+    "#               }\n",
+    "# params['eval_metric'] = 'auc' \n",
+    "# model_phase_1_cla_2 = xgb.train(params,dtrain_B,num_boost_round=150,maximize=True,verbose_eval=True) \n",
+    "\n",
+    "# 0.599235\n",
+    "# Trate=0.25 \n",
+    "# params = {'booster':'gbtree',\n",
+    "#               'eta': 0.05, \n",
+    "#               'max_depth': 4,                  \n",
+    "#               'max_delta_step': 0,\n",
+    "#               'subsample':1,              \n",
+    "#               'colsample_bytree': 0.9,      \n",
+    "#               'base_score': Trate, \n",
+    "#               'objective': 'binary:logistic', \n",
+    "#               'lambda':3,\n",
+    "#               'alpha':5\n",
+    "#               }\n",
+    "# params['eval_metric'] = 'auc' \n",
+    "# model_phase_1_cla_2 = xgb.train(params,dtrain_B,num_boost_round=138,maximize=True,verbose_eval=True) \n",
+    "\n",
+    "\n",
+    "# 0.598832\n",
+    "# Trate=0.25 \n",
+    "# params = {'booster':'gbtree',\n",
+    "#               'eta': 0.05, \n",
+    "#               'max_depth': 4,                  \n",
+    "#               'max_delta_step': 0,\n",
+    "#               'subsample':1,              \n",
+    "#               'colsample_bytree': 0.9,      \n",
+    "#               'base_score': Trate, \n",
+    "#               'objective': 'binary:logistic', \n",
+    "#               'lambda':3,\n",
+    "#               'alpha':5\n",
+    "#               }\n",
+    "# params['eval_metric'] = 'auc' \n",
+    "# model_phase_1_cla_2 = xgb.train(params,dtrain_B,num_boost_round=120,maximize=True,verbose_eval=True) \n",
+    "\n",
+    "# 0.600018\n",
+    "# Trate=0.25 \n",
+    "# params = {'booster':'gbtree',\n",
+    "#               'eta': 0.05, \n",
+    "#               'max_depth': 4,                  \n",
+    "#               'max_delta_step': 0,\n",
+    "#               'subsample':1,              \n",
+    "#               'colsample_bytree': 0.9,      \n",
+    "#               'base_score': Trate, \n",
+    "#               'objective': 'binary:logistic', \n",
+    "#               'lambda':3,\n",
+    "#               'alpha':5\n",
+    "#               }\n",
+    "# params['eval_metric'] = 'auc' \n",
+    "# model_phase_1_cla_2 = xgb.train(params,dtrain_B,num_boost_round=130,maximize=True,verbose_eval=True) \n",
+    "\n",
+    "#0.595537\n",
+    "# Trate=0.25 \n",
+    "# params = {'booster':'gbtree',\n",
+    "#               'eta': 0.05, \n",
+    "#               'max_depth': 3,                  \n",
+    "#               'max_delta_step': 0,\n",
+    "#               'subsample':1,              \n",
+    "#               'colsample_bytree': 0.9,      \n",
+    "#               'base_score': Trate, \n",
+    "#               'objective': 'binary:logistic', \n",
+    "#               'lambda':3,\n",
+    "#               'alpha':5\n",
+    "#               }\n",
+    "# params['eval_metric'] = 'auc' \n",
+    "# model_phase_1_cla_2 = xgb.train(params,dtrain_B,num_boost_round=150,maximize=True,verbose_eval=True) \n",
+    "\n",
+    "# 0.593465\n",
+    "# Trate=0.25 \n",
+    "# params = {'booster':'gbtree',\n",
+    "#               'eta': 0.05, \n",
+    "#               'max_depth': 3,                  \n",
+    "#               'max_delta_step': 0,\n",
+    "#               'subsample':1,              \n",
+    "#               'colsample_bytree': 0.9,      \n",
+    "#               'base_score': Trate, \n",
+    "#               'objective': 'binary:logistic', \n",
+    "#               'lambda':3,\n",
+    "#               'alpha':5\n",
+    "#               }\n",
+    "# params['eval_metric'] = 'auc' \n",
+    "# model_phase_1_cla_2 = xgb.train(params,dtrain_B,num_boost_round=130,maximize=True,verbose_eval=True) \n",
+    "\n",
+    "\n",
+    "#0.594750\n",
+    "# Trate=0.25 \n",
+    "# params = {'booster':'gbtree',\n",
+    "#               'eta': 0.05, \n",
+    "#               'max_depth': 3,                  \n",
+    "#               'max_delta_step': 0,\n",
+    "#               'subsample':1,              \n",
+    "#               'colsample_bytree': 0.9,      \n",
+    "#               'base_score': Trate, \n",
+    "#               'objective': 'binary:logistic', \n",
+    "#               'lambda':3,\n",
+    "#               'alpha':5\n",
+    "#               }\n",
+    "# params['eval_metric'] = 'auc' \n",
+    "# model_phase_1_cla_2 = xgb.train(params,dtrain_B,num_boost_round=200,maximize=True,verbose_eval=True) \n",
+    "\n",
+    "# #0.599226\n",
+    "# Trate=0.25 \n",
+    "# params = {'booster':'gbtree',\n",
+    "#               'eta': 0.05, \n",
+    "#               'max_depth': 4,                  \n",
+    "#               'max_delta_step': 0,\n",
+    "#               'subsample':1,              \n",
+    "#               'colsample_bytree': 0.9,      \n",
+    "#               'base_score': Trate, \n",
+    "#               'objective': 'binary:logistic', \n",
+    "#               'lambda':3,\n",
+    "#               'alpha':5\n",
+    "#               }\n",
+    "# params['eval_metric'] = 'auc' \n",
+    "# model_phase_1_cla_2 = xgb.train(params,dtrain_B,num_boost_round=135,maximize=True,verbose_eval=True) \n",
+    "\n",
+    "\n",
+    "#0.598256\n",
+    "# Trate=0.25 \n",
+    "# params = {'booster':'gbtree',\n",
+    "#               'eta': 0.05, \n",
+    "#               'max_depth': 4,                  \n",
+    "#               'max_delta_step': 0,\n",
+    "#               'subsample':1,              \n",
+    "#               'colsample_bytree': 0.9,      \n",
+    "#               'base_score': Trate, \n",
+    "#               'objective': 'binary:logistic', \n",
+    "#               'lambda':3,\n",
+    "#               'alpha':5\n",
+    "#               }\n",
+    "# params['eval_metric'] = 'auc' \n",
+    "# model_phase_1_cla_2 = xgb.train(params,dtrain_B,num_boost_round=125,maximize=True,verbose_eval=True) \n",
+    "\n",
+    "# 0.599600\n",
+    "# Trate=0.25 \n",
+    "# params = {'booster':'gbtree',\n",
+    "#               'eta': 0.05, \n",
+    "#               'max_depth': 4,                  \n",
+    "#               'max_delta_step': 0,\n",
+    "#               'subsample':1,              \n",
+    "#               'colsample_bytree': 0.9,      \n",
+    "#               'base_score': Trate, \n",
+    "#               'objective': 'binary:logistic', \n",
+    "#               'lambda':3,\n",
+    "#               'alpha':5\n",
+    "#               }\n",
+    "# params['eval_metric'] = 'auc' \n",
+    "# model_phase_1_cla_2 = xgb.train(params,dtrain_B,num_boost_round=132,maximize=True,verbose_eval=True) \n",
+    "\n",
+    "\n",
+    "# 0.599844\n",
+    "# Trate=0.25 \n",
+    "# params = {'booster':'gbtree',\n",
+    "#               'eta': 0.05, \n",
+    "#               'max_depth': 4,                  \n",
+    "#               'max_delta_step': 0,\n",
+    "#               'subsample':1,              \n",
+    "#               'colsample_bytree': 0.9,      \n",
+    "#               'base_score': Trate, \n",
+    "#               'objective': 'binary:logistic', \n",
+    "#               'lambda':3,\n",
+    "#               'alpha':5\n",
+    "#               }\n",
+    "# params['eval_metric'] = 'auc' \n",
+    "# model_phase_1_cla_2 = xgb.train(params,dtrain_B,num_boost_round=132,maximize=True,verbose_eval=True) "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pred = model_phase_1_cla_2.predict(xgb.DMatrix(b_test[col_1]))\n",
+    "result1 = pd.DataFrame()\n",
+    "result1['no'] = b_test['no']\n",
+    "result1['pred'] = pred[:]  "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "result1.to_csv('subimit_target.csv',index=False)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 2",
+   "language": "python",
+   "name": "python2"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}

+ 343 - 0
src/python/Transfer_model.ipynb

@@ -0,0 +1,343 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "from sklearn.ensemble import GradientBoostingClassifier\n",
+    "from sklearn.metrics import auc,roc_auc_score\n",
+    "import xgboost as xgb\n",
+    "from sklearn.cross_validation import train_test_split "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": true
+   },
+   "source": [
+    "此处的特征工程和之前的一样,加了一个非空元素的统计,其他一样"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "a_train = pd.read_csv('A_train.csv')\n",
+    "b_train = pd.read_csv('B_train.csv')\n",
+    "b_test =  pd.read_csv('B_test.csv')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "a_labels = a_train['flag']\n",
+    "b_labels = b_train['flag']\n",
+    "\n",
+    "a_train.drop('no',axis=1,inplace=True) \n",
+    "b_train.drop('no',axis=1,inplace=True) \n",
+    "a_train.drop('flag',axis=1,inplace=True) \n",
+    "b_train.drop('flag',axis=1,inplace=True) \n",
+    "\n",
+    "submit = pd.DataFrame(b_test['no'])\n",
+    "b_test.drop('no',axis=1,inplace=True) "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "a_train['num_NAN'] = a_train.isnull().sum(axis=1)\n",
+    "b_train['num_NAN'] = b_train.isnull().sum(axis=1) \n",
+    "b_test['num_NAN'] = b_train.isnull().sum(axis=1) "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "a_train = a_train.fillna(-999)\n",
+    "b_train = b_train.fillna(-999)\n",
+    "b_test = b_test.fillna(-999)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def dumyuserfeature(train):\n",
+    "    pre_feat = 'UserInfo_'\n",
+    "    # 检查离散变量和连续变量\n",
+    "    asd = 0\n",
+    "    index=[]\n",
+    "    train_copy = train.copy()\n",
+    "    for i ,col in enumerate(train.columns):\n",
+    "        print(i)\n",
+    "        cofe = len(train.groupby(col).count())\n",
+    "        if cofe < 20:\n",
+    "            feikong = np.sum([train[col] != -999] )\n",
+    "            if feikong < len(train) * 0.1:\n",
+    "                continue \n",
+    "            train_copy = train_copy.join(pd.get_dummies(train[col], prefix=col+'_'))\n",
+    "            index.append(col)\n",
+    "            asd += 1\n",
+    "    print(asd,'个离散化的特征')\n",
+    "    return train_copy"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "a_train_dummy = dumyuserfeature(a_train)\n",
+    "b_train_dummy = dumyuserfeature(b_train)\n",
+    "b_test_dummy = dumyuserfeature(b_test)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "col = [x for x in b_train_dummy.columns if x in b_test_dummy.columns]  \n",
+    "col = [x for x in col if x in a_train_dummy.columns] \n",
+    "col = [x for x in col if x not in ['no','flag']]  "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "a_train_dummy_final = a_train_dummy[col]\n",
+    "b_train_dummy_final = b_train_dummy[col]\n",
+    "b_test_dummy_final = b_test_dummy[col]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "# a_train_dummy_final = pd.read_csv('a_train_dummy_final.csv')\n",
+    "# b_train_dummy_final = pd.read_csv('b_train_dummy_final.csv')\n",
+    "# b_test_dummy_final = pd.read_csv('b_test_dummy_final.csv')\n",
+    "# b_train_dummy_final.drop('Unnamed: 0',axis=1,inplace=True)\n",
+    "# b_test_dummy_final.drop('Unnamed: 0',axis=1,inplace=True)\n",
+    "# a_train_dummy_final.drop('Unnamed: 0',axis=1,inplace=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "watchlist=[(xgb.DMatrix(a_train_dummy_final,label=a_labels),'train'),(xgb.DMatrix(b_train_dummy_final,label=b_labels),'eval')]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "# 训练获得迁移使用的源模型\n",
+    "Trate=0.15 \n",
+    "params = {'booster':'gbtree',\n",
+    "              'eta': 0.5, \n",
+    "              'max_depth': 5,                  \n",
+    "              'max_delta_step': 0,\n",
+    "              'subsample':0.8,              \n",
+    "              'colsample_bytree': 0.8,      \n",
+    "              'base_score': Trate, \n",
+    "              'objective': 'binary:logistic', \n",
+    "              'lambda':3,\n",
+    "              'alpha':8\n",
+    "              }\n",
+    "params['eval_metric'] = 'auc'  \n",
+    "\n",
+    "# model_1 = xgb.train(params,xgb.DMatrix(b_train_dummy[col],b_train['flag']),num_boost_round=150,maximize=True,verbose_eval=True)\n",
+    "model_phase_1 = xgb.train(params,xgb.DMatrix(a_train_dummy_final,label=a_labels),num_boost_round=1000,evals=watchlist,early_stopping_rounds=100,maximize=True,verbose_eval=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# # 存储模型\n",
+    "# import pickle\n",
+    "# from sklearn.externals import joblib\n",
+    "# joblib.dump(model_phase_1, 'model_transfer.pkl')\n",
+    "# model_phase_1 = joblib.load('model_transfer.pkl')  "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# train_X,test_X,train_Y,test_Y = train_test_split(b_train_dummy_final,b_labels,test_size=0.2,random_state  = 2) \n",
+    "# watchlist=[(xgb.DMatrix(train_X,label=train_Y),'train'),(xgb.DMatrix(test_X,label=test_Y),'eval')] "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 在源数据上进行微调,这边没有时间就没有细调,线上的结果是单模型随机试了一个(迭代20次)的结果\n",
+    "Trate=0.2 \n",
+    "params = {'booster':'gbtree',\n",
+    "              'eta': 0.05, \n",
+    "              'max_depth': 4,                  \n",
+    "              'max_delta_step': 0,\n",
+    "              'subsample':1,              \n",
+    "              'colsample_bytree': 0.9,      \n",
+    "              'base_score': Trate, \n",
+    "              'objective': 'binary:logistic', \n",
+    "              'lambda':3,\n",
+    "              'alpha':5\n",
+    "              }\n",
+    "params['eval_metric'] = 'auc' \n",
+    "model_phase_1_cla_1 = xgb.train(params,xgb.DMatrix(b_train_dummy_final,b_labels),num_boost_round=25,xgb_model =model_phase_1,maximize=True,verbose_eval=True)\n",
+    "\n",
+    "\n",
+    "Trate=0.2 \n",
+    "params = {'booster':'gbtree',\n",
+    "              'eta': 0.05, \n",
+    "              'max_depth': 5,                  \n",
+    "              'max_delta_step': 0,\n",
+    "              'subsample':0.85,              \n",
+    "              'colsample_bytree': 0.9,      \n",
+    "              'base_score': Trate, \n",
+    "              'objective': 'binary:logistic', \n",
+    "              'lambda':3,\n",
+    "              'alpha':5\n",
+    "              }\n",
+    "params['eval_metric'] = 'auc' \n",
+    "model_phase_1_cla_2 = xgb.train(params,xgb.DMatrix(b_train_dummy_final,b_labels),num_boost_round=40,xgb_model =model_phase_1,maximize=True,verbose_eval=True)\n",
+    "\n",
+    "\n",
+    "Trate=0.2 \n",
+    "params = {'booster':'gbtree',\n",
+    "              'eta': 0.05, \n",
+    "              'max_depth': 4,                  \n",
+    "              'max_delta_step': 0,\n",
+    "              'subsample':1,              \n",
+    "              'colsample_bytree': 0.9,      \n",
+    "              'base_score': Trate, \n",
+    "              'objective': 'binary:logistic', \n",
+    "              'lambda':3,\n",
+    "              'alpha':5\n",
+    "              }\n",
+    "params['eval_metric'] = 'auc' \n",
+    "model_phase_1_cla_3 = xgb.train(params,xgb.DMatrix(b_train_dummy_final,b_labels),num_boost_round=28,xgb_model =model_phase_1,maximize=True,verbose_eval=True)\n",
+    "\n",
+    "Trate=0.25 \n",
+    "params = {'booster':'gbtree',\n",
+    "              'eta': 0.05, \n",
+    "              'max_depth': 5,                  \n",
+    "              'max_delta_step': 0,\n",
+    "              'subsample':1,              \n",
+    "              'colsample_bytree': 0.9,      \n",
+    "              'base_score': Trate, \n",
+    "              'objective': 'binary:logistic', \n",
+    "              'lambda':3,\n",
+    "              'alpha':6\n",
+    "              }\n",
+    "params['eval_metric'] = 'auc' \n",
+    "model_phase_1_cla_4 = xgb.train(params,xgb.DMatrix(b_train_dummy_final,b_labels),num_boost_round=30,xgb_model =model_phase_1,maximize=True,verbose_eval=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# ensemble取均值,这边随机选了四个\n",
+    "pred = model_phase_1_cla_1.predict(xgb.DMatrix(b_test_dummy_final))\n",
+    "pred1 = model_phase_1_cla_2.predict(xgb.DMatrix(b_test_dummy_final))\n",
+    "pred2 = model_phase_1_cla_3.predict(xgb.DMatrix(b_test_dummy_final))\n",
+    "pred3 = model_phase_1_cla_4.predict(xgb.DMatrix(b_test_dummy_final))\n",
+    "submit['pred'] =(pred+pred1+pred2+pred3)/4 "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "submit.to_csv('transfer_submit.csv',index=False)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 2",
+   "language": "python",
+   "name": "python2"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}

+ 116 - 0
src/python/main.py

@@ -0,0 +1,116 @@
+#coding=utf-8
+'''
+Created on 2017年6月19日
+@vsersion:python 3.6
+@author: liuyuqi
+'''
+from nt import chdir, getcwd
+import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+from bokeh.command.util import die
+from bokeh.core.properties import Size
+from _ctypes import sizeof
+import seaborn as sns
+from sklearn import cross_validation
+from xgboost.sklearn import XGBClassifier as XGBC
+from sklearn.metrics import roc_auc_score as AUC
+from astropy.extern.ply.cpp import xrange
+
+
+#导入数据
+getcwd()
+chdir("D:\liuyuqi\数学建模项目\kseci比赛\好信杯")
+# mkdir("qhzx")
+data_a = pd.read_csv(r'data/A_train.csv')
+data_b = pd.read_csv(r'data/B_train.csv')
+test = pd.read_csv(r'data/B_test.csv')
+
+# train_data visualization
+#统计每列(每个特征)丢失数据情况,丢失数据太多舍弃这个特征。
+fea = np.sum(data_a.isnull(),axis=0)
+feb = np.sum(data_b.isnull(),axis=0)
+plt.subplot(211).plot(fea.values)
+plt.subplot(212).plot(feb.values)
+#丢失数据可视化,看不出什么,丢失数据比较乱。
+plt.show()
+
+# sort_values
+#按照从小到大排序,丢失数据最小排在前面
+plt.subplot(211).plot(np.sort(fea))
+plt.subplot(212).plot(np.sort(feb))
+#作图可以发现,丢失330以内的变量比较 稳定。其他丢失数据太多全部舍弃。只取这330个特征
+plt.show()
+
+# test_data visualization
+fet = np.sum(test.isnull(),axis=0)
+plt.plot(fet.values)
+plt.plot(np.sort(fet))
+plt.show()
+
+
+# select 330 features with less null
+a = fea[fea==fea[330]]
+b = feb[feb==feb[330]]
+t = fet[fet==fet[330]]
+
+# save common features in a,b,t
+#a和b的交集
+c = a.index.difference(a.index.difference(b.index))
+#c和t的交集,也就是a,b,t的交集
+c = c.difference(c.difference(t.index))
+c = pd.DataFrame(c)
+c = c.append(['ProductInfo_89'])
+c = c[c[0]!='no']
+c.to_csv(r'../work/qhzx/index.csv',index=False)
+
+# train these features select those importance > 0
+clf1 = XGBC(max_depth=8,seed=999,n_estimators=100)
+train = data_b.sample(3000,random_state=999)
+v = data_b[~data_b['no'].isin(train['no'])]
+clf1.fit(train[train.columns.difference(['flag','no'])],train['flag'])
+p = clf1.predict_proba(v[v.columns.difference(['flag','no'])])[:,1]
+print(AUC(v['flag'],p))
+# print clf1.feature_importances_
+a  = pd.Series(clf1.feature_importances_)
+ind = v.columns.difference(['flag','no'])
+a.index=ind
+select_a = a[a>0]
+
+# log smoothing
+data_a['ProductInfo_89'] = (np.log(data_a['ProductInfo_89'].fillna(-1)+1.1)).astype(np.int)
+data_b['ProductInfo_89'] = (np.log(data_b['ProductInfo_89'].fillna(-1)+1.1)).astype(np.int)
+test['ProductInfo_89'] = (np.log(test['ProductInfo_89'].fillna(-1)+1.1)).astype(np.int)
+
+
+# features grouped by name
+cross = select_a[select_a>0]
+cross.shape
+cu=[];cp=[];cw=[]
+for i in cross.index:
+    if(i[0]=='P'):
+        cp.append(i)
+    if(i[0]=='U'):
+        cu.append(i)
+    if(i[0]=='W'):
+        cw.append(i)
+# print len(cu),len(cp),len(cw)
+
+# cross-features using web_info and product_info
+for i in cw:
+    for j in cp:
+        data_a[i+j] = data_a[i]*data_a[j]
+        data_b[i+j] = data_b[i]*data_b[j]
+        test[i+j] = test[i]*test[j]
+
+
+# save new data_set
+print(data_a.shape,data_b.shape,test.shape)
+data_a.to_csv(r'../work/qhzx/A_train.csv',index=False)
+data_b.to_csv(r'../work/qhzx/B_train_new.csv',index=False)
+test.to_csv(r'../work/qhzx/B_test_new.csv',index=False)
+
+
+
+
+

+ 101 - 0
src/python/weight_submit.ipynb

@@ -0,0 +1,101 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import matplotlib.pyplot as plt\n",
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "from os import chdir, getcwd"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "ename": "NameError",
+     "evalue": "name 'os' is not defined",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
+      "\u001b[0;32m<ipython-input-2-608806573d92>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# 切换项目目录\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mos\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgetcwd\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      3\u001b[0m \u001b[0mos\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mchdir\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"/media/sf_share/linux/haoxin\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;31mNameError\u001b[0m: name 'os' is not defined"
+     ]
+    }
+   ],
+   "source": [
+    "# 切换项目目录\n",
+    "os.getcwd()\n",
+    "os.chdir(\"/media/sf_share/linux/haoxin\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "metadata": {},
+   "outputs": [
+    {
+     "ename": "IOError",
+     "evalue": "File data/subimit_target.csv does not exist",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mIOError\u001b[0m                                   Traceback (most recent call last)",
+      "\u001b[0;32m<ipython-input-20-e49e85594227>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0ma1\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread_csv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'data/subimit_target.csv'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      2\u001b[0m \u001b[0ma2\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread_csv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'data/transfer_submit.csv'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.pyc\u001b[0m in \u001b[0;36mparser_f\u001b[0;34m(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)\u001b[0m\n\u001b[1;32m    653\u001b[0m                     skip_blank_lines=skip_blank_lines)\n\u001b[1;32m    654\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 655\u001b[0;31m         \u001b[0;32mreturn\u001b[0m \u001b[0m_read\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfilepath_or_buffer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkwds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    656\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    657\u001b[0m     \u001b[0mparser_f\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__name__\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.pyc\u001b[0m in \u001b[0;36m_read\u001b[0;34m(filepath_or_buffer, kwds)\u001b[0m\n\u001b[1;32m    403\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    404\u001b[0m     \u001b[0;31m# Create the parser.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 405\u001b[0;31m     \u001b[0mparser\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mTextFileReader\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfilepath_or_buffer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    406\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    407\u001b[0m     \u001b[0;32mif\u001b[0m \u001b[0mchunksize\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0miterator\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.pyc\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, f, engine, **kwds)\u001b[0m\n\u001b[1;32m    762\u001b[0m             \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0moptions\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'has_index_names'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mkwds\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'has_index_names'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    763\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 764\u001b[0;31m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_make_engine\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mengine\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    765\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    766\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0mclose\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.pyc\u001b[0m in \u001b[0;36m_make_engine\u001b[0;34m(self, engine)\u001b[0m\n\u001b[1;32m    983\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0m_make_engine\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mengine\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'c'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    984\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0mengine\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m'c'\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 985\u001b[0;31m             \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mCParserWrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0moptions\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    986\u001b[0m         \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    987\u001b[0m             \u001b[0;32mif\u001b[0m \u001b[0mengine\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m'python'\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.pyc\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, src, **kwds)\u001b[0m\n\u001b[1;32m   1603\u001b[0m         \u001b[0mkwds\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'allow_leading_cols'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mindex_col\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mFalse\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1604\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1605\u001b[0;31m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_reader\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mparsers\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mTextReader\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msrc\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1606\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1607\u001b[0m         \u001b[0;31m# XXX\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader.__cinit__ (pandas/_libs/parsers.c:4209)\u001b[0;34m()\u001b[0m\n",
+      "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader._setup_parser_source (pandas/_libs/parsers.c:8873)\u001b[0;34m()\u001b[0m\n",
+      "\u001b[0;31mIOError\u001b[0m: File data/subimit_target.csv does not exist"
+     ]
+    }
+   ],
+   "source": [
+    "a1 = pd.read_csv('data/subimit_target.csv')\n",
+    "a2 = pd.read_csv('data/transfer_submit.csv')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "a1['pred'] = a1['pred'] * 0.85+ a2['pred'] * 0.15\n",
+    "a1.to_csv('submit_online.csv',index=False)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 2",
+   "language": "python",
+   "name": "python2"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}

Some files were not shown because too many files changed in this diff