kaggle入门第一题-Titanic

Posted on 2018-08-03

题目

训练和测试数据

认识数据

导入训练集和训练集，使用SparkSQL探索数据集，发现数据的大致分布和规律

import org.apache.spark.sql.SparkSession

val sparkSession = SparkSession.builder.appName("example").getOrCreate()

val df = sparkSession.read.option("header", "true").option("inferSchema", "true").csv("/YourPath/.../train.csv")

df.printSchema()

df.describe("Survived","Pclass","Sex","Age","SibSp","Parch","Ticket","Fare","Cabin","Embarked").show(false)

df.createGlobalTempView("people")

1
2

// 舱位对存活率的影响
spark.sql("WITH t AS(SELECT count(*) as c,Survived,Pclass FROM global_temp.people group by Survived,Pclass order by Pclass) select Pclass,sum(case Survived when 1 then c else 0 end) survive_num, sum(c) total_num, round((sum(case Survived when 1 then c else 0 end)/sum(c)),4) survive_ratio from t group by Pclass").show()

1
2

// 年龄对存活率的影响
spark.sql("WITH t AS(SELECT count(*) as c,Survived,Floor(Mod(Age,10)) age_rank FROM global_temp.people group by Survived,age_rank order by age_rank) select age_rank,sum(case Survived when 1 then c else 0 end) survive_num, sum(c) total_num, round((sum(case Survived when 1 then c else 0 end)/sum(c)),4) survive_ratio from t group by age_rank").show()

1
2

// 性别对存活率的影响
spark.sql("WITH t AS(SELECT count(*) as c,Survived,Sex FROM global_temp.people group by Survived,Sex order by Sex) select Sex,sum(case Survived when 1 then c else 0 end) survive_num, sum(c) total_num, round((sum(case Survived when 1 then c else 0 end)/sum(c)),4) survive_ratio from t group by Sex").show()

1
2

// 兄弟姐妹/配偶对存活率的影响
spark.sql("WITH t AS(SELECT count(*) as c,Survived,SibSp FROM global_temp.people group by Survived,SibSp order by SibSp) select SibSp,sum(case Survived when 1 then c else 0 end) survive_num, sum(c) total_num, round((sum(case Survived when 1 then c else 0 end)/sum(c)),4) survive_ratio from t group by SibSp").show()

1
2

// 父母/孩子的数量对存活率的影响
spark.sql("WITH t AS(SELECT count(*) as c,Survived,Parch FROM global_temp.people group by Survived,Parch order by Parch) select Parch,sum(case Survived when 1 then c else 0 end) survive_num, sum(c) total_num, round((sum(case Survived when 1 then c else 0 end)/sum(c)),4) survive_ratio from t group by Parch").show()

1
2

// 登船港口对存活率的影响
spark.sql("WITH t AS(SELECT count(*) as c,Survived,embarked FROM global_temp.people group by Survived,embarked order by embarked) select embarked,sum(case Survived when 1 then c else 0 end) survive_num, sum(c) total_num, round((sum(case Survived when 1 then c else 0 end)/sum(c)),4) survive_ratio from t group by embarked").show()

1
2

// 年龄对存活率的影响
spark.sql("WITH t AS(SELECT count(*) as c,Survived,Floor(Mod(Fare,10)) fare_rank FROM global_temp.people group by Survived,fare_rank order by fare_rank) select fare_rank,sum(case Survived when 1 then c else 0 end) survive_num, sum(c) total_num, round((sum(case Survived when 1 then c else 0 end)/sum(c)),4) survive_ratio from t group by fare_rank").show()

1
2

// 姓名对存活率的影响
spark.sql("WITH t AS(SELECT count(*) as c,Survived,regexp_extract(Name, '(,)([^\\.]+)', 2) name_fix FROM global_temp.people group by Survived,name_fix order by name_fix) select name_fix,sum(case Survived when 1 then c else 0 end) survive_num, sum(c) total_num, round((sum(case Survived when 1 then c else 0 end)/sum(c)),4) survive_ratio from t group by name_fix").show()

特征工程

根据之前对数据的探索，我们可以发现性别、年龄、家庭人数对存活率的影响，我们将其他特征做以下处理：

特征	描述	处理方式
survival	生存	当作label
pclass	票类别	onehot
Name	姓名	第一次不使用用，第二次用中间名做onehot
sex	性别	onehot
Age	年龄	用均值补全
sibsp	兄弟姐妹/配偶	直接使用
parch	父母/孩子的数量	直接使用
ticket	票号	不使用
fare	乘客票价	直接使用
cabin	客舱号码	不使用
embarked	登船港口	onehot

模型训练

由于特征较少，我们可以使用随机森林用作训练模型，代码可参考：github

提交预测数据

第一次没有使用姓名这个特征，第二使用以后排名提升了不少：