kaggle入门第一题-Titanic

题目

题目链接

训练和测试数据

认识数据

导入训练集和训练集,使用SparkSQL探索数据集,发现数据的大致分布和规律
1
2
3
4
5
6
7
8
9
10
11
import org.apache.spark.sql.SparkSession

val sparkSession = SparkSession.builder.appName("example").getOrCreate()

val df = sparkSession.read.option("header", "true").option("inferSchema", "true").csv("/YourPath/.../train.csv")

df.printSchema()

df.describe("Survived","Pclass","Sex","Age","SibSp","Parch","Ticket","Fare","Cabin","Embarked").show(false)

df.createGlobalTempView("people")

1-pic

1
2
// 舱位对存活率的影响
spark.sql("WITH t AS(SELECT count(*) as c,Survived,Pclass FROM global_temp.people group by Survived,Pclass order by Pclass) select Pclass,sum(case Survived when 1 then c else 0 end) survive_num, sum(c) total_num, round((sum(case Survived when 1 then c else 0 end)/sum(c)),4) survive_ratio from t group by Pclass").show()

1
2
// 年龄对存活率的影响
spark.sql("WITH t AS(SELECT count(*) as c,Survived,Floor(Mod(Age,10)) age_rank FROM global_temp.people group by Survived,age_rank order by age_rank) select age_rank,sum(case Survived when 1 then c else 0 end) survive_num, sum(c) total_num, round((sum(case Survived when 1 then c else 0 end)/sum(c)),4) survive_ratio from t group by age_rank").show()
1
2
// 性别对存活率的影响
spark.sql("WITH t AS(SELECT count(*) as c,Survived,Sex FROM global_temp.people group by Survived,Sex order by Sex) select Sex,sum(case Survived when 1 then c else 0 end) survive_num, sum(c) total_num, round((sum(case Survived when 1 then c else 0 end)/sum(c)),4) survive_ratio from t group by Sex").show()

2-pic

1
2
// 兄弟姐妹/配偶对存活率的影响
spark.sql("WITH t AS(SELECT count(*) as c,Survived,SibSp FROM global_temp.people group by Survived,SibSp order by SibSp) select SibSp,sum(case Survived when 1 then c else 0 end) survive_num, sum(c) total_num, round((sum(case Survived when 1 then c else 0 end)/sum(c)),4) survive_ratio from t group by SibSp").show()
1
2
// 父母/孩子的数量对存活率的影响
spark.sql("WITH t AS(SELECT count(*) as c,Survived,Parch FROM global_temp.people group by Survived,Parch order by Parch) select Parch,sum(case Survived when 1 then c else 0 end) survive_num, sum(c) total_num, round((sum(case Survived when 1 then c else 0 end)/sum(c)),4) survive_ratio from t group by Parch").show()
1
2
// 登船港口对存活率的影响
spark.sql("WITH t AS(SELECT count(*) as c,Survived,embarked FROM global_temp.people group by Survived,embarked order by embarked) select embarked,sum(case Survived when 1 then c else 0 end) survive_num, sum(c) total_num, round((sum(case Survived when 1 then c else 0 end)/sum(c)),4) survive_ratio from t group by embarked").show()
1
2
// 年龄对存活率的影响
spark.sql("WITH t AS(SELECT count(*) as c,Survived,Floor(Mod(Fare,10)) fare_rank FROM global_temp.people group by Survived,fare_rank order by fare_rank) select fare_rank,sum(case Survived when 1 then c else 0 end) survive_num, sum(c) total_num, round((sum(case Survived when 1 then c else 0 end)/sum(c)),4) survive_ratio from t group by fare_rank").show()
1
2
// 姓名对存活率的影响
spark.sql("WITH t AS(SELECT count(*) as c,Survived,regexp_extract(Name, '(,)([^\\.]+)', 2) name_fix FROM global_temp.people group by Survived,name_fix order by name_fix) select name_fix,sum(case Survived when 1 then c else 0 end) survive_num, sum(c) total_num, round((sum(case Survived when 1 then c else 0 end)/sum(c)),4) survive_ratio from t group by name_fix").show()

特征工程

根据之前对数据的探索,我们可以发现性别、年龄、家庭人数对存活率的影响,我们将其他特征做以下处理:
特征 描述 处理方式
survival 生存 当作label
pclass 票类别 onehot
Name 姓名 第一次不使用用,第二次用中间名做onehot
sex 性别 onehot
Age 年龄 用均值补全
sibsp 兄弟姐妹/配偶 直接使用
parch 父母/孩子的数量 直接使用
ticket 票号 不使用
fare 乘客票价 直接使用
cabin 客舱号码 不使用
embarked 登船港口 onehot

模型训练

由于特征较少,我们可以使用随机森林用作训练模型,代码可参考:github

提交预测数据

第一次没有使用姓名这个特征,第二使用以后排名提升了不少:3-pic