いろいろ試したい時のデータセット:タイタニック
分析データを見つけるのは難しい
何かを試したい時に適したデータセットを見つけるのはなかなか難しい。
複数のデータセットを目的に応じて使い分けるのも骨が折れる。
ある程度汎用的に使えるデータセットを見つけたい。
iris?
有名なiris。
- 植物に興味がないので興味のあるデータではない。でも何度も使用したことがあるので馴染みはある。
- データ件数が150件で少ない。
- そのままでは2値分類に使用できない
求めるデータセット候補:タイタニック
kaggleで提供しているtitanic。
データ件数も変数の種類も豊富。映画も見ているので馴染みがある。
求めるデータセットに近い。
ただ、問題がひとつ。Rにデータセットが見つからない。
datasetsパッケージのTitanicはクロス集計に加工されたデータでローデータではない。
Titanic
## , , Age = Child, Survived = No
##
## Sex
## Class Male Female
## 1st 0 0
## 2nd 0 0
## 3rd 35 17
## Crew 0 0
##
## , , Age = Adult, Survived = No
##
## Sex
## Class Male Female
## 1st 118 4
## 2nd 154 13
## 3rd 387 89
## Crew 670 3
##
## , , Age = Child, Survived = Yes
##
## Sex
## Class Male Female
## 1st 5 1
## 2nd 11 13
## 3rd 13 14
## Crew 0 0
##
## , , Age = Adult, Survived = Yes
##
## Sex
## Class Male Female
## 1st 57 140
## 2nd 14 80
## 3rd 75 76
## Crew 192 20
kaggleのタイタニックデータセット
githubにパッケージを用意してくれていた。 https://github.com/paulhendricks/titanic
パッケージのインストール
githubからパッケージをインストール。
### パッケージのインストール
## library(devtools)
## install_github("paulhendricks/titanic")
library(titanic)
データ内容 titanic_train
library(knitr)
kable(head(titanic_train))
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.2500 | S | |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1000 | C123 | S |
5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.0500 | S | |
6 | 0 | 3 | Moran, Mr. James | male | NA | 0 | 0 | 330877 | 8.4583 | Q |
library(Hmisc)
describe(titanic_train)
## titanic_train
##
## 12 Variables 891 Observations
## ---------------------------------------------------------------------------
## PassengerId
## n missing unique Info Mean .05 .10 .25 .50
## 891 0 891 1 446 45.5 90.0 223.5 446.0
## .75 .90 .95
## 668.5 802.0 846.5
##
## lowest : 1 2 3 4 5, highest: 887 888 889 890 891
## ---------------------------------------------------------------------------
## Survived
## n missing unique Info Sum Mean
## 891 0 2 0.71 342 0.3838
## ---------------------------------------------------------------------------
## Pclass
## n missing unique Info Mean
## 891 0 3 0.81 2.309
##
## 1 (216, 24%), 2 (184, 21%), 3 (491, 55%)
## ---------------------------------------------------------------------------
## Name
## n missing unique
## 891 0 891
##
## lowest : Abbing, Mr. Anthony Abbott, Mr. Rossmore Edward Abbott, Mrs. Stanton (Rosa Hunt) Abelson, Mr. Samuel Abelson, Mrs. Samuel (Hannah Wizosky)
## highest: de Mulder, Mr. Theodore de Pelsmaeker, Mr. Alfons del Carlo, Mr. Sebastiano van Billiard, Mr. Austin Blyler van Melkebeke, Mr. Philemon
## ---------------------------------------------------------------------------
## Sex
## n missing unique
## 891 0 2
##
## female (314, 35%), male (577, 65%)
## ---------------------------------------------------------------------------
## Age
## n missing unique Info Mean .05 .10 .25 .50
## 714 177 88 1 29.7 4.00 14.00 20.12 28.00
## .75 .90 .95
## 38.00 50.00 56.00
##
## lowest : 0.42 0.67 0.75 0.83 0.92
## highest: 70.00 70.50 71.00 74.00 80.00
## ---------------------------------------------------------------------------
## SibSp
## n missing unique Info Mean
## 891 0 7 0.67 0.523
##
## 0 1 2 3 4 5 8
## Frequency 608 209 28 16 18 5 7
## % 68 23 3 2 2 1 1
## ---------------------------------------------------------------------------
## Parch
## n missing unique Info Mean
## 891 0 7 0.56 0.3816
##
## 0 1 2 3 4 5 6
## Frequency 678 118 80 5 4 5 1
## % 76 13 9 1 0 1 0
## ---------------------------------------------------------------------------
## Ticket
## n missing unique
## 891 0 681
##
## lowest : 110152 110413 110465 110564 110813
## highest: W./C. 6608 W./C. 6609 W.E.P. 5734 W/C 14208 WE/P 5735
## ---------------------------------------------------------------------------
## Fare
## n missing unique Info Mean .05 .10 .25 .50
## 891 0 248 1 32.2 7.225 7.550 7.910 14.454
## .75 .90 .95
## 31.000 77.958 112.079
##
## lowest : 0.000 4.013 5.000 6.237 6.438
## highest: 227.525 247.521 262.375 263.000 512.329
## ---------------------------------------------------------------------------
## Cabin
## n missing unique
## 204 687 147
##
## lowest : A10 A14 A16 A19 A20, highest: F33 F38 F4 G6 T
## ---------------------------------------------------------------------------
## Embarked
## n missing unique
## 889 2 3
##
## C (168, 19%), Q (77, 9%), S (644, 72%)
## ---------------------------------------------------------------------------
データ内容 titanic_test
kable(head(titanic_test))
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|
892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | Q | |
893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0000 | S | |
894 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.6875 | Q | |
895 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.6625 | S | |
896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.2875 | S | |
897 | 3 | Svensson, Mr. Johan Cervin | male | 14.0 | 0 | 0 | 7538 | 9.2250 | S |
describe(titanic_test)
## titanic_test
##
## 11 Variables 418 Observations
## ---------------------------------------------------------------------------
## PassengerId
## n missing unique Info Mean .05 .10 .25 .50
## 418 0 418 1 1100 912.9 933.7 996.2 1100.5
## .75 .90 .95
## 1204.8 1267.3 1288.2
##
## lowest : 892 893 894 895 896, highest: 1305 1306 1307 1308 1309
## ---------------------------------------------------------------------------
## Pclass
## n missing unique Info Mean
## 418 0 3 0.83 2.266
##
## 1 (107, 26%), 2 (93, 22%), 3 (218, 52%)
## ---------------------------------------------------------------------------
## Name
## n missing unique
## 418 0 418
##
## lowest : Abbott, Master. Eugene Joseph Abelseth, Miss. Karen Marie Abelseth, Mr. Olaus Jorgensen Abrahamsson, Mr. Abraham August Johannes Abrahim, Mrs. Joseph (Sophie Halaut Easu)
## highest: de Brito, Mr. Jose Joaquim de Messemaeker, Mr. Guillaume Joseph del Carlo, Mrs. Sebastiano (Argenia Genovesi) van Billiard, Master. James William van Billiard, Master. Walter John
## ---------------------------------------------------------------------------
## Sex
## n missing unique
## 418 0 2
##
## female (152, 36%), male (266, 64%)
## ---------------------------------------------------------------------------
## Age
## n missing unique Info Mean .05 .10 .25 .50
## 332 86 79 1 30.27 8.0 16.1 21.0 27.0
## .75 .90 .95
## 39.0 50.0 57.0
##
## lowest : 0.17 0.33 0.75 0.83 0.92
## highest: 62.00 63.00 64.00 67.00 76.00
## ---------------------------------------------------------------------------
## SibSp
## n missing unique Info Mean
## 418 0 7 0.67 0.4474
##
## 0 1 2 3 4 5 8
## Frequency 283 110 14 4 4 1 2
## % 68 26 3 1 1 0 0
## ---------------------------------------------------------------------------
## Parch
## n missing unique Info Mean
## 418 0 8 0.53 0.3923
##
## 0 1 2 3 4 5 6 9
## Frequency 324 52 33 3 2 1 1 2
## % 78 12 8 1 0 0 0 0
## ---------------------------------------------------------------------------
## Ticket
## n missing unique
## 418 0 363
##
## lowest : 110469 110489 110813 111163 112051
## highest: W./C. 14260 W./C. 14266 W./C. 6607 W./C. 6608 W.E.P. 5734
## ---------------------------------------------------------------------------
## Fare
## n missing unique Info Mean .05 .10 .25 .50
## 417 1 169 1 35.63 7.229 7.642 7.896 14.454
## .75 .90 .95
## 31.500 79.200 151.550
##
## lowest : 0.000 3.171 6.438 6.496 6.950
## highest: 227.525 247.521 262.375 263.000 512.329
## ---------------------------------------------------------------------------
## Cabin
## n missing unique
## 91 327 76
##
## lowest : A11 A18 A21 A29 A34
## highest: F G63 F2 F33 F4 G6
## ---------------------------------------------------------------------------
## Embarked
## n missing unique
## 418 0 3
##
## C (102, 24%), Q (46, 11%), S (270, 65%)
## ---------------------------------------------------------------------------