R Cheat Sheet (12): Looking at Data
Author: nex3z
2015-05-10
当拿到一个新的数据集时,首先要做的是检查一下数据,数据的格式是怎样的?有几个维度?有哪些变量?变量是如何存储的?有哪些丢失的数据?数据中是否有瑕疵?本节主要内容是使用R的内建函数解答以上问题。
本节中使用的数据集来自United States Department of Agriculture’s PLANTS Database (http://plants.usda.gov/adv_search.html)。
读入plants,检查其类型:
> class(plants)
[1] "data.frame"
> class(plants)
[1] "data.frame"
检查其维度尺寸:
> dim(plants)
[1] 5166 10
> dim(plants)
[1] 5166 10
其中5166为行数(观测),10为列数(变量)。也可以使用nrow() 和ncol() 来单独查看行列数量:
> nrow(plants)
[1] 5166
> ncol(plants)
[1] 10
> nrow(plants)
[1] 5166
> ncol(plants)
[1] 10
使用object.size() 可以查看参数对象所占用的内存空间:
> object.size(plants)
644232 bytes
> object.size(plants)
644232 bytes
使用names() 可以查看数据集中各个变量的名称:
[1] "Scientific_Name" "Duration"
[3] "Active_Growth_Period" "Foliage_Color"
[7] "Precip_Min" "Precip_Max"
[9] "Shade_Tolerance" "Temp_Min_F"
> names(plants)
[1] "Scientific_Name" "Duration"
[3] "Active_Growth_Period" "Foliage_Color"
[5] "pH_Min" "pH_Max"
[7] "Precip_Min" "Precip_Max"
[9] "Shade_Tolerance" "Temp_Min_F"
> names(plants)
[1] "Scientific_Name" "Duration"
[3] "Active_Growth_Period" "Foliage_Color"
[5] "pH_Min" "pH_Max"
[7] "Precip_Min" "Precip_Max"
[9] "Shade_Tolerance" "Temp_Min_F"
下一步应当查看一下具体的数据。整个数据集包含5000条以上的观测结果,我们不可能把这些数据一次看完。head() 函数可以给出数据集的开头几行以供预览:
Scientific_Name Duration Active_Growth_Period Foliage_Color pH_Min pH_Max Precip_Min
1 Abelmoschus <NA> <NA> <NA> NA NA NA
2 Abelmoschus esculentus Annual, Perennial <NA> <NA> NA NA NA
3 Abies <NA> <NA> <NA> NA NA NA
4 Abies balsamea Perennial Spring and Summer Green 4 6 13
5 Abies balsamea var. balsamea Perennial <NA> <NA> NA NA NA
6 Abutilon <NA> <NA> <NA> NA NA NA
Precip_Max Shade_Tolerance Temp_Min_F
> head(plants)
Scientific_Name Duration Active_Growth_Period Foliage_Color pH_Min pH_Max Precip_Min
1 Abelmoschus <NA> <NA> <NA> NA NA NA
2 Abelmoschus esculentus Annual, Perennial <NA> <NA> NA NA NA
3 Abies <NA> <NA> <NA> NA NA NA
4 Abies balsamea Perennial Spring and Summer Green 4 6 13
5 Abies balsamea var. balsamea Perennial <NA> <NA> NA NA NA
6 Abutilon <NA> <NA> <NA> NA NA NA
Precip_Max Shade_Tolerance Temp_Min_F
1 NA <NA> NA
2 NA <NA> NA
3 NA <NA> NA
4 60 Tolerant -43
5 NA <NA> NA
6 NA <NA> NA
> head(plants)
Scientific_Name Duration Active_Growth_Period Foliage_Color pH_Min pH_Max Precip_Min
1 Abelmoschus <NA> <NA> <NA> NA NA NA
2 Abelmoschus esculentus Annual, Perennial <NA> <NA> NA NA NA
3 Abies <NA> <NA> <NA> NA NA NA
4 Abies balsamea Perennial Spring and Summer Green 4 6 13
5 Abies balsamea var. balsamea Perennial <NA> <NA> NA NA NA
6 Abutilon <NA> <NA> <NA> NA NA NA
Precip_Max Shade_Tolerance Temp_Min_F
1 NA <NA> NA
2 NA <NA> NA
3 NA <NA> NA
4 60 Tolerant -43
5 NA <NA> NA
6 NA <NA> NA
head() 函数默认给出前6行数据,但也可以通过参数指定显示的行数,如使用head(plants, 10) 可以得到前10行数据。
与 head() 相对,tail() 函数可以显示数据集的末尾几行:
Scientific_Name Duration Active_Growth_Period Foliage_Color pH_Min pH_Max Precip_Min Precip_Max
5160 Zizia aptera Perennial <NA> <NA> NA NA NA NA
5161 Zizia aurea Perennial <NA> <NA> NA NA NA NA
5162 Zizia trifoliata Perennial <NA> <NA> NA NA NA NA
5163 Zostera <NA> <NA> <NA> NA NA NA NA
5164 Zostera marina Perennial <NA> <NA> NA NA NA NA
5165 Zoysia <NA> <NA> <NA> NA NA NA NA
5166 Zoysia japonica Perennial <NA> <NA> NA NA NA NA
Shade_Tolerance Temp_Min_F
> tail(plants, 7)
Scientific_Name Duration Active_Growth_Period Foliage_Color pH_Min pH_Max Precip_Min Precip_Max
5160 Zizia aptera Perennial <NA> <NA> NA NA NA NA
5161 Zizia aurea Perennial <NA> <NA> NA NA NA NA
5162 Zizia trifoliata Perennial <NA> <NA> NA NA NA NA
5163 Zostera <NA> <NA> <NA> NA NA NA NA
5164 Zostera marina Perennial <NA> <NA> NA NA NA NA
5165 Zoysia <NA> <NA> <NA> NA NA NA NA
5166 Zoysia japonica Perennial <NA> <NA> NA NA NA NA
Shade_Tolerance Temp_Min_F
5160 <NA> NA
5161 <NA> NA
5162 <NA> NA
5163 <NA> NA
5164 <NA> NA
5165 <NA> NA
5166 <NA> NA
> tail(plants, 7)
Scientific_Name Duration Active_Growth_Period Foliage_Color pH_Min pH_Max Precip_Min Precip_Max
5160 Zizia aptera Perennial <NA> <NA> NA NA NA NA
5161 Zizia aurea Perennial <NA> <NA> NA NA NA NA
5162 Zizia trifoliata Perennial <NA> <NA> NA NA NA NA
5163 Zostera <NA> <NA> <NA> NA NA NA NA
5164 Zostera marina Perennial <NA> <NA> NA NA NA NA
5165 Zoysia <NA> <NA> <NA> NA NA NA NA
5166 Zoysia japonica Perennial <NA> <NA> NA NA NA NA
Shade_Tolerance Temp_Min_F
5160 <NA> NA
5161 <NA> NA
5162 <NA> NA
5163 <NA> NA
5164 <NA> NA
5165 <NA> NA
5166 <NA> NA
使用summary() 可以得到数据集的概要信息:
Scientific_Name Duration Active_Growth_Period
Abelmoschus : 1 Perennial :3031 Spring and Summer : 447
Abelmoschus esculentus : 1 Annual : 682 Spring : 144
Abies : 1 Annual, Perennial: 179 Spring, Summer, Fall: 95
Abies balsamea : 1 Annual, Biennial : 95 Summer : 92
Abies balsamea var. balsamea: 1 Biennial : 57 Summer and Fall : 24
Abutilon : 1 (Other) : 92 (Other) : 30
(Other) :5160 NA's :1030 NA's :4334
Foliage_Color pH_Min pH_Max Precip_Min Precip_Max Shade_Tolerance
Dark Green : 82 Min. :3.000 Min. : 5.100 Min. : 4.00 Min. : 16.00 Intermediate: 242
Gray-Green : 25 1st Qu.:4.500 1st Qu.: 7.000 1st Qu.:16.75 1st Qu.: 55.00 Intolerant : 349
Green : 692 Median :5.000 Median : 7.300 Median :28.00 Median : 60.00 Tolerant : 246
Red : 4 Mean :4.997 Mean : 7.344 Mean :25.57 Mean : 58.73 NA's :4329
White-Gray : 9 3rd Qu.:5.500 3rd Qu.: 7.800 3rd Qu.:32.00 3rd Qu.: 60.00
Yellow-Green: 20 Max. :7.000 Max. :10.000 Max. :60.00 Max. :200.00
NA's :4334 NA's :4327 NA's :4327 NA's :4338 NA's :4338
> summary(plants)
Scientific_Name Duration Active_Growth_Period
Abelmoschus : 1 Perennial :3031 Spring and Summer : 447
Abelmoschus esculentus : 1 Annual : 682 Spring : 144
Abies : 1 Annual, Perennial: 179 Spring, Summer, Fall: 95
Abies balsamea : 1 Annual, Biennial : 95 Summer : 92
Abies balsamea var. balsamea: 1 Biennial : 57 Summer and Fall : 24
Abutilon : 1 (Other) : 92 (Other) : 30
(Other) :5160 NA's :1030 NA's :4334
Foliage_Color pH_Min pH_Max Precip_Min Precip_Max Shade_Tolerance
Dark Green : 82 Min. :3.000 Min. : 5.100 Min. : 4.00 Min. : 16.00 Intermediate: 242
Gray-Green : 25 1st Qu.:4.500 1st Qu.: 7.000 1st Qu.:16.75 1st Qu.: 55.00 Intolerant : 349
Green : 692 Median :5.000 Median : 7.300 Median :28.00 Median : 60.00 Tolerant : 246
Red : 4 Mean :4.997 Mean : 7.344 Mean :25.57 Mean : 58.73 NA's :4329
White-Gray : 9 3rd Qu.:5.500 3rd Qu.: 7.800 3rd Qu.:32.00 3rd Qu.: 60.00
Yellow-Green: 20 Max. :7.000 Max. :10.000 Max. :60.00 Max. :200.00
NA's :4334 NA's :4327 NA's :4327 NA's :4338 NA's :4338
Temp_Min_F
Min. :-79.00
1st Qu.:-38.00
Median :-33.00
Mean :-22.53
3rd Qu.:-18.00
Max. : 52.00
NA's :4328
> summary(plants)
Scientific_Name Duration Active_Growth_Period
Abelmoschus : 1 Perennial :3031 Spring and Summer : 447
Abelmoschus esculentus : 1 Annual : 682 Spring : 144
Abies : 1 Annual, Perennial: 179 Spring, Summer, Fall: 95
Abies balsamea : 1 Annual, Biennial : 95 Summer : 92
Abies balsamea var. balsamea: 1 Biennial : 57 Summer and Fall : 24
Abutilon : 1 (Other) : 92 (Other) : 30
(Other) :5160 NA's :1030 NA's :4334
Foliage_Color pH_Min pH_Max Precip_Min Precip_Max Shade_Tolerance
Dark Green : 82 Min. :3.000 Min. : 5.100 Min. : 4.00 Min. : 16.00 Intermediate: 242
Gray-Green : 25 1st Qu.:4.500 1st Qu.: 7.000 1st Qu.:16.75 1st Qu.: 55.00 Intolerant : 349
Green : 692 Median :5.000 Median : 7.300 Median :28.00 Median : 60.00 Tolerant : 246
Red : 4 Mean :4.997 Mean : 7.344 Mean :25.57 Mean : 58.73 NA's :4329
White-Gray : 9 3rd Qu.:5.500 3rd Qu.: 7.800 3rd Qu.:32.00 3rd Qu.: 60.00
Yellow-Green: 20 Max. :7.000 Max. :10.000 Max. :60.00 Max. :200.00
NA's :4334 NA's :4327 NA's :4327 NA's :4338 NA's :4338
Temp_Min_F
Min. :-79.00
1st Qu.:-38.00
Median :-33.00
Mean :-22.53
3rd Qu.:-18.00
Max. : 52.00
NA's :4328
对于不同的数据类型,summary() 会给出不同的结果。如对于Precip_Min 这样的数值型变量,summary() 给出了Min.(minimum)、1st Qu(1st quartile)、Median、Mean、3rd Qu(3rd Qu)和Max.(maximum)几项,便于我们了解数据的分布。对于factor,summary() 给出了每一种值出现的次数,如Scientific_Name 的每一种值均只出现了一次,因为科学名称其对于每一种特定植物是唯一的。注意有些变量的summary被截断了,因为它们太长了,未显示的数据都被放在(Other) 里面。
使用table() 可以查看某一变量的各种取值都出现了多少次:
> table(plants$Active_Growth_Period)
Fall, Winter and Spring Spring Spring and Fall Spring and Summer
Spring, Summer, Fall Summer Summer and Fall Year Round
> table(plants$Active_Growth_Period)
Fall, Winter and Spring Spring Spring and Fall Spring and Summer
15 144 10 447
Spring, Summer, Fall Summer Summer and Fall Year Round
95 92 24 5
> table(plants$Active_Growth_Period)
Fall, Winter and Spring Spring Spring and Fall Spring and Summer
15 144 10 447
Spring, Summer, Fall Summer Summer and Fall Year Round
95 92 24 5
str() 可能是查看数据集最有用和简明的方法了,它能够以简明可读的形式,给出数据集的诸多特征:
'data.frame': 5166 obs. of 10 variables:
$ Scientific_Name : Factor w/ 5166 levels "Abelmoschus",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Duration : Factor w/ 8 levels "Annual","Annual, Biennial",..: NA 4 NA 7 7 NA 1 NA 7 7 ...
$ Active_Growth_Period: Factor w/ 8 levels "Fall, Winter and Spring",..: NA NA NA 4 NA NA NA NA 4 NA ...
$ Foliage_Color : Factor w/ 6 levels "Dark Green","Gray-Green",..: NA NA NA 3 NA NA NA NA 3 NA ...
$ pH_Min : num NA NA NA 4 NA NA NA NA 7 NA ...
$ pH_Max : num NA NA NA 6 NA NA NA NA 8.5 NA ...
$ Precip_Min : int NA NA NA 13 NA NA NA NA 4 NA ...
$ Precip_Max : int NA NA NA 60 NA NA NA NA 20 NA ...
$ Shade_Tolerance : Factor w/ 3 levels "Intermediate",..: NA NA NA 3 NA NA NA NA 2 NA ...
$ Temp_Min_F : int NA NA NA -43 NA NA NA NA -13 NA ...
> str(plants)
'data.frame': 5166 obs. of 10 variables:
$ Scientific_Name : Factor w/ 5166 levels "Abelmoschus",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Duration : Factor w/ 8 levels "Annual","Annual, Biennial",..: NA 4 NA 7 7 NA 1 NA 7 7 ...
$ Active_Growth_Period: Factor w/ 8 levels "Fall, Winter and Spring",..: NA NA NA 4 NA NA NA NA 4 NA ...
$ Foliage_Color : Factor w/ 6 levels "Dark Green","Gray-Green",..: NA NA NA 3 NA NA NA NA 3 NA ...
$ pH_Min : num NA NA NA 4 NA NA NA NA 7 NA ...
$ pH_Max : num NA NA NA 6 NA NA NA NA 8.5 NA ...
$ Precip_Min : int NA NA NA 13 NA NA NA NA 4 NA ...
$ Precip_Max : int NA NA NA 60 NA NA NA NA 20 NA ...
$ Shade_Tolerance : Factor w/ 3 levels "Intermediate",..: NA NA NA 3 NA NA NA NA 2 NA ...
$ Temp_Min_F : int NA NA NA -43 NA NA NA NA -13 NA ...
> str(plants)
'data.frame': 5166 obs. of 10 variables:
$ Scientific_Name : Factor w/ 5166 levels "Abelmoschus",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Duration : Factor w/ 8 levels "Annual","Annual, Biennial",..: NA 4 NA 7 7 NA 1 NA 7 7 ...
$ Active_Growth_Period: Factor w/ 8 levels "Fall, Winter and Spring",..: NA NA NA 4 NA NA NA NA 4 NA ...
$ Foliage_Color : Factor w/ 6 levels "Dark Green","Gray-Green",..: NA NA NA 3 NA NA NA NA 3 NA ...
$ pH_Min : num NA NA NA 4 NA NA NA NA 7 NA ...
$ pH_Max : num NA NA NA 6 NA NA NA NA 8.5 NA ...
$ Precip_Min : int NA NA NA 13 NA NA NA NA 4 NA ...
$ Precip_Max : int NA NA NA 60 NA NA NA NA 20 NA ...
$ Shade_Tolerance : Factor w/ 3 levels "Intermediate",..: NA NA NA 3 NA NA NA NA 2 NA ...
$ Temp_Min_F : int NA NA NA -43 NA NA NA NA -13 NA ...