R Cheat Sheet (12): Looking at Data

当拿到一个新的数据集时,首先要做的是检查一下数据,数据的格式是怎样的?有几个维度?有哪些变量?变量是如何存储的?有哪些丢失的数据?数据中是否有瑕疵?本节主要内容是使用R的内建函数解答以上问题。

本节中使用的数据集来自United States Department of Agriculture’s PLANTS Database (http://plants.usda.gov/adv_search.html)。

读入plants,检查其类型:

> class(plants)
[1] "data.frame"
> class(plants) [1] "data.frame"
> class(plants)
[1] "data.frame"

检查其维度尺寸:

> dim(plants)
[1] 5166 10
> dim(plants) [1] 5166 10
> dim(plants)
[1] 5166   10

其中5166为行数(观测),10为列数(变量)。也可以使用nrow() 和ncol() 来单独查看行列数量:

> nrow(plants)
[1] 5166
> ncol(plants)
[1] 10
> nrow(plants) [1] 5166 > ncol(plants) [1] 10
> nrow(plants)
[1] 5166
> ncol(plants)
[1] 10

使用object.size() 可以查看参数对象所占用的内存空间:

> object.size(plants)
644232 bytes
> object.size(plants) 644232 bytes
> object.size(plants)
644232 bytes

使用names() 可以查看数据集中各个变量的名称:

> names(plants)
[1] "Scientific_Name" "Duration"
[3] "Active_Growth_Period" "Foliage_Color"
[5] "pH_Min" "pH_Max"
[7] "Precip_Min" "Precip_Max"
[9] "Shade_Tolerance" "Temp_Min_F"
> names(plants) [1] "Scientific_Name" "Duration" [3] "Active_Growth_Period" "Foliage_Color" [5] "pH_Min" "pH_Max" [7] "Precip_Min" "Precip_Max" [9] "Shade_Tolerance" "Temp_Min_F"
> names(plants)
 [1] "Scientific_Name"      "Duration"            
 [3] "Active_Growth_Period" "Foliage_Color"       
 [5] "pH_Min"               "pH_Max"              
 [7] "Precip_Min"           "Precip_Max"          
 [9] "Shade_Tolerance"      "Temp_Min_F"

下一步应当查看一下具体的数据。整个数据集包含5000条以上的观测结果,我们不可能把这些数据一次看完。head() 函数可以给出数据集的开头几行以供预览:

> head(plants)
Scientific_Name Duration Active_Growth_Period Foliage_Color pH_Min pH_Max Precip_Min
1 Abelmoschus <NA> <NA> <NA> NA NA NA
2 Abelmoschus esculentus Annual, Perennial <NA> <NA> NA NA NA
3 Abies <NA> <NA> <NA> NA NA NA
4 Abies balsamea Perennial Spring and Summer Green 4 6 13
5 Abies balsamea var. balsamea Perennial <NA> <NA> NA NA NA
6 Abutilon <NA> <NA> <NA> NA NA NA
Precip_Max Shade_Tolerance Temp_Min_F
1 NA <NA> NA
2 NA <NA> NA
3 NA <NA> NA
4 60 Tolerant -43
5 NA <NA> NA
6 NA <NA> NA
> head(plants) Scientific_Name Duration Active_Growth_Period Foliage_Color pH_Min pH_Max Precip_Min 1 Abelmoschus <NA> <NA> <NA> NA NA NA 2 Abelmoschus esculentus Annual, Perennial <NA> <NA> NA NA NA 3 Abies <NA> <NA> <NA> NA NA NA 4 Abies balsamea Perennial Spring and Summer Green 4 6 13 5 Abies balsamea var. balsamea Perennial <NA> <NA> NA NA NA 6 Abutilon <NA> <NA> <NA> NA NA NA Precip_Max Shade_Tolerance Temp_Min_F 1 NA <NA> NA 2 NA <NA> NA 3 NA <NA> NA 4 60 Tolerant -43 5 NA <NA> NA 6 NA <NA> NA
> head(plants)
               Scientific_Name          Duration Active_Growth_Period Foliage_Color pH_Min pH_Max Precip_Min
1                  Abelmoschus              <NA>                 <NA>          <NA>     NA     NA         NA
2       Abelmoschus esculentus Annual, Perennial                 <NA>          <NA>     NA     NA         NA
3                        Abies              <NA>                 <NA>          <NA>     NA     NA         NA
4               Abies balsamea         Perennial    Spring and Summer         Green      4      6         13
5 Abies balsamea var. balsamea         Perennial                 <NA>          <NA>     NA     NA         NA
6                     Abutilon              <NA>                 <NA>          <NA>     NA     NA         NA
  Precip_Max Shade_Tolerance Temp_Min_F
1         NA            <NA>         NA
2         NA            <NA>         NA
3         NA            <NA>         NA
4         60        Tolerant        -43
5         NA            <NA>         NA
6         NA            <NA>         NA

head() 函数默认给出前6行数据,但也可以通过参数指定显示的行数,如使用head(plants, 10) 可以得到前10行数据。

与 head() 相对,tail() 函数可以显示数据集的末尾几行:

> tail(plants, 7)
Scientific_Name Duration Active_Growth_Period Foliage_Color pH_Min pH_Max Precip_Min Precip_Max
5160 Zizia aptera Perennial <NA> <NA> NA NA NA NA
5161 Zizia aurea Perennial <NA> <NA> NA NA NA NA
5162 Zizia trifoliata Perennial <NA> <NA> NA NA NA NA
5163 Zostera <NA> <NA> <NA> NA NA NA NA
5164 Zostera marina Perennial <NA> <NA> NA NA NA NA
5165 Zoysia <NA> <NA> <NA> NA NA NA NA
5166 Zoysia japonica Perennial <NA> <NA> NA NA NA NA
Shade_Tolerance Temp_Min_F
5160 <NA> NA
5161 <NA> NA
5162 <NA> NA
5163 <NA> NA
5164 <NA> NA
5165 <NA> NA
5166 <NA> NA
> tail(plants, 7) Scientific_Name Duration Active_Growth_Period Foliage_Color pH_Min pH_Max Precip_Min Precip_Max 5160 Zizia aptera Perennial <NA> <NA> NA NA NA NA 5161 Zizia aurea Perennial <NA> <NA> NA NA NA NA 5162 Zizia trifoliata Perennial <NA> <NA> NA NA NA NA 5163 Zostera <NA> <NA> <NA> NA NA NA NA 5164 Zostera marina Perennial <NA> <NA> NA NA NA NA 5165 Zoysia <NA> <NA> <NA> NA NA NA NA 5166 Zoysia japonica Perennial <NA> <NA> NA NA NA NA Shade_Tolerance Temp_Min_F 5160 <NA> NA 5161 <NA> NA 5162 <NA> NA 5163 <NA> NA 5164 <NA> NA 5165 <NA> NA 5166 <NA> NA
> tail(plants, 7)
      Scientific_Name  Duration Active_Growth_Period Foliage_Color pH_Min pH_Max Precip_Min Precip_Max
5160     Zizia aptera Perennial                 <NA>          <NA>     NA     NA         NA         NA
5161      Zizia aurea Perennial                 <NA>          <NA>     NA     NA         NA         NA
5162 Zizia trifoliata Perennial                 <NA>          <NA>     NA     NA         NA         NA
5163          Zostera      <NA>                 <NA>          <NA>     NA     NA         NA         NA
5164   Zostera marina Perennial                 <NA>          <NA>     NA     NA         NA         NA
5165           Zoysia      <NA>                 <NA>          <NA>     NA     NA         NA         NA
5166  Zoysia japonica Perennial                 <NA>          <NA>     NA     NA         NA         NA
     Shade_Tolerance Temp_Min_F
5160            <NA>         NA
5161            <NA>         NA
5162            <NA>         NA
5163            <NA>         NA
5164            <NA>         NA
5165            <NA>         NA
5166            <NA>         NA

使用summary() 可以得到数据集的概要信息:

> summary(plants)
Scientific_Name Duration Active_Growth_Period
Abelmoschus : 1 Perennial :3031 Spring and Summer : 447
Abelmoschus esculentus : 1 Annual : 682 Spring : 144
Abies : 1 Annual, Perennial: 179 Spring, Summer, Fall: 95
Abies balsamea : 1 Annual, Biennial : 95 Summer : 92
Abies balsamea var. balsamea: 1 Biennial : 57 Summer and Fall : 24
Abutilon : 1 (Other) : 92 (Other) : 30
(Other) :5160 NA's :1030 NA's :4334
Foliage_Color pH_Min pH_Max Precip_Min Precip_Max Shade_Tolerance
Dark Green : 82 Min. :3.000 Min. : 5.100 Min. : 4.00 Min. : 16.00 Intermediate: 242
Gray-Green : 25 1st Qu.:4.500 1st Qu.: 7.000 1st Qu.:16.75 1st Qu.: 55.00 Intolerant : 349
Green : 692 Median :5.000 Median : 7.300 Median :28.00 Median : 60.00 Tolerant : 246
Red : 4 Mean :4.997 Mean : 7.344 Mean :25.57 Mean : 58.73 NA's :4329
White-Gray : 9 3rd Qu.:5.500 3rd Qu.: 7.800 3rd Qu.:32.00 3rd Qu.: 60.00
Yellow-Green: 20 Max. :7.000 Max. :10.000 Max. :60.00 Max. :200.00
NA's :4334 NA's :4327 NA's :4327 NA's :4338 NA's :4338
Temp_Min_F
Min. :-79.00
1st Qu.:-38.00
Median :-33.00
Mean :-22.53
3rd Qu.:-18.00
Max. : 52.00
NA's :4328
> summary(plants) Scientific_Name Duration Active_Growth_Period Abelmoschus : 1 Perennial :3031 Spring and Summer : 447 Abelmoschus esculentus : 1 Annual : 682 Spring : 144 Abies : 1 Annual, Perennial: 179 Spring, Summer, Fall: 95 Abies balsamea : 1 Annual, Biennial : 95 Summer : 92 Abies balsamea var. balsamea: 1 Biennial : 57 Summer and Fall : 24 Abutilon : 1 (Other) : 92 (Other) : 30 (Other) :5160 NA's :1030 NA's :4334 Foliage_Color pH_Min pH_Max Precip_Min Precip_Max Shade_Tolerance Dark Green : 82 Min. :3.000 Min. : 5.100 Min. : 4.00 Min. : 16.00 Intermediate: 242 Gray-Green : 25 1st Qu.:4.500 1st Qu.: 7.000 1st Qu.:16.75 1st Qu.: 55.00 Intolerant : 349 Green : 692 Median :5.000 Median : 7.300 Median :28.00 Median : 60.00 Tolerant : 246 Red : 4 Mean :4.997 Mean : 7.344 Mean :25.57 Mean : 58.73 NA's :4329 White-Gray : 9 3rd Qu.:5.500 3rd Qu.: 7.800 3rd Qu.:32.00 3rd Qu.: 60.00 Yellow-Green: 20 Max. :7.000 Max. :10.000 Max. :60.00 Max. :200.00 NA's :4334 NA's :4327 NA's :4327 NA's :4338 NA's :4338 Temp_Min_F Min. :-79.00 1st Qu.:-38.00 Median :-33.00 Mean :-22.53 3rd Qu.:-18.00 Max. : 52.00 NA's :4328
> summary(plants)
                     Scientific_Name              Duration              Active_Growth_Period
 Abelmoschus                 :   1   Perennial        :3031   Spring and Summer   : 447     
 Abelmoschus esculentus      :   1   Annual           : 682   Spring              : 144     
 Abies                       :   1   Annual, Perennial: 179   Spring, Summer, Fall:  95     
 Abies balsamea              :   1   Annual, Biennial :  95   Summer              :  92     
 Abies balsamea var. balsamea:   1   Biennial         :  57   Summer and Fall     :  24     
 Abutilon                    :   1   (Other)          :  92   (Other)             :  30     
 (Other)                     :5160   NA's             :1030   NA's                :4334     
      Foliage_Color      pH_Min          pH_Max         Precip_Min      Precip_Max         Shade_Tolerance
 Dark Green  :  82   Min.   :3.000   Min.   : 5.100   Min.   : 4.00   Min.   : 16.00   Intermediate: 242  
 Gray-Green  :  25   1st Qu.:4.500   1st Qu.: 7.000   1st Qu.:16.75   1st Qu.: 55.00   Intolerant  : 349  
 Green       : 692   Median :5.000   Median : 7.300   Median :28.00   Median : 60.00   Tolerant    : 246  
 Red         :   4   Mean   :4.997   Mean   : 7.344   Mean   :25.57   Mean   : 58.73   NA's        :4329  
 White-Gray  :   9   3rd Qu.:5.500   3rd Qu.: 7.800   3rd Qu.:32.00   3rd Qu.: 60.00                      
 Yellow-Green:  20   Max.   :7.000   Max.   :10.000   Max.   :60.00   Max.   :200.00                      
 NA's        :4334   NA's   :4327    NA's   :4327     NA's   :4338    NA's   :4338                        
   Temp_Min_F    
 Min.   :-79.00  
 1st Qu.:-38.00  
 Median :-33.00  
 Mean   :-22.53  
 3rd Qu.:-18.00  
 Max.   : 52.00  
 NA's   :4328

对于不同的数据类型,summary() 会给出不同的结果。如对于Precip_Min 这样的数值型变量,summary() 给出了Min.(minimum)、1st Qu(1st quartile)、Median、Mean、3rd Qu(3rd Qu)和Max.(maximum)几项,便于我们了解数据的分布。对于factor,summary() 给出了每一种值出现的次数,如Scientific_Name 的每一种值均只出现了一次,因为科学名称其对于每一种特定植物是唯一的。注意有些变量的summary被截断了,因为它们太长了,未显示的数据都被放在(Other) 里面。

使用table() 可以查看某一变量的各种取值都出现了多少次:

> table(plants$Active_Growth_Period)
Fall, Winter and Spring Spring Spring and Fall Spring and Summer
15 144 10 447
Spring, Summer, Fall Summer Summer and Fall Year Round
95 92 24 5
> table(plants$Active_Growth_Period) Fall, Winter and Spring Spring Spring and Fall Spring and Summer 15 144 10 447 Spring, Summer, Fall Summer Summer and Fall Year Round 95 92 24 5
> table(plants$Active_Growth_Period)

Fall, Winter and Spring                  Spring         Spring and Fall       Spring and Summer 
                     15                     144                      10                     447 
   Spring, Summer, Fall                  Summer         Summer and Fall              Year Round 
                     95                      92                      24                       5

str() 可能是查看数据集最有用和简明的方法了,它能够以简明可读的形式,给出数据集的诸多特征:

> str(plants)
'data.frame': 5166 obs. of 10 variables:
$ Scientific_Name : Factor w/ 5166 levels "Abelmoschus",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Duration : Factor w/ 8 levels "Annual","Annual, Biennial",..: NA 4 NA 7 7 NA 1 NA 7 7 ...
$ Active_Growth_Period: Factor w/ 8 levels "Fall, Winter and Spring",..: NA NA NA 4 NA NA NA NA 4 NA ...
$ Foliage_Color : Factor w/ 6 levels "Dark Green","Gray-Green",..: NA NA NA 3 NA NA NA NA 3 NA ...
$ pH_Min : num NA NA NA 4 NA NA NA NA 7 NA ...
$ pH_Max : num NA NA NA 6 NA NA NA NA 8.5 NA ...
$ Precip_Min : int NA NA NA 13 NA NA NA NA 4 NA ...
$ Precip_Max : int NA NA NA 60 NA NA NA NA 20 NA ...
$ Shade_Tolerance : Factor w/ 3 levels "Intermediate",..: NA NA NA 3 NA NA NA NA 2 NA ...
$ Temp_Min_F : int NA NA NA -43 NA NA NA NA -13 NA ...
> str(plants) 'data.frame': 5166 obs. of 10 variables: $ Scientific_Name : Factor w/ 5166 levels "Abelmoschus",..: 1 2 3 4 5 6 7 8 9 10 ... $ Duration : Factor w/ 8 levels "Annual","Annual, Biennial",..: NA 4 NA 7 7 NA 1 NA 7 7 ... $ Active_Growth_Period: Factor w/ 8 levels "Fall, Winter and Spring",..: NA NA NA 4 NA NA NA NA 4 NA ... $ Foliage_Color : Factor w/ 6 levels "Dark Green","Gray-Green",..: NA NA NA 3 NA NA NA NA 3 NA ... $ pH_Min : num NA NA NA 4 NA NA NA NA 7 NA ... $ pH_Max : num NA NA NA 6 NA NA NA NA 8.5 NA ... $ Precip_Min : int NA NA NA 13 NA NA NA NA 4 NA ... $ Precip_Max : int NA NA NA 60 NA NA NA NA 20 NA ... $ Shade_Tolerance : Factor w/ 3 levels "Intermediate",..: NA NA NA 3 NA NA NA NA 2 NA ... $ Temp_Min_F : int NA NA NA -43 NA NA NA NA -13 NA ...
> str(plants)
'data.frame':   5166 obs. of  10 variables:
 $ Scientific_Name     : Factor w/ 5166 levels "Abelmoschus",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Duration            : Factor w/ 8 levels "Annual","Annual, Biennial",..: NA 4 NA 7 7 NA 1 NA 7 7 ...
 $ Active_Growth_Period: Factor w/ 8 levels "Fall, Winter and Spring",..: NA NA NA 4 NA NA NA NA 4 NA ...
 $ Foliage_Color       : Factor w/ 6 levels "Dark Green","Gray-Green",..: NA NA NA 3 NA NA NA NA 3 NA ...
 $ pH_Min              : num  NA NA NA 4 NA NA NA NA 7 NA ...
 $ pH_Max              : num  NA NA NA 6 NA NA NA NA 8.5 NA ...
 $ Precip_Min          : int  NA NA NA 13 NA NA NA NA 4 NA ...
 $ Precip_Max          : int  NA NA NA 60 NA NA NA NA 20 NA ...
 $ Shade_Tolerance     : Factor w/ 3 levels "Intermediate",..: NA NA NA 3 NA NA NA NA 2 NA ...
 $ Temp_Min_F          : int  NA NA NA -43 NA NA NA NA -13 NA ...