Descriptive Statistics

Md Zulquar Nain

Importing the Data File

# importing data from `csv` file
datai <- read.csv("hsbraw.csv")
  • datai - name of the imported data file inR

  • hsbraw.csv name of the csv file being imported

Exploring the Dataset I

  • Class, structure and dimension of the dataset
# Structure of the data
str(datai)
'data.frame':   189 obs. of  9 variables:
 $ id     : int  3 4 5 6 7 8 9 10 11 12 ...
 $ gender : chr  "male" "female" "male" "female" ...
 $ schtyp : chr  "public" "public" "public" "public" ...
 $ prog   : chr  "academic" "academic" "academic" "academic" ...
 $ read   : int  63 44 47 47 57 39 48 47 34 37 ...
 $ write  : int  65 50 40 41 54 44 49 54 46 44 ...
 $ math   : int  48 41 43 46 59 52 52 49 45 45 ...
 $ science: int  63 39 45 40 47 44 -99 53 39 39 ...
 $ socst  : int  56 51 31 41 51 48 -99 61 36 46 ...
#Class of the data
class(datai)
[1] "data.frame"
# Dimension of the data
dim(datai)
[1] 189   9

Descriptive Statistics

Measures of Central tendency

  • For continuous variables are the
    • mean, median, and variance
  • Functions in R
    • mean()
    • median()
    • var()
    • sd() for standard deviation

Descriptive statistics

  • summary() function available with base R
  • mean,median,25th and 75th quartiles
  • min and max
summary(datai)
       id           gender             schtyp              prog          
 Min.   :  3.0   Length:189         Length:189         Length:189        
 1st Qu.: 52.0   Class :character   Class :character   Class :character  
 Median :101.0   Mode  :character   Mode  :character   Mode  :character  
 Mean   :101.7                                                           
 3rd Qu.:152.0                                                           
 Max.   :200.0                                                           
      read           write            math          science     
 Min.   :28.00   Min.   :31.00   Min.   :35.00   Min.   :-99.0  
 1st Qu.:47.00   1st Qu.:46.00   1st Qu.:46.00   1st Qu.: 44.0  
 Median :52.00   Median :54.00   Median :53.00   Median : 53.0  
 Mean   :52.99   Mean   :53.67   Mean   :53.35   Mean   : 47.7  
 3rd Qu.:60.00   3rd Qu.:61.00   3rd Qu.:60.00   3rd Qu.: 58.0  
 Max.   :76.00   Max.   :67.00   Max.   :75.00   Max.   : 74.0  
     socst      
 Min.   :-99.0  
 1st Qu.: 46.0  
 Median : 52.0  
 Mean   : 48.1  
 3rd Qu.: 61.0  
 Max.   : 71.0  

Descriptive statistics

  • Selecting a specific column
summary(datai$read)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  28.00   47.00   52.00   52.99   60.00   76.00 
  • More than one column
# create a subdata 
subdata <- datai[,c("read","write","math")]
# summary of the subset
summary(subdata)
      read           write            math      
 Min.   :28.00   Min.   :31.00   Min.   :35.00  
 1st Qu.:47.00   1st Qu.:46.00   1st Qu.:46.00  
 Median :52.00   Median :54.00   Median :53.00  
 Mean   :52.99   Mean   :53.67   Mean   :53.35  
 3rd Qu.:60.00   3rd Qu.:61.00   3rd Qu.:60.00  
 Max.   :76.00   Max.   :67.00   Max.   :75.00  

Descriptive Statistics

  • Using describe function form the psych package

  • More control

 library(psych)

describe(subdata)
      vars   n  mean   sd median trimmed   mad min max range  skew kurtosis
read     1 189 52.99 9.94     52   52.78 11.86  28  76    48  0.20    -0.65
write    2 189 53.67 8.90     54   54.22 10.38  31  67    36 -0.52    -0.64
math     3 189 53.35 9.10     53   52.99 10.38  35  75    40  0.27    -0.67
        se
read  0.72
write 0.65
math  0.66

Descriptive Statistics

  • Without Skewness and Kurtosis
# without skewness and kurtosis
describe(subdata, skew=FALSE) 
      vars   n  mean   sd median min max range   se
read     1 189 52.99 9.94     52  28  76    48 0.72
write    2 189 53.67 8.90     54  31  67    36 0.65
math     3 189 53.35 9.10     53  35  75    40 0.66
  • Without range
 # without range
describe(subdata, ranges = FALSE)
      vars   n  mean   sd  skew kurtosis   se
read     1 189 52.99 9.94  0.20    -0.65 0.72
write    2 189 53.67 8.90 -0.52    -0.64 0.65
math     3 189 53.35 9.10  0.27    -0.67 0.66

Descriptive Statistics

  • Summary Statistics by grouping data using some specific criteria
# generating summary statistics by grouping variable
describeBy(subdata, datai$schtyp)

 Descriptive statistics by group 
group: private
      vars  n  mean   sd median trimmed  mad min max range  skew kurtosis   se
read     1 32 54.25 9.20   52.0   53.85 7.41  36  73    37  0.32    -0.91 1.63
write    2 32 55.53 7.18   57.0   56.12 6.67  38  67    29 -0.70    -0.26 1.27
math     3 32 54.75 8.88   53.5   54.27 8.90  41  75    34  0.45    -0.69 1.57
------------------------------------------------------------ 
group: public
      vars   n  mean    sd median trimmed   mad min max range  skew kurtosis
read     1 157 52.73 10.09     52   52.53 11.86  28  76    48  0.19    -0.66
write    2 157 53.29  9.18     54   53.81 11.86  31  67    36 -0.46    -0.77
math     3 157 53.07  9.15     53   52.72 10.38  35  75    40  0.25    -0.73
        se
read  0.81
write 0.73
math  0.73

Frequencies and Cross Tabulation

Frequency Table

  • Generating Frequency Tables
  • frequency tables using the table( ) function
table(datai$gender)

female   male 
   104     85 
table(datai$schtyp)

private  public 
     32     157 

Frequency Tables

  • tables of proportions using the prop.table( ) function
  • for proportions, use output of table() as input to prop.table()
#saving the freq table to an object
tableg <- table(datai$gender)
prop.table(tableg)

   female      male 
0.5502646 0.4497354 
# OR
prop.table(table(datai$gender))

   female      male 
0.5502646 0.4497354 

Cross Tabulation

  • Two Way Tabulation

  • counts in each crossing of gender and school type

tab2way <- table(datai$gender, datai$schtyp)
tab2way
        
         private public
  female      18     86
  male        14     71
  • Marginal frequencies using margin.table( )
margin.table(tab2way,margin = 1)

female   male 
   104     85 
margin.table(tab2way,margin = 2)

private  public 
     32     157 

Proportion Table

  • Row proportions
  • Proportion of gender that falls into school type
prop.table(tab2way, margin = 1)
        
           private    public
  female 0.1730769 0.8269231
  male   0.1647059 0.8352941
  • columns proportions,
  • Proportion of school type that falls into gender
prop.table(tab2way, margin = 2)
        
           private    public
  female 0.5625000 0.5477707
  male   0.4375000 0.4522293

Correlation

Correlation

  • Can use the cor( ) function to produce correlations

  • General framework cor(x, use=, method= )

    • x: Matrix or data frame
    • use: Specifies the handling of missing data
    • method: Specifies the type of correlation
cordata <- datai[,c("read","write","math")]
cor(cordata,use="all.obs",method="pearson")
           read     write      math
read  1.0000000 0.5613371 0.6373328
write 0.5613371 1.0000000 0.5789356
math  0.6373328 0.5789356 1.0000000
cor(cordata,use="all.obs",method="spearman")
           read     write      math
read  1.0000000 0.5825882 0.6355307
write 0.5825882 1.0000000 0.6117342
math  0.6355307 0.6117342 1.0000000
cor(cordata,use="all.obs",method="kendall")
           read     write      math
read  1.0000000 0.4264602 0.4719542
write 0.4264602 1.0000000 0.4492563
math  0.4719542 0.4492563 1.0000000
  • all.obs assumes no missing data - missing data will produce an error
  • complete.obs-listwise deletion
  • pairwise.complete.obs- pairwise deletion

THANKS