Descriptive Statistics

Md Zulquar Nain

Importing the Data File

# importing data from `csv` file
datai <- read.csv("hsbraw.csv")

datai - name of the imported data file inR
hsbraw.csv name of the csv file being imported

Exploring the Dataset I

Class, structure and dimension of the dataset

# Structure of the data
str(datai)

'data.frame':   189 obs. of  9 variables:
 $ id     : int  3 4 5 6 7 8 9 10 11 12 ...
 $ gender : chr  "male" "female" "male" "female" ...
 $ schtyp : chr  "public" "public" "public" "public" ...
 $ prog   : chr  "academic" "academic" "academic" "academic" ...
 $ read   : int  63 44 47 47 57 39 48 47 34 37 ...
 $ write  : int  65 50 40 41 54 44 49 54 46 44 ...
 $ math   : int  48 41 43 46 59 52 52 49 45 45 ...
 $ science: int  63 39 45 40 47 44 -99 53 39 39 ...
 $ socst  : int  56 51 31 41 51 48 -99 61 36 46 ...

#Class of the data
class(datai)

[1] "data.frame"

# Dimension of the data
dim(datai)

[1] 189   9

Descriptive Statistics

Measures of Central tendency

For continuous variables are the
- mean, median, and variance
Functions in R
- mean()
- median()
- var()
- sd() for standard deviation

Descriptive statistics

summary() function available with base R
mean,median,25th and 75th quartiles
min and max

summary(datai)

       id           gender             schtyp              prog          
 Min.   :  3.0   Length:189         Length:189         Length:189        
 1st Qu.: 52.0   Class :character   Class :character   Class :character  
 Median :101.0   Mode  :character   Mode  :character   Mode  :character  
 Mean   :101.7                                                           
 3rd Qu.:152.0                                                           
 Max.   :200.0                                                           
      read           write            math          science     
 Min.   :28.00   Min.   :31.00   Min.   :35.00   Min.   :-99.0  
 1st Qu.:47.00   1st Qu.:46.00   1st Qu.:46.00   1st Qu.: 44.0  
 Median :52.00   Median :54.00   Median :53.00   Median : 53.0  
 Mean   :52.99   Mean   :53.67   Mean   :53.35   Mean   : 47.7  
 3rd Qu.:60.00   3rd Qu.:61.00   3rd Qu.:60.00   3rd Qu.: 58.0  
 Max.   :76.00   Max.   :67.00   Max.   :75.00   Max.   : 74.0  
     socst      
 Min.   :-99.0  
 1st Qu.: 46.0  
 Median : 52.0  
 Mean   : 48.1  
 3rd Qu.: 61.0  
 Max.   : 71.0

Descriptive statistics

Selecting a specific column

summary(datai$read)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  28.00   47.00   52.00   52.99   60.00   76.00

More than one column

# create a subdata 
subdata <- datai[,c("read","write","math")]
# summary of the subset
summary(subdata)

      read           write            math      
 Min.   :28.00   Min.   :31.00   Min.   :35.00  
 1st Qu.:47.00   1st Qu.:46.00   1st Qu.:46.00  
 Median :52.00   Median :54.00   Median :53.00  
 Mean   :52.99   Mean   :53.67   Mean   :53.35  
 3rd Qu.:60.00   3rd Qu.:61.00   3rd Qu.:60.00  
 Max.   :76.00   Max.   :67.00   Max.   :75.00

Descriptive Statistics

Using describe function form the psych package
More control

 library(psych)

describe(subdata)

      vars   n  mean   sd median trimmed   mad min max range  skew kurtosis
read     1 189 52.99 9.94     52   52.78 11.86  28  76    48  0.20    -0.65
write    2 189 53.67 8.90     54   54.22 10.38  31  67    36 -0.52    -0.64
math     3 189 53.35 9.10     53   52.99 10.38  35  75    40  0.27    -0.67
        se
read  0.72
write 0.65
math  0.66

Descriptive Statistics

Without Skewness and Kurtosis

# without skewness and kurtosis
describe(subdata, skew=FALSE)

      vars   n  mean   sd median min max range   se
read     1 189 52.99 9.94     52  28  76    48 0.72
write    2 189 53.67 8.90     54  31  67    36 0.65
math     3 189 53.35 9.10     53  35  75    40 0.66

Without range

 # without range
describe(subdata, ranges = FALSE)

      vars   n  mean   sd  skew kurtosis   se
read     1 189 52.99 9.94  0.20    -0.65 0.72
write    2 189 53.67 8.90 -0.52    -0.64 0.65
math     3 189 53.35 9.10  0.27    -0.67 0.66

Descriptive Statistics

Summary Statistics by grouping data using some specific criteria

# generating summary statistics by grouping variable
describeBy(subdata, datai$schtyp)


 Descriptive statistics by group 
group: private
      vars  n  mean   sd median trimmed  mad min max range  skew kurtosis   se
read     1 32 54.25 9.20   52.0   53.85 7.41  36  73    37  0.32    -0.91 1.63
write    2 32 55.53 7.18   57.0   56.12 6.67  38  67    29 -0.70    -0.26 1.27
math     3 32 54.75 8.88   53.5   54.27 8.90  41  75    34  0.45    -0.69 1.57
------------------------------------------------------------ 
group: public
      vars   n  mean    sd median trimmed   mad min max range  skew kurtosis
read     1 157 52.73 10.09     52   52.53 11.86  28  76    48  0.19    -0.66
write    2 157 53.29  9.18     54   53.81 11.86  31  67    36 -0.46    -0.77
math     3 157 53.07  9.15     53   52.72 10.38  35  75    40  0.25    -0.73
        se
read  0.81
write 0.73
math  0.73

Frequencies and Cross Tabulation

Frequency Table

Generating Frequency Tables
frequency tables using the table( ) function

table(datai$gender)


female   male 
   104     85

table(datai$schtyp)


private  public 
     32     157

Frequency Tables

tables of proportions using the prop.table( ) function
for proportions, use output of table() as input to prop.table()

#saving the freq table to an object
tableg <- table(datai$gender)
prop.table(tableg)


   female      male 
0.5502646 0.4497354

# OR
prop.table(table(datai$gender))


   female      male 
0.5502646 0.4497354

Cross Tabulation

Two Way Tabulation
counts in each crossing of gender and school type

tab2way <- table(datai$gender, datai$schtyp)
tab2way

        
         private public
  female      18     86
  male        14     71

Marginal frequencies using margin.table( )

margin.table(tab2way,margin = 1)


female   male 
   104     85

margin.table(tab2way,margin = 2)


private  public 
     32     157

Proportion Table

Row proportions
Proportion of gender that falls into school type

prop.table(tab2way, margin = 1)

        
           private    public
  female 0.1730769 0.8269231
  male   0.1647059 0.8352941

columns proportions,
Proportion of school type that falls into gender

prop.table(tab2way, margin = 2)

        
           private    public
  female 0.5625000 0.5477707
  male   0.4375000 0.4522293

Correlation

Can use the cor( ) function to produce correlations
General framework cor(x, use=, method= )
- x: Matrix or data frame
- use: Specifies the handling of missing data
- method: Specifies the type of correlation

cordata <- datai[,c("read","write","math")]
cor(cordata,use="all.obs",method="pearson")

           read     write      math
read  1.0000000 0.5613371 0.6373328
write 0.5613371 1.0000000 0.5789356
math  0.6373328 0.5789356 1.0000000

cor(cordata,use="all.obs",method="spearman")

           read     write      math
read  1.0000000 0.5825882 0.6355307
write 0.5825882 1.0000000 0.6117342
math  0.6355307 0.6117342 1.0000000

cor(cordata,use="all.obs",method="kendall")

           read     write      math
read  1.0000000 0.4264602 0.4719542
write 0.4264602 1.0000000 0.4492563
math  0.4719542 0.4492563 1.0000000

all.obs assumes no missing data - missing data will produce an error
complete.obs-listwise deletion
pairwise.complete.obs- pairwise deletion

Descriptive Statistics

Importing the Data File

Exploring the Dataset I

Descriptive Statistics

Measures of Central tendency

Descriptive statistics

Descriptive statistics

Descriptive Statistics

Descriptive Statistics

Descriptive Statistics

Frequencies and Cross Tabulation

Frequency Table

Frequency Tables

Cross Tabulation

Proportion Table

Correlation

Correlation

THANKS