samedi 18 avril 2020

Chi-Square Test: thinking about the home of childhood and daily activities such hobbies

Introduction

As said in link,
The National Longitudinal Study of Adolescent Health (AddHealth) is a representative school-based survey of adolescents in grades 7-12 in the United States. The Wave 1 survey focuses on factors that may influence adolescents’ health and risk behaviors, including personal traits, families, friendships, romantic relationships, peer groups, schools, neighborhoods, and communities.

source: Data Analysis Tools par Université Wesleyenne 
NB. This study is under Coursera training (Outils d'analyse des données) using python

Data

H1GI2 (General Introductory) 

The question was:
Think about the house or apartment building in which you lived in January 1990, when you were {AGE IN JANUARY 1990} years old. Do you still live there? 
Frequency Code Response
3046 0 no
3447 1 yes
3 6 refused
8 8 don't know


=> 4 categories

H1DA2 (Daily Activities): 

The question was:
How many times did adolescents do hobbies? It's a categorical value => response ( not at all, 1 or 2 times, more, refused, don’t know...)
Frequency Code Response
1416 0 not at all
2163 1 1 or 2 times
1439 2 3 or 4 times
1479 3 5 or more times
2 6 refused
5 8 don't know
=> 6 catégories

Objective:

Study of the impact & relationship between thinking about the home of childhood and daily activities such hobbies.

code:

# -*- coding: utf-8 -*-
"""
Created on Sat Apr 18 13:11:16 2020

@author: PC HP
"""
import pandas
import numpy
import scipy.stats
import seaborn
import matplotlib.pyplot as plt
mydata = pandas.read_csv('addhealth_pds.csv', low_memory=False)
mydata['H1GI2'] = mydata['H1GI2'].apply(pandas.to_numeric, errors='coerce')
mydata['H1DA2'] = mydata['H1DA2'].apply(pandas.to_numeric, errors='coerce')
#mydata['H1GI2'] = pandas.to_numeric(mydata['H1GI2'], errors='coerce')
#mydata['H1DA2 '] = pandas.to_numeric(mydata['H1DA2 '], errors='coerce')
#SETTING MISSING DATA
mydata['H1GI2']=mydata['H1GI2'].replace(200, numpy.nan)
mydata['H1DA2']=mydata['H1DA2'].replace(200, numpy.nan)
# contingency table of observed counts
print('observed counts table:')
ct1=pandas.crosstab(mydata['H1GI2'], mydata['H1DA2'])
print (ct1)
# column percentages
colsum=ct1.sum(axis=0)
colpct=ct1/colsum
print('chi percentages table:')
print(colpct)
# chi-square
print ('chi-square value, p value, expected counts')
cs1= scipy.stats.chi2_contingency(ct1)
print (cs1)
# set variable types 
mydata["H1DA2"] = mydata["H1DA2"].astype('category')
# new code for setting variables to numeric:
mydata['H1GI2'] = pandas.to_numeric(mydata['H1GI2'], errors='coerce')
# graph
seaborn.catplot(x="H1DA2", y="H1GI2", data=mydata, kind="bar", ci=None)
plt.xlabel('Times did adolescents do hobbies last week')
plt.ylabel('Proportion residence leaving in childhood Dependent')

recode = {0: 0}
mydata['subH1DA2']= mydata['H1DA2'].map(recode)
# contingency table of observed counts
ct2=pandas.crosstab(mydata['H1GI2'], mydata['subH1DA2'])
print (ct2)
# column percentages
colsum=ct2.sum(axis=0)
colpct=ct2/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs2= scipy.stats.chi2_contingency(ct2)
print (cs2)
#next sub groups
recode = {0: 0,1:1}
mydata['subH1DA2']= mydata['H1DA2'].map(recode)
# contingency table of observed counts
ct2=pandas.crosstab(mydata['H1GI2'], mydata['subH1DA2'])
print (ct2)
# column percentages
colsum=ct2.sum(axis=0)
colpct=ct2/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs2= scipy.stats.chi2_contingency(ct2)
print (cs2)
#next sub groups
recode = {0: 0,1:1,2:2}
mydata['subH1DA2']= mydata['H1DA2'].map(recode)
# contingency table of observed counts
ct2=pandas.crosstab(mydata['H1GI2'], mydata['subH1DA2'])
print (ct2)
# column percentages
colsum=ct2.sum(axis=0)
colpct=ct2/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs2= scipy.stats.chi2_contingency(ct2)
print (cs2)
#next sub groups
recode = {0: 0,1:1,2:2,3:3}
mydata['subH1DA2']= mydata['H1DA2'].map(recode)
# contingency table of observed counts
ct2=pandas.crosstab(mydata['H1GI2'], mydata['subH1DA2'])
print (ct2)
# column percentages
colsum=ct2.sum(axis=0)
colpct=ct2/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs2= scipy.stats.chi2_contingency(ct2)
print (cs2)
#next sub groups
recode = {0: 0,1:1,2:2,3:3,6:6}
mydata['subH1DA2']= mydata['H1DA2'].map(recode)
# contingency table of observed counts
ct2=pandas.crosstab(mydata['H1GI2'], mydata['subH1DA2'])
print (ct2)
# column percentages
colsum=ct2.sum(axis=0)
colpct=ct2/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs2= scipy.stats.chi2_contingency(ct2)
print (cs2)
#next sub groups
recode = {0: 0,1:1,2:2,3:3,8:8}
mydata['subH1DA2']= mydata['H1DA2'].map(recode)
# contingency table of observed counts
ct2=pandas.crosstab(mydata['H1GI2'], mydata['subH1DA2'])
print (ct2)
# column percentages
colsum=ct2.sum(axis=0)
colpct=ct2/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs2= scipy.stats.chi2_contingency(ct2)
print (cs2)
#next sub groups
recode = {6:6,8:8}
mydata['subH1DA2']= mydata['H1DA2'].map(recode)
# contingency table of observed counts
ct2=pandas.crosstab(mydata['H1GI2'], mydata['subH1DA2'])
print (ct2)
# column percentages
colsum=ct2.sum(axis=0)
colpct=ct2/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs2= scipy.stats.chi2_contingency(ct2)
print (cs2)


Results:


According to the figure the S1 = {x = [0..3]} are in relation and for S2 = {x = [6.8]} with a low probability. But the two subgroups are not dependant. This is explained also by the following:
  • S1:

subH1DA2       0.0       1.0       2.0       3.0
H1GI2                                          
0         0.496469  0.478502  0.451703  0.444219
1         0.500706  0.521036  0.548297  0.555105
6         0.000706  0.000000  0.000000  0.000000
8         0.002119  0.000462  0.000000  0.000676
chi-square value, p value, expected counts
(19.160401671234695, 0.023863055784254898, 9, array([[6.63647837e+02, 1.01375019e+03, 6.74427428e+02, 6.93174542e+02],
       [7.51044482e+02, 1.14725227e+03, 7.63243651e+02, 7.84459597e+02],
       [2.17946745e-01, 3.32922887e-01, 2.21486840e-01, 2.27643528e-01],
       [1.08973372e+00, 1.66461444e+00, 1.10743420e+00, 1.13821764e+00]]))
  • S2:

subH1DA2  6.0  8.0
H1GI2            
0           0    1
1           0    1
6           2    0
8           0    3
subH1DA2  6.0  8.0
H1GI2            
0         0.0  0.2
1         0.0  0.2
6         1.0  0.0
8         0.0  0.6
chi-square value, p value, expected counts
(7.0, 0.07189777249646509, 3, array([[0.28571429, 0.71428571],
       [0.28571429, 0.71428571],
       [0.57142857, 1.42857143],
       [0.85714286, 2.14285714]]))

Indeed, the value of chi and p gave for S1 an important chi even if p weak which means the negation of H0. Whereas for S2, a smaller chi and a higher p than S1, which reduces to the dependence with the less negotiation for H0.
More details are available in the following results (including post hoc analysis):
observed counts table:
H1DA2    0     1    2    3  6  8
H1GI2                          
0      703  1035  650  657  0  1
1      709  1127  789  821  0  1
6        1     0    0    0  2  0
8        3     1    0    1  0  3
chi percentages table:
H1DA2         0         1         2         3    6    8
H1GI2                                                 
0      0.496469  0.478502  0.451703  0.444219  0.0  0.2
1      0.500706  0.521036  0.548297  0.555105  0.0  0.2
6      0.000706  0.000000  0.000000  0.000000  1.0  0.0
8      0.002119  0.000462  0.000000  0.000676  0.0  0.6
chi-square value, p value, expected counts
(5810.662666395812, 0.0, 15, array([[6.63151292e+02, 1.01299170e+03, 6.73922817e+02, 6.92655904e+02,
        9.36654367e-01, 2.34163592e+00],
       [7.50453875e+02, 1.14635009e+03, 7.62643450e+02, 7.83842712e+02,
        1.05996310e+00, 2.64990775e+00],
       [6.53136531e-01, 9.97693727e-01, 6.63745387e-01, 6.82195572e-01,
        9.22509225e-04, 2.30627306e-03],
       [1.74169742e+00, 2.66051661e+00, 1.76998770e+00, 1.81918819e+00,
        2.46002460e-03, 6.15006150e-03]]))
subH1DA2  0.0
H1GI2       
0         703
1         709
6           1
8           3
subH1DA2       0.0
H1GI2            
0         0.496469
1         0.500706
6         0.000706
8         0.002119
chi-square value, p value, expected counts
(0.0, 1.0, 0, array([[703.],
       [709.],
       [  1.],
       [  3.]]))
subH1DA2  0.0   1.0
H1GI2             
0         703  1035
1         709  1127
6           1     0
8           3     1
subH1DA2       0.0       1.0
H1GI2                      
0         0.496469  0.478502
1         0.500706  0.521036
6         0.000706  0.000000
8         0.002119  0.000462
chi-square value, p value, expected counts
(4.886483669772731, 0.18030059764588033, 3, array([[6.87624476e+02, 1.05037552e+03],
       [7.26397318e+02, 1.10960268e+03],
       [3.95641241e-01, 6.04358759e-01],
       [1.58256496e+00, 2.41743504e+00]]))
subH1DA2  0.0   1.0  2.0
H1GI2                  
0         703  1035  650
1         709  1127  789
6           1     0    0
8           3     1    0
subH1DA2       0.0       1.0       2.0
H1GI2                                
0         0.496469  0.478502  0.451703
1         0.500706  0.521036  0.548297
6         0.000706  0.000000  0.000000
8         0.002119  0.000462  0.000000
chi-square value, p value, expected counts
(13.279010760752726, 0.03881309634250806, 6, array([[6.73855719e+02, 1.02934316e+03, 6.84801116e+02],
       [7.40733360e+02, 1.13150159e+03, 7.52765046e+02],
       [2.82184137e-01, 4.31048226e-01, 2.86767637e-01],
       [1.12873655e+00, 1.72419291e+00, 1.14707055e+00]]))
subH1DA2  0.0   1.0  2.0  3.0
H1GI2                       
0         703  1035  650  657
1         709  1127  789  821
6           1     0    0    0
8           3     1    0    1
subH1DA2       0.0       1.0       2.0       3.0
H1GI2                                           
0         0.496469  0.478502  0.451703  0.444219
1         0.500706  0.521036  0.548297  0.555105
6         0.000706  0.000000  0.000000  0.000000
8         0.002119  0.000462  0.000000  0.000676
chi-square value, p value, expected counts
(19.160401671234695, 0.023863055784254898, 9, array([[6.63647837e+02, 1.01375019e+03, 6.74427428e+02, 6.93174542e+02],
       [7.51044482e+02, 1.14725227e+03, 7.63243651e+02, 7.84459597e+02],
       [2.17946745e-01, 3.32922887e-01, 2.21486840e-01, 2.27643528e-01],
       [1.08973372e+00, 1.66461444e+00, 1.10743420e+00, 1.13821764e+00]]))
subH1DA2  0.0   1.0  2.0  3.0  6.0
H1GI2                            
0         703  1035  650  657    0
1         709  1127  789  821    0
6           1     0    0    0    2
8           3     1    0    1    0
subH1DA2       0.0       1.0       2.0       3.0  6.0
H1GI2                                               
0         0.496469  0.478502  0.451703  0.444219  0.0
1         0.500706  0.521036  0.548297  0.555105  0.0
6         0.000706  0.000000  0.000000  0.000000  1.0
8         0.002119  0.000462  0.000000  0.000676  0.0
chi-square value, p value, expected counts
(4348.773173724676, 0.0, 12, array([[6.63443607e+02, 1.01343822e+03, 6.74219880e+02, 6.92961225e+02,
        9.37067241e-01],
       [7.50813356e+02, 1.14689922e+03, 7.63008771e+02, 7.84218187e+02,
        1.06047084e+00],
       [6.53639021e-01, 9.98461302e-01, 6.64256039e-01, 6.82720419e-01,
        9.23218957e-04],
       [1.08939837e+00, 1.66410217e+00, 1.10709340e+00, 1.13786736e+00,
        1.53869826e-03]]))
subH1DA2  0.0   1.0  2.0  3.0  8.0
H1GI2                            
0         703  1035  650  657    1
1         709  1127  789  821    1
6           1     0    0    0    0
8           3     1    0    1    3
subH1DA2       0.0       1.0       2.0       3.0  8.0
H1GI2                                               
0         0.496469  0.478502  0.451703  0.444219  0.2
1         0.500706  0.521036  0.548297  0.555105  0.2
6         0.000706  0.000000  0.000000  0.000000  0.0
8         0.002119  0.000462  0.000000  0.000676  0.6
chi-square value, p value, expected counts
(1477.2704083643332, 3.0249713664075e-309, 12, array([[6.63355275e+02, 1.01330329e+03, 6.74130114e+02, 6.92868963e+02,
        2.34235620e+00],
       [7.50684712e+02, 1.14670271e+03, 7.62878038e+02, 7.84083820e+02,
        2.65072285e+00],
       [2.17779145e-01, 3.32666872e-01, 2.21316518e-01, 2.27468471e-01,
        7.68994156e-04],
       [1.74223316e+00, 2.66133497e+00, 1.77053214e+00, 1.81974777e+00,
        6.15195325e-03]]))
subH1DA2  6.0  8.0
H1GI2            
0           0    1
1           0    1
6           2    0
8           0    3
subH1DA2  6.0  8.0
H1GI2            
0         0.0  0.2
1         0.0  0.2
6         1.0  0.0
8         0.0  0.6
chi-square value, p value, expected counts
(7.0, 0.07189777249646509, 3, array([[0.28571429, 0.71428571],
       [0.28571429, 0.71428571],
       [0.57142857, 1.42857143],
       [0.85714286, 2.14285714]]))
Accroding to just above resultes, we abserve that if data of 'x' (subH1DA2 from H1DA2) is not contains x in {6,8}  we have good independance with poor p. In adding 6 or 8 data or all, the chi value is more important and p converge to zero.

So, we conclude the rejection of H0 with great independence between thinking about the home of childhood and daily activities such as hobbies. Also, according to post hoc comparison, it exists great independence between response subgroups; who would like to give response x= [0ènot at all..., 3è5 or more time] or who don't give (refused, don' know) according to their recent hobbies doing.

Aucun commentaire:

Enregistrer un commentaire