From 2c07be95dbb7caa839acb9d695f434b963193151 Mon Sep 17 00:00:00 2001 From: MarcoRianiUNIPR Date: Thu, 31 Oct 2024 18:43:22 +0100 Subject: [PATCH] Added 2x2 contingency table indexes to toutput of corrNominal --- toolbox/helpfiles/FSDA/corrNominal.html | 702 ++++++++++++++---------- toolbox/multivariate/corrNominal.m | 120 +++- 2 files changed, 499 insertions(+), 323 deletions(-) diff --git a/toolbox/helpfiles/FSDA/corrNominal.html b/toolbox/helpfiles/FSDA/corrNominal.html index b20ebb34d..51181de89 100644 --- a/toolbox/helpfiles/FSDA/corrNominal.html +++ b/toolbox/helpfiles/FSDA/corrNominal.html @@ -1,16 +1,21 @@ - corrNominal

corrNominal

corrNominal measures strength of association between two unordered (nominal) categorical variables.

Syntax

Description

corrNominal computes $\chi2$, $\Phi$, Cramer's $V$, Goodman-Kruskal's - $\lambda_{y|x}$, Goodman-Kruskal's $\tau_{y|x}$, and Theil's $H_{y|x}$ - (uncertainty coefficient).

- All these indexes measure the association among two unordered qualitative - variables.

- Additional details about these indexes can be found in the "More About" - section or in the "Output section" of this document.

example

out =corrNominal(N) corrNominal with all the default options.

example

out =corrNominal(N, Name, Value) Example of option conflev.

Examples

expand all

  • corrNominal with all the default options.
  • - Rows of N indicate type of Bachelor degree: - 'Economics' 'Law' 'Literature' - Columns of N indicate employment type: - 'Private_firm' 'Public_firm' 'Freelance' 'Unemployed'

    N=[150	80	20	50
    -80	250	30	140
    -30	50	0	120];
    + 
    
    
    
    corrNominal
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
      

    corrNominal

    corrNominal measures strength of association between two unordered (nominal) categorical variables.

    Syntax

    Description

    corrNominal computes $\chi2$, $\Phi$, Cramer's $V$, Goodman-Kruskal's + $\lambda_{y|x}$, Goodman-Kruskal's $\tau_{y|x}$, and Theil's $H_{y|x}$ + (uncertainty coefficient).

    + All these indexes measure the association among two unordered qualitative + variables.

    + If the input table is 2-by-2 indexes theta (cross product ratio), + Q=(theta-1)/(theta+1) and U=Q=(sqrt(theta)-1)/(sqrt(theta)+1) + are also computed + Additional details about these indexes can be found in the "More About" + section or in the "Output section" of this document.

    example

    out =corrNominal(N) corrNominal with all the default options.

    example

    out =corrNominal(N, Name, Value) Example of option conflev.

    Examples

    expand all

  • corrNominal with all the default options.
  • + Rows of N indicate type of Bachelor degree: + 'Economics' 'Law' 'Literature' + Columns of N indicate employment type: + 'Private_firm' 'Public_firm' 'Freelance' 'Unemployed'

    N=[150	80	20	50
    +80	250	30	140
    +30	50	0	120];
     out=corrNominal(N);
    Chi2 index
       221.2405
     
    @@ -42,10 +47,10 @@
         tauyx         0.091674      0.013524       0.065168    0.11818 
         Hyx            0.08716      0.011265       0.065082    0.10924 
     
    -

  • Example of option conflev.
  • - Use data from Goodman Kruskal (1954).

    N=[1768   807    189 47
    -946   1387    746 53
    -115    438    288 16];
    +

  • Example of option conflev.
  • + Use data from Goodman Kruskal (1954).

    N=[1768   807    189 47
    +946   1387    746 53
    +115    438    288 16];
     out=corrNominal(N,'conflev',0.99);
    Chi2 index
        1.0735e+03
     
    @@ -77,65 +82,65 @@
         tauyx         0.080883      0.0046282      0.068962    0.092805
         Hyx           0.075341      0.0041619      0.064621    0.086061
     
    -

    Related Examples

    expand all

  • corrNominal with option dispresults.
  • N=[ 6 14 17 9;
    -30 32 17 3];
    -out=corrNominal(N,'dispresults',false);

  • Example which starts from the original data matrix.
  • N=[26 26 23 18 9;
    -6  7  9 14 23];
    -% From the contingency table reconstruct the original data matrix.
    -n11=N(1,1); n12=N(1,2); n13=N(1,3); n14=N(1,4); n15=N(1,5);
    -n21=N(2,1); n22=N(2,2); n23=N(2,3); n24=N(2,4); n25=N(2,5);
    -x11=[1*ones(n11,1) 1*ones(n11,1)];
    -x12=[1*ones(n12,1) 2*ones(n12,1)];
    -x13=[1*ones(n13,1) 3*ones(n13,1)];
    -x14=[1*ones(n14,1) 4*ones(n14,1)];
    -x15=[1*ones(n15,1) 5*ones(n15,1)];
    -x21=[2*ones(n21,1) 1*ones(n21,1)];
    -x22=[2*ones(n22,1) 2*ones(n22,1)];
    -x23=[2*ones(n23,1) 3*ones(n23,1)];
    -x24=[2*ones(n24,1) 4*ones(n24,1)];
    -x25=[2*ones(n25,1) 5*ones(n25,1)];
    -% X original data matrix (in this case an array)
    -X=[x11; x12; x13; x14; x15; x21; x22; x23; x24; x25];
    -out=corrNominal(X,'datamatrix',true);

  • Example of option datamatrix combined with X defined as table.
  • - Initial contingency matrix (2D array).

    N=[75   126
    -76   203
    -40   129
    -36   125
    -24   110
    -41   222
    -19   141];
    -% Labels of the contingency matrix
    -Party={'ACTIVIST DEMOCRATIC', 'DEMOCRATIC', ...
    -'SIMPATIZING DEMOCRATIC', 'INDEPENDENT', ...
    -'LIKING REPUBLICAN', 'REPUBLICAN', ...
    -'ACTIVIST REPUBLICAN'};
    -DeathPenalty={'AGAINST' 'FAVORABLE'};
    -Ntable=array2table(N,'RowNames',Party,'VariableNames',DeathPenalty);
    -% From the contingency table reconstruct the original data matrix now
    -% using FSDA function
    -% The output is a cell arrary
    -Xcell=crosstab2datamatrix(Ntable);
    -Xtable=cell2table(Xcell);
    -% call function corrNominal using first argument as input data matrix
    -% in table format and option datamatrix set to true
    -out=corrNominal(Xtable,'datamatrix',true);

  • Example: compare confidence interval for Cramer V.
  • - Use the 4 possible methods

    method={'ncchisq', 'ncchisqadj', 'fisher' 'fisheradj'};
    -% Use a contingency table referred to type of job vs wine delivery
    -rownam={'Butcher' 'Carpenter' 'Carter' 'Farmer' 'Hunter' 'Miller' 'Taylor'};
    -colnam={'Wine not delivered' 'Wine delivered'};
    -N=[85 9
    -214  56
    -212  19
    -100  17
    -139  15
    -109  16
    -172  29];
    -Ntable=array2table(N,'RowNames',rownam,'VariableNames',colnam);
    -ConfintV=zeros(4,2);
    -for i=1:4
    -out=corrNominal(Ntable,'conflimMethodCramerV',method{i});
    -ConfintV(i,:)=out.ConfLimtable{'CramerV',3:4};
    -end
    +

    Related Examples

    expand all

  • corrNominal with option dispresults.
  • N=[ 6 14 17 9;
    +30 32 17 3];
    +out=corrNominal(N,'dispresults',false);

  • Example which starts from the original data matrix.
  • N=[26 26 23 18 9;
    +6  7  9 14 23];
    +% From the contingency table reconstruct the original data matrix.
    +n11=N(1,1); n12=N(1,2); n13=N(1,3); n14=N(1,4); n15=N(1,5);
    +n21=N(2,1); n22=N(2,2); n23=N(2,3); n24=N(2,4); n25=N(2,5);
    +x11=[1*ones(n11,1) 1*ones(n11,1)];
    +x12=[1*ones(n12,1) 2*ones(n12,1)];
    +x13=[1*ones(n13,1) 3*ones(n13,1)];
    +x14=[1*ones(n14,1) 4*ones(n14,1)];
    +x15=[1*ones(n15,1) 5*ones(n15,1)];
    +x21=[2*ones(n21,1) 1*ones(n21,1)];
    +x22=[2*ones(n22,1) 2*ones(n22,1)];
    +x23=[2*ones(n23,1) 3*ones(n23,1)];
    +x24=[2*ones(n24,1) 4*ones(n24,1)];
    +x25=[2*ones(n25,1) 5*ones(n25,1)];
    +% X original data matrix (in this case an array)
    +X=[x11; x12; x13; x14; x15; x21; x22; x23; x24; x25];
    +out=corrNominal(X,'datamatrix',true);

  • Example of option datamatrix combined with X defined as table.
  • + Initial contingency matrix (2D array).

    N=[75   126
    +76   203
    +40   129
    +36   125
    +24   110
    +41   222
    +19   141];
    +% Labels of the contingency matrix
    +Party={'ACTIVIST DEMOCRATIC', 'DEMOCRATIC', ...
    +'SIMPATIZING DEMOCRATIC', 'INDEPENDENT', ...
    +'LIKING REPUBLICAN', 'REPUBLICAN', ...
    +'ACTIVIST REPUBLICAN'};
    +DeathPenalty={'AGAINST' 'FAVORABLE'};
    +Ntable=array2table(N,'RowNames',Party,'VariableNames',DeathPenalty);
    +% From the contingency table reconstruct the original data matrix now
    +% using FSDA function
    +% The output is a cell arrary
    +Xcell=crosstab2datamatrix(Ntable);
    +Xtable=cell2table(Xcell);
    +% call function corrNominal using first argument as input data matrix
    +% in table format and option datamatrix set to true
    +out=corrNominal(Xtable,'datamatrix',true);

  • Example: compare confidence interval for Cramer V.
  • + Use the 4 possible methods

    method={'ncchisq', 'ncchisqadj', 'fisher' 'fisheradj'};
    +% Use a contingency table referred to type of job vs wine delivery
    +rownam={'Butcher' 'Carpenter' 'Carter' 'Farmer' 'Hunter' 'Miller' 'Taylor'};
    +colnam={'Wine not delivered' 'Wine delivered'};
    +N=[85 9
    +214  56
    +212  19
    +100  17
    +139  15
    +109  16
    +172  29];
    +Ntable=array2table(N,'RowNames',rownam,'VariableNames',colnam);
    +ConfintV=zeros(4,2);
    +for i=1:4
    +out=corrNominal(Ntable,'conflimMethodCramerV',method{i});
    +ConfintV(i,:)=out.ConfLimtable{'CramerV',3:4};
    +end
     disp(array2table(ConfintV,'RowNames',method,'VariableNames',{'Lower' 'Upper'}))
    Chi2 index
        21.0290
     
    @@ -268,223 +273,328 @@
         fisher        0.076621    0.18818
         fisheradj     0.076676    0.18824
     
    -

    Input Arguments

    expand all

    N — Contingency table (default) or n-by-2 input dataset. Matrix or Table.

    Matrix or table which contains the input contingency - table (say of size I-by-J) or the original data matrix.

    - In this last case N=crosstab(N(:,1),N(:,2)). As default - procedure assumes that the input is a contingency table.

    - If N is a data matrix (supplied as a a n-by-2 cell array - of strings, or n-by-2 array or n-by-2 table) optional - input datamatrix must be set to true.

    Data Types: single| double

    Name-Value Pair Arguments

    Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

    Example: 'NoStandardErrors',true -, 'dispresults',false -, 'Lr',{'a' 'b' 'c'} -, 'Lc',{'c1' c2' 'c3' 'c4'} -, 'datamatrix',true -, 'conflev',0.99 -, 'conflimMethodCramerV','fisheradj' -

    NoStandardErrors —Just indexes without standard errors and p-values.boolean.

    if NoStandardErrors is true just the indexes are computed - without standard errors and p-values. That is no - inferential measure is given. The default value of - NoStandardErrors is false.

    -

    Example: 'NoStandardErrors',true -

    Data Types: Boolean

    dispresults —Display results on the screen.boolean.

    If dispresults is true (default) it is possible to see on - the screen all the summary results of the analysis.

    -

    Example: 'dispresults',false -

    Data Types: Boolean

    Lr —Vector of row labels.cell.

    Cell containing the labels of the rows of the input - contingency matrix N. This option is unnecessary if N is a - table, because in this case Lr=N.Properties.RowNames;

    -

    Example: 'Lr',{'a' 'b' 'c'} -

    Data Types: cell array of strings

    Lc —Vector of column labels.cell.

    Cell containing the labels of the columns of the input - contingency matrix N. This option is unnecessary if N is a - table, because in this case Lc=N.Properties.VariableNames;

    -

    Example: 'Lc',{'c1' c2' 'c3' 'c4'} -

    Data Types: cell array of strings

    datamatrix —Data matrix or contingency table.boolean.

    If datamatrix is true the first input argument N is forced - to be interpreted as a data matrix, else if the input - argument is false N is treated as a contingency table. The - default value of datamatrix is false, that is the procedure - automatically considers N as a contingency table. In case - datamatrix is true N can be a cell of size n-by-2 - containing the two grouping variables or a numeric array of - size n-by-2 or a table of size n-by-2.

    -

    Example: 'datamatrix',true -

    Data Types: logical

    conflev —Confidence levels to be used to - compute confidence intervals.scalar.

    The default value of conflev is 0.95, that - is 95 per cent confidence intervals - are computed for all the indexes (note that this option is - ignored if NoStandardErrors=true).

    -

    Example: 'conflev',0.99 -

    Data Types: double

    conflimMethodCramerV —method to compute confidence interval for CramerV.character.

    Character which identifies the method to use to compute the - confidence interval for Cramer index. Default value is - 'ncchisq'. Possible values are 'ncchisq', 'ncchisqadj', - 'fisher' or 'fisheradj'; 'ncchisq' uses the non central - chi2. 'ncchisq' uses the non central chi2 adjusted for the - degrees of fredom. 'fisher' uses the Fisher - z-transformation and 'fisheradj' uses the fisher - z-transformation and bias correction.

    -

    Example: 'conflimMethodCramerV','fisheradj' -

    Data Types: character

    Output Arguments

    expand all

    out — description Structure

    Structure which contains the following fields:
    Value Description
    N

    $I$-by-$J$-array containing contingency table - referred to active rows (i.e. referred to the rows which - participated to the fit).

    - The $(i,j)$-th element is equal to $n_{ij}$, - $i=1, 2, \ldots, I$ and $j=1, 2, \ldots, J$. The - sum of the elements of out.N is $n$ (the grand - total).

    Ntable

    same as out.N but in table format (with row and - column names).

    - This output is present just if your MATLAB - version is not<2013b.

    Chi2

    scalar containing $\chi^2$ index.

    Chi2pval

    scalar containing pvalue of the $\chi^2$ index.

    Phi

    $\Phi$ index. Phi is a chi-square-based measure of - association that involves dividing the chi-square - statistic by the sample size and taking the square - root of the result. More precisely - \[ - \Phi= \sqrt{ \frac{\chi^2}{n} } - \] - This index lies in the interval $[0 , \sqrt{\min[(I-1),(J-1)]}$.

    CramerV

    1 x 4 vector which contains Cramer's V index, - standard error, z test, and p-value. Cramer'V index - is index $\Phi$ divided by its maximum. More precisely - \[ - V= \sqrt{\frac{\Phi}{\min[(I-1),(J-1)]}}=\sqrt{\frac{\chi^2}{n \min[(I-1),(J-1)]}} - \]

    - The range of Cramer index is [0, 1]. A Cramer's V in - the range of [0, 0.3] is considered as weak, - [0.3,0.7] as medium and > 0.7 as strong.

    - The way in which the confidence interval for - this index is specified in input option conflimMethodCramerV.

    - If conflimMethodCramerV is 'ncchisq', 'ncchisqadj' - we first find a confidence interval for the non - centrality parameter $\Delta$ of the - $\chi^2$ distribution with $df=(I-1)(J-1)$ degrees of - freedom. (see Smithson (2003); pp. 39-41) $[\Delta_L - \Delta_U]$. If input option conflimMethodCramerV is - 'ncchisq', confidence interval for $\Delta$ is - transformed into one for $V$ by the following - transformation -

    \[ - V_L=\sqrt{\frac{\Delta_L }{n \min[(I-1),(J-1)]}} - \] - and - \[ - V_U=\sqrt{\frac{\Delta_U }{n \min[(I-1),(J-1)]}} - \] - If input option conflimMethodCramerV is - 'ncchisqadj', confidence interval for $\Delta$ is - transformed into one for $V$ by the following - transformation - \[ - V_L=\sqrt{\frac{\Delta_L+ df }{n \min[(I-1),(J-1)]}} - \] - and - \[ - V_U=\sqrt{\frac{\Delta_U+ df }{n \min[(I-1),(J-1)]}} - \]

    GKlambdayx

    1 x 4 vector which contains index $\lambda_{y|x}$ - of Goodman and Kruskal standard error, z test, and p-value.

    -

    \[ - \lambda_{y|x} = \sum_{i=1}^I \frac{r_i- r}{n-r} - \] - \[ - r_i =\max(n_{ij}) - \] - \[ - r =\max(n_{.j}) - \]

    tauyx

    1 x 4 vector which contains tau index $\tau_{y|x}$, - standard error, ztest and p-value.

    -

    \[ - \tau_{y|x}= \frac{\sum_{i=1}^I \sum_{j=1}^J f_{ij}^2/f_{i.} -\sum_{j=1}^J f_{.j}^2 }{1-\sum_{j=1}^J f_{.j}^2 } - \]

    Hyx

    1 x 4 vector which contains the uncertainty - coefficient index (proposed by Theil) $H_{y|x}$, - standard error, ztest and p-value.

    -

    \[ - H_{y|x}= \frac{\sum_{i=1}^I \sum_{j=1}^J f_{ij} \log( f_{ij}/ (f_{i.}f_{.j}))}{\sum_{j=1}^J f_{.j} \log f_{.j} } - \]

    TestInd

    4-by-4 array containing index values (first column), - standard errors (second column), zscores (third column), - p-values (fourth column).

    TestIndtable

    4-by-4 table containing index values (first column), - standard errors (second column), zscores (third column), - p-values (fourth column).

    - This output is present just if your MATLAB - version is not<2013b.

    ConfLim

    4-by-4 array containing index values (first column), - standard errors (second column), lower confidence limit - (third column), upper confidence limit (fourth column).

    ConfLimtable

    4-by-4 table containing index values (first column), - standard errors (second column), lower confidence limit - (third column), upper confidence limit (fourth column).

    - This output is present just if your MATLAB - version is not<2013b.

    More About

    expand all

    Additional Details

    - $\lambda_{y|x}$ is a measure of association that - reflects the proportional reduction in error when - values of the independent variable (variable in the - rows of the contingency table) are used to predict - values of the dependent variable (variable in the - columns of the contingency table). The range of - $\lambda_{y|x}$ is [0, 1]. A value of 1 - means that the independent variable perfectly - predicts the dependent variable. On the other hand, - a value of 0 means that the independent variable - does not help in predicting the dependent variable.

    - More generally, let $V(y)$ a measure of variation - for the marginal distribution $(f_{.1}=n_{.1}/n, - ..., f_{.J}=n_{.J}/n)$ of the response $y$ and let - $V(y|i)$ denote the same measure computed for the - conditional distribution $(f_{1|i}=n_{1|i}/n, ..., - f_{J|i}=n_{J|i}/n)$ of $y$ at the $i$-th setting of - the explanatory variable $x$. A proportional - reduction in variation measure has the form.

    -

    \[ - \frac{V(y) - E[V(y|x)]}{V(y|x)} - \] - where $E[V(y|x)]$ is the expectation of the - conditional variation taken with respect to the - distribution of $x$. When $x$ is a categorical - variable having marginal distribution, - $(f_{1.}, \ldots, f_{I.})$, - \[ - E[V(y|x)]= \sum_{i=1}^I (n_{i.}/n) V(y|i) = \sum_{i=1}^I f_{i.} V(y|i) - \] - If we take as measure of variation $V(y)$ the Gini coefficient - \[ - V(y)=1 -\sum_{j=1}^J f_{.j} \qquad V(y|i)=1 -\sum_{j=1}^J f_{j|i} - \]

    - we obtain the index of proportional reduction in - variation $\tau_{y|x}$ of Goodman and Kruskal.

    -

    \[ - \tau_{y|x}= \frac{\sum_{i=1}^I \sum_{j=1}^J f_{ij}^2/f_{i.} -\sum_{j=1}^J f_{.j}^2 }{1-\sum_{j=1}^J f_{.j}^2 } - \] - If, on the other hand, we take as measure of - variation $V(y)$ the entropy index - \[ - V(y)=-\sum_{j=1}^J f_{.j} \log f_{.j} \qquad V(y|i) -\sum_{j=1}^J f_{j|i} \log f_{j|i} - \]

    - we obtain the index $H_{y|x}$, (uncertainty - coefficient of Theil).

    -

    \[ - H_{y|x}= \frac{\sum_{i=1}^I \sum_{j=1}^J f_{ij} \log( f_{ij}/ (f_{i.}f_{.j}))}{\sum_{j=1}^J f_{.j} \log f_{.j} } - \]

    - The range of $\tau_{y|x}$ and $H_{y|x}$ is [0 1].

    - A large value of - of the index represents a strong association, in - the sense that we can guess $y$ much better when we - know x than when we do not.

    - In other words, $\tau_{y|x}=H_{y|x} =1$ is equivalent to no - conditional variation in the sense that for each - $i$, $n_{j|i}=1$. For example, a value of: - $\tau_{y|x}=0.85$ indicates that knowledge of x - reduces error in predicting values of y by 85 per - cent (when the variation measure which is used is - the Gini's index).

    - $H_{y|x}=0.85$ indicates that - knowledge of x reduces error in predicting values - of y by 85 per cent (when variation measure which - is used is the entropy index) - -

    References

    Agresti, A. (2002), "Categorical Data Analysis", John Wiley & Sons. [pp.

    - 23-26]

    Goodman, L.A. and Kruskal, W.H. (1959), Measures of association for - cross classifications II: Further Discussion and References, - "Journal of the American Statistical Association", Vol. 54, pp. 123-163.

    Goodman, L.A. and Kruskal, W.H. (1963), Measures of association for - cross classifications III: Approximate Sampling Theory, - "Journal of the American Statistical Association", Vol. 58, pp. 310-364.

    Goodman, L.A. and Kruskal, W.H. (1972), Measures of association for - cross classifications IV: Simplification of Asymptotic - Variances, "Journal of the American Statistical Association", Vol. 67, - pp. 415-421.

    Liebetrau, A.M. (1983), "Measures of Association", Sage University Papers - Series on Quantitative Applications in the Social Sciences, 07-004, - Newbury Park, CA: Sage. [pp. 49-56]

    Smithson, M.J. (2003), "Confidence Intervals", Quantitative Applications - in the Social Sciences Series, No. 140. Thousand Oaks, CA: Sage. [pp.

    - 39-41]

    Acknowledgements

    - - -

    This page has been automatically generated by our routine publishFS
    \ No newline at end of file +

  • CorrNominal when input is 2 by 2 + Indexes theta=cross product ratio, + Q and U are also computed.
  • % X=advertisment memory (rows)
    +% Y=product purchase (columns)
    +N= [87 188;
    +42 406];
    +nam=["Yes" "No"];
    +Ntable=array2table(N,"RowNames",nam,"VariableNames",nam);
    +disp('Input 2x2 contingency table')
    +table(Ntable,RowNames=["X=advertisment memory" "advertisment memory "],VariableNames="Y=Product purchase")
    +out=corrNominal(Ntable)
    Input 2x2 contingency table
    +
    +ans =
    +
    +  2×1 table
    +
    +                             Y=Product purchase
    +                             __________________
    +
    +                                    Yes    No  
    +                                    ___    ___ 
    +                                               
    +    X=advertisment memory    Yes    87     188 
    +    advertisment memory      No     42     406 
    +
    +Chi2 index
    +   57.6071
    +
    +pvalue Chi2 index
    +   3.2006e-14
    +
    +Phi index
    +    0.2823
    +
    +Cramer's V 
    +    0.2823
    +
    +-------------------------------
    +2x2 contingency table indexes
    +th=cross product ratio
    +    4.4734
    +
    +Cross product ratio in the interval [-1 1]. Index Q=(th-1)/(th+1)
    +    0.6346
    +
    +Cross product ratio in the interval [-1 1]. Index U=(sqrt(th)-1)/(sqrt(th)+1)
    +    0.3580
    +
    +-------------------------------
    +Test of H_0: independence between rows and columns
    +                   Coeff         se       zscore       pval   
    +                  ________    ________    ______    __________
    +
    +    CramerV        0.28227    0.037189    7.5902    3.1974e-14
    +    GKlambdayx           0           0       NaN           NaN
    +    tauyx         0.079678    0.020787    3.8331    0.00012653
    +    Hyx           0.082782    0.021327    3.8816    0.00010376
    +
    +-----------------------------------------
    +Indexes and 95% confidence limits
    +                   Value      StandardError    ConflimL    ConflimU
    +                  ________    _____________    ________    ________
    +
    +    CramerV        0.28227      0.037189        0.20938    0.35516 
    +    GKlambdayx           0             0              0          0 
    +    tauyx         0.079678      0.020787       0.038937    0.12042 
    +    Hyx           0.082782      0.021327       0.040983    0.12458 
    +
    +
    +out = 
    +
    +  struct with fields:
    +
    +               N: [2×2 double]
    +          Ntable: [2×2 table]
    +            Chi2: 57.6071
    +        Chi2pval: 3.2006e-14
    +             Phi: 0.2823
    +         CramerV: [0.2823 0.0372 7.5902 3.1974e-14]
    +      GKlambdayx: [0 0 NaN NaN]
    +           tauyx: [0.0797 0.0208 3.8331 1.2653e-04]
    +             Hyx: [0.0828 0.0213 3.8816 1.0376e-04]
    +         ConfLim: [4×4 double]
    +    ConfLimtable: [4×4 table]
    +         TestInd: [4×4 double]
    +    TestIndtable: [4×4 table]
    +           theta: 4.4734
    +               Q: 0.6346
    +               U: 0.3580
    +
    +

    Input Arguments

    expand all

    N — Contingency table (default) or n-by-2 input dataset. Matrix or Table.

    Matrix or table which contains the input contingency + table (say of size I-by-J) or the original data matrix.

    + In this last case N=crosstab(N(:,1),N(:,2)). As default + procedure assumes that the input is a contingency table.

    + If N is a data matrix (supplied as a a n-by-2 cell array + of strings, or n-by-2 array or n-by-2 table) optional + input datamatrix must be set to true.

    Data Types: single| double

    Name-Value Pair Arguments

    Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

    Example: 'NoStandardErrors',true +, 'dispresults',false +, 'Lr',{'a' 'b' 'c'} +, 'Lc',{'c1' c2' 'c3' 'c4'} +, 'datamatrix',true +, 'conflev',0.99 +, 'conflimMethodCramerV','fisheradj' +

    NoStandardErrors —Just indexes without standard errors and p-values.boolean.

    if NoStandardErrors is true just the indexes are computed + without standard errors and p-values. That is no + inferential measure is given. The default value of + NoStandardErrors is false.

    +

    Example: 'NoStandardErrors',true +

    Data Types: Boolean

    dispresults —Display results on the screen.boolean.

    If dispresults is true (default) it is possible to see on + the screen all the summary results of the analysis.

    +

    Example: 'dispresults',false +

    Data Types: Boolean

    Lr —Vector of row labels.cell.

    Cell containing the labels of the rows of the input + contingency matrix N. This option is unnecessary if N is a + table, because in this case Lr=N.Properties.RowNames;

    +

    Example: 'Lr',{'a' 'b' 'c'} +

    Data Types: cell array of strings

    Lc —Vector of column labels.cell.

    Cell containing the labels of the columns of the input + contingency matrix N. This option is unnecessary if N is a + table, because in this case Lc=N.Properties.VariableNames;

    +

    Example: 'Lc',{'c1' c2' 'c3' 'c4'} +

    Data Types: cell array of strings

    datamatrix —Data matrix or contingency table.boolean.

    If datamatrix is true the first input argument N is forced + to be interpreted as a data matrix, else if the input + argument is false N is treated as a contingency table. The + default value of datamatrix is false, that is the procedure + automatically considers N as a contingency table. In case + datamatrix is true N can be a cell of size n-by-2 + containing the two grouping variables or a numeric array of + size n-by-2 or a table of size n-by-2.

    +

    Example: 'datamatrix',true +

    Data Types: logical

    conflev —Confidence levels to be used to + compute confidence intervals.scalar.

    The default value of conflev is 0.95, that + is 95 per cent confidence intervals + are computed for all the indexes (note that this option is + ignored if NoStandardErrors=true).

    +

    Example: 'conflev',0.99 +

    Data Types: double

    conflimMethodCramerV —method to compute confidence interval for CramerV.character.

    Character which identifies the method to use to compute the + confidence interval for Cramer index. Default value is + 'ncchisq'. Possible values are 'ncchisq', 'ncchisqadj', + 'fisher' or 'fisheradj'; 'ncchisq' uses the non central + chi2. 'ncchisq' uses the non central chi2 adjusted for the + degrees of fredom. 'fisher' uses the Fisher + z-transformation and 'fisheradj' uses the fisher + z-transformation and bias correction.

    +

    Example: 'conflimMethodCramerV','fisheradj' +

    Data Types: character

    Output Arguments

    expand all

    out — description Structure

    Structure which contains the following fields:
    Value Description
    N

    $I$-by-$J$-array containing contingency table + referred to active rows (i.e. referred to the rows which + participated to the fit).

    + The $(i,j)$-th element is equal to $n_{ij}$, + $i=1, 2, \ldots, I$ and $j=1, 2, \ldots, J$. The + sum of the elements of out.N is $n$ (the grand + total).

    Ntable

    same as out.N but in table format (with row and + column names).

    + This output is present just if your MATLAB + version is not<2013b.

    Chi2

    scalar containing $\chi^2$ index.

    Chi2pval

    scalar containing pvalue of the $\chi^2$ index.

    Phi

    $\Phi$ index. Phi is a chi-square-based measure of + association that involves dividing the chi-square + statistic by the sample size and taking the square + root of the result. More precisely + \[ + \Phi= \sqrt{ \frac{\chi^2}{n} } + \] + This index lies in the interval $[0 , \sqrt{\min[(I-1),(J-1)]}$.

    CramerV

    1 x 4 vector which contains Cramer's V index, + standard error, z test, and p-value. Cramer'V index + is index $\Phi$ divided by its maximum. More precisely + \[ + V= \sqrt{\frac{\Phi}{\min[(I-1),(J-1)]}}=\sqrt{\frac{\chi^2}{n \min[(I-1),(J-1)]}} + \]

    + The range of Cramer index is [0, 1]. A Cramer's V in + the range of [0, 0.3] is considered as weak, + [0.3,0.7] as medium and > 0.7 as strong.

    + The way in which the confidence interval for + this index is specified in input option conflimMethodCramerV.

    + If conflimMethodCramerV is 'ncchisq', 'ncchisqadj' + we first find a confidence interval for the non + centrality parameter $\Delta$ of the + $\chi^2$ distribution with $df=(I-1)(J-1)$ degrees of + freedom. (see Smithson (2003); pp. 39-41) $[\Delta_L + \Delta_U]$. If input option conflimMethodCramerV is + 'ncchisq', confidence interval for $\Delta$ is + transformed into one for $V$ by the following + transformation +

    \[ + V_L=\sqrt{\frac{\Delta_L }{n \min[(I-1),(J-1)]}} + \] + and + \[ + V_U=\sqrt{\frac{\Delta_U }{n \min[(I-1),(J-1)]}} + \] + If input option conflimMethodCramerV is + 'ncchisqadj', confidence interval for $\Delta$ is + transformed into one for $V$ by the following + transformation + \[ + V_L=\sqrt{\frac{\Delta_L+ df }{n \min[(I-1),(J-1)]}} + \] + and + \[ + V_U=\sqrt{\frac{\Delta_U+ df }{n \min[(I-1),(J-1)]}} + \]

    GKlambdayx

    1 x 4 vector which contains index $\lambda_{y|x}$ + of Goodman and Kruskal standard error, z test, and p-value.

    +

    \[ + \lambda_{y|x} = \sum_{i=1}^I \frac{r_i- r}{n-r} + \] + \[ + r_i =\max(n_{ij}) + \] + \[ + r =\max(n_{.j}) + \]

    tauyx

    1 x 4 vector which contains tau index $\tau_{y|x}$, + standard error, ztest and p-value.

    +

    \[ + \tau_{y|x}= \frac{\sum_{i=1}^I \sum_{j=1}^J f_{ij}^2/f_{i.} -\sum_{j=1}^J f_{.j}^2 }{1-\sum_{j=1}^J f_{.j}^2 } + \]

    Hyx

    1 x 4 vector which contains the uncertainty + coefficient index (proposed by Theil) $H_{y|x}$, + standard error, ztest and p-value.

    +

    \[ + H_{y|x}= \frac{\sum_{i=1}^I \sum_{j=1}^J f_{ij} \log( f_{ij}/ (f_{i.}f_{.j}))}{\sum_{j=1}^J f_{.j} \log f_{.j} } + \]

    TestInd

    4-by-4 array containing index values (first column), + standard errors (second column), zscores (third column), + p-values (fourth column).

    TestIndtable

    4-by-4 table containing index values (first column), + standard errors (second column), zscores (third column), + p-values (fourth column).

    + This output is present just if your MATLAB + version is not<2013b.

    ConfLim

    4-by-4 array containing index values (first column), + standard errors (second column), lower confidence limit + (third column), upper confidence limit (fourth column).

    ConfLimtable

    4-by-4 table containing index values (first column), + standard errors (second column), lower confidence limit + (third column), upper confidence limit (fourth column).

    + This output is present just if your MATLAB + version is not<2013b.

    theta

    cross product ratio. This index is computed just + if the input table is 2-by-2

    Q

    cross product ratio in the interval [-1 1] using + the Q rescaling Q=(th-1)/(th+1). This index is computed just + if the input table is 2-by-2

    U

    cross product ratio in the interval [-1 1] using + the U rescaling U=(sqrt(th)-1)/(sqrt(th)+1). This index is computed just + if the input table is 2-by-2

    More About

    expand all

    Additional Details

    + $\lambda_{y|x}$ is a measure of association that + reflects the proportional reduction in error when + values of the independent variable (variable in the + rows of the contingency table) are used to predict + values of the dependent variable (variable in the + columns of the contingency table). The range of + $\lambda_{y|x}$ is [0, 1]. A value of 1 + means that the independent variable perfectly + predicts the dependent variable. On the other hand, + a value of 0 means that the independent variable + does not help in predicting the dependent variable.

    + More generally, let $V(y)$ a measure of variation + for the marginal distribution $(f_{.1}=n_{.1}/n, + ..., f_{.J}=n_{.J}/n)$ of the response $y$ and let + $V(y|i)$ denote the same measure computed for the + conditional distribution $(f_{1|i}=n_{1|i}/n, ..., + f_{J|i}=n_{J|i}/n)$ of $y$ at the $i$-th setting of + the explanatory variable $x$. A proportional + reduction in variation measure has the form.

    +

    \[ + \frac{V(y) - E[V(y|x)]}{V(y|x)} + \] + where $E[V(y|x)]$ is the expectation of the + conditional variation taken with respect to the + distribution of $x$. When $x$ is a categorical + variable having marginal distribution, + $(f_{1.}, \ldots, f_{I.})$, + \[ + E[V(y|x)]= \sum_{i=1}^I (n_{i.}/n) V(y|i) = \sum_{i=1}^I f_{i.} V(y|i) + \] + If we take as measure of variation $V(y)$ the Gini coefficient + \[ + V(y)=1 -\sum_{j=1}^J f_{.j} \qquad V(y|i)=1 -\sum_{j=1}^J f_{j|i} + \]

    + we obtain the index of proportional reduction in + variation $\tau_{y|x}$ of Goodman and Kruskal.

    +

    \[ + \tau_{y|x}= \frac{\sum_{i=1}^I \sum_{j=1}^J f_{ij}^2/f_{i.} -\sum_{j=1}^J f_{.j}^2 }{1-\sum_{j=1}^J f_{.j}^2 } + \] + If, on the other hand, we take as measure of + variation $V(y)$ the entropy index + \[ + V(y)=-\sum_{j=1}^J f_{.j} \log f_{.j} \qquad V(y|i) -\sum_{j=1}^J f_{j|i} \log f_{j|i} + \]

    + we obtain the index $H_{y|x}$, (uncertainty + coefficient of Theil).

    +

    \[ + H_{y|x}= \frac{\sum_{i=1}^I \sum_{j=1}^J f_{ij} \log( f_{ij}/ (f_{i.}f_{.j}))}{\sum_{j=1}^J f_{.j} \log f_{.j} } + \]

    + The range of $\tau_{y|x}$ and $H_{y|x}$ is [0 1].

    + A large value of + of the index represents a strong association, in + the sense that we can guess $y$ much better when we + know x than when we do not.

    + In other words, $\tau_{y|x}=H_{y|x} =1$ is equivalent to no + conditional variation in the sense that for each + $i$, $n_{j|i}=1$. For example, a value of: + $\tau_{y|x}=0.85$ indicates that knowledge of x + reduces error in predicting values of y by 85 per + cent (when the variation measure which is used is + the Gini's index).

    + $H_{y|x}=0.85$ indicates that + knowledge of x reduces error in predicting values + of y by 85 per cent (when variation measure which + is used is the entropy index) + Remark: if the contingency table is of size 2x2 the + following indexes are also computed theta=cross + product ratio, index $Q$ +

    \[ + Q= \frac{\theta-1}{\theta+1} + \] + and $U$ + \[ + U= \frac{\sqrt{\theta}-1}{\sqrt{\theta}+1} + \] + +

    References

    Agresti, A. (2002), "Categorical Data Analysis", John Wiley & Sons. [pp.

    + 23-26]

    Goodman, L.A. and Kruskal, W.H. (1959), Measures of association for + cross classifications II: Further Discussion and References, + "Journal of the American Statistical Association", Vol. 54, pp. 123-163.

    Goodman, L.A. and Kruskal, W.H. (1963), Measures of association for + cross classifications III: Approximate Sampling Theory, + "Journal of the American Statistical Association", Vol. 58, pp. 310-364.

    Goodman, L.A. and Kruskal, W.H. (1972), Measures of association for + cross classifications IV: Simplification of Asymptotic + Variances, "Journal of the American Statistical Association", Vol. 67, + pp. 415-421.

    Liebetrau, A.M. (1983), "Measures of Association", Sage University Papers + Series on Quantitative Applications in the Social Sciences, 07-004, + Newbury Park, CA: Sage. [pp. 49-56]

    Smithson, M.J. (2003), "Confidence Intervals", Quantitative Applications + in the Social Sciences Series, No. 140. Thousand Oaks, CA: Sage. [pp.

    + 39-41]

    Acknowledgements

    + + +

    This page has been automatically generated by our routine publishFS
    \ No newline at end of file diff --git a/toolbox/multivariate/corrNominal.m b/toolbox/multivariate/corrNominal.m index a8ec1ba09..94334c77f 100644 --- a/toolbox/multivariate/corrNominal.m +++ b/toolbox/multivariate/corrNominal.m @@ -8,6 +8,9 @@ % (uncertainty coefficient). % All these indexes measure the association among two unordered qualitative % variables. +% If the input table is 2-by-2 indexes theta (cross product ratio), +% Q=(theta-1)/(theta+1) and U=Q=(sqrt(theta)-1)/(sqrt(theta)+1) +% are also computed % Additional details about these indexes can be found in the "More About" % section or in the "Output section" of this document. % @@ -76,7 +79,7 @@ % Data Types - double % % conflimMethodCramerV: method to compute confidence interval for CramerV. -% Character. +% Character. % Character which identifies the method to use to compute the % confidence interval for Cramer index. Default value is % 'ncchisq'. Possible values are 'ncchisq', 'ncchisqadj', @@ -129,7 +132,7 @@ % centrality parameter $\Delta$ of the % $\chi^2$ distribution with $df=(I-1)(J-1)$ degrees of % freedom. (see Smithson (2003); pp. 39-41) $[\Delta_L -% \Delta_U]$. If input option conflimMethodCramerV is +% \Delta_U]$. If input option conflimMethodCramerV is % 'ncchisq', confidence interval for $\Delta$ is % transformed into one for $V$ by the following % transformation @@ -140,7 +143,7 @@ % \[ % V_U=\sqrt{\frac{\Delta_U }{n \min[(I-1),(J-1)]}} % \] -% If input option conflimMethodCramerV is +% If input option conflimMethodCramerV is % 'ncchisqadj', confidence interval for $\Delta$ is % transformed into one for $V$ by the following % transformation @@ -189,6 +192,14 @@ % (third column), upper confidence limit (fourth column). % This output is present just if your MATLAB % version is not<2013b. +% out.theta = cross product ratio. This index is computed just +% if the input table is 2-by-2 +% out.Q = cross product ratio in the interval [-1 1] using +% the Q rescaling Q=(th-1)/(th+1). This index is computed just +% if the input table is 2-by-2 +% out.U = cross product ratio in the interval [-1 1] using +% the U rescaling U=(sqrt(th)-1)/(sqrt(th)+1). This index is computed just +% if the input table is 2-by-2 % % More About: % @@ -257,7 +268,17 @@ % knowledge of x reduces error in predicting values % of y by 85 per cent (when variation measure which % is used is the entropy index) -% +% Remark: if the contingency table is of size 2x2 the +% following indexes are also computed theta=cross +% product ratio, index $Q$ +% \[ +% Q= \frac{\theta-1}{\theta+1} +% \] +% and $U$ +% \[ +% U= \frac{\sqrt{\theta}-1}{\sqrt{\theta}+1} +% \] +% % See also crosstab, rcontFS, CressieRead, corr, corrOrdinal % % References: @@ -402,6 +423,21 @@ disp(array2table(ConfintV,'RowNames',method,'VariableNames',{'Lower' 'Upper'})) %} +%{ + %% CorrNominal when input is 2 by 2 + % Indexes theta=cross product ratio, + % Q and U are also computed. + % X=advertisment memory (rows) + % Y=product purchase (columns) + N= [87 188; + 42 406]; + nam=["Yes" "No"]; + Ntable=array2table(N,"RowNames",nam,"VariableNames",nam); + disp('Input 2x2 contingency table') + table(Ntable,RowNames=["X=advertisment memory" "advertisment memory "],VariableNames="Y=Product purchase") + out=corrNominal(Ntable) +%} + %% Beginning of code % Check MATLAB version. If it is not smaller than 2013b than output is @@ -449,7 +485,7 @@ dispresults=true; NoStandardErrors=false; conflev=0.95; - conflimMethodCramerV='ncchisq'; +conflimMethodCramerV='ncchisq'; options=struct('Lr',{Lr},'Lc',{Lc},'datamatrix',false,... 'dispresults',dispresults,'NoStandardErrors',NoStandardErrors,... @@ -467,7 +503,7 @@ % Check if user options are valid options aux.chkoptions(options,UserOptions) end - + % Write in structure 'options' the options chosen by the user if nargin > 2 for i=1:2:length(varargin) @@ -483,7 +519,7 @@ end % Extract labels for rows and columns -if verMatlab ==0 && (istable(N) || istimetable(N)) +if verMatlab ==0 && (istable(N) || istimetable(N)) Ntable=N; N=table2array(N); else @@ -495,7 +531,7 @@ error('FSDA:CorrNominal:WrongInputOpt','Wrong length of row labels'); end end - + if isempty(Lc) Lc=cellstr(num2str((1:J)')); else @@ -535,6 +571,14 @@ % Cramer index CramerV=Phi/sqrt(min([I-1 J-1])); +if I==2 && J==2 + % theta= cross product ratio + th=N(1,1)*N(2,2)/(N(1,2)*N(2,1)); + % theta nell'intervallo [-1 1], indexes Q and U + Q=(th-1)/(th+1); + U=(sqrt(th)-1)/(sqrt(th)+1); +end + % Goodman and Kruskal lambda ndotjmax=max(ndotj); nidotmax=max(nidot); @@ -576,10 +620,10 @@ setauyx=NaN; ztauyx=NaN; pvaltauyx=NaN; seHyx=NaN; zHyx=NaN; pvalHyx=NaN; else - + % n * sum(N(i, )^2/sum.row[i]) nerrunconditional=n^2- n*sum(N(:).^2 ./ nidotmat(:)); - + nerrconditional= n^2- sum(ndotj.^2); errunconditional = nerrunconditional/(n^2); errconditional = nerrconditional/(n^2); @@ -591,18 +635,18 @@ % vartauCR =vartauCR + N(i,j)*(-2*errunconditional*ndotj(j)/n +errconditional*(2*N(i,j)/nidot(i)-sum( (N(i,:)/nidot(i)).^2)) - f)^2/(n^2 * errconditional^4); % end % end - + Ndivnidot=repmat(sum((N./nidotmat).^2,2),1,J); vartauyx=sum( N(:).* (-2*errunconditional*ndotjmat(:)/n +errconditional*(2*N(:)./nidotmat(:)-Ndivnidot(:) ) - f).^2 )/(n^2 * errconditional^4); setauyx=sqrt(vartauyx); - + % Find standard error for Cramer V % use external routine ncci to find confidence interval for non % centrality parameter of the chi2 distribution df=(I-1)*(J-1); k=min(I,J); - - + + if strcmp(conflimMethodCramerV,'ncchisq') [LoC]=lochi(Chi2,df,conflev); [HoC]=hichi(Chi2,df,conflev); @@ -625,15 +669,15 @@ end ConfIntCramerV(1)=max([0 ConfIntCramerV(1)]); ConfIntCramerV(2)=min([1 ConfIntCramerV(2)]); - + % use external routine ncpci to find confidence interval % ncpConfInt=ncpci(Chi2,'X2',df,'confLevel',conflev); % ConfIntCramerV=sqrt((ncpConfInt+df)/(n*(k-1))); - + % Store confidence intervals seCramerV=(CramerV-ConfIntCramerV(1))/talpha; - - + + % The asymptotic variance of Gk index is given by % % \[ @@ -650,7 +694,7 @@ % seqJ=1:J; seqI=1:I; - + % column index associated to maximal column frequency %rmax = max_j n_ij nijmax=max(N,[],2); @@ -667,14 +711,14 @@ Lcol(i) = min(seqJ(N(i,:) == nijmax(i))); end end - + varGKlambdayx= (n - sum(nijmax)) * (sum(nijmax) + ndotjmax -2*(sum(nijmax(seqI(Lcol == Lcolmax)))))/(n-ndotjmax)^3; seGKlambdayx=sqrt(varGKlambdayx); - + % variance of uncertainty coefficient of Theil varHyx=sum(N(boo).*( Hy*log(N(boo)./nidotmat(boo)) +(Hx-Hxy)*log(ndotjmat(boo)/n) ).^2)/(n^2*Hy^4); seHyx=sqrt(varHyx); - + % Compute zscores and p-values zCramerV = CramerV/seCramerV; % z-score pvalCramerV = 2*(1 - normcdf(abs(zCramerV))); %p-value (two-sided) @@ -684,7 +728,7 @@ pvaltauyx = 2*(1 - normcdf(abs(ztauyx))); %p-value (two-sided) zHyx = Hyx/seHyx; % z-score pvalHyx = 2*(1 - normcdf(abs(zHyx))); %p-value (two-sided) - + end % Store results in output structure out @@ -735,8 +779,15 @@ out.TestIndtable=TestIndtable; end +% Store 2x2 contingency table indexes +if I==2 && J==2 +out.theta=th; +out.Q=Q; +out.U=U; +end + if dispresults == true - + disp('Chi2 index') disp(Chi2) disp('pvalue Chi2 index') @@ -745,10 +796,25 @@ disp(Phi) disp('Cramer''s V ') disp(CramerV) - + + + if I==2 && J==2 + disp('-------------------------------') + disp('2x2 contingency table indexes') + disp('th=cross product ratio') + disp(th) + disp('Cross product ratio in the interval [-1 1]. Index Q=(th-1)/(th+1)') + disp(Q) + disp('Cross product ratio in the interval [-1 1]. Index U=(sqrt(th)-1)/(sqrt(th)+1)') + disp(U) + disp('-------------------------------') + end + + + if NoStandardErrors == false if verMatlab ==0 - + % Test H_0 % Test of independence disp('Test of H_0: independence between rows and columns') @@ -756,7 +822,7 @@ disp('-----------------------------------------') disp(['Indexes and ' num2str(conflev*100) '% confidence limits']) disp(ConfLimtable); - + else % Test H_0 % Test of independence