From af7b6245390c63af379686b8822e6e2d9330d63e Mon Sep 17 00:00:00 2001 From: Ankur Goel Date: Wed, 24 Apr 2013 22:23:20 +0530 Subject: [PATCH 01/22] Added Gemspec with new gems listing --- README.txt | 174 --------------------------------------------- statsample.gemspec | 98 +++++++++++++++++++++++++ 2 files changed, 98 insertions(+), 174 deletions(-) delete mode 100644 README.txt create mode 100644 statsample.gemspec diff --git a/README.txt b/README.txt deleted file mode 100644 index 0c515b4..0000000 --- a/README.txt +++ /dev/null @@ -1,174 +0,0 @@ -= Statsample - -http://ruby-statsample.rubyforge.org/ - - -== DESCRIPTION: - -A suite for basic and advanced statistics on Ruby. Tested on Ruby 1.8.7, 1.9.1, 1.9.2 (April, 2010), ruby-head(June, 2011) and JRuby 1.4 (Ruby 1.8.7 compatible). - -Include: -* Descriptive statistics: frequencies, median, mean, standard error, skew, kurtosis (and many others). -* Imports and exports datasets from and to Excel, CSV and plain text files. -* Correlations: Pearson's r, Spearman's rank correlation (rho), point biserial, tau a, tau b and gamma. Tetrachoric and Polychoric correlation provides by +statsample-bivariate-extension+ gem. -* Intra-class correlation -* Anova: generic and vector-based One-way ANOVA and Two-way ANOVA, with contrasts for One-way ANOVA. -* Tests: F, T, Levene, U-Mannwhitney. -* Regression: Simple, Multiple (OLS), Probit and Logit -* Factorial Analysis: Extraction (PCA and Principal Axis), Rotation (Varimax, Equimax, Quartimax) and Parallel Analysis and Velicer's MAP test, for estimation of number of factors. -* Reliability analysis for simple scale and a DSL to easily analyze multiple scales using factor analysis and correlations, if you want it. -* Basic time series support -* Dominance Analysis, with multivariate dependent and bootstrap (Azen & Budescu) -* Sample calculation related formulas -* Structural Equation Modeling (SEM), using R libraries +sem+ and +OpenMx+ -* Creates reports on text, html and rtf, using ReportBuilder gem -* Graphics: Histogram, Boxplot and Scatterplot - -== PRINCIPLES - -* Software Design: - * One module/class for each type of analysis - * Options can be set as hash on initialize() or as setters methods - * Clean API for interactive sessions - * summary() returns all necessary informacion for interactive sessions - * All statistical data available though methods on objects - * All (important) methods should be tested. Better with random data. -* Statistical Design - * Results are tested against text results, SPSS and R outputs. - * Go beyond Null Hiphotesis Testing, using confidence intervals and effect sizes when possible - * (When possible) All references for methods are documented, providing sensible information on documentation - -== FEATURES: - -* Classes for manipulation and storage of data: - * Statsample::Vector: An extension of an array, with statistical methods like sum, mean and standard deviation - * Statsample::Dataset: a group of Statsample::Vector, analog to a excel spreadsheet or a dataframe on R. The base of almost all operations on statsample. - * Statsample::Multiset: multiple datasets with same fields and type of vectors -* Anova module provides generic Statsample::Anova::OneWay and vector based Statsample::Anova::OneWayWithVectors. Also you can create contrast using Statsample::Anova::Contrast -* Module Statsample::Bivariate provides covariance and pearson, spearman, point biserial, tau a, tau b, gamma, tetrachoric (see Bivariate::Tetrachoric) and polychoric (see Bivariate::Polychoric) correlations. Include methods to create correlation and covariance matrices -* Multiple types of regression. - * Simple Regression : Statsample::Regression::Simple - * Multiple Regression: Statsample::Regression::Multiple - * Logit Regression: Statsample::Regression::Binomial::Logit - * Probit Regression: Statsample::Regression::Binomial::Probit -* Factorial Analysis algorithms on Statsample::Factor module. - * Classes for Extraction of factors: - * Statsample::Factor::PCA - * Statsample::Factor::PrincipalAxis - * Classes for Rotation of factors: - * Statsample::Factor::Varimax - * Statsample::Factor::Equimax - * Statsample::Factor::Quartimax - * Classes for calculation of factors to retain - * Statsample::Factor::ParallelAnalysis performs Horn's 'parallel analysis' to a principal components analysis to adjust for sample bias in the retention of components. - * Statsample::Factor::MAP performs Velicer's Minimum Average Partial (MAP) test, which retain components as long as the variance in the correlation matrix represents systematic variance. -* Dominance Analysis. Based on Budescu and Azen papers, dominance analysis is a method to analyze the relative importance of one predictor relative to another on multiple regression - * Statsample::DominanceAnalysis class can report dominance analysis for a sample, using uni or multivariate dependent variables - * Statsample::DominanceAnalysis::Bootstrap can execute bootstrap analysis to determine dominance stability, as recomended by Azen & Budescu (2003) link[http://psycnet.apa.org/journals/met/8/2/129/]. -* Module Statsample::Codification, to help to codify open questions -* Converters to import and export data: - * Statsample::Database : Can create sql to create tables, read and insert data - * Statsample::CSV : Read and write CSV files - * Statsample::Excel : Read and write Excel files - * Statsample::Mx : Write Mx Files - * Statsample::GGobi : Write Ggobi files -* Module Statsample::Crosstab provides function to create crosstab for categorical data -* Module Statsample::Reliability provides functions to analyze scales with psychometric methods. - * Class Statsample::Reliability::ScaleAnalysis provides statistics like mean, standard deviation for a scale, Cronbach's alpha and standarized Cronbach's alpha, and for each item: mean, correlation with total scale, mean if deleted, Cronbach's alpha is deleted. - * Class Statsample::Reliability::MultiScaleAnalysis provides a DSL to easily analyze reliability of multiple scales and retrieve correlation matrix and factor analysis of them. - * Class Statsample::Reliability::ICC provides intra-class correlation, using Shrout & Fleiss(1979) and McGraw & Wong (1996) formulations. -* Module Statsample::SRS (Simple Random Sampling) provides a lot of functions to estimate standard error for several type of samples -* Module Statsample::Test provides several methods and classes to perform inferencial statistics - * Statsample::Test::BartlettSphericity - * Statsample::Test::ChiSquare - * Statsample::Test::F - * Statsample::Test::KolmogorovSmirnov (only D value) - * Statsample::Test::Levene - * Statsample::Test::UMannWhitney - * Statsample::Test::T -* Module Graph provides several classes to create beautiful graphs using rubyvis - * Statsample::Graph::Boxplot - * Statsample::Graph::Histogram - * Statsample::Graph::Scatterplot -* Module Statsample::TimeSeries provides basic support for time series. -* Gem +statsample-sem+ provides a DSL to R libraries +sem+ and +OpenMx+ -* Close integration with gem reportbuilder, to easily create reports on text, html and rtf formats. - -== Examples of use: - -See multiples examples of use on [http://github.com/clbustos/statsample/tree/master/examples/] - -=== Boxplot - - require 'statsample' - ss_analysis(Statsample::Graph::Boxplot) do - n=30 - a=rnorm(n-1,50,10) - b=rnorm(n, 30,5) - c=rnorm(n,5,1) - a.push(2) - boxplot(:vectors=>[a,b,c], :width=>300, :height=>300, :groups=>%w{first first second}, :minimum=>0) - end - Statsample::Analysis.run # Open svg file on *nix application defined - -=== Correlation matrix - - require 'statsample' - # Note R like generation of random gaussian variable - # and correlation matrix - - ss_analysis("Statsample::Bivariate.correlation_matrix") do - samples=1000 - ds=data_frame( - 'a'=>rnorm(samples), - 'b'=>rnorm(samples), - 'c'=>rnorm(samples), - 'd'=>rnorm(samples)) - cm=cor(ds) - summary(cm) - end - - Statsample::Analysis.run_batch # Echo output to console - - -== REQUIREMENTS: - -Optional: - -* Plotting: gnuplot and rbgnuplot, SVG::Graph -* Factorial analysis and polychorical correlation(joint estimate and polychoric series): gsl library and rb-gsl (http://rb-gsl.rubyforge.org/). You should install it using gem install gsl. - -Note: Use gsl 1.12.109 or later. - -== RESOURCES - -* Source code on github: http://github.com/clbustos/statsample -* API: http://ruby-statsample.rubyforge.org/statsample/ -* Bug report and feature request: http://github.com/clbustos/statsample/issues -* E-mailing list: http://groups.google.com/group/statsample - -== INSTALL: - - $ sudo gem install statsample - -On *nix, you should install statsample-optimization to retrieve gems gsl, statistics2 and a C extension to speed some methods. - -There are available precompiled version for Ruby 1.9 on x86, x86_64 and mingw32 archs. - - $ sudo gem install statsample-optimization - -If you use Ruby 1.8, you should compile statsample-optimization, usign parameter --platform ruby - - $ sudo gem install statsample-optimization --platform ruby - -If you need to work on Structural Equation Modeling, you could see +statsample-sem+. You need R with +sem+ or +OpenMx+ [http://openmx.psyc.virginia.edu/] libraries installed - - $ sudo gem install statsample-sem - -Available setup.rb file - - sudo gem ruby setup.rb - -== LICENSE: - -GPL-2 (See LICENSE.txt) diff --git a/statsample.gemspec b/statsample.gemspec new file mode 100644 index 0000000..205f0e2 --- /dev/null +++ b/statsample.gemspec @@ -0,0 +1,98 @@ +# -*- encoding: utf-8 -*- + +Gem::Specification.new do |s| + s.name = "statsample" + s.version = "1.1.0.20110912141756" + + s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version= + s.authors = ["Claudio Bustos"] + s.date = "2013-19-04" + s.description = "A suite for basic and advanced statistics on Ruby. Tested on Ruby 1.8.7, 1.9.1, 1.9.2, 1.9.3 and JRuby 1.4 (Ruby 1.8.7 compatible).\n\nInclude:\n* Descriptive statistics: frequencies, median, mean, standard error, skew, kurtosis (and many others).\n* Imports and exports datasets from and to Excel, CSV and plain text files.\n* Correlations: Pearson's r, Spearman's rank correlation (rho), point biserial, tau a, tau b and gamma. Tetrachoric and Polychoric correlation provides by +statsample-bivariate-extension+ gem.\n* Intra-class correlation\n* Anova: generic and vector-based One-way ANOVA and Two-way ANOVA, with contrasts for One-way ANOVA.\n* Tests: F, T, Levene, U-Mannwhitney.\n* Regression: Simple, Multiple (OLS), Probit and Logit\n* Factorial Analysis: Extraction (PCA and Principal Axis), Rotation (Varimax, Equimax, Quartimax) and Parallel Analysis and Velicer's MAP test, for estimation of number of factors.\n* Reliability analysis for simple scale and a DSL to easily analyze multiple scales using factor analysis and correlations, if you want it.\n* Dominance Analysis, with multivariate dependent and bootstrap (Azen & Budescu)\n* Sample calculation related formulas\n* Structural Equation Modeling (SEM), using R libraries +sem+ and +OpenMx+\n* Creates reports on text, html and rtf, using ReportBuilder gem\n* Graphics: Histogram, Boxplot and Scatterplot" + s.email = ["clbustos@gmail.com"] + s.executables = ["statsample"] + s.extra_rdoc_files = ["History.txt", "LICENSE.txt", "Manifest.txt", "README.txt", "references.txt"] + s.files = ["History.txt", "LICENSE.txt", "Manifest.txt", "README.txt", "Rakefile", "benchmarks/correlation_matrix_15_variables.rb", "benchmarks/correlation_matrix_5_variables.rb", "benchmarks/correlation_matrix_methods/correlation_matrix.ds", "benchmarks/correlation_matrix_methods/correlation_matrix.html", "benchmarks/correlation_matrix_methods/correlation_matrix.rb", "benchmarks/correlation_matrix_methods/correlation_matrix.xls", "benchmarks/correlation_matrix_methods/correlation_matrix_gsl_ruby.ods", "benchmarks/correlation_matrix_methods/correlation_matrix_with_graphics.ods", "benchmarks/correlation_matrix_methods/results.ds", "benchmarks/factor_map.rb", "benchmarks/helpers_benchmark.rb", "bin/statsample", "data/locale/es/LC_MESSAGES/statsample.mo", "doc_latex/manual/equations.tex", "examples/boxplot.rb", "examples/correlation_matrix.rb", "examples/dataset.rb", "examples/dominance_analysis.rb", "examples/dominance_analysis_bootstrap.rb", "examples/histogram.rb", "examples/icc.rb", "examples/levene.rb", "examples/multiple_regression.rb", "examples/multivariate_correlation.rb", "examples/parallel_analysis.rb", "examples/polychoric.rb", "examples/principal_axis.rb", "examples/reliability.rb", "examples/scatterplot.rb", "examples/t_test.rb", "examples/tetrachoric.rb", "examples/u_test.rb", "examples/vector.rb", "examples/velicer_map_test.rb", "grab_references.rb", "lib/spss.rb", "lib/statsample.rb", "lib/statsample/analysis.rb", "lib/statsample/anova.rb", "lib/statsample/anova/contrast.rb", "lib/statsample/anova/oneway.rb", "lib/statsample/anova/twoway.rb", "lib/statsample/bivariate.rb", "lib/statsample/bivariate/pearson.rb", "lib/statsample/codification.rb", "lib/statsample/converter/csv.rb", "lib/statsample/converter/spss.rb", "lib/statsample/converters.rb", "lib/statsample/crosstab.rb", "lib/statsample/dataset.rb", "lib/statsample/dominanceanalysis.rb", "lib/statsample/dominanceanalysis/bootstrap.rb", "lib/statsample/factor.rb", "lib/statsample/factor/map.rb", "lib/statsample/factor/parallelanalysis.rb", "lib/statsample/factor/pca.rb", "lib/statsample/factor/principalaxis.rb", "lib/statsample/factor/rotation.rb", "lib/statsample/graph.rb", "lib/statsample/graph/boxplot.rb", "lib/statsample/graph/histogram.rb", "lib/statsample/graph/scatterplot.rb", "lib/statsample/histogram.rb", "lib/statsample/matrix.rb", "lib/statsample/mle.rb", "lib/statsample/mle/logit.rb", "lib/statsample/mle/normal.rb", "lib/statsample/mle/probit.rb", "lib/statsample/multiset.rb", "lib/statsample/regression.rb", "lib/statsample/regression/binomial.rb", "lib/statsample/regression/binomial/logit.rb", "lib/statsample/regression/binomial/probit.rb", "lib/statsample/regression/multiple.rb", "lib/statsample/regression/multiple/alglibengine.rb", "lib/statsample/regression/multiple/baseengine.rb", "lib/statsample/regression/multiple/gslengine.rb", "lib/statsample/regression/multiple/matrixengine.rb", "lib/statsample/regression/multiple/rubyengine.rb", "lib/statsample/regression/simple.rb", "lib/statsample/reliability.rb", "lib/statsample/reliability/icc.rb", "lib/statsample/reliability/multiscaleanalysis.rb", "lib/statsample/reliability/scaleanalysis.rb", "lib/statsample/reliability/skillscaleanalysis.rb", "lib/statsample/resample.rb", "lib/statsample/rserve_extension.rb", "lib/statsample/shorthand.rb", "lib/statsample/srs.rb", "lib/statsample/test.rb", "lib/statsample/test/bartlettsphericity.rb", "lib/statsample/test/chisquare.rb", "lib/statsample/test/f.rb", "lib/statsample/test/kolmogorovsmirnov.rb", "lib/statsample/test/levene.rb", "lib/statsample/test/t.rb", "lib/statsample/test/umannwhitney.rb", "lib/statsample/vector.rb", "lib/statsample/vector/gsl.rb", "po/es/statsample.mo", "po/es/statsample.po", "po/statsample.pot", "references.txt", "setup.rb", "test/fixtures/bank2.dat", "test/fixtures/correlation_matrix.rb", "test/fixtures/crime.txt", "test/fixtures/hartman_23.matrix", "test/fixtures/repeated_fields.csv", "test/fixtures/test_binomial.csv", "test/fixtures/test_csv.csv", "test/fixtures/test_xls.xls", "test/fixtures/tetmat_matrix.txt", "test/fixtures/tetmat_test.txt", "test/helpers_tests.rb", "test/test_analysis.rb", "test/test_anova_contrast.rb", "test/test_anovaoneway.rb", "test/test_anovatwoway.rb", "test/test_anovatwowaywithdataset.rb", "test/test_anovawithvectors.rb", "test/test_bartlettsphericity.rb", "test/test_bivariate.rb", "test/test_codification.rb", "test/test_crosstab.rb", "test/test_csv.rb", "test/test_dataset.rb", "test/test_dominance_analysis.rb", "test/test_factor.rb", "test/test_factor_map.rb", "test/test_factor_pa.rb", "test/test_ggobi.rb", "test/test_gsl.rb", "test/test_histogram.rb", "test/test_logit.rb", "test/test_matrix.rb", "test/test_mle.rb", "test/test_multiset.rb", "test/test_regression.rb", "test/test_reliability.rb", "test/test_reliability_icc.rb", "test/test_reliability_skillscale.rb", "test/test_resample.rb", "test/test_rserve_extension.rb", "test/test_srs.rb", "test/test_statistics.rb", "test/test_stest.rb", "test/test_stratified.rb", "test/test_test_f.rb", "test/test_test_kolmogorovsmirnov.rb", "test/test_test_t.rb", "test/test_umannwhitney.rb", "test/test_vector.rb", "test/test_xls.rb", "web/Rakefile"] + s.homepage = "http://ruby-statsample.rubyforge.org/" + s.post_install_message = "***************************************************\nThanks for installing statsample.\n\nOn *nix, you could install statsample-optimization\nto retrieve gems gsl, statistics2 and a C extension\nto speed some methods.\n\n $ sudo gem install statsample-optimization\n\nOn Ubuntu, install build-essential and libgsl0-dev \nusing apt-get. Compile ruby 1.9x from \nsource code first.\n\n $ sudo apt-get install build-essential libgsl0-dev\n\n\n*****************************************************\n" + s.rdoc_options = ["--main", "README.txt"] + s.require_paths = ["lib"] + s.rubyforge_project = "ruby-statsample" + s.rubygems_version = "1.8.10" + s.summary = "A suite for basic and advanced statistics on Ruby" + s.test_files = ["test/test_ggobi.rb", "test/test_vector.rb", "test/test_crosstab.rb", "test/test_bivariate.rb", "test/test_dominance_analysis.rb", "test/test_analysis.rb", "test/test_dataset.rb", "test/test_test_t.rb", "test/test_regression.rb", "test/test_reliability_icc.rb", "test/test_histogram.rb", "test/test_multiset.rb", "test/test_logit.rb", "test/test_xls.rb", "test/test_test_kolmogorovsmirnov.rb", "test/test_factor_map.rb", "test/test_stratified.rb", "test/test_rserve_extension.rb", "test/test_reliability_skillscale.rb", "test/test_anova_contrast.rb", "test/test_anovaoneway.rb", "test/test_matrix.rb", "test/test_reliability.rb", "test/test_anovatwowaywithdataset.rb", "test/test_factor_pa.rb", "test/test_factor.rb", "test/test_resample.rb", "test/test_anovawithvectors.rb", "test/test_bartlettsphericity.rb", "test/test_statistics.rb", "test/test_stest.rb", "test/test_srs.rb", "test/test_test_f.rb", "test/test_csv.rb", "test/test_anovatwoway.rb", "test/test_gsl.rb", "test/test_mle.rb", "test/test_umannwhitney.rb", "test/test_codification.rb"] + + if s.respond_to? :specification_version then + s.specification_version = 3 + + if Gem::Version.new(Gem::VERSION) >= Gem::Version.new('1.2.0') then + s.add_runtime_dependency(%q, ["~> 0.6.5"]) + s.add_runtime_dependency(%q, ["~> 1.4"]) + s.add_runtime_dependency(%q, ["~> 0.2.0"]) + s.add_runtime_dependency(%q, ["> 0"]) + s.add_runtime_dependency(%q, ["~> 0.0"]) + s.add_runtime_dependency(%q, ["~> 0.3.1"]) + s.add_runtime_dependency(%q, ["> 0"]) + s.add_runtime_dependency(%q, ["~> 0.2.5"]) + s.add_runtime_dependency(%q, ["~> 0.5.0"]) + s.add_runtime_dependency(%q, ["~> 0.3"]) + s.add_development_dependency(%q, ["~> 0"]) + s.add_development_dependency(%q, ["~> 0"]) + s.add_development_dependency(%q, ["~> 2.0"]) + s.add_development_dependency(%q, ["~> 0"]) + s.add_development_dependency(%q, ["~> 2.3.8"]) + s.add_development_dependency(%q, ["~> 0"]) + s.add_development_dependency(%q, ["~> 1.5.0"]) + s.add_development_dependency(%q, ["~> 2.12"]) + s.add_development_dependency(%q, ["~> 2.13.0"]) + s.add_development_dependency(%q, ["~> 2.13.1"]) + s.add_development_dependency(%q, ["~> 2.13.0"]) + s.add_development_dependency(%q, ["~> 2.13.0"]) + else + s.add_dependency(%q, ["~> 0.6.5"]) + s.add_dependency(%q, ["~> 1.4"]) + s.add_dependency(%q, ["~> 0.2.0"]) + s.add_dependency(%q, ["> 0"]) + s.add_dependency(%q, ["~> 0.0"]) + s.add_dependency(%q, ["~> 0.3.1"]) + s.add_dependency(%q, ["> 0"]) + s.add_dependency(%q, ["~> 0.2.5"]) + s.add_dependency(%q, ["~> 0.5.0"]) + s.add_dependency(%q, ["~> 0.3"]) + s.add_dependency(%q, ["~> 0"]) + s.add_dependency(%q, ["~> 0"]) + s.add_dependency(%q, ["~> 2.0"]) + s.add_dependency(%q, ["~> 0"]) + s.add_dependency(%q, ["~> 2.3.8"]) + s.add_dependency(%q, ["~> 0"]) + s.add_dependency(%q, ["~> 1.5.0"]) + s.add_dependency(%q, ["~> 2.12"]) + s.add_dependency(%q, ["~> 2.13.0"]) + s.add_dependency(%q, ["~> 2.13.1"]) + s.add_dependency(%q, ["~> 2.13.0"]) + s.add_dependency(%q, ["~> 2.13.0"]) + end + else + s.add_dependency(%q, ["~> 0.6.5"]) + s.add_dependency(%q, ["~> 1.4"]) + s.add_dependency(%q, ["~> 0.2.0"]) + s.add_dependency(%q, ["> 0"]) + s.add_dependency(%q, ["~> 0.0"]) + s.add_dependency(%q, ["~> 0.3.1"]) + s.add_dependency(%q, ["> 0"]) + s.add_dependency(%q, ["~> 0.2.5"]) + s.add_dependency(%q, ["~> 0.5.0"]) + s.add_dependency(%q, ["~> 0.3"]) + s.add_dependency(%q, ["~> 0"]) + s.add_dependency(%q, ["~> 0"]) + s.add_dependency(%q, ["~> 2.0"]) + s.add_dependency(%q, ["~> 0"]) + s.add_dependency(%q, ["~> 2.3.8"]) + s.add_dependency(%q, ["~> 0"]) + s.add_dependency(%q, ["~> 1.5.0"]) + s.add_dependency(%q, ["~> 2.12"]) + s.add_dependency(%q, ["~> 2.13.0"]) + s.add_dependency(%q, ["~> 2.13.1"]) + s.add_dependency(%q, ["~> 2.13.0"]) + s.add_dependency(%q, ["~> 2.13.0"]) + end +end From 2efd9fb5511070bd07517fc5b8dc71cdeafd85f6 Mon Sep 17 00:00:00 2001 From: Ankur Goel Date: Wed, 24 Apr 2013 22:25:11 +0530 Subject: [PATCH 02/22] Converted README to proper markdown. Easily readable now. --- README.md | 185 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 185 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..d389977 --- /dev/null +++ b/README.md @@ -0,0 +1,185 @@ +Statsample +========== + +[http://ruby-statsample.rubyforge.org/](http://ruby-statsample.rubyforge.org/) + + +DESCRIPTION: +------------ + +A suite for basic and advanced statistics on Ruby. Tested on Ruby 1.8.7, 1.9.1, 1.9.2 (April, 2010), ruby-head(June, 2011) and JRuby 1.4 (Ruby 1.8.7 compatible). + +Include: + +* Descriptive statistics: frequencies, median, mean, standard error, skew, kurtosis (and many others). +* Imports and exports datasets from and to Excel, CSV and plain text files. +* Correlations: Pearson's r, Spearman's rank correlation (rho), point biserial, tau a, tau b and gamma. Tetrachoric and Polychoric correlation provides by +statsample-bivariate-extension+ gem. +* Intra-class correlation +* Anova: generic and vector-based One-way ANOVA and Two-way ANOVA, with contrasts for One-way ANOVA. +* Tests: F, T, Levene, U-Mannwhitney. +* Regression: Simple, Multiple (OLS), Probit and Logit +* Factorial Analysis: Extraction (PCA and Principal Axis), Rotation (Varimax, Equimax, Quartimax) and Parallel Analysis and Velicer's MAP test, for estimation of number of factors. +* Reliability analysis for simple scale and a DSL to easily analyze multiple scales using factor analysis and correlations, if you want it. +* Basic time series support +* Dominance Analysis, with multivariate dependent and bootstrap (Azen & Budescu) +* Sample calculation related formulas +* Structural Equation Modeling (SEM), using R libraries +sem+ and +OpenMx+ +* Creates reports on text, html and rtf, using ReportBuilder gem +* Graphics: Histogram, Boxplot and Scatterplot + +PRINCIPLES +---------- + +* Software Design: + * One module/class for each type of analysis + * Options can be set as hash on initialize() or as setters methods + * Clean API for interactive sessions + * summary() returns all necessary informacion for interactive sessions + * All statistical data available though methods on objects + * All (important) methods should be tested. Better with random data. +* Statistical Design + * Results are tested against text results, SPSS and R outputs. + * Go beyond Null Hiphotesis Testing, using confidence intervals and effect sizes when possible + * (When possible) All references for methods are documented, providing sensible information on documentation + +FEATURES: +-------- + +* Classes for manipulation and storage of data: + * Statsample::Vector: An extension of an array, with statistical methods like sum, mean and standard deviation + * Statsample::Dataset: a group of Statsample::Vector, analog to a excel spreadsheet or a dataframe on R. The base of almost all operations on statsample. + * Statsample::Multiset: multiple datasets with same fields and type of vectors +* Anova module provides generic Statsample::Anova::OneWay and vector based Statsample::Anova::OneWayWithVectors. Also you can create contrast using Statsample::Anova::Contrast +* Module Statsample::Bivariate provides covariance and pearson, spearman, point biserial, tau a, tau b, gamma, tetrachoric (see Bivariate::Tetrachoric) and polychoric (see Bivariate::Polychoric) correlations. Include methods to create correlation and covariance matrices +* Multiple types of regression. + * Simple Regression : Statsample::Regression::Simple + * Multiple Regression: Statsample::Regression::Multiple + * Logit Regression: Statsample::Regression::Binomial::Logit + * Probit Regression: Statsample::Regression::Binomial::Probit +* Factorial Analysis algorithms on Statsample::Factor module. + * Classes for Extraction of factors: + * Statsample::Factor::PCA + * Statsample::Factor::PrincipalAxis + * Classes for Rotation of factors: + * Statsample::Factor::Varimax + * Statsample::Factor::Equimax + * Statsample::Factor::Quartimax + * Classes for calculation of factors to retain + * Statsample::Factor::ParallelAnalysis performs Horn's 'parallel analysis' to a principal components analysis to adjust for sample bias in the retention of components. + * Statsample::Factor::MAP performs Velicer's Minimum Average Partial (MAP) test, which retain components as long as the variance in the correlation matrix represents systematic variance. +* Dominance Analysis. Based on Budescu and Azen papers, dominance analysis is a method to analyze the relative importance of one predictor relative to another on multiple regression + * Statsample::DominanceAnalysis class can report dominance analysis for a sample, using uni or multivariate dependent variables + * Statsample::DominanceAnalysis::Bootstrap can execute bootstrap analysis to determine dominance stability, as recomended by Azen & Budescu (2003) link[http://psycnet.apa.org/journals/met/8/2/129/]. +* Module Statsample::Codification, to help to codify open questions +* Converters to import and export data: + * Statsample::Database : Can create sql to create tables, read and insert data + * Statsample::CSV : Read and write CSV files + * Statsample::Excel : Read and write Excel files + * Statsample::Mx : Write Mx Files + * Statsample::GGobi : Write Ggobi files +* Module Statsample::Crosstab provides function to create crosstab for categorical data +* Module Statsample::Reliability provides functions to analyze scales with psychometric methods. + * Class Statsample::Reliability::ScaleAnalysis provides statistics like mean, standard deviation for a scale, Cronbach's alpha and standarized Cronbach's alpha, and for each item: mean, correlation with total scale, mean if deleted, Cronbach's alpha is deleted. + * Class Statsample::Reliability::MultiScaleAnalysis provides a DSL to easily analyze reliability of multiple scales and retrieve correlation matrix and factor analysis of them. + * Class Statsample::Reliability::ICC provides intra-class correlation, using Shrout & Fleiss(1979) and McGraw & Wong (1996) formulations. +* Module Statsample::SRS (Simple Random Sampling) provides a lot of functions to estimate standard error for several type of samples +* Module Statsample::Test provides several methods and classes to perform inferencial statistics + * Statsample::Test::BartlettSphericity + * Statsample::Test::ChiSquare + * Statsample::Test::F + * Statsample::Test::KolmogorovSmirnov (only D value) + * Statsample::Test::Levene + * Statsample::Test::UMannWhitney + * Statsample::Test::T +* Module Graph provides several classes to create beautiful graphs using rubyvis + * Statsample::Graph::Boxplot + * Statsample::Graph::Histogram + * Statsample::Graph::Scatterplot +* Module Statsample::TimeSeries provides basic support for time series. +* Gem +statsample-sem+ provides a DSL to R libraries +sem+ and +OpenMx+ +* Close integration with gem `reportbuilder`, to easily create reports on text, html and rtf formats. + +Examples of use: +-------------- + +See multiples examples of use on [http://github.com/clbustos/statsample/tree/master/examples/](http://github.com/clbustos/statsample/tree/master/examples/) + +Boxplot +------- +```ruby + require 'statsample' + ss_analysis(Statsample::Graph::Boxplot) do + n=30 + a=rnorm(n-1,50,10) + b=rnorm(n, 30,5) + c=rnorm(n,5,1) + a.push(2) + boxplot(:vectors=>[a,b,c], :width=>300, :height=>300, :groups=>%w{first first second}, :minimum=>0) + end + Statsample::Analysis.run # Open svg file on *nix application defined +``` +Correlation matrix +------------------ +```ruby + require 'statsample' + # Note R like generation of random gaussian variable + # and correlation matrix + + ss_analysis("Statsample::Bivariate.correlation_matrix") do + samples=1000 + ds=data_frame( + 'a'=>rnorm(samples), + 'b'=>rnorm(samples), + 'c'=>rnorm(samples), + 'd'=>rnorm(samples)) + cm=cor(ds) + summary(cm) + end + + Statsample::Analysis.run_batch # Echo output to console +``` + +REQUIREMENTS: +------------- + +Optional: + +* Plotting: gnuplot and rbgnuplot, SVG::Graph +* Factorial analysis and polychorical correlation(joint estimate and polychoric series): gsl library and rb-gsl [http://rb-gsl.rubyforge.org/](http://rb-gsl.rubyforge.org/). You should install it using `gem install gsl`. + +**Note**: Use gsl 1.12.109 or later. + +RESOURCES: +---------- + +* Source code on github: [http://github.com/clbustos/statsample](http://github.com/clbustos/statsample) +* API: [http://ruby-statsample.rubyforge.org/statsample/](http://ruby-statsample.rubyforge.org/statsample/) +* Bug report and feature request: [http://github.com/clbustos/statsample/issues](http://github.com/clbustos/statsample/issues) +* E-mailing list: [http://groups.google.com/group/statsample](http://groups.google.com/group/statsample) + +INSTALL: +--------- + `$ sudo gem install statsample` + +On \*nix, you should install statsample-optimization to retrieve gems gsl, statistics2 and a C extension to speed some methods. + +There are available precompiled version for Ruby 1.9 on x86, x86\_64 and mingw32 archs. + + `$ sudo gem install statsample-optimization` + +If you use Ruby 1.8, you should compile statsample-optimization, usign parameter `--platform ruby` + + `$ sudo gem install statsample-optimization --platform ruby` + +If you need to work on Structural Equation Modeling, you could see _statsample-sem_. You need R with _sem_ or _OpenMx_ [http://openmx.psyc.virginia.edu/](http://openmx.psyc.virginia.edu/) libraries installed + + `$ sudo gem install statsample-sem` + +Available setup.rb file + + `sudo gem ruby setup.rb` + +LICENSE: +------- + +GPL-2 (See LICENSE.txt) From 80f7a73be015d2ac7f531fb2125363a6617f0fab Mon Sep 17 00:00:00 2001 From: Ankur Goel Date: Wed, 24 Apr 2013 22:26:14 +0530 Subject: [PATCH 03/22] Added specs for Simple Random Sampling(SRS) and Time Series Analysis. Removed depracated Shoulda includes from helper_tests Modified param list in srs.rb for whitespaces. --- lib/statsample/srs.rb | 2 +- test/helpers_tests.rb | 6 ------ test/test_srs.rb | 46 +++++++++++++++++++++++++++++++++++++++++++ test/test_tseries.rb | 13 ++++++++++-- 4 files changed, 58 insertions(+), 9 deletions(-) diff --git a/lib/statsample/srs.rb b/lib/statsample/srs.rb index 6dcc7d2..a372431 100644 --- a/lib/statsample/srs.rb +++ b/lib/statsample/srs.rb @@ -61,7 +61,7 @@ def proportion_confidence_interval_z(p, n_sample, n_population, margin=0.95) # Proportion confidence interval with x value # Uses estimated proportion, sample without replacement - def proportion_confidence_interval(p, sam,pop , x) + def proportion_confidence_interval(p, sam, pop, x) #f=sam.quo(pop) one_range=x * Math::sqrt((qf(sam, pop) * p * (1-p)).quo(sam-1)) + (1.quo(sam * 2.0)) [p-one_range, p+one_range] diff --git a/test/helpers_tests.rb b/test/helpers_tests.rb index 82c6b2e..419b24f 100644 --- a/test/helpers_tests.rb +++ b/test/helpers_tests.rb @@ -14,10 +14,6 @@ module MiniTest class Unit class TestCase - include Shoulda::InstanceMethods - extend Shoulda::ClassMethods - include Shoulda::Assertions - def self.should_with_gsl(name,&block) should(name) do if Statsample.has_gsl? @@ -25,10 +21,8 @@ def self.should_with_gsl(name,&block) else skip("Requires GSL") end - end end - end end diff --git a/test/test_srs.rb b/test/test_srs.rb index 1d18cf9..a486423 100644 --- a/test/test_srs.rb +++ b/test/test_srs.rb @@ -6,4 +6,50 @@ def test_std_error assert_equal(108,Statsample::SRS.estimation_n(0.05,0.5,150,0.95).to_i) assert_in_delta(0.0289,Statsample::SRS.proportion_sd_kp_wor(0.5,100,150),0.001) end + + def test_fpc + sample_size = 80 + population_size = 10 + + assert_equal -7, Statsample::SRS.fpc_var(sample_size, population_size).to_i + end + + def test_qf + sample_size = 80 + population_size = 10 + + assert_equal -7, Statsample::SRS.qf(sample_size, population_size).to_i + #non sample fraction = 1 - sample fraction + assert_equal -7, 1 - sample_size.quo(population_size) + end + + def test_proportion_confidence_interval_t + proportion = 59 + n_sample, n_population = 15, 10 + confidence_interval = Statsample::SRS.proportion_confidence_interval_t(proportion, n_sample, n_population).map(&:to_f) + + assert_in_delta 35.2559, confidence_interval[0], 0.001 + assert_in_delta 82.7440, confidence_interval[1], 0.001 + end + + def test_proportion_confidence_interval_z + #estimated proportion with normal distribution. + proportion = 59 + n_sample, n_population = 15, 10 + confidence_interval = Statsample::SRS.proportion_confidence_interval_z(proportion, n_sample, n_population).map(&:to_f) + + assert_in_delta 37.2991, confidence_interval[0], 0.001 + assert_in_delta 80.7008, confidence_interval[1], 0.001 + end + + def test_proportion_sd_kp_wor + #standard deviation without replacement strategy for known proportion + assert_in_delta 11.6979, Statsample::SRS.proportion_sd_kp_wor(40, -30, 20), 0.001 + end + + def test_proportion_sd_ep_wor + #standard deviation with replacement for estimated proportion + assert_in_delta 35.5527, Statsample::SRS.proportion_sd_ep_wr(80, -4), 0.001 + assert_in_delta 38.9700, Statsample::SRS.proportion_sd_ep_wr(68, -2), 0.001 + end end diff --git a/test/test_tseries.rb b/test/test_tseries.rb index 35f2cc9..9306ae0 100644 --- a/test/test_tseries.rb +++ b/test/test_tseries.rb @@ -24,8 +24,17 @@ def test_acf end def test_lag - assert_in_delta 16.66, @xiu.lag[@xiu.size - 1], 0.001 - assert_in_delta 16.36, @xiu.lag(2)[@xiu.size - 1], 0.001 + #test of default lag + lag1 = @xiu.lag + + assert_in_delta 16.66, lag1[lag1.size - 1], 0.001 + assert_in_delta 16.36, lag1[lag1.size - 2], 0.001 + + #test with different lagging unit + lag2 = @xiu.lag(2) + + assert_in_delta 16.36, lag2[lag2.size - 1], 0.001 + assert_in_delta 16.56, lag2[lag2.size - 2], 0.001 end def test_delta From b95ca730c2f9715ed9def88c08d2d872b84f76ab Mon Sep 17 00:00:00 2001 From: Ankur Goel Date: Wed, 1 May 2013 21:17:27 +0530 Subject: [PATCH 04/22] Added tests for StratifiedSample with proper comments on their significance Edited lib/statsample/multiset.rb for proper indentation, spacing and ruby patterns. --- lib/statsample/multiset.rb | 324 +++++++++++++++++++------------------ lib/statsample/srs.rb | 2 +- test/test_stratified.rb | 70 ++++++++ 3 files changed, 240 insertions(+), 156 deletions(-) diff --git a/lib/statsample/multiset.rb b/lib/statsample/multiset.rb index 88fe595..230fe3c 100644 --- a/lib/statsample/multiset.rb +++ b/lib/statsample/multiset.rb @@ -1,212 +1,225 @@ module Statsample # Multiset joins multiple dataset with the same fields and vectors - # but with different number of cases. + # but with different number of cases. # This is the base class for stratified and cluster sampling estimation class Multiset # Name of fields attr_reader :fields # Array with Statsample::Dataset attr_reader :datasets + # To create a multiset # * Multiset.new(%w{f1 f2 f3}) # define only fields def initialize(fields) - @fields=fields - @datasets={} + @fields = fields + @datasets = {} end - def self.new_empty_vectors(fields,ds_names) - ms=Multiset.new(fields) - ds_names.each{|d| - ms.add_dataset(d,Dataset.new(fields)) - } - ms + + def self.new_empty_vectors(fields, ds_names) + ms = Multiset.new(fields) + ds_names.each do |d| + ms.add_dataset(d, Dataset.new(fields)) + end + ms end + # Generate a new dataset as a union of partial dataset # If block given, this is applied to each dataset before union def union(&block) - union_field={} - types={} - names={} - labels={} - each do |k,ds| + union_field = {} + types = {} + names = {} + labels = {} + each do |k, ds| if block - ds=ds.dup - yield k,ds + ds = ds.dup + yield k, ds end @fields.each do |f| - union_field[f]||=Array.new + union_field[f] ||= Array.new union_field[f].concat(ds[f].data) - types[f]||=ds[f].type - names[f]||=ds[f].name - labels[f]||=ds[f].labels + types[f] ||= ds[f].type + names[f] ||= ds[f].name + labels[f] ||= ds[f].labels end end - + @fields.each do |f| - union_field[f]=union_field[f].to_vector(types[f]) - union_field[f].name=names[f] - union_field[f].labels=labels[f] + union_field[f] = union_field[f].to_vector(types[f]) + union_field[f].name = names[f] + union_field[f].labels = labels[f] end - ds_union=union_field.to_dataset - ds_union.fields=@fields + ds_union = union_field.to_dataset + ds_union.fields = @fields ds_union end + def datasets_names - @datasets.keys.sort + @datasets.keys.sort end + def n_datasets - @datasets.size + @datasets.size end - def add_dataset(key,ds) - if(ds.fields!=@fields) - raise ArgumentError, "Dataset(#{ds.fields.to_s})must have the same fields of the Multiset(#{@fields})" + + def add_dataset(key, ds) + if(ds.fields != @fields) + raise ArgumentError, "Dataset(#{ds.fields.to_s})must have the same fields of the Multiset(#{@fields})" else - @datasets[key]=ds + @datasets[key]=ds end end + def sum_field(field) - @datasets.inject(0) {|a,da| - stratum_name=da[0] - vector=da[1][field] - val=yield stratum_name,vector + @datasets.inject(0) do |a,da| + stratum_name = da[0] + vector = da[1][field] + val = yield stratum_name,vector a+val - } + end end + def collect_vector(field) - @datasets.collect {|k,v| + @datasets.collect {|k, v| yield k, v[field] } end - + def each_vector(field) - @datasets.each {|k,v| + @datasets.each {|k, v| yield k, v[field] } end + def[](i) @datasets[i] end + def each(&block) - @datasets.each {|k,ds| - next if ds.cases==0 - block.call(k,ds) - } + @datasets.each do |k, ds| + next if ds.cases == 0 + block.call(k, ds) + end end end + class StratifiedSample class << self # mean for an array of vectors def mean(*vectors) - n_total=0 - means=vectors.inject(0){|a,v| - n_total+=v.size - a+v.sum - } - means.to_f/n_total + n_total = 0 + means=vectors.inject(0) do |a,v| + n_total += v.size + a + v.sum + end + means.to_f / n_total end - + def standard_error_ksd_wr(es) - n_total=0 - sum=es.inject(0){|a,h| - n_total+=h['N'] - a+((h['N']**2 * h['s']**2) / h['n'].to_f) - } + n_total = 0 + sum=es.inject(0) do |a,h| + n_total += h['N'] + a + ((h['N'] ** 2 * h['s'] ** 2) / h['n'].to_f) + end (1.to_f / n_total)*Math::sqrt(sum) end - - + + def variance_ksd_wr(es) standard_error_ksd_wr(es)**2 end def calculate_n_total(es) - es.inject(0) {|a,h| a+h['N'] } + es.inject(0) {|a, h| a+h['N'] } end # Source : Cochran (1972) - + def variance_ksd_wor(es) - n_total=calculate_n_total(es) - es.inject(0){|a,h| - val=((h['N'].to_f / n_total)**2) * (h['s']**2 / h['n'].to_f) * (1 - (h['n'].to_f / h['N'])) - a+val - } + n_total = calculate_n_total(es) + es.inject(0) do |a, h| + val = ((h['N'].to_f / n_total)**2) * (h['s']**2 / h['n'].to_f) * (1 - (h['n'].to_f / h['N'])) + a+val + end end + def standard_error_ksd_wor(es) Math::sqrt(variance_ksd_wor(es)) end - - - + def variance_esd_wor(es) - n_total=calculate_n_total(es) - sum=es.inject(0){|a,h| - val=h['N']*(h['N']-h['n'])*(h['s']**2 / h['n'].to_f) - a+val - } - (1.0/(n_total**2))*sum + n_total = calculate_n_total(es) + sum=es.inject(0) do |a,h| + val = h['N'] * (h['N'] - h['n']) * (h['s']**2 / h['n'].to_f) + a + val + end + (1.0 / (n_total**2)) * sum end - - + def standard_error_esd_wor(es) Math::sqrt(variance_ksd_wor(es)) end # Based on http://stattrek.com/Lesson6/STRAnalysis.aspx def variance_esd_wr(es) - n_total=calculate_n_total(es) - sum=es.inject(0){|a,h| - val= ((h['s']**2 * h['N']**2) / h['n'].to_f) - a+val - } - (1.0/(n_total**2))*sum + n_total = calculate_n_total(es) + sum = es.inject(0) do |a,h| + val = ((h['s']**2 * h['N']**2) / h['n'].to_f) + a + val + end + (1.0 / (n_total**2)) * sum end + def standard_error_esd_wr(es) Math::sqrt(variance_esd_wr(es)) end - + def proportion_variance_ksd_wor(es) n_total=calculate_n_total(es) - es.inject(0){|a,h| - val= (((h['N'].to_f / n_total)**2 * h['p']*(1-h['p'])) / (h['n'])) * (1- (h['n'].to_f / h['N'])) - a+val - } + es.inject(0) do |a,h| + val = (((h['N'].to_f / n_total)**2 * h['p']*(1-h['p'])) / (h['n'])) * (1- (h['n'].to_f / h['N'])) + a + val + end end + def proportion_sd_ksd_wor(es) - Math::sqrt(proportion_variance_ksd_wor(es)) + Math::sqrt(proportion_variance_ksd_wor(es)) end - - + def proportion_sd_ksd_wr(es) - n_total=calculate_n_total(es) - sum=es.inject(0){|a,h| - val= (h['N']**2 * h['p']*(1-h['p'])) / h['n'].to_f - a+val - } - Math::sqrt(sum) * (1.0/n_total) + n_total = calculate_n_total(es) + sum = es.inject(0) do |a,h| + val = (h['N']**2 * h['p'] * (1 - h['p'])) / h['n'].to_f + a + val + end + Math::sqrt(sum) * (1.0 / n_total) end + def proportion_variance_ksd_wr(es) - proportion_variance_ksd_wor(es)**2 + proportion_variance_ksd_wor(es)**2 end - + def proportion_variance_esd_wor(es) n_total=n_total=calculate_n_total(es) - - sum=es.inject(0){|a,h| - a=(h['N']**2 * (h['N']-h['n']) * h['p']*(1.0-h['p'])) / ((h['n']-1)*(h['N']-1)) - a+val - } + + sum = es.inject(0) do |a,h| + a = (h['N']**2 * (h['N'] - h['n']) * h['p'] * (1.0-h['p'])) / ((h['n'] - 1) * (h['N'] - 1)) + a + val + end Math::sqrt(sum) * (1.0/n_total**2) end + def proportion_sd_esd_wor(es) - Math::sqrt(proportion_variance_ksd_wor(es)) + Math::sqrt(proportion_variance_ksd_wor(es)) end end - - def initialize(ms,strata_sizes) - raise TypeError,"ms should be a Multiset" unless ms.is_a? Statsample::Multiset - @ms=ms - raise ArgumentError,"You should put a strata size for each dataset" if strata_sizes.keys.sort!=ms.datasets_names - @strata_sizes=strata_sizes - @population_size=@strata_sizes.inject(0) {|a,x| a+x[1]} - @strata_number=@ms.n_datasets - @sample_size=@ms.datasets.inject(0) {|a,x| a+x[1].cases} + #class singleton methods end above. + + def initialize(ms, strata_sizes) + raise TypeError, "ms should be a Multiset" unless ms.is_a? Statsample::Multiset + @ms = ms + raise ArgumentError, "You should put a strata size for each dataset" if strata_sizes.keys.sort!=ms.datasets_names + @strata_sizes = strata_sizes + @population_size = @strata_sizes.inject(0) { |a, x| a + x[1] } + @strata_number = @ms.n_datasets + @sample_size = @ms.datasets.inject(0) { |a,x| a + x[1].cases } end # Number of strata def strata_number @@ -226,14 +239,14 @@ def stratum_size(h) @strata_sizes[h] end def vectors_by_field(field) - @ms.datasets.collect{|k,ds| + @ms.datasets.collect{|k, ds| ds[field] } end # Population proportion based on strata - def proportion(field, v=1) - @ms.sum_field(field) {|s_name,vector| - stratum_ponderation(s_name)*vector.proportion(v) + def proportion(field, v = 1) + @ms.sum_field(field) {|s_name, vector| + stratum_ponderation(s_name) * vector.proportion(v) } end # Stratum ponderation. @@ -242,69 +255,70 @@ def stratum_ponderation(h) @strata_sizes[h].to_f / @population_size end alias_method :wh, :stratum_ponderation - + # Population mean based on strata def mean(field) - @ms.sum_field(field) {|s_name,vector| - stratum_ponderation(s_name)*vector.mean + @ms.sum_field(field) {|s_name, vector| + stratum_ponderation(s_name) * vector.mean } end # Standard error with estimated population variance and without replacement. # Source: Cochran (1972) def standard_error_wor(field) - es=@ms.collect_vector(field) {|s_n, vector| - {'N'=>@strata_sizes[s_n],'n'=>vector.size, 's'=>vector.sds} + es = @ms.collect_vector(field) {|s_n, vector| + {'N' => @strata_sizes[s_n], 'n' => vector.size, 's' => vector.sds} } - + StratifiedSample.standard_error_esd_wor(es) end - + # Standard error with estimated population variance and without replacement. # Source: http://stattrek.com/Lesson6/STRAnalysis.aspx - + def standard_error_wor_2(field) - sum=@ms.sum_field(field) {|s_name,vector| - s_size=@strata_sizes[s_name] - (s_size**2 * (1-(vector.size.to_f / s_size)) * vector.variance_sample / vector.size.to_f) - } - (1/@population_size.to_f)*Math::sqrt(sum) + sum = @ms.sum_field(field) do |s_name, vector| + s_size = @strata_sizes[s_name] + (s_size**2 * (1 - (vector.size.to_f / s_size)) * vector.variance_sample / vector.size.to_f) + end + (1 / @population_size.to_f) * Math::sqrt(sum) end - + def standard_error_wr(field) - es=@ms.collect_vector(field) {|s_n, vector| - {'N'=>@strata_sizes[s_n],'n'=>vector.size, 's'=>vector.sds} + es = @ms.collect_vector(field) {|s_n, vector| + {'N' => @strata_sizes[s_n], 'n' => vector.size, 's' => vector.sds} } - + StratifiedSample.standard_error_esd_wr(es) end - def proportion_sd_esd_wor(field,v=1) - es=@ms.collect_vector(field) {|s_n, vector| - {'N'=>@strata_sizes[s_n],'n'=>vector.size, 'p'=>vector.proportion(v)} + + def proportion_sd_esd_wor(field, v = 1) + es = @ms.collect_vector(field) {|s_n, vector| + {'N' => @strata_sizes[s_n], 'n' => vector.size, 'p' => vector.proportion(v)} } - + StratifiedSample.proportion_sd_esd_wor(es) end - - def proportion_standard_error(field,v=1) - prop=proportion(field,v) - sum=@ms.sum_field(field) {|s_name,vector| - nh=vector.size - s_size=@strata_sizes[s_name] - (s_size**2 * (1-(nh/s_size)) * prop * (1-prop) / (nh -1 )) + + def proportion_standard_error(field, v = 1) + prop=proportion(field, v) + sum = @ms.sum_field(field) {|s_name, vector| + nh = vector.size + s_size = @strata_sizes[s_name] + (s_size**2 * (1 - (nh / s_size)) * prop * (1 - prop) / (nh -1)) } (1.quo(@population_size)) * Math::sqrt(sum) end # Cochran(1971), p. 150 - def variance_pst(field,v=1) - sum=@ms.datasets.inject(0) {|a,da| - stratum_name=da[0] - ds=da[1] - nh=ds.cases.to_f - s_size=@strata_sizes[stratum_name] - prop=ds[field].proportion(v) - a + (((s_size**2 * (s_size-nh)) / (s_size-1))*(prop*(1-prop) / (nh-1))) - } - (1/@population_size.to_f ** 2)*sum + def variance_pst(field, v = 1) + sum=@ms.datasets.inject(0) do |a, da| + stratum_name = da[0] + ds = da[1] + nh = ds.cases.to_f + s_size = @strata_sizes[stratum_name] + prop = ds[field].proportion(v) + a + (((s_size**2 * (s_size - nh)) / (s_size - 1)) * (prop * (1 - prop) / (nh - 1))) + end + (1 / @population_size.to_f ** 2) * sum end end end diff --git a/lib/statsample/srs.rb b/lib/statsample/srs.rb index a372431..752cba5 100644 --- a/lib/statsample/srs.rb +++ b/lib/statsample/srs.rb @@ -112,7 +112,7 @@ def proportion_total_sd_ep_wor(prop, sam, pop) ######################## # - # :SECTION: Mean stimation + # :SECTION: Mean estimation # ######################## diff --git a/test/test_stratified.rb b/test/test_stratified.rb index eb8ef45..7f0ef8d 100644 --- a/test/test_stratified.rb +++ b/test/test_stratified.rb @@ -3,6 +3,11 @@ class StatsampleStratifiedTestCase < MiniTest::Unit::TestCase def initialize(*args) + @es = [ + {"N" => 5, "s" => 3, "n" => 2}, + {"N" => 10, "s" => 5, "n" => 2}, + {"N" => 30, "s" => 32, "n" => 12} + ] super end def test_mean @@ -14,4 +19,69 @@ def test_mean popv=pop.to_vector(:scale) assert_equal(popv.mean,Statsample::StratifiedSample.mean(av,bv)) end + + def test_standard_error_ksd_wr + # Standard error for known sample design with replacement + es = [ + {"N" => 5, "s" => 3, "n" => 2}, + {"N" => 10, "s" => 5, "n" => 2}, + {"N" => 30, "s" => 32, "n" => 12} + ] + assert_in_delta 6.21279, Statsample::StratifiedSample.standard_error_ksd_wr(es), 0.001 + end + + def test_variance_ksd_wr + # variance of known sample desgin with replacement + result = Statsample::StratifiedSample.variance_ksd_wr(@es) + + assert_in_delta 38.5987, result, 0.001 + assert_in_delta Statsample::StratifiedSample.standard_error_ksd_wr(@es), Math::sqrt(result), 0.001 + end + + def test_calculate_n_total + # calculates total 'n' from the hash + + assert_equal 45, Statsample::StratifiedSample.calculate_n_total(@es) + assert_equal 15, Statsample::StratifiedSample.calculate_n_total(@es.first(2)) + assert_equal 40, Statsample::StratifiedSample.calculate_n_total(@es.last(2)) + end + + def test_variance_ksd_wor + # variance of known sample desgin without replacement + + assert_in_delta 23.2827, Statsample::StratifiedSample.variance_ksd_wor(@es), 0.001 + assert_in_delta 4.7444, Statsample::StratifiedSample.variance_ksd_wor(@es.first(2)), 0.001 + end + + def test_standard_error_ksd_wor + # standard error for known sample design without replacement + # + assert_in_delta 4.8252, Statsample::StratifiedSample.standard_error_ksd_wor(@es), 0.001 + + #for fragment sample design + assert_in_delta 5.4244, Statsample::StratifiedSample.standard_error_ksd_wor(@es.last 2), 0.001 + end + + def test_variance_esd_wor + # variance of estimated sample design without replacement strategry + assert_in_delta 23.2827, Statsample::StratifiedSample.variance_esd_wor(@es), 0.001 + end + + def test_standard_error_esd_wor + # Standard error for estimated sample design without replacement + assert_in_delta 4.82521, Statsample::StratifiedSample.standard_error_esd_wor(@es), 0.001 + end + + def test_standard_error_esd_wr + res = Statsample::StratifiedSample.standard_error_esd_wr(@es) + assert_in_delta 6.21279, res, 0.001 + assert_in_delta Statsample::StratifiedSample.variance_esd_wr(@es), res ** 2, 0.001 + end + + def test_variance_esd_wr + # variance for estimated sample dsign with replacement + + assert_in_delta 38.5987, Statsample::StratifiedSample.variance_esd_wr(@es), 0.001 + end + end From fcfa01ec367ebd66e358d865f5af2ed40b2d87b5 Mon Sep 17 00:00:00 2001 From: Ankur Goel Date: Sat, 22 Jun 2013 09:59:54 +0530 Subject: [PATCH 05/22] corrected doc typo --- lib/statsample/analysis.rb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/statsample/analysis.rb b/lib/statsample/analysis.rb index bf0e7b1..b0036a4 100644 --- a/lib/statsample/analysis.rb +++ b/lib/statsample/analysis.rb @@ -3,7 +3,7 @@ module Statsample # DSL to create analysis without hazzle. - # * Shortcuts methods to avoid use complete namescapes, many based on R + # * Shortcuts methods to avoid use complete namespaces, many based on R # * Attach/detach vectors to workspace, like R # == Example # an1=Statsample::Analysis.store(:first) do From 47ff467f9866798e0917cd75ccb21ad046c3b0d3 Mon Sep 17 00:00:00 2001 From: Ankur Goel Date: Sat, 22 Jun 2013 10:27:31 +0530 Subject: [PATCH 06/22] Regression tests --- test/test_regression2.rb | 235 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 235 insertions(+) create mode 100644 test/test_regression2.rb diff --git a/test/test_regression2.rb b/test/test_regression2.rb new file mode 100644 index 0000000..2c6b9ad --- /dev/null +++ b/test/test_regression2.rb @@ -0,0 +1,235 @@ +require(File.expand_path(File.dirname(__FILE__)+'/helpers_tests.rb')) + +class StatsampleRegressionTestCase < MiniTest::Unit::TestCase + include Statsample::Shorthand + # context "Example with missing data" do + + def setup + @x=[0.285714285714286, 0.114285714285714, 0.314285714285714, 0.2, 0.2, 0.228571428571429, 0.2, 0.4, 0.714285714285714, 0.285714285714286, 0.285714285714286, 0.228571428571429, 0.485714285714286, 0.457142857142857, 0.257142857142857, 0.228571428571429, 0.285714285714286, 0.285714285714286, 0.285714285714286, 0.142857142857143, 0.285714285714286, 0.514285714285714, 0.485714285714286, 0.228571428571429, 0.285714285714286, 0.342857142857143, 0.285714285714286, 0.0857142857142857].to_scale + + @y=[nil, 0.233333333333333, nil, 0.266666666666667, 0.366666666666667, nil, 0.333333333333333, 0.3, 0.666666666666667, 0.0333333333333333, 0.333333333333333, nil, nil, 0.533333333333333, 0.433333333333333, 0.4, 0.4, 0.5, 0.4, 0.266666666666667, 0.166666666666667, 0.666666666666667, 0.433333333333333, 0.166666666666667, nil, 0.4, 0.366666666666667, nil].to_scale + @ds={'x'=>@x,'y'=>@y}.to_dataset + @lr=Statsample::Regression::Multiple::RubyEngine.new(@ds,'y') + end + + def test_correct_values #should "have correct values" do + assert_in_delta(0.455,@lr.r2,0.001) + assert_in_delta(0.427,@lr.r2_adjusted, 0.001) + assert_in_delta(0.1165,@lr.se_estimate,0.001) + assert_in_delta(15.925,@lr.f,0.0001) + assert_in_delta(0.675, @lr.standarized_coeffs['x'],0.001) + assert_in_delta(0.778, @lr.coeffs['x'],0.001, "coeff x") + assert_in_delta(0.132, @lr.constant,0.001,"constant") + assert_in_delta(0.195, @lr.coeffs_se['x'],0.001,"coeff x se") + assert_in_delta(0.064, @lr.constant_se,0.001,"constant se") + end + #end + def test_error# "return an error if data is linearly dependent" do + samples=100 + + a,b=rand,rand + + x1=samples.times.map { rand}.to_scale + x2=samples.times.map {rand}.to_scale + x3=samples.times.map {|i| x1[i]*(1+a)+x2[i]*(1+b)}.to_scale + y=samples.times.map {|i| x1[i]+x2[i]+x3[i]+rand}.to_scale + + ds={'x1'=>x1,'x2'=>x2,'x3'=>x3,'y'=>y}.to_dataset + + assert_raise(Statsample::Regression::LinearDependency) { + Statsample::Regression::Multiple::RubyEngine.new(ds,'y') + } + end + def test_parameters + @x=[13,20,10,33,15].to_vector(:scale) + @y=[23,18,35,10,27].to_vector(:scale) + reg=Statsample::Regression::Simple.new_from_vectors(@x,@y) + _test_simple_regression(reg) + ds={'x'=>@x,'y'=>@y}.to_dataset + reg=Statsample::Regression::Simple.new_from_dataset(ds,'x','y') + _test_simple_regression(reg) + reg=Statsample::Regression.simple(@x,@y) + _test_simple_regression(reg) + + end + def _test_simple_regression(reg) + + assert_in_delta(40.009, reg.a,0.001) + assert_in_delta(-0.957, reg.b,0.001) + assert_in_delta(4.248, reg.standard_error,0.002) + assert(reg.summary) + end + + def test_summaries + a=10.times.map{rand(100)}.to_scale + b=10.times.map{rand(100)}.to_scale + y=10.times.map{rand(100)}.to_scale + ds={'a'=>a,'b'=>b,'y'=>y}.to_dataset + lr=Statsample::Regression::Multiple::RubyEngine.new(ds,'y') + assert(lr.summary.size>0) + end + def test_multiple_dependent + complete=Matrix[ + [1,0.53,0.62,0.19,-0.09,0.08,0.02,-0.12,0.08], + [0.53,1,0.61,0.23,0.1,0.18,0.02,-0.1,0.15], + [0.62,0.61,1,0.03,0.1,0.12,0.03,-0.06,0.12], + [0.19,0.23,0.03,1,-0.02,0.02,0,-0.02,-0.02], + [-0.09,0.1,0.1,-0.02,1,0.05,0.06,0.18,0.02], + [0.08,0.18,0.12,0.02,0.05,1,0.22,-0.07,0.36], + [0.02,0.02,0.03,0,0.06,0.22,1,-0.01,-0.05], + [-0.12,-0.1,-0.06,-0.02,0.18,-0.07,-0.01,1,-0.03], + [0.08,0.15,0.12,-0.02,0.02,0.36,-0.05,-0.03,1]] + complete.extend Statsample::CovariateMatrix + complete.fields=%w{adhd cd odd sex age monly mwork mage poverty} + + lr=Statsample::Regression::Multiple::MultipleDependent.new(complete, %w{adhd cd odd}) + + + assert_in_delta(0.197, lr.r2yx,0.001) + assert_in_delta(0.197, lr.r2yx_covariance,0.001) + assert_in_delta(0.07, lr.p2yx,0.001) + + end + + def test_multiple_regression_pairwise_2 + @a=[1,3,2,4,3,5,4,6,5,7,3,nil,3,nil,3].to_vector(:scale) + @b=[3,3,4,4,5,5,6,6,4,4,2,2,nil,6,2].to_vector(:scale) + @c=[11,22,30,40,50,65,78,79,99,100,nil,3,7,nil,7].to_vector(:scale) + @y=[3,4,5,6,7,8,9,10,20,30,30,40,nil,50,nil].to_vector(:scale) + ds={'a'=>@a,'b'=>@b,'c'=>@c,'y'=>@y}.to_dataset + lr=Statsample::Regression::Multiple::RubyEngine.new(ds,'y') + assert_in_delta(2407.436,lr.sst,0.001) + assert_in_delta(0.752,lr.r,0.001, "pairwise r") + assert_in_delta(0.565,lr.r2,0.001) + assert_in_delta(1361.130,lr.ssr,0.001) + assert_in_delta(1046.306,lr.sse,0.001) + assert_in_delta(3.035,lr.f,0.001) + end + + + def test_multiple_regression_gsl + if Statsample.has_gsl? + @a=[1,3,2,4,3,5,4,6,5,7].to_vector(:scale) + @b=[3,3,4,4,5,5,6,6,4,4].to_vector(:scale) + @c=[11,22,30,40,50,65,78,79,99,100].to_vector(:scale) + @y=[3,4,5,6,7,8,9,10,20,30].to_vector(:scale) + ds={'a'=>@a,'b'=>@b,'c'=>@c,'y'=>@y}.to_dataset + lr=Statsample::Regression::Multiple::GslEngine.new(ds,'y') + assert(lr.summary.size>0) + model_test(lr,'gsl') + predicted=[1.7857, 6.0989, 3.2433, 7.2908, 4.9667, 10.3428, 8.8158, 10.4717, 23.6639, 25.3198] + c_predicted=lr.predicted + predicted.each_index{|i| + assert_in_delta(predicted[i],c_predicted[i],0.001) + } + residuals=[1.2142, -2.0989, 1.7566, -1.29085, 2.033, -2.3428, 0.18414, -0.47177, -3.66395, 4.6801] + c_residuals=lr.residuals + residuals.each_index{|i| + assert_in_delta(residuals[i],c_residuals[i],0.001) + } + else + skip "Regression::Multiple::GslEngine not tested (no Gsl)" + end + end + + + + def model_test_matrix(lr,name='undefined') + + stan_coeffs={'a'=>0.151,'b'=>-0.547,'c'=>0.997} + unstan_coeffs={'a'=>0.695, 'b'=>-4.286, 'c'=>0.266} + + unstan_coeffs.each_key{|k| + assert_in_delta(unstan_coeffs[k], lr.coeffs[k],0.001,"b coeffs - #{name}") + } + + stan_coeffs.each_key{|k| + assert_in_delta(stan_coeffs[k], lr.standarized_coeffs[k],0.001, "beta coeffs - #{name}") + } + + assert_in_delta(11.027,lr.constant,0.001) + + assert_in_delta(0.955,lr.r,0.001) + assert_in_delta(0.913,lr.r2,0.001) + + assert_in_delta(20.908, lr.f,0.001) + assert_in_delta(0.001, lr.probability, 0.001) + assert_in_delta(0.226,lr.tolerance("a"),0.001) + + coeffs_se={"a"=>1.171,"b"=>1.129,"c"=>0.072} + + + + ccoeffs_se=lr.coeffs_se + coeffs_se.each_key{|k| + assert_in_delta(coeffs_se[k],ccoeffs_se[k],0.001) + } + coeffs_t={"a"=>0.594,"b"=>-3.796,"c"=>3.703} + ccoeffs_t=lr.coeffs_t + coeffs_t.each_key{|k| + assert_in_delta(coeffs_t[k], ccoeffs_t[k],0.001) + } + + assert_in_delta(639.6,lr.sst,0.001) + assert_in_delta(583.76,lr.ssr,0.001) + assert_in_delta(55.840,lr.sse,0.001) + assert(lr.summary.size>0, "#{name} without summary") + end + def model_test(lr,name='undefined') + model_test_matrix(lr,name) + assert_in_delta(4.559, lr.constant_se,0.001) + assert_in_delta(2.419, lr.constant_t,0.001) + + assert_in_delta(1.785,lr.process([1,3,11]),0.001) + end + def test_regression_matrix + @a=[1,3,2,4,3,5,4,6,5,7].to_vector(:scale) + @b=[3,3,4,4,5,5,6,6,4,4].to_vector(:scale) + @c=[11,22,30,40,50,65,78,79,99,100].to_vector(:scale) + @y=[3,4,5,6,7,8,9,10,20,30].to_vector(:scale) + ds={'a'=>@a,'b'=>@b,'c'=>@c,'y'=>@y}.to_dataset + cor=Statsample::Bivariate.correlation_matrix(ds) + + lr=Statsample::Regression::Multiple::MatrixEngine.new(cor,'y', :y_mean=>@y.mean, :x_mean=>{'a'=>ds['a'].mean, 'b'=>ds['b'].mean, 'c'=>ds['c'].mean}, :cases=>@a.size, :y_sd=>@y.sd , :x_sd=>{'a' => @a.sd, 'b' => @b.sd, 'c' => @c.sd}) + assert_nil(lr.constant_se) + assert_nil(lr.constant_t) + model_test_matrix(lr, "correlation matrix") + + covariance=Statsample::Bivariate.covariance_matrix(ds) + lr=Statsample::Regression::Multiple::MatrixEngine.new(covariance,'y', :y_mean=>@y.mean, :x_mean=>{'a'=>ds['a'].mean, 'b'=>ds['b'].mean, 'c'=>ds['c'].mean}, :cases=>@a.size) + assert(lr.summary.size>0) + + model_test(lr , "covariance matrix") + end + def test_regression_rubyengine + @a=[nil,1,3,2,4,3,5,4,6,5,7].to_vector(:scale) + @b=[nil,3,3,4,4,5,5,6,6,4,4].to_vector(:scale) + @c=[nil,11,22,30,40,50,65,78,79,99,100].to_vector(:scale) + @y=[nil,3,4,5,6,7,8,9,10,20,30].to_vector(:scale) + ds={'a'=>@a,'b'=>@b,'c'=>@c,'y'=>@y}.to_dataset + lr=Statsample::Regression::Multiple::RubyEngine.new(ds,'y') + assert_equal(11, lr.total_cases) + assert_equal(10, lr.valid_cases) + model_test(lr, 'rubyengine with missing data') + + predicted=[nil,1.7857, 6.0989, 3.2433, 7.2908, 4.9667, 10.3428, 8.8158, 10.4717, 23.6639, 25.3198] + c_predicted = lr.predicted + predicted.each_index do |i| + if c_predicted[i].nil? + assert(predicted[i].nil?, "Actual #{i} is nil, but expected #{predicted[i]}") + else + assert_in_delta(predicted[i], c_predicted[i], 0.001) + end + end + residuals=[nil,1.2142, -2.0989, 1.7566, -1.29085, 2.033, -2.3428, 0.18414, -0.47177, -3.66395, 4.6801] + c_residuals=lr.residuals + residuals.each_index do |i| + if c_residuals[i].nil? + assert(residuals[i].nil?) + else + assert_in_delta(residuals[i],c_residuals[i],0.001) + end + end + + end +end From 734f3be565090b7168b286fec90a48c07089fdba Mon Sep 17 00:00:00 2001 From: Ankur Goel Date: Sun, 23 Jun 2013 23:24:46 +0530 Subject: [PATCH 07/22] Repaired F tests, removed shoulda references --- test/test_regression2.rb | 4 +-- test/test_test_f.rb | 60 +++++++++++++++++++++------------------- 2 files changed, 33 insertions(+), 31 deletions(-) diff --git a/test/test_regression2.rb b/test/test_regression2.rb index 2c6b9ad..099c340 100644 --- a/test/test_regression2.rb +++ b/test/test_regression2.rb @@ -12,7 +12,7 @@ def setup @lr=Statsample::Regression::Multiple::RubyEngine.new(@ds,'y') end - def test_correct_values #should "have correct values" do + def test_correct_values assert_in_delta(0.455,@lr.r2,0.001) assert_in_delta(0.427,@lr.r2_adjusted, 0.001) assert_in_delta(0.1165,@lr.se_estimate,0.001) @@ -24,7 +24,7 @@ def test_correct_values #should "have correct values" do assert_in_delta(0.064, @lr.constant_se,0.001,"constant se") end #end - def test_error# "return an error if data is linearly dependent" do + def test_error samples=100 a,b=rand,rand diff --git a/test/test_test_f.rb b/test/test_test_f.rb index b7cc4a8..9145d7a 100644 --- a/test/test_test_f.rb +++ b/test/test_test_f.rb @@ -1,33 +1,35 @@ require(File.expand_path(File.dirname(__FILE__)+'/helpers_tests.rb')) class StatsampleTestFTestCase < MiniTest::Unit::TestCase - context(Statsample::Test::F) do - setup do - @ssb=84 - @ssw=68 - @df_num=2 - @df_den=15 - @f=Statsample::Test::F.new(@ssb.quo(@df_num),@ssw.quo(@df_den), @df_num, @df_den) - end - should "have #f equal to msb/msw" do - assert_equal((@ssb.quo(@df_num)).quo(@ssw.quo(@df_den)), @f.f) - end - should "have df total equal to df_num+df_den" do - assert_equal(@df_num + @df_den, @f.df_total) - end - should "have probability near 0.002" do - assert_in_delta(0.002, @f.probability, 0.0005) - end - should "be coerced into float" do - assert_equal(@f.to_f, @f.f) - end - - context("method summary") do - setup do - @summary=@f.summary - end - should "have size > 0" do - assert(@summary.size>0) - end - end + def setup + @ssb=84 + @ssw=68 + @df_num=2 + @df_den=15 + @f=Statsample::Test::F.new(@ssb.quo(@df_num),@ssw.quo(@df_den), @df_num, @df_den) + @summary=@f.summary + end + + def test_f_value + #should have #f equal to msb/msw + assert_equal((@ssb.quo(@df_num)).quo(@ssw.quo(@df_den)), @f.f) + end + + def test_df_with_num_and_den + # df total should be equal to df_num+df_den + assert_equal(@df_num + @df_den, @f.df_total) + end + + def test_probability + #should have probability near 0.002 + assert_in_delta(0.002, @f.probability, 0.0005) + end + + def test_float_coercion + #should be coerced into float + assert_equal(@f.to_f, @f.f) + end + + def test_summary_size + assert(@summary.size>0) end end From acc618ef5fd26577e65484024d4086ede4f226d6 Mon Sep 17 00:00:00 2001 From: Ankur Goel Date: Sun, 23 Jun 2013 23:25:38 +0530 Subject: [PATCH 08/22] Prepared gemspec with updated version of dependencies for proper build and devs. To build: gem build statsample.gemspec To install: gem install ./statsample*.gem --- statsample.gemspec | 117 ++++++++++----------------------------------- 1 file changed, 24 insertions(+), 93 deletions(-) diff --git a/statsample.gemspec b/statsample.gemspec index 205f0e2..92bed6b 100644 --- a/statsample.gemspec +++ b/statsample.gemspec @@ -1,98 +1,29 @@ -# -*- encoding: utf-8 -*- - Gem::Specification.new do |s| - s.name = "statsample" - s.version = "1.1.0.20110912141756" - - s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version= - s.authors = ["Claudio Bustos"] - s.date = "2013-19-04" - s.description = "A suite for basic and advanced statistics on Ruby. Tested on Ruby 1.8.7, 1.9.1, 1.9.2, 1.9.3 and JRuby 1.4 (Ruby 1.8.7 compatible).\n\nInclude:\n* Descriptive statistics: frequencies, median, mean, standard error, skew, kurtosis (and many others).\n* Imports and exports datasets from and to Excel, CSV and plain text files.\n* Correlations: Pearson's r, Spearman's rank correlation (rho), point biserial, tau a, tau b and gamma. Tetrachoric and Polychoric correlation provides by +statsample-bivariate-extension+ gem.\n* Intra-class correlation\n* Anova: generic and vector-based One-way ANOVA and Two-way ANOVA, with contrasts for One-way ANOVA.\n* Tests: F, T, Levene, U-Mannwhitney.\n* Regression: Simple, Multiple (OLS), Probit and Logit\n* Factorial Analysis: Extraction (PCA and Principal Axis), Rotation (Varimax, Equimax, Quartimax) and Parallel Analysis and Velicer's MAP test, for estimation of number of factors.\n* Reliability analysis for simple scale and a DSL to easily analyze multiple scales using factor analysis and correlations, if you want it.\n* Dominance Analysis, with multivariate dependent and bootstrap (Azen & Budescu)\n* Sample calculation related formulas\n* Structural Equation Modeling (SEM), using R libraries +sem+ and +OpenMx+\n* Creates reports on text, html and rtf, using ReportBuilder gem\n* Graphics: Histogram, Boxplot and Scatterplot" - s.email = ["clbustos@gmail.com"] - s.executables = ["statsample"] - s.extra_rdoc_files = ["History.txt", "LICENSE.txt", "Manifest.txt", "README.txt", "references.txt"] - s.files = ["History.txt", "LICENSE.txt", "Manifest.txt", "README.txt", "Rakefile", "benchmarks/correlation_matrix_15_variables.rb", "benchmarks/correlation_matrix_5_variables.rb", "benchmarks/correlation_matrix_methods/correlation_matrix.ds", "benchmarks/correlation_matrix_methods/correlation_matrix.html", "benchmarks/correlation_matrix_methods/correlation_matrix.rb", "benchmarks/correlation_matrix_methods/correlation_matrix.xls", "benchmarks/correlation_matrix_methods/correlation_matrix_gsl_ruby.ods", "benchmarks/correlation_matrix_methods/correlation_matrix_with_graphics.ods", "benchmarks/correlation_matrix_methods/results.ds", "benchmarks/factor_map.rb", "benchmarks/helpers_benchmark.rb", "bin/statsample", "data/locale/es/LC_MESSAGES/statsample.mo", "doc_latex/manual/equations.tex", "examples/boxplot.rb", "examples/correlation_matrix.rb", "examples/dataset.rb", "examples/dominance_analysis.rb", "examples/dominance_analysis_bootstrap.rb", "examples/histogram.rb", "examples/icc.rb", "examples/levene.rb", "examples/multiple_regression.rb", "examples/multivariate_correlation.rb", "examples/parallel_analysis.rb", "examples/polychoric.rb", "examples/principal_axis.rb", "examples/reliability.rb", "examples/scatterplot.rb", "examples/t_test.rb", "examples/tetrachoric.rb", "examples/u_test.rb", "examples/vector.rb", "examples/velicer_map_test.rb", "grab_references.rb", "lib/spss.rb", "lib/statsample.rb", "lib/statsample/analysis.rb", "lib/statsample/anova.rb", "lib/statsample/anova/contrast.rb", "lib/statsample/anova/oneway.rb", "lib/statsample/anova/twoway.rb", "lib/statsample/bivariate.rb", "lib/statsample/bivariate/pearson.rb", "lib/statsample/codification.rb", "lib/statsample/converter/csv.rb", "lib/statsample/converter/spss.rb", "lib/statsample/converters.rb", "lib/statsample/crosstab.rb", "lib/statsample/dataset.rb", "lib/statsample/dominanceanalysis.rb", "lib/statsample/dominanceanalysis/bootstrap.rb", "lib/statsample/factor.rb", "lib/statsample/factor/map.rb", "lib/statsample/factor/parallelanalysis.rb", "lib/statsample/factor/pca.rb", "lib/statsample/factor/principalaxis.rb", "lib/statsample/factor/rotation.rb", "lib/statsample/graph.rb", "lib/statsample/graph/boxplot.rb", "lib/statsample/graph/histogram.rb", "lib/statsample/graph/scatterplot.rb", "lib/statsample/histogram.rb", "lib/statsample/matrix.rb", "lib/statsample/mle.rb", "lib/statsample/mle/logit.rb", "lib/statsample/mle/normal.rb", "lib/statsample/mle/probit.rb", "lib/statsample/multiset.rb", "lib/statsample/regression.rb", "lib/statsample/regression/binomial.rb", "lib/statsample/regression/binomial/logit.rb", "lib/statsample/regression/binomial/probit.rb", "lib/statsample/regression/multiple.rb", "lib/statsample/regression/multiple/alglibengine.rb", "lib/statsample/regression/multiple/baseengine.rb", "lib/statsample/regression/multiple/gslengine.rb", "lib/statsample/regression/multiple/matrixengine.rb", "lib/statsample/regression/multiple/rubyengine.rb", "lib/statsample/regression/simple.rb", "lib/statsample/reliability.rb", "lib/statsample/reliability/icc.rb", "lib/statsample/reliability/multiscaleanalysis.rb", "lib/statsample/reliability/scaleanalysis.rb", "lib/statsample/reliability/skillscaleanalysis.rb", "lib/statsample/resample.rb", "lib/statsample/rserve_extension.rb", "lib/statsample/shorthand.rb", "lib/statsample/srs.rb", "lib/statsample/test.rb", "lib/statsample/test/bartlettsphericity.rb", "lib/statsample/test/chisquare.rb", "lib/statsample/test/f.rb", "lib/statsample/test/kolmogorovsmirnov.rb", "lib/statsample/test/levene.rb", "lib/statsample/test/t.rb", "lib/statsample/test/umannwhitney.rb", "lib/statsample/vector.rb", "lib/statsample/vector/gsl.rb", "po/es/statsample.mo", "po/es/statsample.po", "po/statsample.pot", "references.txt", "setup.rb", "test/fixtures/bank2.dat", "test/fixtures/correlation_matrix.rb", "test/fixtures/crime.txt", "test/fixtures/hartman_23.matrix", "test/fixtures/repeated_fields.csv", "test/fixtures/test_binomial.csv", "test/fixtures/test_csv.csv", "test/fixtures/test_xls.xls", "test/fixtures/tetmat_matrix.txt", "test/fixtures/tetmat_test.txt", "test/helpers_tests.rb", "test/test_analysis.rb", "test/test_anova_contrast.rb", "test/test_anovaoneway.rb", "test/test_anovatwoway.rb", "test/test_anovatwowaywithdataset.rb", "test/test_anovawithvectors.rb", "test/test_bartlettsphericity.rb", "test/test_bivariate.rb", "test/test_codification.rb", "test/test_crosstab.rb", "test/test_csv.rb", "test/test_dataset.rb", "test/test_dominance_analysis.rb", "test/test_factor.rb", "test/test_factor_map.rb", "test/test_factor_pa.rb", "test/test_ggobi.rb", "test/test_gsl.rb", "test/test_histogram.rb", "test/test_logit.rb", "test/test_matrix.rb", "test/test_mle.rb", "test/test_multiset.rb", "test/test_regression.rb", "test/test_reliability.rb", "test/test_reliability_icc.rb", "test/test_reliability_skillscale.rb", "test/test_resample.rb", "test/test_rserve_extension.rb", "test/test_srs.rb", "test/test_statistics.rb", "test/test_stest.rb", "test/test_stratified.rb", "test/test_test_f.rb", "test/test_test_kolmogorovsmirnov.rb", "test/test_test_t.rb", "test/test_umannwhitney.rb", "test/test_vector.rb", "test/test_xls.rb", "web/Rakefile"] - s.homepage = "http://ruby-statsample.rubyforge.org/" - s.post_install_message = "***************************************************\nThanks for installing statsample.\n\nOn *nix, you could install statsample-optimization\nto retrieve gems gsl, statistics2 and a C extension\nto speed some methods.\n\n $ sudo gem install statsample-optimization\n\nOn Ubuntu, install build-essential and libgsl0-dev \nusing apt-get. Compile ruby 1.9x from \nsource code first.\n\n $ sudo apt-get install build-essential libgsl0-dev\n\n\n*****************************************************\n" - s.rdoc_options = ["--main", "README.txt"] - s.require_paths = ["lib"] - s.rubyforge_project = "ruby-statsample" - s.rubygems_version = "1.8.10" - s.summary = "A suite for basic and advanced statistics on Ruby" - s.test_files = ["test/test_ggobi.rb", "test/test_vector.rb", "test/test_crosstab.rb", "test/test_bivariate.rb", "test/test_dominance_analysis.rb", "test/test_analysis.rb", "test/test_dataset.rb", "test/test_test_t.rb", "test/test_regression.rb", "test/test_reliability_icc.rb", "test/test_histogram.rb", "test/test_multiset.rb", "test/test_logit.rb", "test/test_xls.rb", "test/test_test_kolmogorovsmirnov.rb", "test/test_factor_map.rb", "test/test_stratified.rb", "test/test_rserve_extension.rb", "test/test_reliability_skillscale.rb", "test/test_anova_contrast.rb", "test/test_anovaoneway.rb", "test/test_matrix.rb", "test/test_reliability.rb", "test/test_anovatwowaywithdataset.rb", "test/test_factor_pa.rb", "test/test_factor.rb", "test/test_resample.rb", "test/test_anovawithvectors.rb", "test/test_bartlettsphericity.rb", "test/test_statistics.rb", "test/test_stest.rb", "test/test_srs.rb", "test/test_test_f.rb", "test/test_csv.rb", "test/test_anovatwoway.rb", "test/test_gsl.rb", "test/test_mle.rb", "test/test_umannwhitney.rb", "test/test_codification.rb"] + s.name = 'statsample' + s.version = '1.1.0.2013' + s.date = '2013-04-28' + s.summary = 'StatSample - Ruby Statistical library' + s.description = "A suite for basic and advanced statistics on Ruby. Tested on Ruby 1.8.7 1.9.1, 1.9.2 and 1.9.3, JRuby 1.4(Ruby 1.8.7 compatible)." + s.authors = ["Claudio Bustos"] + s.email = ["clbustos@gmail.com"] + s.executables = `git ls-files -- bin/*`.split("\n").map { |f| File.basename(f) } + s.files = `git ls-files -- lib/*`.split("\n") + s.extra_rdoc_files += ['History.txt', 'Manifest.txt', 'README.md', 'references.txt'] - if s.respond_to? :specification_version then - s.specification_version = 3 + s.homepage = "http://github.com/clbustos/statsample" + s.require_path = "lib" + s.rubyforge_project = "ruby-statsample" + s.test_files = `git ls-files -- {test}/*`.split("\n") - if Gem::Version.new(Gem::VERSION) >= Gem::Version.new('1.2.0') then - s.add_runtime_dependency(%q, ["~> 0.6.5"]) - s.add_runtime_dependency(%q, ["~> 1.4"]) - s.add_runtime_dependency(%q, ["~> 0.2.0"]) - s.add_runtime_dependency(%q, ["> 0"]) - s.add_runtime_dependency(%q, ["~> 0.0"]) - s.add_runtime_dependency(%q, ["~> 0.3.1"]) - s.add_runtime_dependency(%q, ["> 0"]) - s.add_runtime_dependency(%q, ["~> 0.2.5"]) - s.add_runtime_dependency(%q, ["~> 0.5.0"]) - s.add_runtime_dependency(%q, ["~> 0.3"]) - s.add_development_dependency(%q, ["~> 0"]) - s.add_development_dependency(%q, ["~> 0"]) - s.add_development_dependency(%q, ["~> 2.0"]) - s.add_development_dependency(%q, ["~> 0"]) - s.add_development_dependency(%q, ["~> 2.3.8"]) - s.add_development_dependency(%q, ["~> 0"]) - s.add_development_dependency(%q, ["~> 1.5.0"]) - s.add_development_dependency(%q, ["~> 2.12"]) - s.add_development_dependency(%q, ["~> 2.13.0"]) - s.add_development_dependency(%q, ["~> 2.13.1"]) - s.add_development_dependency(%q, ["~> 2.13.0"]) - s.add_development_dependency(%q, ["~> 2.13.0"]) - else - s.add_dependency(%q, ["~> 0.6.5"]) - s.add_dependency(%q, ["~> 1.4"]) - s.add_dependency(%q, ["~> 0.2.0"]) - s.add_dependency(%q, ["> 0"]) - s.add_dependency(%q, ["~> 0.0"]) - s.add_dependency(%q, ["~> 0.3.1"]) - s.add_dependency(%q, ["> 0"]) - s.add_dependency(%q, ["~> 0.2.5"]) - s.add_dependency(%q, ["~> 0.5.0"]) - s.add_dependency(%q, ["~> 0.3"]) - s.add_dependency(%q, ["~> 0"]) - s.add_dependency(%q, ["~> 0"]) - s.add_dependency(%q, ["~> 2.0"]) - s.add_dependency(%q, ["~> 0"]) - s.add_dependency(%q, ["~> 2.3.8"]) - s.add_dependency(%q, ["~> 0"]) - s.add_dependency(%q, ["~> 1.5.0"]) - s.add_dependency(%q, ["~> 2.12"]) - s.add_dependency(%q, ["~> 2.13.0"]) - s.add_dependency(%q, ["~> 2.13.1"]) - s.add_dependency(%q, ["~> 2.13.0"]) - s.add_dependency(%q, ["~> 2.13.0"]) - end - else - s.add_dependency(%q, ["~> 0.6.5"]) - s.add_dependency(%q, ["~> 1.4"]) - s.add_dependency(%q, ["~> 0.2.0"]) - s.add_dependency(%q, ["> 0"]) - s.add_dependency(%q, ["~> 0.0"]) - s.add_dependency(%q, ["~> 0.3.1"]) - s.add_dependency(%q, ["> 0"]) - s.add_dependency(%q, ["~> 0.2.5"]) - s.add_dependency(%q, ["~> 0.5.0"]) - s.add_dependency(%q, ["~> 0.3"]) - s.add_dependency(%q, ["~> 0"]) - s.add_dependency(%q, ["~> 0"]) - s.add_dependency(%q, ["~> 2.0"]) - s.add_dependency(%q, ["~> 0"]) - s.add_dependency(%q, ["~> 2.3.8"]) - s.add_dependency(%q, ["~> 0"]) - s.add_dependency(%q, ["~> 1.5.0"]) - s.add_dependency(%q, ["~> 2.12"]) - s.add_dependency(%q, ["~> 2.13.0"]) - s.add_dependency(%q, ["~> 2.13.1"]) - s.add_dependency(%q, ["~> 2.13.0"]) - s.add_dependency(%q, ["~> 2.13.0"]) + DEPENDENCIES = [{:gem => "spreadsheet", :version => "~> 0.8.5"}, {:gem => "reportbuilder", :version => "~> 1.4.2"}, + {:gem => "minimization", :version => "~> 0.2.1"}, {:gem => "fastercsv", :version => "~> 1.5.5"}, + {:gem => "dirty-memoize", :version => "~> 0.0.4"}, {:gem => "extendmatrix", :version => "~> 0.3.1"}, + {:gem => "statsample-bivariate-extension", :version => "~> 1.1.0"}, {:gem => "rserve-client", :version => "~> 0.3.0"}, + {:gem => "rubyvis", :version => "~> 0.5.2"}, {:gem => "gettext", :version => "~> 2.3.9"}, + {:gem => "mocha", :version => "~> 0.14.0"}, {:gem => "hoe-git", :version => "~> 1.5.0"}, + {:gem => "minitest", :version => "~> 5.0.5"}, {:gem => "shoulda", :version => "~> 3.5.0"}, + ] + DEPENDENCIES.each do |dependency| + s.add_dependency(dependency[:gem], dependency[:version]) end end From fd06759440604eb64f3a47a628d7d3693f6f25fb Mon Sep 17 00:00:00 2001 From: Ankur Goel Date: Sun, 30 Jun 2013 05:52:23 +0530 Subject: [PATCH 09/22] Wald test implemented --- lib/statsample/timeseries.rb | 4 +-- test/test_wald.rb | 53 ++++++++++++++++++++++++++++++++++++ 2 files changed, 55 insertions(+), 2 deletions(-) create mode 100644 test/test_wald.rb diff --git a/lib/statsample/timeseries.rb b/lib/statsample/timeseries.rb index d9e7aad..eada4a5 100644 --- a/lib/statsample/timeseries.rb +++ b/lib/statsample/timeseries.rb @@ -2,8 +2,8 @@ module Statsample::TimeSeriesShorthands # Creates a new Statsample::TimeSeries object # Argument should be equal to TimeSeries.new def to_time_series(*args) - Statsample::TimeSeries::TimeSeries.new(self, :scale, *args) - end + Statsample::TimeSeries::TimeSeries.new(self, :scale, *args) + end alias :to_ts :to_time_series end diff --git a/test/test_wald.rb b/test/test_wald.rb new file mode 100644 index 0000000..73a3459 --- /dev/null +++ b/test/test_wald.rb @@ -0,0 +1,53 @@ +require(File.expand_path(File.dirname(__FILE__)+'/helpers_tests.rb')) + +class StatsampleTestWaldTest < MiniTest::Unit::TestCase + include Statsample::Shorthand + include Statsample::TimeSeries + def setup + @wald = 100.times.map do + rand(100) + end.to_ts + end + + def generate_acf_series(lags) + acf_series = @wald.acf(lags) + acf_series.map do |x| + x ** 2 + end.inject(:+) + end + + def compute_chi_df(lags) + observed = Matrix[@wald.acf(lags)] + Statsample::Test.chi_square(observed).df + end + + def test_with_5_lags + assert_equal (generate_acf_series(5)/5).to_i, compute_chi_df(5) + assert_equal generate_acf_series(5), @wald.acf(5).map { |x| x ** 2 }.inject(:+) + end + + def test_with_10_lags + assert_equal (generate_acf_series(5)/5).to_i, compute_chi_df(5) + end + + def test_with_15_lags + assert_equal (generate_acf_series(5)/5).to_i, compute_chi_df(5) + end + + def test_with_20_lags + assert_equal (generate_acf_series(5)/5).to_i, compute_chi_df(5) + end + + def test_with_different_series + @wald = 200.times.map { rand }.to_ts + acf_series = @wald.acf(10) + lhs = acf_series.each { |x| x ** 2 }.inject(:+) + rhs = Statsample::Test.chi_square(Matrix[acf_series]).df + assert_equal (lhs/10).to_i, rhs + end + + def test_pacf + #TODO + end +end + From 67aa8c98d8c6c8a6fd5ae9eec88332e4c9a1673b Mon Sep 17 00:00:00 2001 From: Ankur Goel Date: Sun, 30 Jun 2013 06:17:27 +0530 Subject: [PATCH 10/22] Implemented wald tests for acf mean, variance and distributed chi-square --- test/test_wald.rb | 41 ++++++++++++++++++++++++++++++++++++----- 1 file changed, 36 insertions(+), 5 deletions(-) diff --git a/test/test_wald.rb b/test/test_wald.rb index 73a3459..30fb87b 100644 --- a/test/test_wald.rb +++ b/test/test_wald.rb @@ -1,6 +1,8 @@ require(File.expand_path(File.dirname(__FILE__)+'/helpers_tests.rb')) class StatsampleTestWaldTest < MiniTest::Unit::TestCase + + # Wald include Statsample::Shorthand include Statsample::TimeSeries def setup @@ -10,6 +12,19 @@ def setup end def generate_acf_series(lags) + @wald.acf(lags) + end + + def compute_mean(acf) + acf.inject(:+) / acf.size + end + + def compute_variance(acf) + result = acf.map { |x| x ** 2 }.inject(:+) / acf.size + return result.to_f + end + + def generate_acf_summation(lags) acf_series = @wald.acf(lags) acf_series.map do |x| x ** 2 @@ -22,20 +37,36 @@ def compute_chi_df(lags) end def test_with_5_lags - assert_equal (generate_acf_series(5)/5).to_i, compute_chi_df(5) - assert_equal generate_acf_series(5), @wald.acf(5).map { |x| x ** 2 }.inject(:+) + acf = generate_acf_series(5) + assert_equal generate_acf_summation(5), @wald.acf(5).map { |x| x ** 2 }.inject(:+) + assert_in_delta compute_mean(acf).to_i, 0 + assert_in_delta compute_variance(acf), 1.0/acf.size, 0.1 + assert_equal (generate_acf_summation(5)/5).to_i, compute_chi_df(5) + end def test_with_10_lags - assert_equal (generate_acf_series(5)/5).to_i, compute_chi_df(5) + acf = generate_acf_series(10) + assert_equal generate_acf_summation(10), @wald.acf(10).map { |x| x ** 2 }.inject(:+) + assert_in_delta compute_mean(acf).to_i, 0 + assert_in_delta compute_variance(acf), 1.0/acf.size, 0.1 + assert_equal (generate_acf_summation(10)/10).to_i, compute_chi_df(10) end def test_with_15_lags - assert_equal (generate_acf_series(5)/5).to_i, compute_chi_df(5) + acf = generate_acf_series(15) + assert_equal generate_acf_summation(15), @wald.acf(15).map { |x| x ** 2 }.inject(:+) + assert_in_delta compute_mean(acf).to_i, 0 + assert_in_delta compute_variance(acf), 1.0/acf.size, 0.1 + assert_equal (generate_acf_summation(15)/15).to_i, compute_chi_df(15) end def test_with_20_lags - assert_equal (generate_acf_series(5)/5).to_i, compute_chi_df(5) + acf = generate_acf_series(20) + assert_equal generate_acf_summation(20), @wald.acf(20).map { |x| x ** 2 }.inject(:+) + assert_in_delta compute_mean(acf).to_i, 0 + assert_in_delta compute_variance(acf), 1.0/acf.size, 0.1 + assert_equal (generate_acf_summation(20)/20).to_i, compute_chi_df(20) end def test_with_different_series From ac99e293c0d89e21a638780db75abca818fc55d0 Mon Sep 17 00:00:00 2001 From: Ankur Goel Date: Mon, 1 Jul 2013 20:38:36 +0530 Subject: [PATCH 11/22] Wald test. To test acf and fit on an ARIMA model. --- test/test_wald2.rb | 66 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 66 insertions(+) create mode 100644 test/test_wald2.rb diff --git a/test/test_wald2.rb b/test/test_wald2.rb new file mode 100644 index 0000000..7f59e2c --- /dev/null +++ b/test/test_wald2.rb @@ -0,0 +1,66 @@ +require(File.expand_path(File.dirname(__FILE__)+'/helpers_tests.rb')) + +class StatsampleWaldTest < MiniTest::Unit::TestCase + # Wald test is useful to test a series of n acf with Chi-square + # degree of freedom. It is extremely useful to test fit the fit of + # an ARIMA model to test the residuals. + + include Statsample::TimeSeries + include Statsample::Shorthand + include Distribution + + def setup + #create time series to evaluate later + @wald = 100.times.map { rand(100) }.to_ts + end + + def sum_of_squares_of_acf_series(lags) + #perform sum of squares for a series of acf with specified lags + acf = @wald.acf(lags) + return acf.map { |x| x ** 2 }.inject(:+) + end + + def chisquare_cdf(sum_of_squares, lags) + 1 - ChiSquare.cdf(sum_of_squares, lags) + end + + + def test_wald_with_5_lags + #number of lags for acf = 5 + lags = 5 + sum_of_squares = sum_of_squares_of_acf_series(lags) + assert_in_delta chisquare_cdf(sum_of_squares, lags), 1, 0.05 + end + + + def test_wald_with_10_lags + #number of lags for acf = 10 + lags = 10 + sum_of_squares = sum_of_squares_of_acf_series(lags) + assert_in_delta chisquare_cdf(sum_of_squares, lags), 1, 0.05 + end + + + def test_wald_with_15_lags + #number of lags for acf = 15 + lags = 15 + sum_of_squares = sum_of_squares_of_acf_series(lags) + assert_in_delta chisquare_cdf(sum_of_squares, lags), 1, 0.05 + end + + + def test_wald_with_20_lags + #number of lags for acf = 20 + lags = 20 + sum_of_squares = sum_of_squares_of_acf_series(lags) + assert_in_delta chisquare_cdf(sum_of_squares, lags), 1, 0.05 + end + + + def test_wald_with_25_lags + #number of lags for acf = 25 + lags = 25 + sum_of_squares = sum_of_squares_of_acf_series(lags) + assert_in_delta chisquare_cdf(sum_of_squares, lags), 1, 0.05 + end +end From 99b31c5ae2cf2835922c9b013fbefca00b450324 Mon Sep 17 00:00:00 2001 From: Ankur Goel Date: Wed, 3 Jul 2013 12:39:52 +0530 Subject: [PATCH 12/22] Implementing pacf with yule walker for unbiased and mle method. --- lib/statsample/timeseries.rb | 90 ++++++++++++++++++++++++++++++++++++ test/test_wald2.rb | 5 ++ 2 files changed, 95 insertions(+) diff --git a/lib/statsample/timeseries.rb b/lib/statsample/timeseries.rb index eada4a5..0ed56eb 100644 --- a/lib/statsample/timeseries.rb +++ b/lib/statsample/timeseries.rb @@ -45,6 +45,96 @@ def acf maxlags = nil end end + + def yule_walker(k = 1, method='yw') + #From the series, estimates AR(p)(autoregressive) parameter + #using Yule-Waler equation. See - + #http://en.wikipedia.org/wiki/Autoregressive_moving_average_model + + #parameters: + #ts = series + #k = order, default = 1 + #method = can be 'yw' or 'mle'. If 'yw' then it is unbiased, denominator + #is (n - k) + + #returns: + #rho => autoregressive coefficients + ts = self #timeseries + ts -= ts.mean + n = ts.size + if method.downcase.eql? 'yw' + #unbiased => denominator = (n - k) + denom =->(k) { n - k } + else + #mle + #denominator => (n) + denom =->(k) { n } + end + r = Array.new(k + 1) { 0.0 } + r[0] = ts.map { |x| x ** 2 }.inject(:+) / denom.call(0) + + for k in (1..(k+1)) + r[k] = (ts[0...-k] * ts[k...ts.size]).inject(:+) / denom.call(k) + end + + R = toeplitz(r[0...-1]) + + #TODO: return the solved matrix equation of (R, r[1..r.size]) + #pending + end + + + def toeplitz(arr) + #Generates Toeplitz matrix - http://en.wikipedia.org/wiki/Toeplitz_matrix + #=> arr = [0, 1, 2, 3] + #=> result: + #[[0, 1, 2, 3], + # [1, 0, 1, 2], + # [2, 1, 0, 1], + # [3, 2, 1, 0]] + eplitz_matrix = Array.new(arr.size) { Array.new(arr.size) } + + 0.upto(arr.size - 1) do |i| + j = 0 + index = i + while i >= 0 do + eplitz_matrix[index][j] = arr[i] + j += 1 + i -= 1 + end + i = index + 1; k = 1 + while i < arr.size do + eplitz_matrix[index][j] = arr[k] + i += 1; j += 1; k += 1 + end + end + + return eplitz_matrix + end + + + def pacf_yw(maxlags, method = 'yw') + #partial autocorrelation by yule walker equations. + #Inspiration: StatsModels + xm = self - self.mean + pacf = [1] + (1..maxlags).map do |i| + pacf << yule_walker(i, method)[0][-1] + end + return pacf + end + + + def pacf(maxlags = nil, method = 'yw') + #parameters: + #maxlags => maximum number of lags for pcf + #method => for autocovariance in yule_walker: + #'yw' for 'yule-walker unbaised', 'mle' for biased maximum likelihood + + maxlags ||= (10 * Math.log10(size)).to_i + return pacf_yw(maxlags, method) + end + # Lags the series by k periods. # # The convention is to set the oldest observations (the first ones diff --git a/test/test_wald2.rb b/test/test_wald2.rb index 7f59e2c..1a3eecd 100644 --- a/test/test_wald2.rb +++ b/test/test_wald2.rb @@ -30,6 +30,7 @@ def test_wald_with_5_lags lags = 5 sum_of_squares = sum_of_squares_of_acf_series(lags) assert_in_delta chisquare_cdf(sum_of_squares, lags), 1, 0.05 + assert_equal @wald.acf(lags).size, lags + 1 end @@ -38,6 +39,7 @@ def test_wald_with_10_lags lags = 10 sum_of_squares = sum_of_squares_of_acf_series(lags) assert_in_delta chisquare_cdf(sum_of_squares, lags), 1, 0.05 + assert_equal @wald.acf(lags).size, lags + 1 end @@ -46,6 +48,7 @@ def test_wald_with_15_lags lags = 15 sum_of_squares = sum_of_squares_of_acf_series(lags) assert_in_delta chisquare_cdf(sum_of_squares, lags), 1, 0.05 + assert_equal @wald.acf(lags).size, lags + 1 end @@ -54,6 +57,7 @@ def test_wald_with_20_lags lags = 20 sum_of_squares = sum_of_squares_of_acf_series(lags) assert_in_delta chisquare_cdf(sum_of_squares, lags), 1, 0.05 + assert_equal @wald.acf(lags).size, lags + 1 end @@ -62,5 +66,6 @@ def test_wald_with_25_lags lags = 25 sum_of_squares = sum_of_squares_of_acf_series(lags) assert_in_delta chisquare_cdf(sum_of_squares, lags), 1, 0.05 + assert_equal @wald.acf(lags).size, lags + 1 end end From 2baf8a094a02cfc338246d019d673831967d939e Mon Sep 17 00:00:00 2001 From: Ankur Goel Date: Fri, 5 Jul 2013 22:21:54 +0530 Subject: [PATCH 13/22] Implemented pacf with yule-walker. --- lib/statsample/timeseries.rb | 43 +++++++++++++++++++++++++----------- statsample.gemspec | 2 +- 2 files changed, 31 insertions(+), 14 deletions(-) diff --git a/lib/statsample/timeseries.rb b/lib/statsample/timeseries.rb index 0ed56eb..11fa6d2 100644 --- a/lib/statsample/timeseries.rb +++ b/lib/statsample/timeseries.rb @@ -46,7 +46,7 @@ def acf maxlags = nil end - def yule_walker(k = 1, method='yw') + def yule_walker(series, k = 1, method='yw') #From the series, estimates AR(p)(autoregressive) parameter #using Yule-Waler equation. See - #http://en.wikipedia.org/wiki/Autoregressive_moving_average_model @@ -59,8 +59,8 @@ def yule_walker(k = 1, method='yw') #returns: #rho => autoregressive coefficients - ts = self #timeseries - ts -= ts.mean + ts = series #timeseries + ts = ts - ts.mean n = ts.size if method.downcase.eql? 'yw' #unbiased => denominator = (n - k) @@ -71,21 +71,39 @@ def yule_walker(k = 1, method='yw') denom =->(k) { n } end r = Array.new(k + 1) { 0.0 } - r[0] = ts.map { |x| x ** 2 }.inject(:+) / denom.call(0) + r[0] = ts.map { |x| x ** 2 }.inject(:+).to_f / denom.call(0).to_f - for k in (1..(k+1)) - r[k] = (ts[0...-k] * ts[k...ts.size]).inject(:+) / denom.call(k) + for l in (1..k) + r[k] = (ts[0...-l].zip(ts[l...ts.size])).map do |x| + x.inject(:*) + end.inject(:+).to_f / denom.call(l).to_f end - R = toeplitz(r[0...-1]) + r_R = toeplitz(r[0...-1]) - #TODO: return the solved matrix equation of (R, r[1..r.size]) - #pending + mat = Matrix.columns(r_R).inverse() + solve_matrix(mat, r[1..r.size]) + end + + def solve_matrix(matrix, out_vector) + solution_vector = Array.new(out_vector.size,0 ) + matrix = matrix.to_a + k = 0 + matrix.each do |row| + row.each_with_index do |element, i| + solution_vector[k] += element * 1.0 * out_vector[i] + end + k += 1 + end + return solution_vector end def toeplitz(arr) - #Generates Toeplitz matrix - http://en.wikipedia.org/wiki/Toeplitz_matrix + #Generates Toeplitz matrix - + #http://en.wikipedia.org/wiki/Toeplitz_matrix + #Toeplitz matrix are equal when they are stored in row & + #column major #=> arr = [0, 1, 2, 3] #=> result: #[[0, 1, 2, 3], @@ -116,10 +134,9 @@ def toeplitz(arr) def pacf_yw(maxlags, method = 'yw') #partial autocorrelation by yule walker equations. #Inspiration: StatsModels - xm = self - self.mean - pacf = [1] + pacf = [1.0] (1..maxlags).map do |i| - pacf << yule_walker(i, method)[0][-1] + pacf << yule_walker(self, i, method)[-1] end return pacf end diff --git a/statsample.gemspec b/statsample.gemspec index 92bed6b..de16f4d 100644 --- a/statsample.gemspec +++ b/statsample.gemspec @@ -21,7 +21,7 @@ Gem::Specification.new do |s| {:gem => "statsample-bivariate-extension", :version => "~> 1.1.0"}, {:gem => "rserve-client", :version => "~> 0.3.0"}, {:gem => "rubyvis", :version => "~> 0.5.2"}, {:gem => "gettext", :version => "~> 2.3.9"}, {:gem => "mocha", :version => "~> 0.14.0"}, {:gem => "hoe-git", :version => "~> 1.5.0"}, - {:gem => "minitest", :version => "~> 5.0.5"}, {:gem => "shoulda", :version => "~> 3.5.0"}, + {:gem => "minitest", :version => "~> 4.2"}, {:gem => "shoulda", :version => "~> 3.5.0"}, ] DEPENDENCIES.each do |dependency| s.add_dependency(dependency[:gem], dependency[:version]) From 935e107763d039e1989a14b5e37bd485dd0b045a Mon Sep 17 00:00:00 2001 From: Ankur Goel Date: Sat, 6 Jul 2013 02:31:56 +0530 Subject: [PATCH 14/22] Corrected a loop mistake in pacf implementation. Completed and verified working on pacf - yule-walker unbiased and mle. :) Next : tests --- lib/statsample/timeseries.rb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/statsample/timeseries.rb b/lib/statsample/timeseries.rb index 11fa6d2..770c308 100644 --- a/lib/statsample/timeseries.rb +++ b/lib/statsample/timeseries.rb @@ -74,7 +74,7 @@ def yule_walker(series, k = 1, method='yw') r[0] = ts.map { |x| x ** 2 }.inject(:+).to_f / denom.call(0).to_f for l in (1..k) - r[k] = (ts[0...-l].zip(ts[l...ts.size])).map do |x| + r[l] = (ts[0...-l].zip(ts[l...ts.size])).map do |x| x.inject(:*) end.inject(:+).to_f / denom.call(l).to_f end From d83c19bc83107cf57b759f165e264b84820caad4 Mon Sep 17 00:00:00 2001 From: Ankur Goel Date: Sat, 6 Jul 2013 13:44:39 +0530 Subject: [PATCH 15/22] Additional comments on pacf usage documented --- lib/statsample/timeseries.rb | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/lib/statsample/timeseries.rb b/lib/statsample/timeseries.rb index 770c308..5804efe 100644 --- a/lib/statsample/timeseries.rb +++ b/lib/statsample/timeseries.rb @@ -141,7 +141,13 @@ def pacf_yw(maxlags, method = 'yw') return pacf end - + ## USAGE ## + ## series = (1..100).map { rand(100) }.to_ts + # series.pacf(10, 'yw') + # => yule-walker unbiased partial-autocorrelation + # series.pacf(10, 'mle') + # => yule-walker mle pacf + # def pacf(maxlags = nil, method = 'yw') #parameters: #maxlags => maximum number of lags for pcf From 1efd7fcbd188dd230299b670af81d66fcb100ba2 Mon Sep 17 00:00:00 2001 From: Ankur Goel Date: Sat, 6 Jul 2013 21:01:44 +0530 Subject: [PATCH 16/22] Initial tests for pacf --- test/test_pacf.rb | 42 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 42 insertions(+) create mode 100644 test/test_pacf.rb diff --git a/test/test_pacf.rb b/test/test_pacf.rb new file mode 100644 index 0000000..981a545 --- /dev/null +++ b/test/test_pacf.rb @@ -0,0 +1,42 @@ +require(File.expand_path(File.dirname(__FILE__)+'/helpers_tests.rb')) +class StatsampleTimeSeriesPacfTestCase < MiniTest::Unit::TestCase + context(Statsample::TimeSeries) do + include Statsample::TimeSeries + setup do + @ts = (1..20).map { |x| x * 10 }.to_ts + #setting up a proc to get a closure for pacf calling with variable lags and methods + @pacf_proc =->(k, method) { @ts.pacf(k, method) } + end + + should "return correct correct pacf size for lags = 5" do + assert_equal @pacf_proc.call(5, 'yw').size, 6 + assert_equal @pacf_proc.call(5, 'mle').size, 6 + #first element is 1.0 + end + + should "return correct correct pacf size for lags = 10" do + assert_equal @pacf_proc.call(10, 'yw').size, 11 + assert_equal @pacf_proc.call(10, 'mle').size, 11 + #first element is 1.0 + end + + should "have first element as 1.0" do + assert_equal @pacf_proc.call(10, 'yw')[0], 1.0 + assert_equal @pacf_proc.call(10, 'mle')[0], 1.0 + end + + should "give correct pacf results for unbiased yule-walker" do + result_10 = [1.0, 0.8947368421052632, -0.10582010582010604, -0.11350188273265083, -0.12357534824820737, -0.13686534216335522, -0.15470588235294147, -0.17938011883732036, -0.2151192288178601, -0.2707082833133261, -0.3678160919540221] + result_5 = [1.0, 0.8947368421052632, -0.10582010582010604, -0.11350188273265083, -0.12357534824820737, -0.13686534216335522] + assert_equal @pacf_proc.call(10, 'yw'), result_10 + assert_equal @pacf_proc.call(5, 'yw'), result_5 + end + + should "give correct pacf results for mle yule-walker" do + result_10 = [1.0, 0.85, -0.07566212829370711, -0.07635069706072706, -0.07698628638512295, -0.07747034005560738, -0.0776780981161499, -0.07744984679625189, -0.0765803323191094, -0.07480650005932366, -0.07179435184923755] + result_5 = [1.0, 0.85, -0.07566212829370711, -0.07635069706072706, -0.07698628638512295, -0.07747034005560738] + assert_equal @pacf_proc.call(10, 'mle'), result_10 + assert_equal @pacf_proc.call(5, 'mle'), result_5 + end + end +end From 05dad29d1bf043016176059c17aba3b2674239cf Mon Sep 17 00:00:00 2001 From: Ankur Goel Date: Sun, 7 Jul 2013 21:05:15 +0530 Subject: [PATCH 17/22] Tested assertions for pacf in range (1..10) --- test/test_pacf.rb | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/test/test_pacf.rb b/test/test_pacf.rb index 981a545..278214b 100644 --- a/test/test_pacf.rb +++ b/test/test_pacf.rb @@ -30,6 +30,11 @@ class StatsampleTimeSeriesPacfTestCase < MiniTest::Unit::TestCase result_5 = [1.0, 0.8947368421052632, -0.10582010582010604, -0.11350188273265083, -0.12357534824820737, -0.13686534216335522] assert_equal @pacf_proc.call(10, 'yw'), result_10 assert_equal @pacf_proc.call(5, 'yw'), result_5 + + #Checking for lag = (1..10) + 1.upto(10) do |i| + assert_equal @pacf_proc.call(i, 'yw'), result_10[0..i] + end end should "give correct pacf results for mle yule-walker" do @@ -37,6 +42,11 @@ class StatsampleTimeSeriesPacfTestCase < MiniTest::Unit::TestCase result_5 = [1.0, 0.85, -0.07566212829370711, -0.07635069706072706, -0.07698628638512295, -0.07747034005560738] assert_equal @pacf_proc.call(10, 'mle'), result_10 assert_equal @pacf_proc.call(5, 'mle'), result_5 + + #Checking for lag = (1..10) + 1.upto(10) do |i| + assert_equal @pacf_proc.call(i, 'mle'), result_10[0..i] + end end end end From 623acbf0a39fa3ddedbffd5db728e9d543c5398b Mon Sep 17 00:00:00 2001 From: Ankur Goel Date: Mon, 8 Jul 2013 22:34:22 +0530 Subject: [PATCH 18/22] Abstracted code for Pacf in a new Statsample::TimeSeries::Pacf module. Addressed few concerns by Monica and Claudio. :) --- lib/statsample/timeseries.rb | 121 +++--------------------------- lib/statsample/timeseries/pacf.rb | 100 ++++++++++++++++++++++++ 2 files changed, 110 insertions(+), 111 deletions(-) create mode 100644 lib/statsample/timeseries/pacf.rb diff --git a/lib/statsample/timeseries.rb b/lib/statsample/timeseries.rb index 5804efe..3a236b3 100644 --- a/lib/statsample/timeseries.rb +++ b/lib/statsample/timeseries.rb @@ -1,3 +1,4 @@ +require 'statsample/timeseries/pacf' module Statsample::TimeSeriesShorthands # Creates a new Statsample::TimeSeries object # Argument should be equal to TimeSeries.new @@ -17,6 +18,7 @@ module TimeSeries # Collection of data indexed by time. # The order goes from earliest to latest. class TimeSeries < Statsample::Vector + include Statsample::TimeSeries::Pacf # Calculates the autocorrelation coefficients of the series. # # The first element is always 1, since that is the correlation @@ -29,10 +31,10 @@ class TimeSeries < Statsample::Vector # ts.acf # => array with first 21 autocorrelations # ts.acf 3 # => array with first 3 autocorrelations # - def acf maxlags = nil - maxlags ||= (10 * Math.log10(size)).to_i + def acf max_lags = nil + max_lags ||= (10 * Math.log10(size)).to_i - (0..maxlags).map do |i| + (0..max_lags).map do |i| if i == 0 1.0 else @@ -45,117 +47,14 @@ def acf maxlags = nil end end - - def yule_walker(series, k = 1, method='yw') - #From the series, estimates AR(p)(autoregressive) parameter - #using Yule-Waler equation. See - - #http://en.wikipedia.org/wiki/Autoregressive_moving_average_model - + def pacf(max_lags = nil, method = 'yw') #parameters: - #ts = series - #k = order, default = 1 - #method = can be 'yw' or 'mle'. If 'yw' then it is unbiased, denominator - #is (n - k) - - #returns: - #rho => autoregressive coefficients - ts = series #timeseries - ts = ts - ts.mean - n = ts.size - if method.downcase.eql? 'yw' - #unbiased => denominator = (n - k) - denom =->(k) { n - k } - else - #mle - #denominator => (n) - denom =->(k) { n } - end - r = Array.new(k + 1) { 0.0 } - r[0] = ts.map { |x| x ** 2 }.inject(:+).to_f / denom.call(0).to_f - - for l in (1..k) - r[l] = (ts[0...-l].zip(ts[l...ts.size])).map do |x| - x.inject(:*) - end.inject(:+).to_f / denom.call(l).to_f - end - - r_R = toeplitz(r[0...-1]) - - mat = Matrix.columns(r_R).inverse() - solve_matrix(mat, r[1..r.size]) - end - - def solve_matrix(matrix, out_vector) - solution_vector = Array.new(out_vector.size,0 ) - matrix = matrix.to_a - k = 0 - matrix.each do |row| - row.each_with_index do |element, i| - solution_vector[k] += element * 1.0 * out_vector[i] - end - k += 1 - end - return solution_vector - end - - - def toeplitz(arr) - #Generates Toeplitz matrix - - #http://en.wikipedia.org/wiki/Toeplitz_matrix - #Toeplitz matrix are equal when they are stored in row & - #column major - #=> arr = [0, 1, 2, 3] - #=> result: - #[[0, 1, 2, 3], - # [1, 0, 1, 2], - # [2, 1, 0, 1], - # [3, 2, 1, 0]] - eplitz_matrix = Array.new(arr.size) { Array.new(arr.size) } - - 0.upto(arr.size - 1) do |i| - j = 0 - index = i - while i >= 0 do - eplitz_matrix[index][j] = arr[i] - j += 1 - i -= 1 - end - i = index + 1; k = 1 - while i < arr.size do - eplitz_matrix[index][j] = arr[k] - i += 1; j += 1; k += 1 - end - end - - return eplitz_matrix - end - - - def pacf_yw(maxlags, method = 'yw') - #partial autocorrelation by yule walker equations. - #Inspiration: StatsModels - pacf = [1.0] - (1..maxlags).map do |i| - pacf << yule_walker(self, i, method)[-1] - end - return pacf - end - - ## USAGE ## - ## series = (1..100).map { rand(100) }.to_ts - # series.pacf(10, 'yw') - # => yule-walker unbiased partial-autocorrelation - # series.pacf(10, 'mle') - # => yule-walker mle pacf - # - def pacf(maxlags = nil, method = 'yw') - #parameters: - #maxlags => maximum number of lags for pcf + #max_lags => maximum number of lags for pcf #method => for autocovariance in yule_walker: #'yw' for 'yule-walker unbaised', 'mle' for biased maximum likelihood - maxlags ||= (10 * Math.log10(size)).to_i - return pacf_yw(maxlags, method) + max_lags ||= (10 * Math.log10(size)).to_i + Pacf::Pacf.pacf_yw(self, max_lags, method) end # Lags the series by k periods. @@ -199,7 +98,7 @@ def lag k = 1 # # => [0.69, 0.23, 0.44, 0.71, ...] # # ts.diff # => [nil, -0.46, 0.21, 0.27, ...] - # + # def diff self - self.lag end diff --git a/lib/statsample/timeseries/pacf.rb b/lib/statsample/timeseries/pacf.rb new file mode 100644 index 0000000..09836b9 --- /dev/null +++ b/lib/statsample/timeseries/pacf.rb @@ -0,0 +1,100 @@ +module Statsample + module TimeSeries + module Pacf + class Pacf + + def self.pacf_yw(timeseries, max_lags, method = 'yw') + #partial autocorrelation by yule walker equations. + #Inspiration: StatsModels + pacf = [1.0] + (1..max_lags).map do |i| + pacf << yule_walker(timeseries, i, method)[-1] + end + pacf + end + + def self.yule_walker(ts, k = 1, method='yw') + #From the series, estimates AR(p)(autoregressive) parameter + #using Yule-Waler equation. See - + #http://en.wikipedia.org/wiki/Autoregressive_moving_average_model + + #parameters: + #ts = series + #k = order, default = 1 + #method = can be 'yw' or 'mle'. If 'yw' then it is unbiased, denominator + #is (n - k) + + #returns: + #rho => autoregressive coefficients + ts = ts - ts.mean + n = ts.size + if method.downcase.eql? 'yw' + #unbiased => denominator = (n - k) + denom =->(k) { n - k } + else + #mle + #denominator => (n) + denom =->(k) { n } + end + r = Array.new(k + 1) { 0.0 } + r[0] = ts.map { |x| x ** 2 }.inject(:+).to_f / denom.call(0).to_f + + 1.upto(k) do |l| + r[l] = (ts[0...-l].zip(ts[l...ts.size])).map do |x| + x.inject(:*) + end.inject(:+).to_f / denom.call(l).to_f + end + + r_R = toeplitz(r[0...-1]) + + mat = Matrix.columns(r_R).inverse() + solve_matrix(mat, r[1..r.size]) + end + + def self.toeplitz(arr) + #Generates Toeplitz matrix - + #http://en.wikipedia.org/wiki/Toeplitz_matrix + #Toeplitz matrix are equal when they are stored in row & + #column major + #=> arr = [0, 1, 2, 3] + #=> result: + #[[0, 1, 2, 3], + # [1, 0, 1, 2], + # [2, 1, 0, 1], + # [3, 2, 1, 0]] + eplitz_matrix = Array.new(arr.size) { Array.new(arr.size) } + + 0.upto(arr.size - 1) do |i| + j = 0 + index = i + while i >= 0 do + eplitz_matrix[index][j] = arr[i] + j += 1 + i -= 1 + end + i = index + 1; k = 1 + while i < arr.size do + eplitz_matrix[index][j] = arr[k] + i += 1; j += 1; k += 1 + end + end + eplitz_matrix + end + + def self.solve_matrix(matrix, out_vector) + solution_vector = Array.new(out_vector.size, 0) + matrix = matrix.to_a + k = 0 + matrix.each do |row| + row.each_with_index do |element, i| + solution_vector[k] += element * 1.0 * out_vector[i] + end + k += 1 + end + solution_vector + end + + end + end + end +end From 8849e9e1e50503eb0ea5e28821c4aa6b1597c480 Mon Sep 17 00:00:00 2001 From: Ankur Goel Date: Thu, 11 Jul 2013 15:43:19 +0530 Subject: [PATCH 19/22] Listed cucumber gem in gemspec, added 'features' directory in test_files --- features/pacf.feature | 3 +++ statsample.gemspec | 3 ++- 2 files changed, 5 insertions(+), 1 deletion(-) create mode 100644 features/pacf.feature diff --git a/features/pacf.feature b/features/pacf.feature new file mode 100644 index 0000000..75d7025 --- /dev/null +++ b/features/pacf.feature @@ -0,0 +1,3 @@ +Feature: PACF + + diff --git a/statsample.gemspec b/statsample.gemspec index de16f4d..34798d9 100644 --- a/statsample.gemspec +++ b/statsample.gemspec @@ -13,7 +13,7 @@ Gem::Specification.new do |s| s.homepage = "http://github.com/clbustos/statsample" s.require_path = "lib" s.rubyforge_project = "ruby-statsample" - s.test_files = `git ls-files -- {test}/*`.split("\n") + s.test_files = `git ls-files -- test features`.split("\n") DEPENDENCIES = [{:gem => "spreadsheet", :version => "~> 0.8.5"}, {:gem => "reportbuilder", :version => "~> 1.4.2"}, {:gem => "minimization", :version => "~> 0.2.1"}, {:gem => "fastercsv", :version => "~> 1.5.5"}, @@ -22,6 +22,7 @@ Gem::Specification.new do |s| {:gem => "rubyvis", :version => "~> 0.5.2"}, {:gem => "gettext", :version => "~> 2.3.9"}, {:gem => "mocha", :version => "~> 0.14.0"}, {:gem => "hoe-git", :version => "~> 1.5.0"}, {:gem => "minitest", :version => "~> 4.2"}, {:gem => "shoulda", :version => "~> 3.5.0"}, + {:gem => "cucumber", :version => "~>1.3.3"} ] DEPENDENCIES.each do |dependency| s.add_dependency(dependency[:gem], dependency[:version]) From 3616c9ecb30e58dacf3d8340867e8e8afb33955f Mon Sep 17 00:00:00 2001 From: Ankur Goel Date: Thu, 11 Jul 2013 15:44:11 +0530 Subject: [PATCH 20/22] Full featured cucumber tests for pacf module. --- features/pacf.feature | 39 ++++++++++++++++++++++++++++++++++++ features/step_definitions.rb | 39 ++++++++++++++++++++++++++++++++++++ 2 files changed, 78 insertions(+) create mode 100644 features/step_definitions.rb diff --git a/features/pacf.feature b/features/pacf.feature index 75d7025..ff13afd 100644 --- a/features/pacf.feature +++ b/features/pacf.feature @@ -1,3 +1,42 @@ Feature: PACF + As a statistician + So that I can quickly evaluate partial autocorrelation of a series + I want to evaluate pacf +Background: a timeseries + + Given the following values in a timeseries: + | timeseries | + | 10 20 30 40 50 60 70 80 90 100 | + | 110 120 130 140 150 160 170 180 190 200 | + +Scenario: check pacf for 10 lags with unbiased + When I provide 10 lags for pacf + When I provide yw yule walker as method + Then I should get Array as resultant output + Then I should get 11 values in resultant pacf + +Scenario: check pacf for 5 lags with mle + When I provide 5 lags for pacf + When I provide mle yule walker as method + Then I should get Array as resultant output + Then I should get 6 values in resultant pacf + +Scenario: check first value of pacf + When I provide 5 lags for pacf + When I provide yw yule walker as method + Then I should get Array as resultant output + And I should see 1.0 as first value + +Scenario: check all values in pacf for 5 lags with mle + When I provide 5 lags for pacf + When I provide mle yule walker as method + Then I should get Array as resultant output + And I should see "1.0, 0.85, -0.07566212829370711, -0.07635069706072706, -0.07698628638512295, -0.07747034005560738" as complete series + +Scenario: check all values in pacf for 5 lags with unbiased + When I provide 5 lags for pacf + When I provide yw yule walker as method + Then I should get Array as resultant output + And I should see "1.0, 0.8947368421052632, -0.10582010582010604, -0.11350188273265083, -0.12357534824820737, -0.13686534216335522" as complete series diff --git a/features/step_definitions.rb b/features/step_definitions.rb new file mode 100644 index 0000000..2c0539b --- /dev/null +++ b/features/step_definitions.rb @@ -0,0 +1,39 @@ +require 'statsample' +require 'debugger' +include Statsample::TimeSeries + +Given /^the following values in a timeseries:$/ do |series| + arr = [] + series.hashes.each do |sequence| + arr += sequence['timeseries'].split(' ').map(&:to_i).to_ts + end + @timeseries = arr.to_ts +end + +When /^I provide (\d+) lags for pacf$/ do |lags| + @lags = lags.to_i +end + +When /^I provide (\w+) yule walker as method$/ do |method| + @method = method +end + +Then /^I should get (\w+) as resultant output$/ do |klass| + @result = @timeseries.pacf(@lags, @method) + assert_equal @result.class.to_s, klass +end + +Then /^I should get (\w+) values in resultant pacf$/ do |values_count| + assert_equal @result.size, values_count.to_i + @timeseries +end + +And /^I should see (\d+\.\d) as first value$/ do |first_value| + assert_equal @result.first, first_value.to_f +end + +And /^I should see \"(.+)\" as complete series$/ do |series| + series = series.split(',').map(&:to_f) + assert_equal @result, series +end + From 46287c0501b6ac2452ee9d4dbeb17f485292a8b6 Mon Sep 17 00:00:00 2001 From: Ankur Goel Date: Thu, 11 Jul 2013 16:17:30 +0530 Subject: [PATCH 21/22] Adding cucumber tests for autocorrelation. --- features/acf.feature | 31 +++++++++++++++++++++++++++++++ features/step_definitions.rb | 6 ++---- features/step_definitions_acf.rb | 9 +++++++++ 3 files changed, 42 insertions(+), 4 deletions(-) create mode 100644 features/acf.feature create mode 100644 features/step_definitions_acf.rb diff --git a/features/acf.feature b/features/acf.feature new file mode 100644 index 0000000..da31ed1 --- /dev/null +++ b/features/acf.feature @@ -0,0 +1,31 @@ +Feature: ACF + + As a statistician + So that I can evaluate autocorrelation of a series + I want to evaluate acf + +Background: a timeseries + + Given the following values in a timeseries: + | timeseries | + | 10 20 30 40 50 60 70 80 90 100 | + | 110 120 130 140 150 160 170 180 190 200 | + +Scenario: cross-check acf for 10 lags + When I provide 10 lags for acf + And I calculate acf + Then I should get 11 values in resultant acf + And I should see "1.0, 0.85, 0.7015037593984963, 0.556015037593985, 0.4150375939849624, 0.2800751879699248, 0.15263157894736842, 0.034210526315789476, -0.07368421052631578, -0.16954887218045114, -0.2518796992481203" as complete series + +Scenario: cross-check acf for 5 lags + When I provide 5 lags for acf + And I calculate acf + Then I should get 6 values in resultant acf + And I should see "1.0, 0.85, 0.7015037593984963, 0.556015037593985, 0.4150375939849624, 0.2800751879699248" as complete series + +Scenario: first value should be 1.0 + When I provide 2 lags for acf + And I calculate acf + Then I should get 3 values in resultant acf + And I should see 1.0 as first value + diff --git a/features/step_definitions.rb b/features/step_definitions.rb index 2c0539b..aefc354 100644 --- a/features/step_definitions.rb +++ b/features/step_definitions.rb @@ -1,5 +1,4 @@ require 'statsample' -require 'debugger' include Statsample::TimeSeries Given /^the following values in a timeseries:$/ do |series| @@ -10,7 +9,7 @@ @timeseries = arr.to_ts end -When /^I provide (\d+) lags for pacf$/ do |lags| +When /^I provide (\d+) lags for p?acf$/ do |lags| @lags = lags.to_i end @@ -23,9 +22,8 @@ assert_equal @result.class.to_s, klass end -Then /^I should get (\w+) values in resultant pacf$/ do |values_count| +Then /^I should get (\w+) values in resultant p?acf$/ do |values_count| assert_equal @result.size, values_count.to_i - @timeseries end And /^I should see (\d+\.\d) as first value$/ do |first_value| diff --git a/features/step_definitions_acf.rb b/features/step_definitions_acf.rb new file mode 100644 index 0000000..ca5479c --- /dev/null +++ b/features/step_definitions_acf.rb @@ -0,0 +1,9 @@ +require 'statsample' +require 'debugger' +include Statsample::TimeSeries + +#all instance variable and cucumber DSL s DRYed up in step_definitions.rb +And /^I calculate acf$/ do + @result = @timeseries.acf(@lags) +end + From e74ec7e35baf4e0c3752837b58b454e9085ff5de Mon Sep 17 00:00:00 2001 From: Ankur Goel Date: Sun, 14 Jul 2013 17:49:36 +0530 Subject: [PATCH 22/22] basic prototype of arima module. currently ongoing: general simulation for AR(p) --- lib/statsample/arima.rb | 28 ++++++++++++++++++++++++++++ 1 file changed, 28 insertions(+) create mode 100644 lib/statsample/arima.rb diff --git a/lib/statsample/arima.rb b/lib/statsample/arima.rb new file mode 100644 index 0000000..70a5dc8 --- /dev/null +++ b/lib/statsample/arima.rb @@ -0,0 +1,28 @@ +require 'debugger' +module Statsample + module ARIMA + class ARIMA < Statsample::TimeSeries + + def arima(ds, p, i, q) + if q.zero? + self.ar(p) + elsif p.zero? + self.ma(p) + end + end + + def ar(p) + #AutoRegressive part of model + #http://en.wikipedia.org/wiki/Autoregressive_model#Definition + #For finding parameters(to fit), we will use either Yule-walker + #or Burg's algorithm(more efficient) + + degugger + + end + + def yule_walker() + end + end + end +end