diff --git a/README.txt b/README.md similarity index 77% rename from README.txt rename to README.md index 0c515b4..d389977 100644 --- a/README.txt +++ b/README.md @@ -1,13 +1,16 @@ -= Statsample +Statsample +========== -http://ruby-statsample.rubyforge.org/ +[http://ruby-statsample.rubyforge.org/](http://ruby-statsample.rubyforge.org/) -== DESCRIPTION: +DESCRIPTION: +------------ A suite for basic and advanced statistics on Ruby. Tested on Ruby 1.8.7, 1.9.1, 1.9.2 (April, 2010), ruby-head(June, 2011) and JRuby 1.4 (Ruby 1.8.7 compatible). Include: + * Descriptive statistics: frequencies, median, mean, standard error, skew, kurtosis (and many others). * Imports and exports datasets from and to Excel, CSV and plain text files. * Correlations: Pearson's r, Spearman's rank correlation (rho), point biserial, tau a, tau b and gamma. Tetrachoric and Polychoric correlation provides by +statsample-bivariate-extension+ gem. @@ -24,9 +27,10 @@ Include: * Creates reports on text, html and rtf, using ReportBuilder gem * Graphics: Histogram, Boxplot and Scatterplot -== PRINCIPLES +PRINCIPLES +---------- -* Software Design: +* Software Design: * One module/class for each type of analysis * Options can be set as hash on initialize() or as setters methods * Clean API for interactive sessions @@ -36,13 +40,14 @@ Include: * Statistical Design * Results are tested against text results, SPSS and R outputs. * Go beyond Null Hiphotesis Testing, using confidence intervals and effect sizes when possible - * (When possible) All references for methods are documented, providing sensible information on documentation + * (When possible) All references for methods are documented, providing sensible information on documentation -== FEATURES: +FEATURES: +-------- * Classes for manipulation and storage of data: * Statsample::Vector: An extension of an array, with statistical methods like sum, mean and standard deviation - * Statsample::Dataset: a group of Statsample::Vector, analog to a excel spreadsheet or a dataframe on R. The base of almost all operations on statsample. + * Statsample::Dataset: a group of Statsample::Vector, analog to a excel spreadsheet or a dataframe on R. The base of almost all operations on statsample. * Statsample::Multiset: multiple datasets with same fields and type of vectors * Anova module provides generic Statsample::Anova::OneWay and vector based Statsample::Anova::OneWayWithVectors. Also you can create contrast using Statsample::Anova::Contrast * Module Statsample::Bivariate provides covariance and pearson, spearman, point biserial, tau a, tau b, gamma, tetrachoric (see Bivariate::Tetrachoric) and polychoric (see Bivariate::Polychoric) correlations. Include methods to create correlation and covariance matrices @@ -52,10 +57,10 @@ Include: * Logit Regression: Statsample::Regression::Binomial::Logit * Probit Regression: Statsample::Regression::Binomial::Probit * Factorial Analysis algorithms on Statsample::Factor module. - * Classes for Extraction of factors: + * Classes for Extraction of factors: * Statsample::Factor::PCA * Statsample::Factor::PrincipalAxis - * Classes for Rotation of factors: + * Classes for Rotation of factors: * Statsample::Factor::Varimax * Statsample::Factor::Equimax * Statsample::Factor::Quartimax @@ -64,7 +69,7 @@ Include: * Statsample::Factor::MAP performs Velicer's Minimum Average Partial (MAP) test, which retain components as long as the variance in the correlation matrix represents systematic variance. * Dominance Analysis. Based on Budescu and Azen papers, dominance analysis is a method to analyze the relative importance of one predictor relative to another on multiple regression * Statsample::DominanceAnalysis class can report dominance analysis for a sample, using uni or multivariate dependent variables - * Statsample::DominanceAnalysis::Bootstrap can execute bootstrap analysis to determine dominance stability, as recomended by Azen & Budescu (2003) link[http://psycnet.apa.org/journals/met/8/2/129/]. + * Statsample::DominanceAnalysis::Bootstrap can execute bootstrap analysis to determine dominance stability, as recomended by Azen & Budescu (2003) link[http://psycnet.apa.org/journals/met/8/2/129/]. * Module Statsample::Codification, to help to codify open questions * Converters to import and export data: * Statsample::Database : Can create sql to create tables, read and insert data @@ -73,7 +78,7 @@ Include: * Statsample::Mx : Write Mx Files * Statsample::GGobi : Write Ggobi files * Module Statsample::Crosstab provides function to create crosstab for categorical data -* Module Statsample::Reliability provides functions to analyze scales with psychometric methods. +* Module Statsample::Reliability provides functions to analyze scales with psychometric methods. * Class Statsample::Reliability::ScaleAnalysis provides statistics like mean, standard deviation for a scale, Cronbach's alpha and standarized Cronbach's alpha, and for each item: mean, correlation with total scale, mean if deleted, Cronbach's alpha is deleted. * Class Statsample::Reliability::MultiScaleAnalysis provides a DSL to easily analyze reliability of multiple scales and retrieve correlation matrix and factor analysis of them. * Class Statsample::Reliability::ICC provides intra-class correlation, using Shrout & Fleiss(1979) and McGraw & Wong (1996) formulations. @@ -81,7 +86,7 @@ Include: * Module Statsample::Test provides several methods and classes to perform inferencial statistics * Statsample::Test::BartlettSphericity * Statsample::Test::ChiSquare - * Statsample::Test::F + * Statsample::Test::F * Statsample::Test::KolmogorovSmirnov (only D value) * Statsample::Test::Levene * Statsample::Test::UMannWhitney @@ -90,85 +95,91 @@ Include: * Statsample::Graph::Boxplot * Statsample::Graph::Histogram * Statsample::Graph::Scatterplot -* Module Statsample::TimeSeries provides basic support for time series. +* Module Statsample::TimeSeries provides basic support for time series. * Gem +statsample-sem+ provides a DSL to R libraries +sem+ and +OpenMx+ -* Close integration with gem reportbuilder, to easily create reports on text, html and rtf formats. - -== Examples of use: +* Close integration with gem `reportbuilder`, to easily create reports on text, html and rtf formats. -See multiples examples of use on [http://github.com/clbustos/statsample/tree/master/examples/] +Examples of use: +-------------- -=== Boxplot +See multiples examples of use on [http://github.com/clbustos/statsample/tree/master/examples/](http://github.com/clbustos/statsample/tree/master/examples/) +Boxplot +------- +```ruby require 'statsample' - ss_analysis(Statsample::Graph::Boxplot) do + ss_analysis(Statsample::Graph::Boxplot) do n=30 a=rnorm(n-1,50,10) b=rnorm(n, 30,5) c=rnorm(n,5,1) a.push(2) boxplot(:vectors=>[a,b,c], :width=>300, :height=>300, :groups=>%w{first first second}, :minimum=>0) - end + end Statsample::Analysis.run # Open svg file on *nix application defined - -=== Correlation matrix - +``` +Correlation matrix +------------------ +```ruby require 'statsample' # Note R like generation of random gaussian variable # and correlation matrix - + ss_analysis("Statsample::Bivariate.correlation_matrix") do samples=1000 ds=data_frame( - 'a'=>rnorm(samples), + 'a'=>rnorm(samples), 'b'=>rnorm(samples), 'c'=>rnorm(samples), 'd'=>rnorm(samples)) - cm=cor(ds) + cm=cor(ds) summary(cm) end - - Statsample::Analysis.run_batch # Echo output to console + Statsample::Analysis.run_batch # Echo output to console +``` -== REQUIREMENTS: +REQUIREMENTS: +------------- -Optional: +Optional: * Plotting: gnuplot and rbgnuplot, SVG::Graph -* Factorial analysis and polychorical correlation(joint estimate and polychoric series): gsl library and rb-gsl (http://rb-gsl.rubyforge.org/). You should install it using gem install gsl. - -Note: Use gsl 1.12.109 or later. +* Factorial analysis and polychorical correlation(joint estimate and polychoric series): gsl library and rb-gsl [http://rb-gsl.rubyforge.org/](http://rb-gsl.rubyforge.org/). You should install it using `gem install gsl`. -== RESOURCES +**Note**: Use gsl 1.12.109 or later. -* Source code on github: http://github.com/clbustos/statsample -* API: http://ruby-statsample.rubyforge.org/statsample/ -* Bug report and feature request: http://github.com/clbustos/statsample/issues -* E-mailing list: http://groups.google.com/group/statsample +RESOURCES: +---------- -== INSTALL: +* Source code on github: [http://github.com/clbustos/statsample](http://github.com/clbustos/statsample) +* API: [http://ruby-statsample.rubyforge.org/statsample/](http://ruby-statsample.rubyforge.org/statsample/) +* Bug report and feature request: [http://github.com/clbustos/statsample/issues](http://github.com/clbustos/statsample/issues) +* E-mailing list: [http://groups.google.com/group/statsample](http://groups.google.com/group/statsample) - $ sudo gem install statsample +INSTALL: +--------- + `$ sudo gem install statsample` -On *nix, you should install statsample-optimization to retrieve gems gsl, statistics2 and a C extension to speed some methods. +On \*nix, you should install statsample-optimization to retrieve gems gsl, statistics2 and a C extension to speed some methods. -There are available precompiled version for Ruby 1.9 on x86, x86_64 and mingw32 archs. +There are available precompiled version for Ruby 1.9 on x86, x86\_64 and mingw32 archs. - $ sudo gem install statsample-optimization + `$ sudo gem install statsample-optimization` -If you use Ruby 1.8, you should compile statsample-optimization, usign parameter --platform ruby +If you use Ruby 1.8, you should compile statsample-optimization, usign parameter `--platform ruby` - $ sudo gem install statsample-optimization --platform ruby + `$ sudo gem install statsample-optimization --platform ruby` -If you need to work on Structural Equation Modeling, you could see +statsample-sem+. You need R with +sem+ or +OpenMx+ [http://openmx.psyc.virginia.edu/] libraries installed +If you need to work on Structural Equation Modeling, you could see _statsample-sem_. You need R with _sem_ or _OpenMx_ [http://openmx.psyc.virginia.edu/](http://openmx.psyc.virginia.edu/) libraries installed - $ sudo gem install statsample-sem + `$ sudo gem install statsample-sem` Available setup.rb file - sudo gem ruby setup.rb + `sudo gem ruby setup.rb` -== LICENSE: +LICENSE: +------- GPL-2 (See LICENSE.txt) diff --git a/features/acf.feature b/features/acf.feature new file mode 100644 index 0000000..da31ed1 --- /dev/null +++ b/features/acf.feature @@ -0,0 +1,31 @@ +Feature: ACF + + As a statistician + So that I can evaluate autocorrelation of a series + I want to evaluate acf + +Background: a timeseries + + Given the following values in a timeseries: + | timeseries | + | 10 20 30 40 50 60 70 80 90 100 | + | 110 120 130 140 150 160 170 180 190 200 | + +Scenario: cross-check acf for 10 lags + When I provide 10 lags for acf + And I calculate acf + Then I should get 11 values in resultant acf + And I should see "1.0, 0.85, 0.7015037593984963, 0.556015037593985, 0.4150375939849624, 0.2800751879699248, 0.15263157894736842, 0.034210526315789476, -0.07368421052631578, -0.16954887218045114, -0.2518796992481203" as complete series + +Scenario: cross-check acf for 5 lags + When I provide 5 lags for acf + And I calculate acf + Then I should get 6 values in resultant acf + And I should see "1.0, 0.85, 0.7015037593984963, 0.556015037593985, 0.4150375939849624, 0.2800751879699248" as complete series + +Scenario: first value should be 1.0 + When I provide 2 lags for acf + And I calculate acf + Then I should get 3 values in resultant acf + And I should see 1.0 as first value + diff --git a/features/pacf.feature b/features/pacf.feature new file mode 100644 index 0000000..ff13afd --- /dev/null +++ b/features/pacf.feature @@ -0,0 +1,42 @@ +Feature: PACF + + As a statistician + So that I can quickly evaluate partial autocorrelation of a series + I want to evaluate pacf + +Background: a timeseries + + Given the following values in a timeseries: + | timeseries | + | 10 20 30 40 50 60 70 80 90 100 | + | 110 120 130 140 150 160 170 180 190 200 | + +Scenario: check pacf for 10 lags with unbiased + When I provide 10 lags for pacf + When I provide yw yule walker as method + Then I should get Array as resultant output + Then I should get 11 values in resultant pacf + +Scenario: check pacf for 5 lags with mle + When I provide 5 lags for pacf + When I provide mle yule walker as method + Then I should get Array as resultant output + Then I should get 6 values in resultant pacf + +Scenario: check first value of pacf + When I provide 5 lags for pacf + When I provide yw yule walker as method + Then I should get Array as resultant output + And I should see 1.0 as first value + +Scenario: check all values in pacf for 5 lags with mle + When I provide 5 lags for pacf + When I provide mle yule walker as method + Then I should get Array as resultant output + And I should see "1.0, 0.85, -0.07566212829370711, -0.07635069706072706, -0.07698628638512295, -0.07747034005560738" as complete series + +Scenario: check all values in pacf for 5 lags with unbiased + When I provide 5 lags for pacf + When I provide yw yule walker as method + Then I should get Array as resultant output + And I should see "1.0, 0.8947368421052632, -0.10582010582010604, -0.11350188273265083, -0.12357534824820737, -0.13686534216335522" as complete series diff --git a/features/step_definitions.rb b/features/step_definitions.rb new file mode 100644 index 0000000..aefc354 --- /dev/null +++ b/features/step_definitions.rb @@ -0,0 +1,37 @@ +require 'statsample' +include Statsample::TimeSeries + +Given /^the following values in a timeseries:$/ do |series| + arr = [] + series.hashes.each do |sequence| + arr += sequence['timeseries'].split(' ').map(&:to_i).to_ts + end + @timeseries = arr.to_ts +end + +When /^I provide (\d+) lags for p?acf$/ do |lags| + @lags = lags.to_i +end + +When /^I provide (\w+) yule walker as method$/ do |method| + @method = method +end + +Then /^I should get (\w+) as resultant output$/ do |klass| + @result = @timeseries.pacf(@lags, @method) + assert_equal @result.class.to_s, klass +end + +Then /^I should get (\w+) values in resultant p?acf$/ do |values_count| + assert_equal @result.size, values_count.to_i +end + +And /^I should see (\d+\.\d) as first value$/ do |first_value| + assert_equal @result.first, first_value.to_f +end + +And /^I should see \"(.+)\" as complete series$/ do |series| + series = series.split(',').map(&:to_f) + assert_equal @result, series +end + diff --git a/features/step_definitions_acf.rb b/features/step_definitions_acf.rb new file mode 100644 index 0000000..ca5479c --- /dev/null +++ b/features/step_definitions_acf.rb @@ -0,0 +1,9 @@ +require 'statsample' +require 'debugger' +include Statsample::TimeSeries + +#all instance variable and cucumber DSL s DRYed up in step_definitions.rb +And /^I calculate acf$/ do + @result = @timeseries.acf(@lags) +end + diff --git a/lib/statsample/analysis.rb b/lib/statsample/analysis.rb index bf0e7b1..b0036a4 100644 --- a/lib/statsample/analysis.rb +++ b/lib/statsample/analysis.rb @@ -3,7 +3,7 @@ module Statsample # DSL to create analysis without hazzle. - # * Shortcuts methods to avoid use complete namescapes, many based on R + # * Shortcuts methods to avoid use complete namespaces, many based on R # * Attach/detach vectors to workspace, like R # == Example # an1=Statsample::Analysis.store(:first) do diff --git a/lib/statsample/arima.rb b/lib/statsample/arima.rb new file mode 100644 index 0000000..70a5dc8 --- /dev/null +++ b/lib/statsample/arima.rb @@ -0,0 +1,28 @@ +require 'debugger' +module Statsample + module ARIMA + class ARIMA < Statsample::TimeSeries + + def arima(ds, p, i, q) + if q.zero? + self.ar(p) + elsif p.zero? + self.ma(p) + end + end + + def ar(p) + #AutoRegressive part of model + #http://en.wikipedia.org/wiki/Autoregressive_model#Definition + #For finding parameters(to fit), we will use either Yule-walker + #or Burg's algorithm(more efficient) + + degugger + + end + + def yule_walker() + end + end + end +end diff --git a/lib/statsample/multiset.rb b/lib/statsample/multiset.rb index 88fe595..230fe3c 100644 --- a/lib/statsample/multiset.rb +++ b/lib/statsample/multiset.rb @@ -1,212 +1,225 @@ module Statsample # Multiset joins multiple dataset with the same fields and vectors - # but with different number of cases. + # but with different number of cases. # This is the base class for stratified and cluster sampling estimation class Multiset # Name of fields attr_reader :fields # Array with Statsample::Dataset attr_reader :datasets + # To create a multiset # * Multiset.new(%w{f1 f2 f3}) # define only fields def initialize(fields) - @fields=fields - @datasets={} + @fields = fields + @datasets = {} end - def self.new_empty_vectors(fields,ds_names) - ms=Multiset.new(fields) - ds_names.each{|d| - ms.add_dataset(d,Dataset.new(fields)) - } - ms + + def self.new_empty_vectors(fields, ds_names) + ms = Multiset.new(fields) + ds_names.each do |d| + ms.add_dataset(d, Dataset.new(fields)) + end + ms end + # Generate a new dataset as a union of partial dataset # If block given, this is applied to each dataset before union def union(&block) - union_field={} - types={} - names={} - labels={} - each do |k,ds| + union_field = {} + types = {} + names = {} + labels = {} + each do |k, ds| if block - ds=ds.dup - yield k,ds + ds = ds.dup + yield k, ds end @fields.each do |f| - union_field[f]||=Array.new + union_field[f] ||= Array.new union_field[f].concat(ds[f].data) - types[f]||=ds[f].type - names[f]||=ds[f].name - labels[f]||=ds[f].labels + types[f] ||= ds[f].type + names[f] ||= ds[f].name + labels[f] ||= ds[f].labels end end - + @fields.each do |f| - union_field[f]=union_field[f].to_vector(types[f]) - union_field[f].name=names[f] - union_field[f].labels=labels[f] + union_field[f] = union_field[f].to_vector(types[f]) + union_field[f].name = names[f] + union_field[f].labels = labels[f] end - ds_union=union_field.to_dataset - ds_union.fields=@fields + ds_union = union_field.to_dataset + ds_union.fields = @fields ds_union end + def datasets_names - @datasets.keys.sort + @datasets.keys.sort end + def n_datasets - @datasets.size + @datasets.size end - def add_dataset(key,ds) - if(ds.fields!=@fields) - raise ArgumentError, "Dataset(#{ds.fields.to_s})must have the same fields of the Multiset(#{@fields})" + + def add_dataset(key, ds) + if(ds.fields != @fields) + raise ArgumentError, "Dataset(#{ds.fields.to_s})must have the same fields of the Multiset(#{@fields})" else - @datasets[key]=ds + @datasets[key]=ds end end + def sum_field(field) - @datasets.inject(0) {|a,da| - stratum_name=da[0] - vector=da[1][field] - val=yield stratum_name,vector + @datasets.inject(0) do |a,da| + stratum_name = da[0] + vector = da[1][field] + val = yield stratum_name,vector a+val - } + end end + def collect_vector(field) - @datasets.collect {|k,v| + @datasets.collect {|k, v| yield k, v[field] } end - + def each_vector(field) - @datasets.each {|k,v| + @datasets.each {|k, v| yield k, v[field] } end + def[](i) @datasets[i] end + def each(&block) - @datasets.each {|k,ds| - next if ds.cases==0 - block.call(k,ds) - } + @datasets.each do |k, ds| + next if ds.cases == 0 + block.call(k, ds) + end end end + class StratifiedSample class << self # mean for an array of vectors def mean(*vectors) - n_total=0 - means=vectors.inject(0){|a,v| - n_total+=v.size - a+v.sum - } - means.to_f/n_total + n_total = 0 + means=vectors.inject(0) do |a,v| + n_total += v.size + a + v.sum + end + means.to_f / n_total end - + def standard_error_ksd_wr(es) - n_total=0 - sum=es.inject(0){|a,h| - n_total+=h['N'] - a+((h['N']**2 * h['s']**2) / h['n'].to_f) - } + n_total = 0 + sum=es.inject(0) do |a,h| + n_total += h['N'] + a + ((h['N'] ** 2 * h['s'] ** 2) / h['n'].to_f) + end (1.to_f / n_total)*Math::sqrt(sum) end - - + + def variance_ksd_wr(es) standard_error_ksd_wr(es)**2 end def calculate_n_total(es) - es.inject(0) {|a,h| a+h['N'] } + es.inject(0) {|a, h| a+h['N'] } end # Source : Cochran (1972) - + def variance_ksd_wor(es) - n_total=calculate_n_total(es) - es.inject(0){|a,h| - val=((h['N'].to_f / n_total)**2) * (h['s']**2 / h['n'].to_f) * (1 - (h['n'].to_f / h['N'])) - a+val - } + n_total = calculate_n_total(es) + es.inject(0) do |a, h| + val = ((h['N'].to_f / n_total)**2) * (h['s']**2 / h['n'].to_f) * (1 - (h['n'].to_f / h['N'])) + a+val + end end + def standard_error_ksd_wor(es) Math::sqrt(variance_ksd_wor(es)) end - - - + def variance_esd_wor(es) - n_total=calculate_n_total(es) - sum=es.inject(0){|a,h| - val=h['N']*(h['N']-h['n'])*(h['s']**2 / h['n'].to_f) - a+val - } - (1.0/(n_total**2))*sum + n_total = calculate_n_total(es) + sum=es.inject(0) do |a,h| + val = h['N'] * (h['N'] - h['n']) * (h['s']**2 / h['n'].to_f) + a + val + end + (1.0 / (n_total**2)) * sum end - - + def standard_error_esd_wor(es) Math::sqrt(variance_ksd_wor(es)) end # Based on http://stattrek.com/Lesson6/STRAnalysis.aspx def variance_esd_wr(es) - n_total=calculate_n_total(es) - sum=es.inject(0){|a,h| - val= ((h['s']**2 * h['N']**2) / h['n'].to_f) - a+val - } - (1.0/(n_total**2))*sum + n_total = calculate_n_total(es) + sum = es.inject(0) do |a,h| + val = ((h['s']**2 * h['N']**2) / h['n'].to_f) + a + val + end + (1.0 / (n_total**2)) * sum end + def standard_error_esd_wr(es) Math::sqrt(variance_esd_wr(es)) end - + def proportion_variance_ksd_wor(es) n_total=calculate_n_total(es) - es.inject(0){|a,h| - val= (((h['N'].to_f / n_total)**2 * h['p']*(1-h['p'])) / (h['n'])) * (1- (h['n'].to_f / h['N'])) - a+val - } + es.inject(0) do |a,h| + val = (((h['N'].to_f / n_total)**2 * h['p']*(1-h['p'])) / (h['n'])) * (1- (h['n'].to_f / h['N'])) + a + val + end end + def proportion_sd_ksd_wor(es) - Math::sqrt(proportion_variance_ksd_wor(es)) + Math::sqrt(proportion_variance_ksd_wor(es)) end - - + def proportion_sd_ksd_wr(es) - n_total=calculate_n_total(es) - sum=es.inject(0){|a,h| - val= (h['N']**2 * h['p']*(1-h['p'])) / h['n'].to_f - a+val - } - Math::sqrt(sum) * (1.0/n_total) + n_total = calculate_n_total(es) + sum = es.inject(0) do |a,h| + val = (h['N']**2 * h['p'] * (1 - h['p'])) / h['n'].to_f + a + val + end + Math::sqrt(sum) * (1.0 / n_total) end + def proportion_variance_ksd_wr(es) - proportion_variance_ksd_wor(es)**2 + proportion_variance_ksd_wor(es)**2 end - + def proportion_variance_esd_wor(es) n_total=n_total=calculate_n_total(es) - - sum=es.inject(0){|a,h| - a=(h['N']**2 * (h['N']-h['n']) * h['p']*(1.0-h['p'])) / ((h['n']-1)*(h['N']-1)) - a+val - } + + sum = es.inject(0) do |a,h| + a = (h['N']**2 * (h['N'] - h['n']) * h['p'] * (1.0-h['p'])) / ((h['n'] - 1) * (h['N'] - 1)) + a + val + end Math::sqrt(sum) * (1.0/n_total**2) end + def proportion_sd_esd_wor(es) - Math::sqrt(proportion_variance_ksd_wor(es)) + Math::sqrt(proportion_variance_ksd_wor(es)) end end - - def initialize(ms,strata_sizes) - raise TypeError,"ms should be a Multiset" unless ms.is_a? Statsample::Multiset - @ms=ms - raise ArgumentError,"You should put a strata size for each dataset" if strata_sizes.keys.sort!=ms.datasets_names - @strata_sizes=strata_sizes - @population_size=@strata_sizes.inject(0) {|a,x| a+x[1]} - @strata_number=@ms.n_datasets - @sample_size=@ms.datasets.inject(0) {|a,x| a+x[1].cases} + #class singleton methods end above. + + def initialize(ms, strata_sizes) + raise TypeError, "ms should be a Multiset" unless ms.is_a? Statsample::Multiset + @ms = ms + raise ArgumentError, "You should put a strata size for each dataset" if strata_sizes.keys.sort!=ms.datasets_names + @strata_sizes = strata_sizes + @population_size = @strata_sizes.inject(0) { |a, x| a + x[1] } + @strata_number = @ms.n_datasets + @sample_size = @ms.datasets.inject(0) { |a,x| a + x[1].cases } end # Number of strata def strata_number @@ -226,14 +239,14 @@ def stratum_size(h) @strata_sizes[h] end def vectors_by_field(field) - @ms.datasets.collect{|k,ds| + @ms.datasets.collect{|k, ds| ds[field] } end # Population proportion based on strata - def proportion(field, v=1) - @ms.sum_field(field) {|s_name,vector| - stratum_ponderation(s_name)*vector.proportion(v) + def proportion(field, v = 1) + @ms.sum_field(field) {|s_name, vector| + stratum_ponderation(s_name) * vector.proportion(v) } end # Stratum ponderation. @@ -242,69 +255,70 @@ def stratum_ponderation(h) @strata_sizes[h].to_f / @population_size end alias_method :wh, :stratum_ponderation - + # Population mean based on strata def mean(field) - @ms.sum_field(field) {|s_name,vector| - stratum_ponderation(s_name)*vector.mean + @ms.sum_field(field) {|s_name, vector| + stratum_ponderation(s_name) * vector.mean } end # Standard error with estimated population variance and without replacement. # Source: Cochran (1972) def standard_error_wor(field) - es=@ms.collect_vector(field) {|s_n, vector| - {'N'=>@strata_sizes[s_n],'n'=>vector.size, 's'=>vector.sds} + es = @ms.collect_vector(field) {|s_n, vector| + {'N' => @strata_sizes[s_n], 'n' => vector.size, 's' => vector.sds} } - + StratifiedSample.standard_error_esd_wor(es) end - + # Standard error with estimated population variance and without replacement. # Source: http://stattrek.com/Lesson6/STRAnalysis.aspx - + def standard_error_wor_2(field) - sum=@ms.sum_field(field) {|s_name,vector| - s_size=@strata_sizes[s_name] - (s_size**2 * (1-(vector.size.to_f / s_size)) * vector.variance_sample / vector.size.to_f) - } - (1/@population_size.to_f)*Math::sqrt(sum) + sum = @ms.sum_field(field) do |s_name, vector| + s_size = @strata_sizes[s_name] + (s_size**2 * (1 - (vector.size.to_f / s_size)) * vector.variance_sample / vector.size.to_f) + end + (1 / @population_size.to_f) * Math::sqrt(sum) end - + def standard_error_wr(field) - es=@ms.collect_vector(field) {|s_n, vector| - {'N'=>@strata_sizes[s_n],'n'=>vector.size, 's'=>vector.sds} + es = @ms.collect_vector(field) {|s_n, vector| + {'N' => @strata_sizes[s_n], 'n' => vector.size, 's' => vector.sds} } - + StratifiedSample.standard_error_esd_wr(es) end - def proportion_sd_esd_wor(field,v=1) - es=@ms.collect_vector(field) {|s_n, vector| - {'N'=>@strata_sizes[s_n],'n'=>vector.size, 'p'=>vector.proportion(v)} + + def proportion_sd_esd_wor(field, v = 1) + es = @ms.collect_vector(field) {|s_n, vector| + {'N' => @strata_sizes[s_n], 'n' => vector.size, 'p' => vector.proportion(v)} } - + StratifiedSample.proportion_sd_esd_wor(es) end - - def proportion_standard_error(field,v=1) - prop=proportion(field,v) - sum=@ms.sum_field(field) {|s_name,vector| - nh=vector.size - s_size=@strata_sizes[s_name] - (s_size**2 * (1-(nh/s_size)) * prop * (1-prop) / (nh -1 )) + + def proportion_standard_error(field, v = 1) + prop=proportion(field, v) + sum = @ms.sum_field(field) {|s_name, vector| + nh = vector.size + s_size = @strata_sizes[s_name] + (s_size**2 * (1 - (nh / s_size)) * prop * (1 - prop) / (nh -1)) } (1.quo(@population_size)) * Math::sqrt(sum) end # Cochran(1971), p. 150 - def variance_pst(field,v=1) - sum=@ms.datasets.inject(0) {|a,da| - stratum_name=da[0] - ds=da[1] - nh=ds.cases.to_f - s_size=@strata_sizes[stratum_name] - prop=ds[field].proportion(v) - a + (((s_size**2 * (s_size-nh)) / (s_size-1))*(prop*(1-prop) / (nh-1))) - } - (1/@population_size.to_f ** 2)*sum + def variance_pst(field, v = 1) + sum=@ms.datasets.inject(0) do |a, da| + stratum_name = da[0] + ds = da[1] + nh = ds.cases.to_f + s_size = @strata_sizes[stratum_name] + prop = ds[field].proportion(v) + a + (((s_size**2 * (s_size - nh)) / (s_size - 1)) * (prop * (1 - prop) / (nh - 1))) + end + (1 / @population_size.to_f ** 2) * sum end end end diff --git a/lib/statsample/srs.rb b/lib/statsample/srs.rb index 6dcc7d2..752cba5 100644 --- a/lib/statsample/srs.rb +++ b/lib/statsample/srs.rb @@ -61,7 +61,7 @@ def proportion_confidence_interval_z(p, n_sample, n_population, margin=0.95) # Proportion confidence interval with x value # Uses estimated proportion, sample without replacement - def proportion_confidence_interval(p, sam,pop , x) + def proportion_confidence_interval(p, sam, pop, x) #f=sam.quo(pop) one_range=x * Math::sqrt((qf(sam, pop) * p * (1-p)).quo(sam-1)) + (1.quo(sam * 2.0)) [p-one_range, p+one_range] @@ -112,7 +112,7 @@ def proportion_total_sd_ep_wor(prop, sam, pop) ######################## # - # :SECTION: Mean stimation + # :SECTION: Mean estimation # ######################## diff --git a/lib/statsample/timeseries.rb b/lib/statsample/timeseries.rb index d9e7aad..3a236b3 100644 --- a/lib/statsample/timeseries.rb +++ b/lib/statsample/timeseries.rb @@ -1,9 +1,10 @@ +require 'statsample/timeseries/pacf' module Statsample::TimeSeriesShorthands # Creates a new Statsample::TimeSeries object # Argument should be equal to TimeSeries.new def to_time_series(*args) - Statsample::TimeSeries::TimeSeries.new(self, :scale, *args) - end + Statsample::TimeSeries::TimeSeries.new(self, :scale, *args) + end alias :to_ts :to_time_series end @@ -17,6 +18,7 @@ module TimeSeries # Collection of data indexed by time. # The order goes from earliest to latest. class TimeSeries < Statsample::Vector + include Statsample::TimeSeries::Pacf # Calculates the autocorrelation coefficients of the series. # # The first element is always 1, since that is the correlation @@ -29,10 +31,10 @@ class TimeSeries < Statsample::Vector # ts.acf # => array with first 21 autocorrelations # ts.acf 3 # => array with first 3 autocorrelations # - def acf maxlags = nil - maxlags ||= (10 * Math.log10(size)).to_i + def acf max_lags = nil + max_lags ||= (10 * Math.log10(size)).to_i - (0..maxlags).map do |i| + (0..max_lags).map do |i| if i == 0 1.0 else @@ -45,6 +47,16 @@ def acf maxlags = nil end end + def pacf(max_lags = nil, method = 'yw') + #parameters: + #max_lags => maximum number of lags for pcf + #method => for autocovariance in yule_walker: + #'yw' for 'yule-walker unbaised', 'mle' for biased maximum likelihood + + max_lags ||= (10 * Math.log10(size)).to_i + Pacf::Pacf.pacf_yw(self, max_lags, method) + end + # Lags the series by k periods. # # The convention is to set the oldest observations (the first ones @@ -86,7 +98,7 @@ def lag k = 1 # # => [0.69, 0.23, 0.44, 0.71, ...] # # ts.diff # => [nil, -0.46, 0.21, 0.27, ...] - # + # def diff self - self.lag end diff --git a/lib/statsample/timeseries/pacf.rb b/lib/statsample/timeseries/pacf.rb new file mode 100644 index 0000000..09836b9 --- /dev/null +++ b/lib/statsample/timeseries/pacf.rb @@ -0,0 +1,100 @@ +module Statsample + module TimeSeries + module Pacf + class Pacf + + def self.pacf_yw(timeseries, max_lags, method = 'yw') + #partial autocorrelation by yule walker equations. + #Inspiration: StatsModels + pacf = [1.0] + (1..max_lags).map do |i| + pacf << yule_walker(timeseries, i, method)[-1] + end + pacf + end + + def self.yule_walker(ts, k = 1, method='yw') + #From the series, estimates AR(p)(autoregressive) parameter + #using Yule-Waler equation. See - + #http://en.wikipedia.org/wiki/Autoregressive_moving_average_model + + #parameters: + #ts = series + #k = order, default = 1 + #method = can be 'yw' or 'mle'. If 'yw' then it is unbiased, denominator + #is (n - k) + + #returns: + #rho => autoregressive coefficients + ts = ts - ts.mean + n = ts.size + if method.downcase.eql? 'yw' + #unbiased => denominator = (n - k) + denom =->(k) { n - k } + else + #mle + #denominator => (n) + denom =->(k) { n } + end + r = Array.new(k + 1) { 0.0 } + r[0] = ts.map { |x| x ** 2 }.inject(:+).to_f / denom.call(0).to_f + + 1.upto(k) do |l| + r[l] = (ts[0...-l].zip(ts[l...ts.size])).map do |x| + x.inject(:*) + end.inject(:+).to_f / denom.call(l).to_f + end + + r_R = toeplitz(r[0...-1]) + + mat = Matrix.columns(r_R).inverse() + solve_matrix(mat, r[1..r.size]) + end + + def self.toeplitz(arr) + #Generates Toeplitz matrix - + #http://en.wikipedia.org/wiki/Toeplitz_matrix + #Toeplitz matrix are equal when they are stored in row & + #column major + #=> arr = [0, 1, 2, 3] + #=> result: + #[[0, 1, 2, 3], + # [1, 0, 1, 2], + # [2, 1, 0, 1], + # [3, 2, 1, 0]] + eplitz_matrix = Array.new(arr.size) { Array.new(arr.size) } + + 0.upto(arr.size - 1) do |i| + j = 0 + index = i + while i >= 0 do + eplitz_matrix[index][j] = arr[i] + j += 1 + i -= 1 + end + i = index + 1; k = 1 + while i < arr.size do + eplitz_matrix[index][j] = arr[k] + i += 1; j += 1; k += 1 + end + end + eplitz_matrix + end + + def self.solve_matrix(matrix, out_vector) + solution_vector = Array.new(out_vector.size, 0) + matrix = matrix.to_a + k = 0 + matrix.each do |row| + row.each_with_index do |element, i| + solution_vector[k] += element * 1.0 * out_vector[i] + end + k += 1 + end + solution_vector + end + + end + end + end +end diff --git a/statsample.gemspec b/statsample.gemspec new file mode 100644 index 0000000..34798d9 --- /dev/null +++ b/statsample.gemspec @@ -0,0 +1,30 @@ +Gem::Specification.new do |s| + s.name = 'statsample' + s.version = '1.1.0.2013' + s.date = '2013-04-28' + s.summary = 'StatSample - Ruby Statistical library' + s.description = "A suite for basic and advanced statistics on Ruby. Tested on Ruby 1.8.7 1.9.1, 1.9.2 and 1.9.3, JRuby 1.4(Ruby 1.8.7 compatible)." + s.authors = ["Claudio Bustos"] + s.email = ["clbustos@gmail.com"] + s.executables = `git ls-files -- bin/*`.split("\n").map { |f| File.basename(f) } + s.files = `git ls-files -- lib/*`.split("\n") + s.extra_rdoc_files += ['History.txt', 'Manifest.txt', 'README.md', 'references.txt'] + + s.homepage = "http://github.com/clbustos/statsample" + s.require_path = "lib" + s.rubyforge_project = "ruby-statsample" + s.test_files = `git ls-files -- test features`.split("\n") + + DEPENDENCIES = [{:gem => "spreadsheet", :version => "~> 0.8.5"}, {:gem => "reportbuilder", :version => "~> 1.4.2"}, + {:gem => "minimization", :version => "~> 0.2.1"}, {:gem => "fastercsv", :version => "~> 1.5.5"}, + {:gem => "dirty-memoize", :version => "~> 0.0.4"}, {:gem => "extendmatrix", :version => "~> 0.3.1"}, + {:gem => "statsample-bivariate-extension", :version => "~> 1.1.0"}, {:gem => "rserve-client", :version => "~> 0.3.0"}, + {:gem => "rubyvis", :version => "~> 0.5.2"}, {:gem => "gettext", :version => "~> 2.3.9"}, + {:gem => "mocha", :version => "~> 0.14.0"}, {:gem => "hoe-git", :version => "~> 1.5.0"}, + {:gem => "minitest", :version => "~> 4.2"}, {:gem => "shoulda", :version => "~> 3.5.0"}, + {:gem => "cucumber", :version => "~>1.3.3"} + ] + DEPENDENCIES.each do |dependency| + s.add_dependency(dependency[:gem], dependency[:version]) + end +end diff --git a/test/helpers_tests.rb b/test/helpers_tests.rb index eca4b75..8eba53b 100644 --- a/test/helpers_tests.rb +++ b/test/helpers_tests.rb @@ -25,10 +25,8 @@ def self.should_with_gsl(name,&block) else skip("Requires GSL") end - end end - end end diff --git a/test/test_pacf.rb b/test/test_pacf.rb new file mode 100644 index 0000000..278214b --- /dev/null +++ b/test/test_pacf.rb @@ -0,0 +1,52 @@ +require(File.expand_path(File.dirname(__FILE__)+'/helpers_tests.rb')) +class StatsampleTimeSeriesPacfTestCase < MiniTest::Unit::TestCase + context(Statsample::TimeSeries) do + include Statsample::TimeSeries + setup do + @ts = (1..20).map { |x| x * 10 }.to_ts + #setting up a proc to get a closure for pacf calling with variable lags and methods + @pacf_proc =->(k, method) { @ts.pacf(k, method) } + end + + should "return correct correct pacf size for lags = 5" do + assert_equal @pacf_proc.call(5, 'yw').size, 6 + assert_equal @pacf_proc.call(5, 'mle').size, 6 + #first element is 1.0 + end + + should "return correct correct pacf size for lags = 10" do + assert_equal @pacf_proc.call(10, 'yw').size, 11 + assert_equal @pacf_proc.call(10, 'mle').size, 11 + #first element is 1.0 + end + + should "have first element as 1.0" do + assert_equal @pacf_proc.call(10, 'yw')[0], 1.0 + assert_equal @pacf_proc.call(10, 'mle')[0], 1.0 + end + + should "give correct pacf results for unbiased yule-walker" do + result_10 = [1.0, 0.8947368421052632, -0.10582010582010604, -0.11350188273265083, -0.12357534824820737, -0.13686534216335522, -0.15470588235294147, -0.17938011883732036, -0.2151192288178601, -0.2707082833133261, -0.3678160919540221] + result_5 = [1.0, 0.8947368421052632, -0.10582010582010604, -0.11350188273265083, -0.12357534824820737, -0.13686534216335522] + assert_equal @pacf_proc.call(10, 'yw'), result_10 + assert_equal @pacf_proc.call(5, 'yw'), result_5 + + #Checking for lag = (1..10) + 1.upto(10) do |i| + assert_equal @pacf_proc.call(i, 'yw'), result_10[0..i] + end + end + + should "give correct pacf results for mle yule-walker" do + result_10 = [1.0, 0.85, -0.07566212829370711, -0.07635069706072706, -0.07698628638512295, -0.07747034005560738, -0.0776780981161499, -0.07744984679625189, -0.0765803323191094, -0.07480650005932366, -0.07179435184923755] + result_5 = [1.0, 0.85, -0.07566212829370711, -0.07635069706072706, -0.07698628638512295, -0.07747034005560738] + assert_equal @pacf_proc.call(10, 'mle'), result_10 + assert_equal @pacf_proc.call(5, 'mle'), result_5 + + #Checking for lag = (1..10) + 1.upto(10) do |i| + assert_equal @pacf_proc.call(i, 'mle'), result_10[0..i] + end + end + end +end diff --git a/test/test_regression2.rb b/test/test_regression2.rb new file mode 100644 index 0000000..099c340 --- /dev/null +++ b/test/test_regression2.rb @@ -0,0 +1,235 @@ +require(File.expand_path(File.dirname(__FILE__)+'/helpers_tests.rb')) + +class StatsampleRegressionTestCase < MiniTest::Unit::TestCase + include Statsample::Shorthand + # context "Example with missing data" do + + def setup + @x=[0.285714285714286, 0.114285714285714, 0.314285714285714, 0.2, 0.2, 0.228571428571429, 0.2, 0.4, 0.714285714285714, 0.285714285714286, 0.285714285714286, 0.228571428571429, 0.485714285714286, 0.457142857142857, 0.257142857142857, 0.228571428571429, 0.285714285714286, 0.285714285714286, 0.285714285714286, 0.142857142857143, 0.285714285714286, 0.514285714285714, 0.485714285714286, 0.228571428571429, 0.285714285714286, 0.342857142857143, 0.285714285714286, 0.0857142857142857].to_scale + + @y=[nil, 0.233333333333333, nil, 0.266666666666667, 0.366666666666667, nil, 0.333333333333333, 0.3, 0.666666666666667, 0.0333333333333333, 0.333333333333333, nil, nil, 0.533333333333333, 0.433333333333333, 0.4, 0.4, 0.5, 0.4, 0.266666666666667, 0.166666666666667, 0.666666666666667, 0.433333333333333, 0.166666666666667, nil, 0.4, 0.366666666666667, nil].to_scale + @ds={'x'=>@x,'y'=>@y}.to_dataset + @lr=Statsample::Regression::Multiple::RubyEngine.new(@ds,'y') + end + + def test_correct_values + assert_in_delta(0.455,@lr.r2,0.001) + assert_in_delta(0.427,@lr.r2_adjusted, 0.001) + assert_in_delta(0.1165,@lr.se_estimate,0.001) + assert_in_delta(15.925,@lr.f,0.0001) + assert_in_delta(0.675, @lr.standarized_coeffs['x'],0.001) + assert_in_delta(0.778, @lr.coeffs['x'],0.001, "coeff x") + assert_in_delta(0.132, @lr.constant,0.001,"constant") + assert_in_delta(0.195, @lr.coeffs_se['x'],0.001,"coeff x se") + assert_in_delta(0.064, @lr.constant_se,0.001,"constant se") + end + #end + def test_error + samples=100 + + a,b=rand,rand + + x1=samples.times.map { rand}.to_scale + x2=samples.times.map {rand}.to_scale + x3=samples.times.map {|i| x1[i]*(1+a)+x2[i]*(1+b)}.to_scale + y=samples.times.map {|i| x1[i]+x2[i]+x3[i]+rand}.to_scale + + ds={'x1'=>x1,'x2'=>x2,'x3'=>x3,'y'=>y}.to_dataset + + assert_raise(Statsample::Regression::LinearDependency) { + Statsample::Regression::Multiple::RubyEngine.new(ds,'y') + } + end + def test_parameters + @x=[13,20,10,33,15].to_vector(:scale) + @y=[23,18,35,10,27].to_vector(:scale) + reg=Statsample::Regression::Simple.new_from_vectors(@x,@y) + _test_simple_regression(reg) + ds={'x'=>@x,'y'=>@y}.to_dataset + reg=Statsample::Regression::Simple.new_from_dataset(ds,'x','y') + _test_simple_regression(reg) + reg=Statsample::Regression.simple(@x,@y) + _test_simple_regression(reg) + + end + def _test_simple_regression(reg) + + assert_in_delta(40.009, reg.a,0.001) + assert_in_delta(-0.957, reg.b,0.001) + assert_in_delta(4.248, reg.standard_error,0.002) + assert(reg.summary) + end + + def test_summaries + a=10.times.map{rand(100)}.to_scale + b=10.times.map{rand(100)}.to_scale + y=10.times.map{rand(100)}.to_scale + ds={'a'=>a,'b'=>b,'y'=>y}.to_dataset + lr=Statsample::Regression::Multiple::RubyEngine.new(ds,'y') + assert(lr.summary.size>0) + end + def test_multiple_dependent + complete=Matrix[ + [1,0.53,0.62,0.19,-0.09,0.08,0.02,-0.12,0.08], + [0.53,1,0.61,0.23,0.1,0.18,0.02,-0.1,0.15], + [0.62,0.61,1,0.03,0.1,0.12,0.03,-0.06,0.12], + [0.19,0.23,0.03,1,-0.02,0.02,0,-0.02,-0.02], + [-0.09,0.1,0.1,-0.02,1,0.05,0.06,0.18,0.02], + [0.08,0.18,0.12,0.02,0.05,1,0.22,-0.07,0.36], + [0.02,0.02,0.03,0,0.06,0.22,1,-0.01,-0.05], + [-0.12,-0.1,-0.06,-0.02,0.18,-0.07,-0.01,1,-0.03], + [0.08,0.15,0.12,-0.02,0.02,0.36,-0.05,-0.03,1]] + complete.extend Statsample::CovariateMatrix + complete.fields=%w{adhd cd odd sex age monly mwork mage poverty} + + lr=Statsample::Regression::Multiple::MultipleDependent.new(complete, %w{adhd cd odd}) + + + assert_in_delta(0.197, lr.r2yx,0.001) + assert_in_delta(0.197, lr.r2yx_covariance,0.001) + assert_in_delta(0.07, lr.p2yx,0.001) + + end + + def test_multiple_regression_pairwise_2 + @a=[1,3,2,4,3,5,4,6,5,7,3,nil,3,nil,3].to_vector(:scale) + @b=[3,3,4,4,5,5,6,6,4,4,2,2,nil,6,2].to_vector(:scale) + @c=[11,22,30,40,50,65,78,79,99,100,nil,3,7,nil,7].to_vector(:scale) + @y=[3,4,5,6,7,8,9,10,20,30,30,40,nil,50,nil].to_vector(:scale) + ds={'a'=>@a,'b'=>@b,'c'=>@c,'y'=>@y}.to_dataset + lr=Statsample::Regression::Multiple::RubyEngine.new(ds,'y') + assert_in_delta(2407.436,lr.sst,0.001) + assert_in_delta(0.752,lr.r,0.001, "pairwise r") + assert_in_delta(0.565,lr.r2,0.001) + assert_in_delta(1361.130,lr.ssr,0.001) + assert_in_delta(1046.306,lr.sse,0.001) + assert_in_delta(3.035,lr.f,0.001) + end + + + def test_multiple_regression_gsl + if Statsample.has_gsl? + @a=[1,3,2,4,3,5,4,6,5,7].to_vector(:scale) + @b=[3,3,4,4,5,5,6,6,4,4].to_vector(:scale) + @c=[11,22,30,40,50,65,78,79,99,100].to_vector(:scale) + @y=[3,4,5,6,7,8,9,10,20,30].to_vector(:scale) + ds={'a'=>@a,'b'=>@b,'c'=>@c,'y'=>@y}.to_dataset + lr=Statsample::Regression::Multiple::GslEngine.new(ds,'y') + assert(lr.summary.size>0) + model_test(lr,'gsl') + predicted=[1.7857, 6.0989, 3.2433, 7.2908, 4.9667, 10.3428, 8.8158, 10.4717, 23.6639, 25.3198] + c_predicted=lr.predicted + predicted.each_index{|i| + assert_in_delta(predicted[i],c_predicted[i],0.001) + } + residuals=[1.2142, -2.0989, 1.7566, -1.29085, 2.033, -2.3428, 0.18414, -0.47177, -3.66395, 4.6801] + c_residuals=lr.residuals + residuals.each_index{|i| + assert_in_delta(residuals[i],c_residuals[i],0.001) + } + else + skip "Regression::Multiple::GslEngine not tested (no Gsl)" + end + end + + + + def model_test_matrix(lr,name='undefined') + + stan_coeffs={'a'=>0.151,'b'=>-0.547,'c'=>0.997} + unstan_coeffs={'a'=>0.695, 'b'=>-4.286, 'c'=>0.266} + + unstan_coeffs.each_key{|k| + assert_in_delta(unstan_coeffs[k], lr.coeffs[k],0.001,"b coeffs - #{name}") + } + + stan_coeffs.each_key{|k| + assert_in_delta(stan_coeffs[k], lr.standarized_coeffs[k],0.001, "beta coeffs - #{name}") + } + + assert_in_delta(11.027,lr.constant,0.001) + + assert_in_delta(0.955,lr.r,0.001) + assert_in_delta(0.913,lr.r2,0.001) + + assert_in_delta(20.908, lr.f,0.001) + assert_in_delta(0.001, lr.probability, 0.001) + assert_in_delta(0.226,lr.tolerance("a"),0.001) + + coeffs_se={"a"=>1.171,"b"=>1.129,"c"=>0.072} + + + + ccoeffs_se=lr.coeffs_se + coeffs_se.each_key{|k| + assert_in_delta(coeffs_se[k],ccoeffs_se[k],0.001) + } + coeffs_t={"a"=>0.594,"b"=>-3.796,"c"=>3.703} + ccoeffs_t=lr.coeffs_t + coeffs_t.each_key{|k| + assert_in_delta(coeffs_t[k], ccoeffs_t[k],0.001) + } + + assert_in_delta(639.6,lr.sst,0.001) + assert_in_delta(583.76,lr.ssr,0.001) + assert_in_delta(55.840,lr.sse,0.001) + assert(lr.summary.size>0, "#{name} without summary") + end + def model_test(lr,name='undefined') + model_test_matrix(lr,name) + assert_in_delta(4.559, lr.constant_se,0.001) + assert_in_delta(2.419, lr.constant_t,0.001) + + assert_in_delta(1.785,lr.process([1,3,11]),0.001) + end + def test_regression_matrix + @a=[1,3,2,4,3,5,4,6,5,7].to_vector(:scale) + @b=[3,3,4,4,5,5,6,6,4,4].to_vector(:scale) + @c=[11,22,30,40,50,65,78,79,99,100].to_vector(:scale) + @y=[3,4,5,6,7,8,9,10,20,30].to_vector(:scale) + ds={'a'=>@a,'b'=>@b,'c'=>@c,'y'=>@y}.to_dataset + cor=Statsample::Bivariate.correlation_matrix(ds) + + lr=Statsample::Regression::Multiple::MatrixEngine.new(cor,'y', :y_mean=>@y.mean, :x_mean=>{'a'=>ds['a'].mean, 'b'=>ds['b'].mean, 'c'=>ds['c'].mean}, :cases=>@a.size, :y_sd=>@y.sd , :x_sd=>{'a' => @a.sd, 'b' => @b.sd, 'c' => @c.sd}) + assert_nil(lr.constant_se) + assert_nil(lr.constant_t) + model_test_matrix(lr, "correlation matrix") + + covariance=Statsample::Bivariate.covariance_matrix(ds) + lr=Statsample::Regression::Multiple::MatrixEngine.new(covariance,'y', :y_mean=>@y.mean, :x_mean=>{'a'=>ds['a'].mean, 'b'=>ds['b'].mean, 'c'=>ds['c'].mean}, :cases=>@a.size) + assert(lr.summary.size>0) + + model_test(lr , "covariance matrix") + end + def test_regression_rubyengine + @a=[nil,1,3,2,4,3,5,4,6,5,7].to_vector(:scale) + @b=[nil,3,3,4,4,5,5,6,6,4,4].to_vector(:scale) + @c=[nil,11,22,30,40,50,65,78,79,99,100].to_vector(:scale) + @y=[nil,3,4,5,6,7,8,9,10,20,30].to_vector(:scale) + ds={'a'=>@a,'b'=>@b,'c'=>@c,'y'=>@y}.to_dataset + lr=Statsample::Regression::Multiple::RubyEngine.new(ds,'y') + assert_equal(11, lr.total_cases) + assert_equal(10, lr.valid_cases) + model_test(lr, 'rubyengine with missing data') + + predicted=[nil,1.7857, 6.0989, 3.2433, 7.2908, 4.9667, 10.3428, 8.8158, 10.4717, 23.6639, 25.3198] + c_predicted = lr.predicted + predicted.each_index do |i| + if c_predicted[i].nil? + assert(predicted[i].nil?, "Actual #{i} is nil, but expected #{predicted[i]}") + else + assert_in_delta(predicted[i], c_predicted[i], 0.001) + end + end + residuals=[nil,1.2142, -2.0989, 1.7566, -1.29085, 2.033, -2.3428, 0.18414, -0.47177, -3.66395, 4.6801] + c_residuals=lr.residuals + residuals.each_index do |i| + if c_residuals[i].nil? + assert(residuals[i].nil?) + else + assert_in_delta(residuals[i],c_residuals[i],0.001) + end + end + + end +end diff --git a/test/test_srs.rb b/test/test_srs.rb index 1d18cf9..a486423 100644 --- a/test/test_srs.rb +++ b/test/test_srs.rb @@ -6,4 +6,50 @@ def test_std_error assert_equal(108,Statsample::SRS.estimation_n(0.05,0.5,150,0.95).to_i) assert_in_delta(0.0289,Statsample::SRS.proportion_sd_kp_wor(0.5,100,150),0.001) end + + def test_fpc + sample_size = 80 + population_size = 10 + + assert_equal -7, Statsample::SRS.fpc_var(sample_size, population_size).to_i + end + + def test_qf + sample_size = 80 + population_size = 10 + + assert_equal -7, Statsample::SRS.qf(sample_size, population_size).to_i + #non sample fraction = 1 - sample fraction + assert_equal -7, 1 - sample_size.quo(population_size) + end + + def test_proportion_confidence_interval_t + proportion = 59 + n_sample, n_population = 15, 10 + confidence_interval = Statsample::SRS.proportion_confidence_interval_t(proportion, n_sample, n_population).map(&:to_f) + + assert_in_delta 35.2559, confidence_interval[0], 0.001 + assert_in_delta 82.7440, confidence_interval[1], 0.001 + end + + def test_proportion_confidence_interval_z + #estimated proportion with normal distribution. + proportion = 59 + n_sample, n_population = 15, 10 + confidence_interval = Statsample::SRS.proportion_confidence_interval_z(proportion, n_sample, n_population).map(&:to_f) + + assert_in_delta 37.2991, confidence_interval[0], 0.001 + assert_in_delta 80.7008, confidence_interval[1], 0.001 + end + + def test_proportion_sd_kp_wor + #standard deviation without replacement strategy for known proportion + assert_in_delta 11.6979, Statsample::SRS.proportion_sd_kp_wor(40, -30, 20), 0.001 + end + + def test_proportion_sd_ep_wor + #standard deviation with replacement for estimated proportion + assert_in_delta 35.5527, Statsample::SRS.proportion_sd_ep_wr(80, -4), 0.001 + assert_in_delta 38.9700, Statsample::SRS.proportion_sd_ep_wr(68, -2), 0.001 + end end diff --git a/test/test_stratified.rb b/test/test_stratified.rb index eb8ef45..7f0ef8d 100644 --- a/test/test_stratified.rb +++ b/test/test_stratified.rb @@ -3,6 +3,11 @@ class StatsampleStratifiedTestCase < MiniTest::Unit::TestCase def initialize(*args) + @es = [ + {"N" => 5, "s" => 3, "n" => 2}, + {"N" => 10, "s" => 5, "n" => 2}, + {"N" => 30, "s" => 32, "n" => 12} + ] super end def test_mean @@ -14,4 +19,69 @@ def test_mean popv=pop.to_vector(:scale) assert_equal(popv.mean,Statsample::StratifiedSample.mean(av,bv)) end + + def test_standard_error_ksd_wr + # Standard error for known sample design with replacement + es = [ + {"N" => 5, "s" => 3, "n" => 2}, + {"N" => 10, "s" => 5, "n" => 2}, + {"N" => 30, "s" => 32, "n" => 12} + ] + assert_in_delta 6.21279, Statsample::StratifiedSample.standard_error_ksd_wr(es), 0.001 + end + + def test_variance_ksd_wr + # variance of known sample desgin with replacement + result = Statsample::StratifiedSample.variance_ksd_wr(@es) + + assert_in_delta 38.5987, result, 0.001 + assert_in_delta Statsample::StratifiedSample.standard_error_ksd_wr(@es), Math::sqrt(result), 0.001 + end + + def test_calculate_n_total + # calculates total 'n' from the hash + + assert_equal 45, Statsample::StratifiedSample.calculate_n_total(@es) + assert_equal 15, Statsample::StratifiedSample.calculate_n_total(@es.first(2)) + assert_equal 40, Statsample::StratifiedSample.calculate_n_total(@es.last(2)) + end + + def test_variance_ksd_wor + # variance of known sample desgin without replacement + + assert_in_delta 23.2827, Statsample::StratifiedSample.variance_ksd_wor(@es), 0.001 + assert_in_delta 4.7444, Statsample::StratifiedSample.variance_ksd_wor(@es.first(2)), 0.001 + end + + def test_standard_error_ksd_wor + # standard error for known sample design without replacement + # + assert_in_delta 4.8252, Statsample::StratifiedSample.standard_error_ksd_wor(@es), 0.001 + + #for fragment sample design + assert_in_delta 5.4244, Statsample::StratifiedSample.standard_error_ksd_wor(@es.last 2), 0.001 + end + + def test_variance_esd_wor + # variance of estimated sample design without replacement strategry + assert_in_delta 23.2827, Statsample::StratifiedSample.variance_esd_wor(@es), 0.001 + end + + def test_standard_error_esd_wor + # Standard error for estimated sample design without replacement + assert_in_delta 4.82521, Statsample::StratifiedSample.standard_error_esd_wor(@es), 0.001 + end + + def test_standard_error_esd_wr + res = Statsample::StratifiedSample.standard_error_esd_wr(@es) + assert_in_delta 6.21279, res, 0.001 + assert_in_delta Statsample::StratifiedSample.variance_esd_wr(@es), res ** 2, 0.001 + end + + def test_variance_esd_wr + # variance for estimated sample dsign with replacement + + assert_in_delta 38.5987, Statsample::StratifiedSample.variance_esd_wr(@es), 0.001 + end + end diff --git a/test/test_test_f.rb b/test/test_test_f.rb index b7cc4a8..9145d7a 100644 --- a/test/test_test_f.rb +++ b/test/test_test_f.rb @@ -1,33 +1,35 @@ require(File.expand_path(File.dirname(__FILE__)+'/helpers_tests.rb')) class StatsampleTestFTestCase < MiniTest::Unit::TestCase - context(Statsample::Test::F) do - setup do - @ssb=84 - @ssw=68 - @df_num=2 - @df_den=15 - @f=Statsample::Test::F.new(@ssb.quo(@df_num),@ssw.quo(@df_den), @df_num, @df_den) - end - should "have #f equal to msb/msw" do - assert_equal((@ssb.quo(@df_num)).quo(@ssw.quo(@df_den)), @f.f) - end - should "have df total equal to df_num+df_den" do - assert_equal(@df_num + @df_den, @f.df_total) - end - should "have probability near 0.002" do - assert_in_delta(0.002, @f.probability, 0.0005) - end - should "be coerced into float" do - assert_equal(@f.to_f, @f.f) - end - - context("method summary") do - setup do - @summary=@f.summary - end - should "have size > 0" do - assert(@summary.size>0) - end - end + def setup + @ssb=84 + @ssw=68 + @df_num=2 + @df_den=15 + @f=Statsample::Test::F.new(@ssb.quo(@df_num),@ssw.quo(@df_den), @df_num, @df_den) + @summary=@f.summary + end + + def test_f_value + #should have #f equal to msb/msw + assert_equal((@ssb.quo(@df_num)).quo(@ssw.quo(@df_den)), @f.f) + end + + def test_df_with_num_and_den + # df total should be equal to df_num+df_den + assert_equal(@df_num + @df_den, @f.df_total) + end + + def test_probability + #should have probability near 0.002 + assert_in_delta(0.002, @f.probability, 0.0005) + end + + def test_float_coercion + #should be coerced into float + assert_equal(@f.to_f, @f.f) + end + + def test_summary_size + assert(@summary.size>0) end end diff --git a/test/test_tseries.rb b/test/test_tseries.rb index 35f2cc9..9306ae0 100644 --- a/test/test_tseries.rb +++ b/test/test_tseries.rb @@ -24,8 +24,17 @@ def test_acf end def test_lag - assert_in_delta 16.66, @xiu.lag[@xiu.size - 1], 0.001 - assert_in_delta 16.36, @xiu.lag(2)[@xiu.size - 1], 0.001 + #test of default lag + lag1 = @xiu.lag + + assert_in_delta 16.66, lag1[lag1.size - 1], 0.001 + assert_in_delta 16.36, lag1[lag1.size - 2], 0.001 + + #test with different lagging unit + lag2 = @xiu.lag(2) + + assert_in_delta 16.36, lag2[lag2.size - 1], 0.001 + assert_in_delta 16.56, lag2[lag2.size - 2], 0.001 end def test_delta diff --git a/test/test_wald.rb b/test/test_wald.rb new file mode 100644 index 0000000..30fb87b --- /dev/null +++ b/test/test_wald.rb @@ -0,0 +1,84 @@ +require(File.expand_path(File.dirname(__FILE__)+'/helpers_tests.rb')) + +class StatsampleTestWaldTest < MiniTest::Unit::TestCase + + # Wald + include Statsample::Shorthand + include Statsample::TimeSeries + def setup + @wald = 100.times.map do + rand(100) + end.to_ts + end + + def generate_acf_series(lags) + @wald.acf(lags) + end + + def compute_mean(acf) + acf.inject(:+) / acf.size + end + + def compute_variance(acf) + result = acf.map { |x| x ** 2 }.inject(:+) / acf.size + return result.to_f + end + + def generate_acf_summation(lags) + acf_series = @wald.acf(lags) + acf_series.map do |x| + x ** 2 + end.inject(:+) + end + + def compute_chi_df(lags) + observed = Matrix[@wald.acf(lags)] + Statsample::Test.chi_square(observed).df + end + + def test_with_5_lags + acf = generate_acf_series(5) + assert_equal generate_acf_summation(5), @wald.acf(5).map { |x| x ** 2 }.inject(:+) + assert_in_delta compute_mean(acf).to_i, 0 + assert_in_delta compute_variance(acf), 1.0/acf.size, 0.1 + assert_equal (generate_acf_summation(5)/5).to_i, compute_chi_df(5) + + end + + def test_with_10_lags + acf = generate_acf_series(10) + assert_equal generate_acf_summation(10), @wald.acf(10).map { |x| x ** 2 }.inject(:+) + assert_in_delta compute_mean(acf).to_i, 0 + assert_in_delta compute_variance(acf), 1.0/acf.size, 0.1 + assert_equal (generate_acf_summation(10)/10).to_i, compute_chi_df(10) + end + + def test_with_15_lags + acf = generate_acf_series(15) + assert_equal generate_acf_summation(15), @wald.acf(15).map { |x| x ** 2 }.inject(:+) + assert_in_delta compute_mean(acf).to_i, 0 + assert_in_delta compute_variance(acf), 1.0/acf.size, 0.1 + assert_equal (generate_acf_summation(15)/15).to_i, compute_chi_df(15) + end + + def test_with_20_lags + acf = generate_acf_series(20) + assert_equal generate_acf_summation(20), @wald.acf(20).map { |x| x ** 2 }.inject(:+) + assert_in_delta compute_mean(acf).to_i, 0 + assert_in_delta compute_variance(acf), 1.0/acf.size, 0.1 + assert_equal (generate_acf_summation(20)/20).to_i, compute_chi_df(20) + end + + def test_with_different_series + @wald = 200.times.map { rand }.to_ts + acf_series = @wald.acf(10) + lhs = acf_series.each { |x| x ** 2 }.inject(:+) + rhs = Statsample::Test.chi_square(Matrix[acf_series]).df + assert_equal (lhs/10).to_i, rhs + end + + def test_pacf + #TODO + end +end + diff --git a/test/test_wald2.rb b/test/test_wald2.rb new file mode 100644 index 0000000..1a3eecd --- /dev/null +++ b/test/test_wald2.rb @@ -0,0 +1,71 @@ +require(File.expand_path(File.dirname(__FILE__)+'/helpers_tests.rb')) + +class StatsampleWaldTest < MiniTest::Unit::TestCase + # Wald test is useful to test a series of n acf with Chi-square + # degree of freedom. It is extremely useful to test fit the fit of + # an ARIMA model to test the residuals. + + include Statsample::TimeSeries + include Statsample::Shorthand + include Distribution + + def setup + #create time series to evaluate later + @wald = 100.times.map { rand(100) }.to_ts + end + + def sum_of_squares_of_acf_series(lags) + #perform sum of squares for a series of acf with specified lags + acf = @wald.acf(lags) + return acf.map { |x| x ** 2 }.inject(:+) + end + + def chisquare_cdf(sum_of_squares, lags) + 1 - ChiSquare.cdf(sum_of_squares, lags) + end + + + def test_wald_with_5_lags + #number of lags for acf = 5 + lags = 5 + sum_of_squares = sum_of_squares_of_acf_series(lags) + assert_in_delta chisquare_cdf(sum_of_squares, lags), 1, 0.05 + assert_equal @wald.acf(lags).size, lags + 1 + end + + + def test_wald_with_10_lags + #number of lags for acf = 10 + lags = 10 + sum_of_squares = sum_of_squares_of_acf_series(lags) + assert_in_delta chisquare_cdf(sum_of_squares, lags), 1, 0.05 + assert_equal @wald.acf(lags).size, lags + 1 + end + + + def test_wald_with_15_lags + #number of lags for acf = 15 + lags = 15 + sum_of_squares = sum_of_squares_of_acf_series(lags) + assert_in_delta chisquare_cdf(sum_of_squares, lags), 1, 0.05 + assert_equal @wald.acf(lags).size, lags + 1 + end + + + def test_wald_with_20_lags + #number of lags for acf = 20 + lags = 20 + sum_of_squares = sum_of_squares_of_acf_series(lags) + assert_in_delta chisquare_cdf(sum_of_squares, lags), 1, 0.05 + assert_equal @wald.acf(lags).size, lags + 1 + end + + + def test_wald_with_25_lags + #number of lags for acf = 25 + lags = 25 + sum_of_squares = sum_of_squares_of_acf_series(lags) + assert_in_delta chisquare_cdf(sum_of_squares, lags), 1, 0.05 + assert_equal @wald.acf(lags).size, lags + 1 + end +end