Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fit of continuous data to discrete distributions should return an error #120

Closed
piklprado opened this issue Feb 9, 2016 · 4 comments
Closed

Comments

@piklprado
Copy link
Contributor

Most of the functions that fit discrete sads proceed to fitting even when there is non-integer values in the data. I got fits to data with continuous values from fitls, fitpower and fitmzsm. Fits from the other discrete sads did not converge in the tests I've done, but return an error from mle2, showing that they proceed to fitting. Only fitpoilog stops when there is any continuous value in the data, which is an error-checking from poilog::fitpoilog

> distr("poilog")
[1] "discrete"
> x1 <- c(rpoilog(1000, 1.5, 1), 1.1)
> x1 <- x1[x1>0]
> fitpoilog(x1) ## error: "all n must be integers"
Error in dpoilog(un, z[1], exp(z[2])) (from fitpoilog.R#7) : all n must be integers

Which makes sense to me.

@piklprado piklprado added this to the sads 0.3.0 milestone Feb 9, 2016
@piklprado
Copy link
Contributor Author

Some details about the tests I ran (script here ):

fitls and fitpower

The density functions for these sads models correctly outputs zero for non-integer values, making the log-likelihood = -Inf. Still, the Brent's method use in these two functions do return a fit:

> x1 <- c(rls(100, N=1000, 10), 1.18)
> fitls(x1) ## fit with LogLik=-Inf and issues warnings about non-integer values
Maximum likelihood estimation
Type: discrete  species abundance distribution
Species: 101 individuals: 2368.18 

Call:
mle2(minuslogl = function (N, alpha) 
-sum(dls(x, N, alpha, log = TRUE)), start = list(alpha = 21.4238153469672), 
    method = "Brent", fixed = list(N = 2368.18), data = list(
        x = list(1, 75, 3, 4, 1, "etc")), lower = 0, upper = 101L)

Coefficients:
      N   alpha 
2368.18  101.00 

Log-likelihood: -Inf 
There were 50 or more warnings (use warnings() to see the first 50)
> warnings()[1]
Mensagem de aviso:
In dls(x, N, alpha, log = TRUE) : non integer values in x

fitmzsm

The density function incorrectly returns non-zero to continuous values (#119 ) and thus fits data with continuous values returns a numeric log-likelihood:

> x1 <- c(rmzsm(999, 1000, 20), 1.18)
> fitmzsm(x1) ## fit with LogLik!=-Inf and issues warnings about non-integer values.
Maximum likelihood estimation
Type: discrete  species abundance distribution
Species: 1000 individuals: 14280.18 

Call:
mle2(minuslogl = function (J, theta) 
-sum(dmzsm(x, J = J, theta = theta, log = TRUE)), start = list(
    theta = 1000L), method = "Brent", fixed = list(J = 14280.18), 
    data = list(x = list(24, 2, 10, 2, 35, "etc")), lower = 0.001, 
    upper = 1000L)

Coefficients:
         J      theta 
14280.1800   242.8042 

Log-likelihood: -3380.22 
> warnings()[1]
Mensagem de aviso:
In dls(x, N, alpha, log = TRUE) : non integer values in x

fitgeom, fitpowbend, fitnbinom, fitvolkov

At least in my tests did not fit because of convergence problems.

@andrechalom
Copy link
Member

OK, what we need to decide is how to deal with this in a coherent fashion. I believe that all fitting procedures should return an error if invalid data is entered, but the problem is: what is invalid data? Non-integer numbers for discrete fits are invalid, that's fine, but also negative numbers are invalid for all distributions, and still they fit (with ll=-Inf ):

> fitls(x = c(-1, moths))
(...)
Coefficients:
    N alpha 
15608   241 

Log-likelihood: -Inf 

This is particularly troubling for rad fits, because as they are converted to ranks, no checking at all is done and the fit seems valid:

> fitzipf(x = c(-1, moths))
(...)
Coefficients:
         N          s 
240.000000   1.034841 

Log-likelihood: -65008.43 

We can add a check to all fitting functions to make sure x is positive; also integer if the distribution is discrete. Are we overlooking some other case of invalid data?

@andrechalom
Copy link
Member

[Related: #101 and #18: can we make some way to automatically discard zeros? How do zero counts relate to parametric diversity indexes? ]

@piklprado
Copy link
Contributor Author

To check that x>0 for all models and x is integer for discrete models seems enough to version 0.3. We left the zero issue (#101 and #18 ) for version 1.0.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants