Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird results with trueLL #5

Open
piklprado opened this issue Jul 16, 2014 · 11 comments
Open

Weird results with trueLL #5

piklprado opened this issue Jul 16, 2014 · 11 comments

Comments

@piklprado
Copy link
Contributor

trueLL is supposed to provide a fair comparison between likelihoods of discrete and continuous distributions. For Fisher's moth data data however, I got a weird result with trueLL=TRUE: a truncated lognormal model has lower AIC than a Poisson-lognormal model, but diagnostic plots suggest that poilog fits better:

moths.ls <- fitsad(moths, "ls")
moths.pln <- fitsad(moths, "poilog")
moths.ln1 <- fitsad(moths, "lnorm", trunc=0.5)
moths.ln2 <- fitsad(moths, "lnorm", trunc=0.5, trueLL=F)
AICctab(moths.ls, moths.pln, moths.ln1, moths.ln2,
        nobs=length(moths), base=T, weights=T)

The model selection table:

          AICc   dAICc  df weight
moths.ln1 2175.0    0.0 2  0.44  
moths.pln 2176.2    1.2 2  0.24  
moths.ln2 2176.8    1.8 2  0.18  
moths.ls  2177.4    2.5 1  0.13  

and the plot:

plot(octav(moths))
lines(octavpred(moths.pln))
lines(octavpred(moths.ln1), col="red")
legend("topright", c("Poilog", "Lognormal"), 
       lty=1, pch=1, col=c("blue", "red"))

tuell

@piklprado
Copy link
Contributor Author

I conservatively removed trueLL argument from fitting functions. It will be available in AIC methods, but not as default untill this issue is not solved.

@andrechalom
Copy link
Member

Not exactly related to this issue, but the man page of the trueLL function has a marker for a merge conflict:

<<<<<<< HEAD
> trueLL(x, "lnorm", coef=list(meanlog=mean(log(x)), sdlog=sd(log(x))),
> dec.places=1, )
> Data in classes
> xoc <- octav(x)
> xc <- as.numeric(as.character(xoc$octave))
> xb <- 2^(c(min(xc)-1, xc))
> xh <- hist(x, breaks=xb, plot=FALSE)
> xll <- trueLL(x, dens="lnorm", breaks = xb, counts = xoc$Freq,
>    coef = list(meanlog=mean(log(x)), sd=sd(log(x))))
> xp <- diff(plnorm(xh$breaks, mean(log(x)), sd(log(x))))
> xll2 <- sum( rep(log(xp), xh$counts))
> all.equal(xll, xll2) # should be TRUE
> =======
> trueLL(x, "lnorm", coef=list(meanlog=mean(log(x)), sdlog=sd(log(x))), dec.places=1)
> >>>>>>> provisorio

Also: the examples in fitsad.Rd are using the "dec.places" argument in the current version, I'm commenting them out in the development branch until dec.places is supported.

@piklprado
Copy link
Contributor Author

Conflict solved and I removed the commented lines by now.

@piklprado
Copy link
Contributor Author

Before implementing the method in AIC, we need to understand the weird result itself.

@andrechalom andrechalom added this to the sads 0.2.0 milestone May 22, 2015
@andrechalom
Copy link
Member

The decision of where to cut the underlying distribution is very problematic. I don't see any theoretical reason to chose between three alternatives:

  1. at the midpoints between the integers (so that "5" individuals for example are accounted for in the interval [4.5, 5.5])
  2. at the integers, from the left("5" is accounted in [5, 6])
  3. at the integers, from the right("5" is accounted in [4,5])

The first seems a bit more natural, but this choice is perfectly arbitrary. However, the impacts of chosing each alternative are huge:

> l <- fitlnorm(moths, trunc=0.5)
> trueLL(moths, "lnorm", as.list(coef(l)), trunc=0.5) # alternative 1
[1] -1085.468
> trueLL(moths+0.5, "lnorm", as.list(coef(l)), trunc=0.5) # alternative 2
[1] -1102.392
> trueLL(moths-0.5, "lnorm", as.list(coef(l)), trunc=0.5) # alternative 3
[1] -1095.544

In this light, I don't think it's advisable to use the trueLL with counting data unless we can find a more firm theoretical grounding.

@andrechalom
Copy link
Member

I rewrote the code on trueLL, along with the man pages, on branch https://github.com/andrechalom/sads/tree/trueLL. I believe the weird results are caused by the extreme sensitivity of the trueLL to the break points used. Sometimes, the breaks get "lucky" and end up providing a smaller nLL, but most of the time they provide a very large increase in nLL.

In my opinion, what we should do is:

  • Keep trueLL as it is, with a large warning in the man page and maybe in the vignette,
  • Do not include it in AIC, AICtab or fitsad methods, even as non-default,
  • Maybe write a AICt function to give AIC based on trueLL.

@andrechalom andrechalom mentioned this issue May 26, 2015
@piklprado
Copy link
Contributor Author

Agree that we need more theoretical ground to use trueLL. So I'd be rather more conservative and move trueLL + man page to a branch from dev and remove this issue from the milestone.

@andrechalom
Copy link
Member

So, trueLL should be kept just on a branch? For the released package, should we remove trueLL methods?

@piklprado
Copy link
Contributor Author

Yes. Thinking that a package encapsulates an analytical workflow, I can't
see how trueLL fits in the package in the current state. What do you think?
Em 29/05/2015 12:32, "andrechalom" [email protected] escreveu:

So, trueLL should be kept just on a branch? For the released package,
should we remove trueLL methods?


Reply to this email directly or view it on GitHub
#5 (comment).

@andrechalom andrechalom removed this from the sads 0.2.0 milestone May 29, 2015
@andrechalom andrechalom added this to the sads 0.3.0 milestone Jun 13, 2015
@piklprado piklprado modified the milestones: sads 1.0.0, sads 0.3.0 Nov 3, 2015
@andrechalom
Copy link
Member

I have updated the branch trueLL to incorporate all the changes done so far in the package.

@andrechalom
Copy link
Member

A couple of updates:

1- The weird behaviour of trueLL might be related to truncated continuous distributions. Compare the weird results above with this fit with no truncation:

> fit <- fitlnorm(moths)
> logLik(fit)
'log Lik.' -1097.723 (df=2)
> trueLL(fit)
[1] -1097.779

The diff here is around 0.05; in contrast, the 0.5 truncated version has almost 1 point of divergence. Other data sets show the same behavior.

Also, the "third alternative" mentioned in the 22 May 2015 comment is actually meaningless for this data, as it involves the calculation of probability densities below the truncation point: as D/2 = 0.5 for dec.places=0 (the default), moths-0.5 contains some data points as 0.5. The trueLL for these data is the integral from 0 to 1; but the distribution is truncated at 0.5, which is larger than the initial value of 0. The following graph shows how bizarre is the trueLL for truncated moth fits in which the truncation point is larger than D/2=0.5:
truell

I am adding a check on trueLL to guarantee that the smallest x-D/2 is larger than the truncation point.

2- The results for trueLL, irrespective of truncation, are extremely sensitive to dec.places. The values of logLik for fitting moths range from around -1085 for sensible distributions to -1150 for fitgamma/fitweibull. The value of trueLL for increasing dec.places drops astonishingly fast, reaching -2200 for dec.places=2 (or D/2 = 5e-3). It may be remarkable that this drop forms a straight line when plotted against dec.places:

truell2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants