Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to specify format of regression statistics in front end #161

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

junder873
Copy link
Collaborator

@junder873 junder873 commented May 31, 2024

Following #160, allowing more formatting options on the front end would be useful and likely more intuitive. This request attempts to do this by creating anonymous functions in these cases. The result would be:

regtable(rr1, rr2, rr3, rr4; regression_statistics=[
    Nobs,
    format(R2; precision=6),
    format(AIC; precision=1, commas=true),
    format(AdjR2; precision=3) => "My R2",
    cfmt("%0.2f", BIC),
])

-------------------------------------------------------------
                                   Sales
              -----------------------------------------------
                     (1)          (2)         (3)         (4)
-------------------------------------------------------------
(Intercept)   138.480***   133.068***    0.007***    0.007***
                 (1.427)      (2.868)     (0.000)     (0.000)
NDI             0.007***     0.007***   -0.000***   -0.000***
                 (0.000)      (0.001)     (0.000)     (0.000)
Price          -0.938***    -0.813***    0.000***    0.000***
                 (0.054)      (0.079)     (0.000)     (0.000)
NDI & Price                   -0.000*                 0.000**
                              (0.000)                 (0.000)
-------------------------------------------------------------
Estimator            OLS          OLS       Gamma       Gamma
-------------------------------------------------------------
N                  1,380        1,380       1,380       1,380
R2              0.208806     0.211517
AIC             13,077.1     13,074.3    12,662.8    12,656.1
My R2              0.208        0.210
BIC             13097.98     13100.48    12683.70    12682.26
-------------------------------------------------------------

@junder873
Copy link
Collaborator Author

In reply to the comment on #161

  • There are/would be two ways to specify a statistic and its format? I see format() clauses and a cfmt() clause. That in itself is confusing, as an example. That said, having additional ways to do the same thing should be OK, as long as one is emphasized in documentation and is easy to understand.
    • Both of these are methods exported by Format.jl, so the hope is that it is easy since it is consistent with the rest of the Julia ecosystem.
  • I find it confusing that, if I understand right, the syntax in both cases has a formatting function wrapping a statistic. Psychologically, for me, this makes formatting primary and what's formatted secondary. But I think of the statistic as primary, so I want a syntax that centers the statistic and makes formatting secondary.
    • How best to do this comes up again later.
  • In Stata, and in Julia if it becomes popular for econometrics, there are lots of estimation packages that produce lots of different return values. So a regression tables package should make it easy and intuitive to report any return value. Neither the user nor the author of the regression package should have to do anything special to make the package work with regtables(): that's the point of standards. I think in a Julia context, we can probably stipulate that the standard is that any return value must be made available through an exported function that takes the estimation result object as its sole argument. That function might be part of StatsAPI, or it might be unique to the estimation package. The user hoping to display the statistic using regtables() should not need to define any functions or types to report it and control its formatting. That's too confusing. Maybe that is already the case. My impression is that it's not, but I'm not sure because of my struggles to understand the package and its documentation.
    • This package does implement StatsAPI in its entirety. (One exception is instead of coefnames, this package relies on formula from StatsModels, which is still a widely used package. I think adding some flexibility into the package so StatsAPI.coefnames is used if StatsModels is not implemented is possible.) This package almost already works with arbitrary functions. All that is needed is a try-catch loop to deal with the situation where an arbitrary function is probably not defined for a different regression model (i.e., your regression model might implement a function mystat, but calling mystat on a GLM will throw an error).
  • I don't understand where in the example the anonymous function is. And I think a regular user should probably not have to deal with that clever a concept.
    • Underneath in the example, it is an anonymous function, the user does not need to create an anonymous function.
  • It is confusing to me (as a general user) to specify a statistic label with the pairs syntax (format(AdjR2; precision=3) => "My R2"). Statistics have several attributes. Why is the format specified one way and the label a quite different way?
    • Also partially discussed next. However, what other attributes besides label and format might be necessary? Those are the only two I can think of, but if others exist, it would be worthwhile to implement them now.
  • Stata's options syntax is flexible and gives users a good way to think about and invoke program features. The most analogous Julia constructs are keyword arguments and named tuples, which are of course closely related. As a user I want to translate Stata's stat(ll R2, fmt(a2 a3) label("Log likelihood" "R2")) to statistics=(items=[ll R2], fmt=["a2" "a3"], label=["Log likelihood" "R2"]) or statistics=(items=(ll, R2), fmt=("a2", "a3"), label=("Log likelihood", "R2")) or statistics=(ll=>(fmt="a2", label="Log likelihood"), R2=>(fmt="a3", label="R2")) or statistics=(ll=(fmt="a2", label="Log likelihood"), R2=(fmt="a3", label="R2")). I find such syntaxes less confusing than what I'm encountering in regtables(). All but the last would require ll and R2 to already be defined objects. They would probably be functions ll(::<:RegressionModel) -> <:Number and R2(::<:RegressionModel) -> <:Number, which it is reasonable to expect estimation packages to define and export.
    • Of these options, my favorite would be regression_statistics=(ll=>(fmt="a2", label="Log likelihood"), R2=>(fmt="a3", label="R2")). This package already defines LogLikehood and R2, and it is more consistent with the pair syntax already in this package (running R2 => "My R2" already works). The downside of this is it is not consistent if the user wants to just adjust format (the user could not run R2 => "a3" since that would be impossible to distinguish from a label, it might be possible to do R2 => fmt(...) since Format.jl already exports a fmt function).
  • One could use macros to offer the user an even cleaner syntax, like @opts stat[ll R2; fmt[a2 a3] label["Log likelihood" "R2"]]. Meta.parse() can parse that line!
    • I don't know that macros are actually necessary to do this, further label["Log likelihood" "R2"] is not very consistent with the rest of Julia.

Overall, it seems like the feedback is to not go with format(Stat; ...) => "label" since that seems inconsistent (format is done differently from label). Some other options exist:

  • Probably the most specific way to do this is if the user provides a named tuple Stat => (fmt="format", label="label").
    • This package already implements Stat => "label", so ideally, that would not need to change. However, if that is the case then users might expect Stat => "format" to work, but it would not since that is used for label. Perhaps using the Julia type system and using Stat => fmt("format") would clarify.
  • Another option that is consistent with Julia syntax would be two pairs: Stat => format => "label" (similar to DataFrames.jl combine syntax). The challenge here is that order would matter, but I do not know if there is a logical order (Stat => "label" => format` also makes some sense).
  • Another option is to implement this in one whole named tuple: regression_statistics=(statistics=[LogLikehood, R2], format=["format1", "format2"], labels=["Log Likelihood", "R2"]). This seems challenging since it is very different from what is currently implemented (so is a bigger breaking change). It also adds difficulty for the user if some of the default labels/formats are acceptable, this would probably require the user to write all formats/labels.

@droodman
Copy link

droodman commented Jun 4, 2024

  • On the use of formula to get coefficients. As I think I mentioned separately, I think the package shouldn't assume the presence of a formula. That is one way of expressing a model, which works great for many models, but not all. For example, in an ordered probit or logit model, one estimates the cut points that divide the continuum into chunks corresponding to the discretized outcomes. How do the cut points fit into a formula? Put another way, if a formula is required, then I think it should somehow accept multiple formulas--which may get into extending StatsModels? In Stata, solo parameters such as cut points are equivalent to equations that have nothing but a constant term. [cut1]_cons and /cut_1 mean the same thing.
  • "what other attributes besides label and format might be necessary?" Good question. Maybe a scheme for adornment with stars or other symbols? Some added statistics could be test results or estimations of quantities that are functions of the estimated parameters. Or something to specify putting parentheses or brackets around a stat?
  • "Of these options, my favorite would be regression_statistics=(ll=>(fmt="a2", label="Log likelihood"), R2=>(fmt="a3", label="R2"))" I think that would be OK. On balance though I think I'd favor "=" instead of "=>". In the example above, I see a three-level hierarchy for expressing options: regression_statistics=, ll=>, and fmt=... I think it would be better not to have to think about using different symbols at different levels. I understand that ll is not the name of an option the way regression_statistics and fmt are. But in general an options specification could get nested to more levels, and having different symbols would become confusing.
  • "Another option that is consistent with Julia syntax would be two pairs: Stat => format => "label"". Eek no! I think that would be very confusing. To me it looks like a flow chart, which isn't a helpful metaphor here. As you say, there isn't a logical order.
  • "Another option is to implement this in one whole named tuple:..." I like that because it is more parsimonious, as in Stata. That serves the user. Maybe it would be better to use only parentheses rather than mixing parentheses and square brackets, again because it can be confusing to have different syntaxes at different levels. Or it can accept both, along with other iterables.

Probably you can add a lot of these options in a non-breaking way? Currently regression_statistics takes a vector of types or type-string pairs? The new syntaxes would just be introducing new types for that option? Or you can create differently named options with different syntaxes. Or a whole new function, analogous to esttab vs estout in Stata.

Separately, the options could accept missing and/or "" to signal acceptance of defaults.

@droodman
Copy link

droodman commented Jun 4, 2024

Honestly, right now I think the ideal would be:

regression_statistics=(LogLikehood, R2; format = ["format1", "format2"], labels = ["Log Likelihood", "R2"])

or

regression_statistics=(LogLikehood, R2; format = ("format1", "format2"), labels = ("Log Likelihood", "R2"))

That is the most parsimonious, which is good for the user. It matches the Julia function call syntax. And it is parseable. But I guess it would require a macro for implementation.

@junder873
Copy link
Collaborator Author

Unfortunately

    regression_statistics=(LogLikehood, R2; format = ["format1", "format2"], labels = ["Log Likelihood", "R2"])

Does not work with the Julia parser. For that to work, it would have to be written as a function:

regtable(
    ...;
    regression_statistics=statistics(LogLikehood, R2; format = ["format1", "format2"], labels = ["Log Likelihood", "R2"])
)

That is certainly possible, but adds another set of inconsistent formatting since none of the other keyword arguments use that.

  • "Maybe a scheme for adornment with stars or other symbols? Some added statistics could be test results or estimations of quantities that are functions of the estimated parameters. Or something to specify putting parentheses or brackets around a stat?"
    • To me, these make sense to simply be a piece of formatting. For example, mirroring Julia's built in Printf, the user would specify format="(%0.2f***)" for a two decimal point number surrounded by parentheses and three stars.
  • "Or you can create differently named options with different syntaxes. Or a whole new function, analogous to esttab vs estout in Stata."
    • I want to avoid creating new named options, to me this would seem to add to the confusion you have experienced if option A works with regtable and option B works with differentregtablefunction. However, I think you are right that it is possible to implement a few different ways of doing it and rely on the Julia type system to parse some of it. So I can create a pair syntax: [LogLikelihood => (format="format1", label="label1"), ..., a named tuple syntax (statistics=[LogLikelihood], format=["format1"], label=["label1"]), and even a function. It should be possible to make it so any iterable works, so a vector ([...]) or a tuple ((...)) would work.

Finally, a note on the formula and StatsModels. I do plan on making StatsModels optional. However, I don't know if you were implementing an ordered probit/logit, but there is a Julia package OrderedMultinomialModels.jl that implements those using StatsModels (the R implementation uses formula as well). StatsModels is very extendable, as seen in FixedEffectModels.jl's implementation of IVs. StatsModels also deals with cuts (categorical variables) and interactions, which is why so many packages use it.

@droodman
Copy link

droodman commented Jun 7, 2024

Well that syntax does parse in the sense that if you pass it as a string to Meta.parse(), you won't get an error. You'll get a proper Julia AST. But you would need to precede any use of such a syntax with call to a macro, and the macro would need to be written. Maybe the graceful way to do it indeed would be to have a function that could be called opt(args..., kwargs...)=args,kwargs, which would wrap each option using that sort of syntax. Then no macro would be needed.

I think regtable() should be able to accept adornment rules, if it doesn't already. I might want code that asks regtable() to figure out how many stars to put on a summary statistic. So then it wouldn't be simply a matter of putting the stars in the formatting string. Also I can't tell if currently there is a way to do put in the stars and parentheses manually because the documentation doesn't, as far as I know, define what can go in a format string. At any rate, I think it's best to develop a syntax that makes it easier to add attributes down the road, beyond labelling and formatting (and adornment rules).

I tried to run regtable() on an ordered logit model but got an error (see below). Will regtable() automatically display the cuts. I see now that the ordered probit example is not a good one for the point I was making, since the user does not need to explicitly add the cut parameters to the model specification. Just providing a formula suffices. In my own current application, I am modeling a single outcome variable with a distribution that has 5-10 parameters, none of which are coefficients. I can make a formula to represent it, like z ~ p1 + p2 + tau1 +... but it's just a fake. And it may not work because there are no variables called p1, p2, .... Stata handles this situation by in effect making each parameter its own equation, with only a constant term.

julia> using OrdinalMultinomialModels, RegressionTables, RDatasets

julia> housing = dataset("MASS", "housing");

julia> house_po = polr(@formula(Sat ~ Infl + Type + Cont), housing)
StatsModels.TableRegressionModel{OrdinalMultinomialModel{Int64, Float64, LogitLink}, Matrix{Float64}}

Sat ~ Infl + Type + Cont

Coefficients:
──────────────────────────────────────────────────────────────────────
                           Estimate  Std.Error       t value  Pr(>|t|)
──────────────────────────────────────────────────────────────────────
intercept Low|Medium   -0.693193      0.586303  -1.18231        0.2415
intercept Medium|High   0.693013      0.5863     1.18201        0.2416
Infl: Medium           -0.000158234   0.53033   -0.000298369    0.9998
Infl: High             -0.000158234   0.53033   -0.000298369    0.9998
Type: Apartment         6.84978e-5    0.612372   0.000111856    0.9999
Type: Atrium            6.84978e-5    0.612372   0.000111856    0.9999
Type: Terrace           6.84978e-5    0.612372   0.000111856    0.9999
Cont: High             -5.66775e-5    0.433013  -0.000130891    0.9999
──────────────────────────────────────────────────────────────────────

julia> regtable(house_po)
ERROR: MethodError: no method matching islinear(::StatsModels.TableRegressionModel{OrdinalMultinomialModel{Int64, Float64, LogitLink}, Matrix{Float64}})
Stacktrace:
  [1] RegressionType(x::StatsModels.TableRegressionModel{OrdinalMultinomialModel{Int64, Float64, LogitLink}, Matrix{Float64}})
    @ RegressionTables C:\Users\drood\.julia\packages\RegressionTables\iz9ba\src\regressionResults.jl:142
  [2] _broadcast_getindex_evalf
    @ .\broadcast.jl:709 [inlined]
  [3] _broadcast_getindex
    @ .\broadcast.jl:682 [inlined]
  [4] (::Base.Broadcast.var"#31#32"{Base.Broadcast.Broadcasted{}})(k::Int64)
    @ Base.Broadcast .\broadcast.jl:1118
  [5] ntuple
    @ .\ntuple.jl:48 [inlined]
  [6] copy
    @ .\broadcast.jl:1118 [inlined]
  [7] materialize
    @ .\broadcast.jl:903 [inlined]
  [8] default_print_estimator(render::AsciiTable, rrs::Tuple{StatsModels.TableRegressionModel{…}})
    @ RegressionTables C:\Users\drood\.julia\packages\RegressionTables\iz9ba\src\regtable.jl:291
  [9] regtable(rrs::StatsModels.TableRegressionModel{OrdinalMultinomialModel{Int64, Float64, LogitLink}, Matrix{Float64}})
    @ RegressionTables C:\Users\drood\.julia\packages\RegressionTables\iz9ba\src\regtable.jl:373
 [10] top-level scope
    @ REPL[19]:1
Some type information was truncated. Use `show(err)` to see complete types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants