Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add news.md support #33

Merged
merged 6 commits into from
Oct 6, 2023
Merged

Conversation

maksymiuks
Copy link
Collaborator

This was a wild journey through white space management. After adding the subsection tag (5min) I discovered that sometimes we might get excessive spaces from the Rd parsing. In most cases, it is not a problem, but sometimes it could get appended to the section definition and this breaks the rendering. And then I went down that rabbit hole (countless hours) trying to fix it. My answer to that is general and I believe it will allow us to avoid many other bugs in the future. If there are TEXT tags with standalone spaces (" ") (not white spaces, just spaces, every other sign is fine) I merge them to the nearest meaningful TEXT tags. This way if they are excessive, they are properly pruned further by the clean_text_whitespace function, which upon creation didn't account for possible standalone spaces, instead of being pasted to the output. I believe it solves this and many issues that might occur in the future spanning from these excessive spaces.

Additionally, it is totally possible I overdone it and there is a simple solution, so please be critical, I really want a second pair of eyes on it. It was a good lesson anyway. I appreciate it might be hard to get a grasp of these changes without a proper context so wd can hop in on a call once you are bag @dgkf

@maksymiuks maksymiuks linked an issue Sep 18, 2023 that may be closed by this pull request
Copy link
Collaborator

@dgkf dgkf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Suggested a refactor of merge_text_whitespaces that hopefully cuts down on the different conditional paths through the

R/rd2markdown.R Outdated
Comment on lines 157 to 160
#' @param level optional level parameter. 2L by default
#' @param child_level optional parameter that will be set as level to
#' "child" tags parsed using `map_rd2markdown`. It allows proper rendering
#' of sections and subsections.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need both a level and a child_level? I would probably lean toward getting rid of child_level and just calling map_rd2markdown with level = level + 1 in the one place it's used.

R/rd2markdown.R Outdated
Comment on lines 164 to 165
rd2markdown.description <- function(x, fragments = c(), ..., title = NULL, level = 2L, child_level = level) {
out <- map_rd2markdown(x, ..., collapse = "", level = child_level)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rd2markdown.description <- function(x, fragments = c(), ..., title = NULL, level = 2L, child_level = level) {
out <- map_rd2markdown(x, ..., collapse = "", level = child_level)
rd2markdown.description <- function(x, fragments = c(), ..., title = NULL, level = 2L) {
out <- map_rd2markdown(x, ..., collapse = "", level = level + 1L)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now when I think about it I had not idea why I introduced this param in the first place. Probably was thinking about something different and overdone it. Removed

Comment on lines 126 to 183
merge_text_whitespaces <- function(x) {
ws <- vapply(x, function(y) grepl("^ *$", y), FUN.VALUE = logical(1))
ib <- vapply(x, function(y) is_block(y), FUN.VALUE = logical(1))
ws <- ws & !ib

seqs <- rle(ws)
values <- seqs$values
lengths <- seqs$lengths

merged <- list()
i <- 1
while (i <= length(values)) {

# Current value was already included
if (lengths[i] == 0) {
i <- i + 1
next()
}
# Get index corresponding to end of current's value subsequence
ws_inds <- sum(lengths[seq(i)])
# Get subvector of the same value of ws
x_sub <- x[seq(ws_inds - lengths[i] + 1, ws_inds)]

if (!values[i]) {
merged <- append(merged, x_sub)
} else {
merged_ws <- paste0(x_sub, collapse = "")
if (isFALSE(values[i + 1]) && !is_block(x[[ws_inds + 1]])) {
merged_ws <- paste0(merged_ws, x[[ws_inds + 1]], collapse = "")
# We already include one of the next FALSE, therefore we reduce
# corresponding lengths value by 1 and increment current one
lengths[i + 1] <- lengths[i + 1] - 1
lengths[i] <- lengths[i] + 1
merged <- append(merged, list(merged_ws))
# If next element is a block, we will try to append to the previous.
} else if (isFALSE(values[i - 1])) {
# Previous value was FALSE, therefore we can use last value in
# merged to append to unless it is block.
if (!is_block(merged[[length(merged)]])) {
merged_ws <- paste0(merged[[length(merged)]], merged_ws, collapse = "")
merged[[length(merged)]] <- merged_ws
} else {
# Previous value was a block. Merge standalone space TEXT
merged_ws <- paste0(x_sub, collapse = "")
merged <- append(merged, list(merged_ws))
}
} else {
# Entire x consists of ws only
merged_ws <- paste0(x_sub, collapse = "")
merged <- append(merged, merged_ws)
}
}

i <- i + 1
}

merged
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels more complicated than I think it needs to be. Instead, we can figure out which bits need to be merged, then paste them together instead of scanning through with many different conditional behaviors.

Note: This isn't a drop-in replacement. Instead of [" ", "a", " ", "b", " "] => [" a", " b "] as it is now, this will produce [" a ", "b "]. This only affects the merge_text_whitespaces tests, though.

Suggested change
merge_text_whitespaces <- function(x) {
ws <- vapply(x, function(y) grepl("^ *$", y), FUN.VALUE = logical(1))
ib <- vapply(x, function(y) is_block(y), FUN.VALUE = logical(1))
ws <- ws & !ib
seqs <- rle(ws)
values <- seqs$values
lengths <- seqs$lengths
merged <- list()
i <- 1
while (i <= length(values)) {
# Current value was already included
if (lengths[i] == 0) {
i <- i + 1
next()
}
# Get index corresponding to end of current's value subsequence
ws_inds <- sum(lengths[seq(i)])
# Get subvector of the same value of ws
x_sub <- x[seq(ws_inds - lengths[i] + 1, ws_inds)]
if (!values[i]) {
merged <- append(merged, x_sub)
} else {
merged_ws <- paste0(x_sub, collapse = "")
if (isFALSE(values[i + 1]) && !is_block(x[[ws_inds + 1]])) {
merged_ws <- paste0(merged_ws, x[[ws_inds + 1]], collapse = "")
# We already include one of the next FALSE, therefore we reduce
# corresponding lengths value by 1 and increment current one
lengths[i + 1] <- lengths[i + 1] - 1
lengths[i] <- lengths[i] + 1
merged <- append(merged, list(merged_ws))
# If next element is a block, we will try to append to the previous.
} else if (isFALSE(values[i - 1])) {
# Previous value was FALSE, therefore we can use last value in
# merged to append to unless it is block.
if (!is_block(merged[[length(merged)]])) {
merged_ws <- paste0(merged[[length(merged)]], merged_ws, collapse = "")
merged[[length(merged)]] <- merged_ws
} else {
# Previous value was a block. Merge standalone space TEXT
merged_ws <- paste0(x_sub, collapse = "")
merged <- append(merged, list(merged_ws))
}
} else {
# Entire x consists of ws only
merged_ws <- paste0(x_sub, collapse = "")
merged <- append(merged, merged_ws)
}
}
i <- i + 1
}
merged
}
merge_text_whitespaces <- function(x) {
if (length(x) == 0) return(x)
# assumes length x > 0
spaces <- grepl("^ *$", x, perl = TRUE)
is_trailing <- cumsum(!spaces) > 1 # after at least 1 non-space character
blocks <- vlapply(x, is_block)
groups <- cumsum(!spaces & is_trailing | blocks)
# collapse non-block groups, maintain first-in-group's attributes
unname(lapply(split(x, groups), function(group) {
if (is_block(group[[1]])) {
group[[1]]
} else {
atts <- attributes(group[[1]])
result <- paste(group, collapse = "")
attributes(result) <- atts
result
}
}))
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is much simpler! I feel a little bit miserable now :P

There are a couple of bugs, however (for instance if the string starts with a block or pearl matching). I will fix them in your version and prompt re-review.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have sligthly refactored your approach, by adding another split logic. The problem with blocks was that they are not only non-space characters but also characters to which we cannot append. That's why it would be really hard to follow your strategy to append always after non-block group. It was leading to skipping first non-space character after block and all space character after block resulting in edge cases like list(block(), " ", " ", block()) -> list(block(), block()) which is something we want to avoid. That's why there were so many edge cases in my original solution. Therefore I simply combined both solutions and removed the problem by doing additional split on each block. Now each sub list is treated (using your brilliant idea) as there were no blocks and everything is merged as expected. Then the blocks and list are pasted back do get the desired result. I think this is a good compromise.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. This might keep the simplicity but resolve these cases:

merge_text_spaces <- function(x) {
  if (length(x) == 0) return(x)

  # assumes length x > 0
  spaces <- grepl("^ *$", x)
  blocks <- vlapply(x, is_block)
  lag_blocks <- c(FALSE, blocks[-length(x)])
  is_trailing_after_block <- (cumsum(!spaces) - cumsum(cumsum(!spaces) * blocks)) > 1
  groups <- cumsum(!spaces & is_trailing_after_block | blocks | lag_blocks)

  # collapse non-block groups, maintain first-in-group's attributes
  unname(lapply(split(x, groups), function(group) {
    if (is_block(group[[1]])) {
      group[[1]]
    } else  {
      atts <- attributes(group[[1]])
      result <- paste(group, collapse = "")
      attributes(result) <- atts
      result
    }
  }))
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_trailing_after_block <- (cumsum(!spaces) - cumsum(cumsum(!spaces) * blocks)) > 1 although it's a really smart neat solution, is it simpler? There is a lot of happening in this line and might be hard to unpack for someone reading this code in the future (and this is an open source). To be honest I'm leaning towards just splitting the list on each block and treating strings as if there were no blocks and merging afterwards. I think might be simpler to work around this in the future. What's your opinion on this one?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. I still prefer this style, but maybe we can break it down a bit so it's more clear. What do you think about this as a more annotated form of the same code:

  # assumes length x > 0
  spaces <- grepl("^ *$", x)
  is_text <- !spaces
  blocks <- vlapply(x, is_block)
  lag_blocks <- c(FALSE, blocks[-length(x)])

  # find cells within each block that follow at least one non-space element
  block_text_start <- cumsum(is_text) * blocks
  block_text_index <- cumsum(is_text) - cumsum(block_text_start)
  is_block_first_text <- block_text_index <= 1

  # determine which spans of elements need to be flattened
  groups <- cumsum(is_text & !is_block_first_text | blocks | lag_blocks)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks so much better. Let me test it against some of the edge-cases but just by the look of it it is promising.

@dgkf
Copy link
Collaborator

dgkf commented Oct 4, 2023

@maksymiuks I tried looking into this issue with the excessive newlines between block regions, but I think it's an issue in the current code and not related to this PR.

From what I can tell, when a block is created in a subsection, newlines are automatically appended, then the same block will get more newlines around it when it's incorporated into the parent, so we end up with twice the usual number of surrounding newlines (4 instead of 2).

This might just be more visible now that the subsections are handled.

In the spirit of keeping this PR minimal and to get us unstuck for the \subsection rendering, I'd say we put that formatting issue aside for now. I think solving this properly will take a bit more inspection of the block concatenation, and unfortunately I don't have the bandwidth to really dig into this at the moment.

@maksymiuks maksymiuks merged commit f08d3f9 into main Oct 6, 2023
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sections and subsections support
2 participants