Add news.md support #33

maksymiuks · 2023-09-18T18:39:20Z

This was a wild journey through white space management. After adding the subsection tag (5min) I discovered that sometimes we might get excessive spaces from the Rd parsing. In most cases, it is not a problem, but sometimes it could get appended to the section definition and this breaks the rendering. And then I went down that rabbit hole (countless hours) trying to fix it. My answer to that is general and I believe it will allow us to avoid many other bugs in the future. If there are TEXT tags with standalone spaces (" ") (not white spaces, just spaces, every other sign is fine) I merge them to the nearest meaningful TEXT tags. This way if they are excessive, they are properly pruned further by the clean_text_whitespace function, which upon creation didn't account for possible standalone spaces, instead of being pasted to the output. I believe it solves this and many issues that might occur in the future spanning from these excessive spaces.

Additionally, it is totally possible I overdone it and there is a simple solution, so please be critical, I really want a second pair of eyes on it. It was a good lesson anyway. I appreciate it might be hard to get a grasp of these changes without a proper context so wd can hop in on a call once you are bag @dgkf

dgkf

Looks good. Suggested a refactor of merge_text_whitespaces that hopefully cuts down on the different conditional paths through the

dgkf · 2023-09-23T17:27:46Z

R/rd2markdown.R

+#' @param level optional level parameter. 2L by default
+#' @param child_level optional parameter that will be set as level to
+#' "child" tags parsed using `map_rd2markdown`. It allows proper rendering
+#' of sections and subsections.


Do we need both a level and a child_level? I would probably lean toward getting rid of child_level and just calling map_rd2markdown with level = level + 1 in the one place it's used.

dgkf · 2023-09-23T17:28:00Z

R/rd2markdown.R

+rd2markdown.description <- function(x, fragments = c(), ..., title = NULL, level = 2L, child_level = level) {
+  out <- map_rd2markdown(x, ..., collapse = "", level = child_level)


Suggested change

rd2markdown.description <- function(x, fragments = c(), ..., title = NULL, level = 2L, child_level = level) {

out <- map_rd2markdown(x, ..., collapse = "", level = child_level)

rd2markdown.description <- function(x, fragments = c(), ..., title = NULL, level = 2L) {

out <- map_rd2markdown(x, ..., collapse = "", level = level + 1L)

Now when I think about it I had not idea why I introduced this param in the first place. Probably was thinking about something different and overdone it. Removed

dgkf · 2023-09-23T19:07:40Z

R/map_rd2markdown.R

+merge_text_whitespaces <- function(x) {
+  ws <- vapply(x, function(y) grepl("^ *$", y), FUN.VALUE = logical(1))
+  ib <- vapply(x, function(y) is_block(y),  FUN.VALUE = logical(1))
+  ws <- ws & !ib
+
+  seqs <- rle(ws)
+  values <- seqs$values
+  lengths <- seqs$lengths
+
+  merged <- list()
+  i <- 1
+  while (i <= length(values)) {
+
+    # Current value was already included
+    if (lengths[i] == 0) {
+      i <- i + 1
+      next()
+    }
+    # Get index corresponding to end of current's value subsequence
+    ws_inds <- sum(lengths[seq(i)])
+    # Get subvector of the same value of ws
+    x_sub <- x[seq(ws_inds - lengths[i] + 1, ws_inds)]
+
+    if (!values[i]) {
+      merged <- append(merged, x_sub)
+    } else {
+      merged_ws <- paste0(x_sub, collapse = "")
+      if (isFALSE(values[i + 1]) && !is_block(x[[ws_inds + 1]])) {
+        merged_ws <- paste0(merged_ws, x[[ws_inds + 1]], collapse = "")
+        # We already include one of the next FALSE, therefore we reduce
+        # corresponding lengths value by 1 and increment current one
+        lengths[i + 1] <- lengths[i + 1] - 1
+        lengths[i] <- lengths[i] + 1
+        merged <- append(merged, list(merged_ws))
+        # If next element is a block, we will try to append to the previous.
+      } else if (isFALSE(values[i - 1])) {
+        # Previous value was FALSE, therefore we can use last value in
+        # merged to append to unless it is block.
+        if (!is_block(merged[[length(merged)]])) {
+          merged_ws <- paste0(merged[[length(merged)]], merged_ws, collapse = "")
+          merged[[length(merged)]] <- merged_ws
+        } else {
+        # Previous value was a block. Merge standalone space TEXT
+          merged_ws <- paste0(x_sub, collapse = "")
+          merged <- append(merged, list(merged_ws))
+        }
+      } else {
+        # Entire x consists of ws only
+        merged_ws <- paste0(x_sub, collapse = "")
+        merged <- append(merged, merged_ws)
+      }
+    }
+
+    i <- i + 1
+  }
+
+  merged
+}


This feels more complicated than I think it needs to be. Instead, we can figure out which bits need to be merged, then paste them together instead of scanning through with many different conditional behaviors.

Note: This isn't a drop-in replacement. Instead of [" ", "a", " ", "b", " "] => [" a", " b "] as it is now, this will produce [" a ", "b "]. This only affects the merge_text_whitespaces tests, though.

Suggested change

merge_text_whitespaces <- function(x) {

ws <- vapply(x, function(y) grepl("^ *$", y), FUN.VALUE = logical(1))

ib <- vapply(x, function(y) is_block(y), FUN.VALUE = logical(1))

ws <- ws & !ib

seqs <- rle(ws)

values <- seqs$values

lengths <- seqs$lengths

merged <- list()

i <- 1

while (i <= length(values)) {

# Current value was already included

if (lengths[i] == 0) {

i <- i + 1

next()

}

# Get index corresponding to end of current's value subsequence

ws_inds <- sum(lengths[seq(i)])

# Get subvector of the same value of ws

x_sub <- x[seq(ws_inds - lengths[i] + 1, ws_inds)]

if (!values[i]) {

merged <- append(merged, x_sub)

} else {

merged_ws <- paste0(x_sub, collapse = "")

if (isFALSE(values[i + 1]) && !is_block(x[[ws_inds + 1]])) {

merged_ws <- paste0(merged_ws, x[[ws_inds + 1]], collapse = "")

# We already include one of the next FALSE, therefore we reduce

# corresponding lengths value by 1 and increment current one

lengths[i + 1] <- lengths[i + 1] - 1

lengths[i] <- lengths[i] + 1

merged <- append(merged, list(merged_ws))

# If next element is a block, we will try to append to the previous.

} else if (isFALSE(values[i - 1])) {

# Previous value was FALSE, therefore we can use last value in

# merged to append to unless it is block.

if (!is_block(merged[[length(merged)]])) {

merged_ws <- paste0(merged[[length(merged)]], merged_ws, collapse = "")

merged[[length(merged)]] <- merged_ws

} else {

# Previous value was a block. Merge standalone space TEXT

merged_ws <- paste0(x_sub, collapse = "")

merged <- append(merged, list(merged_ws))

}

} else {

# Entire x consists of ws only

merged_ws <- paste0(x_sub, collapse = "")

merged <- append(merged, merged_ws)

}

}

i <- i + 1

}

merged

}

merge_text_whitespaces <- function(x) {

if (length(x) == 0) return(x)

# assumes length x > 0

spaces <- grepl("^ *$", x, perl = TRUE)

is_trailing <- cumsum(!spaces) > 1 # after at least 1 non-space character

blocks <- vlapply(x, is_block)

groups <- cumsum(!spaces & is_trailing | blocks)

# collapse non-block groups, maintain first-in-group's attributes

unname(lapply(split(x, groups), function(group) {

if (is_block(group[[1]])) {

group[[1]]

} else {

atts <- attributes(group[[1]])

result <- paste(group, collapse = "")

attributes(result) <- atts

result

}

}))

}

Wow, this is much simpler! I feel a little bit miserable now :P

There are a couple of bugs, however (for instance if the string starts with a block or pearl matching). I will fix them in your version and prompt re-review.

I have sligthly refactored your approach, by adding another split logic. The problem with blocks was that they are not only non-space characters but also characters to which we cannot append. That's why it would be really hard to follow your strategy to append always after non-block group. It was leading to skipping first non-space character after block and all space character after block resulting in edge cases like list(block(), " ", " ", block()) -> list(block(), block()) which is something we want to avoid. That's why there were so many edge cases in my original solution. Therefore I simply combined both solutions and removed the problem by doing additional split on each block. Now each sub list is treated (using your brilliant idea) as there were no blocks and everything is merged as expected. Then the blocks and list are pasted back do get the desired result. I think this is a good compromise.

Ah, I see. This might keep the simplicity but resolve these cases:

merge_text_spaces <- function(x) { if (length(x) == 0) return(x) # assumes length x > 0 spaces <- grepl("^ *$", x) blocks <- vlapply(x, is_block) lag_blocks <- c(FALSE, blocks[-length(x)]) is_trailing_after_block <- (cumsum(!spaces) - cumsum(cumsum(!spaces) * blocks)) > 1 groups <- cumsum(!spaces & is_trailing_after_block | blocks | lag_blocks) # collapse non-block groups, maintain first-in-group's attributes unname(lapply(split(x, groups), function(group) { if (is_block(group[[1]])) { group[[1]] } else { atts <- attributes(group[[1]]) result <- paste(group, collapse = "") attributes(result) <- atts result } })) }

is_trailing_after_block <- (cumsum(!spaces) - cumsum(cumsum(!spaces) * blocks)) > 1 although it's a really smart neat solution, is it simpler? There is a lot of happening in this line and might be hard to unpack for someone reading this code in the future (and this is an open source). To be honest I'm leaning towards just splitting the list on each block and treating strings as if there were no blocks and merging afterwards. I think might be simpler to work around this in the future. What's your opinion on this one?

I see what you mean. I still prefer this style, but maybe we can break it down a bit so it's more clear. What do you think about this as a more annotated form of the same code:

# assumes length x > 0 spaces <- grepl("^ *$", x) is_text <- !spaces blocks <- vlapply(x, is_block) lag_blocks <- c(FALSE, blocks[-length(x)]) # find cells within each block that follow at least one non-space element block_text_start <- cumsum(is_text) * blocks block_text_index <- cumsum(is_text) - cumsum(block_text_start) is_block_first_text <- block_text_index <= 1 # determine which spans of elements need to be flattened groups <- cumsum(is_text & !is_block_first_text | blocks | lag_blocks)

This looks so much better. Let me test it against some of the edge-cases but just by the look of it it is promising.

dgkf · 2023-10-04T20:38:03Z

@maksymiuks I tried looking into this issue with the excessive newlines between block regions, but I think it's an issue in the current code and not related to this PR.

From what I can tell, when a block is created in a subsection, newlines are automatically appended, then the same block will get more newlines around it when it's incorporated into the parent, so we end up with twice the usual number of surrounding newlines (4 instead of 2).

This might just be more visible now that the subsections are handled.

In the spirit of keeping this PR minimal and to get us unstuck for the \subsection rendering, I'd say we put that formatting issue aside for now. I think solving this properly will take a bit more inspection of the block concatenation, and unfortunately I don't have the bandwidth to really dig into this at the moment.

Add news.md support

2704ff9

maksymiuks requested a review from dgkf September 18, 2023 18:39

maksymiuks linked an issue Sep 18, 2023 that may be closed by this pull request

Sections and subsections support #32

Closed

maksymiuks added 2 commits September 18, 2023 20:40

Polish NEWS

8ca8494

Answer r cmd check issues

ffc5389

dgkf requested changes Sep 23, 2023

View reviewed changes

maksymiuks added 2 commits September 28, 2023 15:56

Refactor merge_text_spaces

63430b9

Adjust unit tests

7d6656e

Refactor merge_text_spaces

006c76d

maksymiuks merged commit f08d3f9 into main Oct 6, 2023
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add news.md support #33

Add news.md support #33

maksymiuks commented Sep 18, 2023

dgkf left a comment

dgkf Sep 23, 2023

dgkf Sep 23, 2023

maksymiuks Sep 28, 2023

dgkf Sep 23, 2023

maksymiuks Sep 28, 2023

maksymiuks Sep 28, 2023

dgkf Sep 29, 2023

maksymiuks Sep 29, 2023

dgkf Oct 2, 2023

maksymiuks Oct 2, 2023

dgkf commented Oct 4, 2023

		rd2markdown.description <- function(x, fragments = c(), ..., title = NULL, level = 2L, child_level = level) {
		out <- map_rd2markdown(x, ..., collapse = "", level = child_level)

Add news.md support #33

Add news.md support #33

Conversation

maksymiuks commented Sep 18, 2023

dgkf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dgkf commented Oct 4, 2023