Add automatically merging cells in tables #970

stevenlis · 2023-10-19T01:15:23Z

Please explain your intent
It is common to encounter merged cells in a table. Currently, you need to manually create a merged cell using colspan in row.cell(). It would be helpful if merged cells could be handled automatically, eliminating the need to loop over your data and add multiple nested if-else conditions.

Describe the solution you'd like
Assuming our table data appears as follows.

table_data = [
    ['0', '1', '2', '3'],
    ['A1', 'A2', '-', 'A4'],
    ['B1', '-', 'B3', 'B4'],
]

Here, we use '-' (or any other user-defined placeholder) to indicate an empty cell.

Empty cells can be merged with the cell before:

with pdf.table(width=120, auto_merging='before', empty_cell='-') as table

Or they can be merged with the cell after:

with pdf.table(width=120, auto_merging='after', empty_cell='-') as table

The text was updated successfully, but these errors were encountered:

gmischler · 2023-10-19T11:26:06Z

Welcome to fpdf2, @stevenlis !

Your suggestion is interesting, but raises a few tricky questions:

Is a specific text string really the optimal trigger? The same string might appear in the data as legitimate cell content, after all.

I expect that we'll eventually support row spans as well. How should we distinguish between the two cases in auto merging?

More generally speaking: Column and row spans are essentially metadata of the table. Is it really wise trying to derive those from the cell contents? While it might be possible to get it to work in a very specific set of circumstances, a system like that seems very prone to suffering from unintended consequences.

stevenlis · 2023-10-19T18:05:43Z

@gmischler Thank you for response.

Indeed, there are definitely cases where you may want to leave an empty cell as it is, which is why I mention "any other user-defined placeholder" to provide users with flexibility.

Regarding row spans, perhaps we can use two different characters/strings to define the rules. For example, "-" for colspan and "|" for rowspan, or any other characters that users prefer.

I'm new to fpdf2 and my main challenge is that it requires looping over the table instead of taking the data as a whole. This approach provides flexibility in defining the style of each cell, but it would be easier if we had a function like write_table accompanied by a styler object for styling purposes only. An example of such API design can be found in pandas.

As for the table layout, I took inspiration from matplotlib's mosaic, which allows you to easily define the layout of a figure.

Maybe we can have a function that returns table_data and colspan mapping for each cell.

from fpdf import FPDF

pdf = FPDF()
pdf.set_font(family='helvetica')
pdf.add_page()

def remove_empty_map_colspan(table_data: list, empty_cell: str) -> tuple[list, list]:
    """
    Remove the table cells with the specified value for empty_cell,
    and maps the cleaned data based on the following rules:
    - If the item/cell is not empty, it is mapped to 1.
    - If the item/cell is empty, it is mapped to 2, and the next item in the row is skipped.

    Args:
        table_data (list): Nested list representing the table data.
        empty_cell (string): Value representing an empty cell.

    Returns:
        tuple: A tuple containing the mapped colspan data and the cleaned data.

    Example:
        table_data = [
            ['0', '1', '2', '3'],
            ['A1', 'A2', '', 'A4'],
            ['B1', '', 'B3', 'B4'],
        ]
        empty_removed_table, colspans = remove_empty_map_colspan(table_data)
        print(empty_removed_table)
        # Output: [['0', '1', '2', '3'], ['A1', 'A2', 'A4'], ['B1', 'B3', 'B4']]
        print(colspans)
        # Output: [[1, 1, 1, 1], [1, 1, 2], [1, 2, 1]]
    """

    colspans = []
    skip_next = False

    for row in table_data:
        new_row = []
        for item in row:
            if skip_next:
                skip_next = False
                continue
            if item != empty_cell:
                new_row.append(1)
            else:
                new_row.append(2)
                skip_next = True
        colspans.append(new_row)

    empty_removed_table = [
        [item for item in row if item != empty_cell] for row in table_data
    ]

    return empty_removed_table, colspans

table_data = [
    ['0', '1', '2', '3'],
    ['A1', 'A2', '-', 'A4'],
    ['B1', '-', 'B3', 'B4'],
]
table_data, colpans = remove_empty_map_colspan(table_data, empty_cell='-')

with pdf.table(width=120) as table:
    for data_row, colspan_row in zip(table_data, colpans):
        row = table.row()
        for cell, colspan in zip(data_row, colspan_row):
            row.cell(cell, colspan=colspan)

pdf.output('test-file.pdf')

gmischler · 2023-10-19T20:00:04Z

The longer I think about it, the less comfortable I am with the idea of encoding structural information about a table in individual data items. This just looks loke bad coding and data management practise to me. Where do you encounter such data?

When fpdf2 started, it described itself as "simplistic". It has gained quite some additional functionality since then, so we have now arrived at "simple and straightforward", essentially following the Python motto of "make simple tasks easy, and complex tasks possible".

Tables are a very recent addition, and there's still room for improvement. But for simple data (without spans), you already can feed them in all at once. What you'd like to have though, is not simple. It's still possible, but you'll have to do some of the legwork yourself.

If there is no way to acquire your data in a more resonable form, then the best workaround is probably to write a wrapper for FPDF.table() which does more or less what your example above illustrates. This will make it straightforward for you to use, while keeping the data/metadata confusion away from the fpdf2 codebase

stevenlis · 2023-10-19T20:22:12Z

Where do you encounter such data?

If you perform data analysis to generate reports or check simple statistics using crosstabs, it's a part of your daily routine. I understand that fpdf2 is not specifically designed for report generation but has a more general purpose. Nonetheless...

Thank you for the explanation. I don't think overwriting a class every time for simple things is very pythonic. Also, there's no need to overcomplicate things by using a context manager to create a table when it's already in a table structure. Anyway, that's just my opinion.

gmischler · 2023-10-19T21:04:22Z

Where do you encounter such data?

If you perform data analysis to generate reports or check simple statistics using crosstabs, it's a part of your daily routine.

Are you saying that data where column spans are defined through special cell contents are an inherent and necessary result of statistical analysis? Or are we really talking about different things here?

stevenlis · 2023-10-19T22:06:08Z

To be honest, I'm not quite sure, but here's an example where I would like to generate summary statistics for different group breakdowns.

import seaborn as sns
df = sns.load_dataset('tips')
df_result = (
    df.groupby(['sex', 'smoker', 'time'])[['tip', 'size', 'total_bill']]
    .mean().round(1).reset_index()
    .pivot(index=['smoker', 'time'], columns=['sex'])
)
df_result

The one above is perhaps the most readable format, but I think we can also accept the following if rowspan is a challenge.

df_result.reset_index(inplace=True)
df_result.columns.names = [None, None]
df_result

Then the issue is that how do we represent this structure, I think we one way is:

table_data = [
    ['smoker', 'time' , 'tip', '-', 'size', '-', 'total_bill', '-'],
    [' ', ' ', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female'], 
    ['Yes', 'Lunch', 2.8, 2.9, 2.2, 2.3, 17.4, 17.4],
    ['Yes', 'Dinner', 3.1, 2.9, 2.6, 2.2, 23.6, 18.2],
    ['No', 'Lunch', 2.9, 2.5, 2.5, 2.5, 18.5, 15.9],
    ['No', 'Dinner', 3.2, 3.0, 2.8, 2.7, 20.1, 20.0]]

]

Each row represents a list with 8 items. An empty cell is denoted by a ' ' and a merging to the next cell is indicated by a '-'.

Of course, you can always convert it into a standard table structure. However, when including it in a PDF report, it becomes less readable.

df_result.columns = ['_'.join(col) for col in df_result.columns.values]
df_result

I closed this because I don't want you to feel like I'm requesting something just for myself. However, it's a very common use case in my experience when you create any kind of report that includes a table.

gmischler · 2023-10-20T08:56:41Z

I guess I should have made the connection to #969 myself, eh? 🙄

So we're not talking about detecting spans in the actual grid data, but rather about how to get the labelling formatted correctly. That makes it a bit less weird.

Actually, in the context of a jupyter notebook, that data format makes perfect sense. The trick there is that jupyter simply doesn't draw any border lines. That way it doesn't have to do anything special with the empty cells, and they'll still give the visual appearance of a column or row span. What you're looking at there is essentially ASCII art...

When you try to put the same data in a PDF for print (and with border lines), the requirements change quite a bit. In an ideal world, pandas would offer an option to give you the data in a more explicitly structured form (maybe HTML?). I've never looked at it in any detail, so I have no idea if it does. If it doesn't though, then the next necessary step is "data cleanup", to transform the ASCII art hack into something more palatable. You were hoping for fpdf2 to handle this cleanup for you. I don't think that is the right place for it (I may not have the last word on it, so that is just my take).

From a software architecture perspective, I think this sort of problem would be most reasonably solved with a system of "data adapters". This could be a subclass as previously suggested, but a less tightly coupled approach might be even more flexible. Let's say we have a class that knows how to talk to a fpdf2 table. Then you create a subclass that can process and cleanup pandas output. Other subclasses can be added to handle any weird data produced by other packages. This keeps fpdf2 itself simple, while still enabling data exchange with pretty much anything out there.

Don't worry about bringing this up. The topic raises some interesting questions that are clearly of general interest. While I personally don't agree with your initial approach, we do have a strong interest in enabling data transfer from other software (there's a whole section about that in the docs). I'll reopen this for now to see if someone else has a better idea than what I've come up with so far.

afriedman412 · 2023-10-23T18:19:10Z

This is less about generalized formatting and more specifically formatting multi-level indexes, and that feels like a pretty widely applicable use case!

I'm going to take a crack at building out a more robust header/index functionality for tables, with an eye towards fixing this problem specifically.

gmischler · 2023-10-23T20:25:13Z

I'm going to take a crack at building out a more robust header/index functionality for tables, with an eye towards fixing this problem specifically.

Both row span and row labels are features that we could use in any case, as they are necessary to create more complex tables than is currently possible. If you can come up with an implementation that offers those and is still reasonably easy to use, more power to you! 👍

The question of allowing special data items (ie. empty strings) to control the table formatting probably needs some more input and consideration. I'd recomment to postpone that aspect to a later phase.

afriedman412 · 2023-10-25T16:09:09Z

I added code that automatically formats a multi-index/multi-header dataframe so it looks like it does in pandas.

So this...

import seaborn as sns
from fpdf import FPDF
from fpdf.table import format_dataframe

data = sns.load_dataset('tips')
df = (
    data.groupby(['sex', 'smoker', 'time'])[['tip', 'size', 'total_bill']]
    .mean().round(1).reset_index()
    .pivot(index=['smoker', 'time'], columns=['sex'])
)

# new function
data = format_dataframe(df) # converts data to string and formats column and index labels

pdf = FPDF()
pdf.add_page()
pdf.set_font("Times", size=10)
with pdf.table() as table:
    for data_row in data:
        table.row(data_row)
pdf.output("table_from_pandas.pdf")

Now renders into this...

Also if you pass include_index=False to format_dataframe() it will ignore the index.

The update also adds multi-index formatting that matches the current implementation of multi-header formatting.

Probably worth noting that I'm using "index" as the label for rows and "header" as label for the columns, and using "column" the way everyone seems to be using "span" to align with Pandas terminology. It's whatever but worth keeping an eye on.

Anyways lmk how this all looks and I'll add tests and better documentation...

(oh also I guess this got lumped in with the last pull request -- dunno if that matters though?)

Lucas-C · 2024-10-14T09:20:18Z

Hi @afriedman412,

I'm sorry that your last, detailed comment was left unanswered for a year 😢

Thank you very much for the thoughts & efforts you put in this, and in PR #1046 👍

Are you still willing to continue this work, almost a year later?
Or do you prefer that we simply close this PR?

I'm going to review it now 🙂

stevenlis added the enhancement label Oct 19, 2023

stevenlis closed this as completed Oct 19, 2023

gmischler reopened this Oct 20, 2023

gmischler assigned afriedman412 Oct 23, 2023

afriedman412 mentioned this issue Oct 25, 2023

Adds pandas-friendly formatting for multi-column and multi-index tables #980

Closed

4 tasks

This was referenced Oct 26, 2023

updating table label stuff #985

Closed

Adds pandas-friendly formatting for multi-column and multi-index tables #986

Closed

Lucas-C mentioned this issue Oct 30, 2023

Images flowing within TextColumns #973

Merged

5 tasks

This was referenced Nov 7, 2023

support for pandas multi-index and multi-heading tables #1007

Closed

Adding support for pandas dataframes, multindex formatting #1008

Closed

This was referenced Dec 5, 2023

Tables3 #1045

Closed

Adding support for pandas dataframes, multindex formatting #1046

Open

mjasperse mentioned this issue Jan 3, 2024

Implementation of rowspan in tables #1088

Merged

5 tasks

Lucas-C added the pending-review label Aug 20, 2024

Lucas-C added pending-answer and removed pending-review labels Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add automatically merging cells in tables #970

Add automatically merging cells in tables #970

stevenlis commented Oct 19, 2023

gmischler commented Oct 19, 2023

stevenlis commented Oct 19, 2023

gmischler commented Oct 19, 2023

stevenlis commented Oct 19, 2023

gmischler commented Oct 19, 2023

stevenlis commented Oct 19, 2023

gmischler commented Oct 20, 2023 •

edited

Loading

afriedman412 commented Oct 23, 2023

gmischler commented Oct 23, 2023

afriedman412 commented Oct 25, 2023 •

edited by Lucas-C

Loading

Lucas-C commented Oct 14, 2024 •

edited

Loading

Add automatically merging cells in tables #970

Add automatically merging cells in tables #970

Comments

stevenlis commented Oct 19, 2023

gmischler commented Oct 19, 2023

stevenlis commented Oct 19, 2023

gmischler commented Oct 19, 2023

stevenlis commented Oct 19, 2023

gmischler commented Oct 19, 2023

stevenlis commented Oct 19, 2023

gmischler commented Oct 20, 2023 • edited Loading

afriedman412 commented Oct 23, 2023

gmischler commented Oct 23, 2023

afriedman412 commented Oct 25, 2023 • edited by Lucas-C Loading

Lucas-C commented Oct 14, 2024 • edited Loading

gmischler commented Oct 20, 2023 •

edited

Loading

afriedman412 commented Oct 25, 2023 •

edited by Lucas-C

Loading

Lucas-C commented Oct 14, 2024 •

edited

Loading