[BUG]%%pretty with chinese character error #767

kongLingming06 · 2022-08-07T08:04:38Z

Describe the bug
when use %%pretty function, if there is chinese character in the table, the table cannot show properly and return this error.

An internal error was encountered.
Please file an issue at https://github.com/jupyter-incubator/sparkmagic
Error:
Expected DF rows to be uniform width (581)

To Reproduce
%%pretty
df.show()

Versions:

SparkMagic
Livy (if you know it)
Spark

xuansontrinh · 2022-10-21T11:57:24Z

I experience the same issue.

baixinzxl · 2023-01-29T08:33:11Z

I experience the same issue too, when i want to display chinese character，it will return error:
Expected DF rows to be uniform width (804)

thanks to any help,it really confuses me TAT

devstein · 2023-01-30T15:52:57Z

This issue should be fixed by this PR, which I just released as part of the 0.20.4 release.

I'm marking this as resolved for now, but please let me know if this is not the case after you upgrade.

baixinzxl · 2023-01-31T08:27:02Z

This issue should be fixed by this PR, which I just released as part of the 0.20.4 release.

I'm marking this as resolved for now, but please let me know if this is not the case after you upgrade.

thanks for your reply！
but when i upgrade to 0.20.4 release，i will still meet the same probelm：
Expected DF rows to be uniform width (11)
but found | a| 你好| (9)

i found it is correct in livyserver，but returned error in notebook display

maybe you can use this dataframe to reappear the problem：
`
%%pretty

df = spark.createDataFrame([("a","你好"),("b","你好")],("key","value"))

df.show(5)
`

@devstein

and it is my versions
spark 2.4.5
sparkmagic 0.20.4
i dont know livy version

devstein · 2023-01-31T17:37:44Z

Thanks for the code snippet @baixinzxl. I will investigate once I have bandwidth in the coming weeks. Contributions are welcome if you want to dive into the code!

baixinzxl · 2023-02-02T03:14:44Z

thanks, take your time～
and i guess the problem may relate with chinese encoding differences from livy to jupyter ，hope it helps

baixinzxl · 2023-02-28T03:09:45Z

sorry for disturbing but i wonder if there is any findings about the problem?
looking forward to hearing from you @devstein ~ thank you ~

devstein · 2023-04-05T03:10:39Z

Hey @baixinzxl I haven't forgotten about this. I've been stretched for time and have tried to tackle this twice without success. The relevant code is in this file if you want to take a stab at it!

pancodia · 2024-05-26T07:15:14Z

I am using sparkmagic: 0.21.0

Same issue when show dataframe with Japanese column

devstein · 2024-05-27T01:25:20Z

@dosu how would you update https://github.com/jupyter-incubator/sparkmagic/blob/master/sparkmagic/sparkmagic/utils/dataframe_parser.py to handle characters with varying widths like Chinese and Japanese?

dosubot · 2024-05-27T01:28:28Z

Hey @devstein! 👋 I'm here to help with any bugs, questions, or contributions you have for the repository. I'm a bot designed to assist you while you wait for a human maintainer. Let me know how I can assist you!

To update dataframe_parser.py to handle characters with varying widths like Chinese and Japanese, you need to account for the fact that these characters typically take up more space than Latin characters. This can be done by using the wcwidth library, which provides a way to determine the display width of a string.

Here is how you can modify the extractors function and the DataframeHtmlParser class to handle characters with varying widths:

Install the wcwidth library if you haven't already:
```
pip install wcwidth
```
Update the extractors function to use wcwidth to calculate the correct column widths:

from wcwidth import wcswidth

def extractors(header_top, header_content):
    """Creates functions to pull column values out of Spark DF rows.

    Based on the top of a Dataframe header, identifies start and end index for
    each column value.

    012345678901
    +---+------+
    | id|animal|
    +---+------+
    |  1|   cat|
    |  2| mouse|
    |  3| horse|
    +---+------+

    For example, the `id` column is row[0:4] and `animal` is row[4:11]

    :param header_top The header border top comprise of `+` and `-` marking off
                       demarcating different columns.
                       eg `+---+------+`
    :param header_content The string following the header_top, containing the
                            actual column names
                            eg `| id|animal|`
    :return A dict of column: function that can be applied to string-row
            representation of a a Dataframe, eg `|  1|   cat|`

            In our example:
            {'id': lambda row: row[0:4], 'animal': lambda row[4:11]}
    """
    header_pluses = list(re.finditer(r"\+", header_top))
    column_widths = [wcswidth(header_content[start.end():end.start()]) for start, end in zip(header_pluses, header_pluses[1:])]

    def _extract(l, r, row, offset=0):
        return row[offset + l : offset + r].strip()

    def _extractor_iter():
        start = 0
        for width in column_widths:
            end = start + width
            yield partial(_extract, start, end)
            start = end

    return OrderedDict((x(header_content), x) for x in _extractor_iter())

Update the DataframeHtmlParser class to use the new extractors function:

class DataframeHtmlParser:
    """Parses a Spark Dataframe and presents it as a HTML table."""

    header_top_r = re.compile(header_top_pattern)
    header_content_r = re.compile(header_content_pattern)

    def __init__(self, cell, start=0, end=None):
        """Creates a Dataframe parser for a single dataframe.

        :param cell The evaluated output of a cell.
                    Cell can contain more than one dataframe, but a single
                    DataframeHtmlParser can only parse table headers/rows for a
                    a single dataframe in the substring cell[start:end]
        """
        self.cell_contents = cell
        end = end or len(self.cell_contents)
        header_spans = DataframeHtmlParser.header_top_r.finditer(
            self.cell_contents, start, end
        )
        parts = {
            "header_top": next(header_spans).span(),
            "header_content": DataframeHtmlParser.header_content_r.search(
                self.cell_contents, start, end
            ).span(),
            "header_bottom": next(header_spans).span(),
            "footer": next(header_spans).span(),
        }
        self.header_content_span = parts["header_content"]
        header_content = self._cell_span(self.header_content_span)

        self.expected_width = wcswidth(header_content.strip())

        header_top = self._cell_span(parts["header_top"])
        self.extractors = extractors(header_top.strip(), header_content.strip())
        # The content is between the header-bottom and the footer
        self.content_span = (parts["header_bottom"][1], parts["footer"][0])

    def _cell_span(self, span):
        s, e = span
        return self.cell_contents[s:e]

    def _rowspan_iter(self):
        """Extract each row from the contents of a Dataframe."""
        row_delimiters = re.compile(r"\n").finditer(
            self.cell_contents, self.content_span[0], self.content_span[1]
        )
        start = self.content_span[0]
        for row_delimiter in row_delimiters:
            end, next_start = row_delimiter.span()[0], row_delimiter.span()[1]
            yield (start, end)
            start = next_start

    def row_iter(self, transform=None):
        """Extract and transform each row from a Dataframe.

        Defaults to converting a row to a dict {colName: value}
        """
        _transform = transform or (
            lambda r: {col: x(r) for col, x in self.extractors.items()}
        )
        for rowspan in self._rowspan_iter():
            row = self._cell_span(rowspan).strip()
            if wcswidth(row) != self.expected_width:
                raise ValueError(
                    """Expected DF rows to be uniform width (%d)
                                 but found %s (%d)"""
                    % (self.expected_width, row, wcswidth(row))
                )
            yield _transform(row)

    def to_table(self):
        """Converts the contents of a notebook cell to a HTML table."""

        header_content = self._cell_span(self.header_content_span)
        table_header_html = self._to_tr(header_content.strip(), is_header=True)

        table_row_iter = self.row_iter(transform=self._to_tr)
        table_body = "".join([r for r in table_row_iter])
        return "<table>%s%s</table>" % (table_header_html, table_body)

    def _to_tr(self, row, is_header=False):
        """Converts a spark dataframe row to a HTML row."""
        tag = "th" if is_header else "td"
        row_content = [x(row) for x in self.extractors.values()]
        row_html = "".join(
            [
                '<%s><pre style="word-break: unset; background-color: unset;">%s</pre></%s>'
                % (tag, rc, tag)
                for rc in row_content
            ]
        )
        return "<tr>%s</tr>" % row_html

These changes ensure that the parser correctly handles characters with varying widths, such as Chinese and Japanese characters, by using the wcwidth library to calculate the display width of each string ^[1].

Details

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

devstein closed this as completed Jan 30, 2023

devstein reopened this Jan 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]%%pretty with chinese character error #767

[BUG]%%pretty with chinese character error #767

kongLingming06 commented Aug 7, 2022

xuansontrinh commented Oct 21, 2022

baixinzxl commented Jan 29, 2023

devstein commented Jan 30, 2023

baixinzxl commented Jan 31, 2023 •

edited

Loading

devstein commented Jan 31, 2023

baixinzxl commented Feb 2, 2023 •

edited

Loading

baixinzxl commented Feb 28, 2023

devstein commented Apr 5, 2023 •

edited

Loading

pancodia commented May 26, 2024

devstein commented May 27, 2024

dosubot bot commented May 27, 2024

Details

[BUG]%%pretty with chinese character error #767

[BUG]%%pretty with chinese character error #767

Comments

kongLingming06 commented Aug 7, 2022

xuansontrinh commented Oct 21, 2022

baixinzxl commented Jan 29, 2023

devstein commented Jan 30, 2023

baixinzxl commented Jan 31, 2023 • edited Loading

devstein commented Jan 31, 2023

baixinzxl commented Feb 2, 2023 • edited Loading

baixinzxl commented Feb 28, 2023

devstein commented Apr 5, 2023 • edited Loading

pancodia commented May 26, 2024

devstein commented May 27, 2024

dosubot bot commented May 27, 2024

Details

baixinzxl commented Jan 31, 2023 •

edited

Loading

baixinzxl commented Feb 2, 2023 •

edited

Loading

devstein commented Apr 5, 2023 •

edited

Loading