-
Notifications
You must be signed in to change notification settings - Fork 447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]%%pretty with chinese character error #767
Comments
I experience the same issue. |
I experience the same issue too, when i want to display chinese character,it will return error: thanks to any help,it really confuses me TAT |
This issue should be fixed by this PR, which I just released as part of the 0.20.4 release. I'm marking this as resolved for now, but please let me know if this is not the case after you upgrade. |
thanks for your reply! i found it is correct in livyserver,but returned error in notebook display maybe you can use this dataframe to reappear the problem: df = spark.createDataFrame([("a","你好"),("b","你好")],("key","value")) df.show(5) and it is my versions |
Thanks for the code snippet @baixinzxl. I will investigate once I have bandwidth in the coming weeks. Contributions are welcome if you want to dive into the code! |
thanks, take your time~ |
sorry for disturbing but i wonder if there is any findings about the problem? |
Hey @baixinzxl I haven't forgotten about this. I've been stretched for time and have tried to tackle this twice without success. The relevant code is in this file if you want to take a stab at it! |
I am using sparkmagic: 0.21.0 Same issue when show dataframe with Japanese column |
@dosu how would you update https://github.com/jupyter-incubator/sparkmagic/blob/master/sparkmagic/sparkmagic/utils/dataframe_parser.py to handle characters with varying widths like Chinese and Japanese? |
Hey @devstein! 👋 I'm here to help with any bugs, questions, or contributions you have for the repository. I'm a bot designed to assist you while you wait for a human maintainer. Let me know how I can assist you! To update Here is how you can modify the
from wcwidth import wcswidth
def extractors(header_top, header_content):
"""Creates functions to pull column values out of Spark DF rows.
Based on the top of a Dataframe header, identifies start and end index for
each column value.
012345678901
+---+------+
| id|animal|
+---+------+
| 1| cat|
| 2| mouse|
| 3| horse|
+---+------+
For example, the `id` column is row[0:4] and `animal` is row[4:11]
:param header_top The header border top comprise of `+` and `-` marking off
demarcating different columns.
eg `+---+------+`
:param header_content The string following the header_top, containing the
actual column names
eg `| id|animal|`
:return A dict of column: function that can be applied to string-row
representation of a a Dataframe, eg `| 1| cat|`
In our example:
{'id': lambda row: row[0:4], 'animal': lambda row[4:11]}
"""
header_pluses = list(re.finditer(r"\+", header_top))
column_widths = [wcswidth(header_content[start.end():end.start()]) for start, end in zip(header_pluses, header_pluses[1:])]
def _extract(l, r, row, offset=0):
return row[offset + l : offset + r].strip()
def _extractor_iter():
start = 0
for width in column_widths:
end = start + width
yield partial(_extract, start, end)
start = end
return OrderedDict((x(header_content), x) for x in _extractor_iter())
class DataframeHtmlParser:
"""Parses a Spark Dataframe and presents it as a HTML table."""
header_top_r = re.compile(header_top_pattern)
header_content_r = re.compile(header_content_pattern)
def __init__(self, cell, start=0, end=None):
"""Creates a Dataframe parser for a single dataframe.
:param cell The evaluated output of a cell.
Cell can contain more than one dataframe, but a single
DataframeHtmlParser can only parse table headers/rows for a
a single dataframe in the substring cell[start:end]
"""
self.cell_contents = cell
end = end or len(self.cell_contents)
header_spans = DataframeHtmlParser.header_top_r.finditer(
self.cell_contents, start, end
)
parts = {
"header_top": next(header_spans).span(),
"header_content": DataframeHtmlParser.header_content_r.search(
self.cell_contents, start, end
).span(),
"header_bottom": next(header_spans).span(),
"footer": next(header_spans).span(),
}
self.header_content_span = parts["header_content"]
header_content = self._cell_span(self.header_content_span)
self.expected_width = wcswidth(header_content.strip())
header_top = self._cell_span(parts["header_top"])
self.extractors = extractors(header_top.strip(), header_content.strip())
# The content is between the header-bottom and the footer
self.content_span = (parts["header_bottom"][1], parts["footer"][0])
def _cell_span(self, span):
s, e = span
return self.cell_contents[s:e]
def _rowspan_iter(self):
"""Extract each row from the contents of a Dataframe."""
row_delimiters = re.compile(r"\n").finditer(
self.cell_contents, self.content_span[0], self.content_span[1]
)
start = self.content_span[0]
for row_delimiter in row_delimiters:
end, next_start = row_delimiter.span()[0], row_delimiter.span()[1]
yield (start, end)
start = next_start
def row_iter(self, transform=None):
"""Extract and transform each row from a Dataframe.
Defaults to converting a row to a dict {colName: value}
"""
_transform = transform or (
lambda r: {col: x(r) for col, x in self.extractors.items()}
)
for rowspan in self._rowspan_iter():
row = self._cell_span(rowspan).strip()
if wcswidth(row) != self.expected_width:
raise ValueError(
"""Expected DF rows to be uniform width (%d)
but found %s (%d)"""
% (self.expected_width, row, wcswidth(row))
)
yield _transform(row)
def to_table(self):
"""Converts the contents of a notebook cell to a HTML table."""
header_content = self._cell_span(self.header_content_span)
table_header_html = self._to_tr(header_content.strip(), is_header=True)
table_row_iter = self.row_iter(transform=self._to_tr)
table_body = "".join([r for r in table_row_iter])
return "<table>%s%s</table>" % (table_header_html, table_body)
def _to_tr(self, row, is_header=False):
"""Converts a spark dataframe row to a HTML row."""
tag = "th" if is_header else "td"
row_content = [x(row) for x in self.extractors.values()]
row_html = "".join(
[
'<%s><pre style="word-break: unset; background-color: unset;">%s</pre></%s>'
% (tag, rc, tag)
for rc in row_content
]
)
return "<tr>%s</tr>" % row_html These changes ensure that the parser correctly handles characters with varying widths, such as Chinese and Japanese characters, by using the
|
Describe the bug
when use %%pretty function, if there is chinese character in the table, the table cannot show properly and return this error.
An internal error was encountered.
Please file an issue at https://github.com/jupyter-incubator/sparkmagic
Error:
Expected DF rows to be uniform width (581)
To Reproduce
%%pretty
df.show()
Versions:
The text was updated successfully, but these errors were encountered: