-
Notifications
You must be signed in to change notification settings - Fork 7.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tweaks for Excel to Markdown conversion #3022
Conversation
Signed-off-by: Jared Van Bortel <[email protected]>
Signed-off-by: Jared Van Bortel <[email protected]>
Signed-off-by: Jared Van Bortel <[email protected]>
Signed-off-by: Jared Van Bortel <[email protected]>
The only one I'm worried about is the extra spaces. This increases context length of the document so I'd like to see a test cast this fixes... |
This markdown is not displayable by QTextDocument apparently as a markdown table. The issue seems to be the missing headers. Moreover, this is the raw markdown you're producing in your test case:
Which apparently doesn't have the spaces you intended to add? |
gpt4all-chat/src/xlsxtomd.cpp
Outdated
markdown += headerRowMarkdown + "\n"; | ||
|
||
// Create Markdown separator row | ||
// Separator row (no header) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to cause problems and I don't think it is actually helping with the password test case as the resulting markdown doesn't have extra spaces.
Signed-off-by: Jared Van Bortel <[email protected]>
Signed-off-by: Jared Van Bortel <[email protected]>
Signed-off-by: Jared Van Bortel <[email protected]>
Signed-off-by: Jared Van Bortel <[email protected]>
gpt4all-chat/src/xlsxtomd.cpp
Outdated
|
||
// Escape special characters | ||
static QRegularExpression special(uR"([\\`*_{}[\]()#+-.!])"_s); | ||
cellText.replace(special, uR"(\\1)"_s); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The escaping is worse. I am getting this kind of md on the walt disney xlsx example for produced markdown:
| | | | | | | |
|---|---|---|---|---|---|---|
|~~***_Walt Disney Co\\1_***~~| | | | | | |
|~~***_Consolidated Income Statement_***~~| | | | | | |
| | | | | | | |
|US$ in millions| | | | | | |
|~~***_12 months ended:_***~~|~~***_45199_***~~|~~***_44835_***~~|~~***_44471_***~~|~~***_44107_***~~|~~***_43736_***~~|~~***_43372_***~~|
|Services|79562\\10|74200\\10|61768\\10|59265\\10|60542\\10|50869\\10|
|Products|9336\\10|8522\\10|5650\\10|6123\\10|9028\\10|8565\\10|
|~~***_Revenues_***~~|~~***_88898\\10_***~~|~~***_82722\\10_***~~|~~***_67418\\10_***~~|~~***_65388\\10_***~~|~~***_69570\\10_***~~|~~***_59434\\10_***~~|
|Cost of services\\1 exclusive of depreciation and amortization|\\153139\\10|\\148962\\10|\\141129\\10|\\139406\\10|\\136450\\10|\\127528\\10|
Which looks like this:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notice the strikeout as well...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a mistake in my regex and a bug in QXlsx. I have fixed both of them.
Signed-off-by: Jared Van Bortel <[email protected]>
Signed-off-by: Jared Van Bortel <[email protected]>
Signed-off-by: Jared Van Bortel <[email protected]>
Signed-off-by: Jared Van Bortel <[email protected]>
Signed-off-by: Jared Van Bortel <[email protected]>
There are a few changes here that we didn't have time to discuss in the previous PR:
_Underlines_
seem to be recognized by Llama 3, so use themNeeds a changelog entry once we decide which of these changes to keep.Follow-up to #3007