Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweaks for Excel to Markdown conversion #3022

Merged
merged 14 commits into from
Oct 4, 2024
Merged

Tweaks for Excel to Markdown conversion #3022

merged 14 commits into from
Oct 4, 2024

Conversation

cebtenzzre
Copy link
Member

@cebtenzzre cebtenzzre commented Oct 2, 2024

There are a few changes here that we didn't have time to discuss in the previous PR:

  • _Underlines_ seem to be recognized by Llama 3, so use them
  • Not all spreadsheets use the first row as a header. Llama 3 seems to generalize better with no header row.
  • Empty cells seem to not be well supported by Llama 3 unless there is a space in them (this matters more for spreadsheets with lots of horizontal whitespace)
  • Long filenames were exceeding the width and height of the attached file cards. Probably we should be using a RowLayout and not a Row so it's easier to automatically constrain the width of the elements, but this change works too.

Needs a changelog entry once we decide which of these changes to keep.

Follow-up to #3007

@manyoso
Copy link
Collaborator

manyoso commented Oct 2, 2024

The only one I'm worried about is the extra spaces. This increases context length of the document so I'd like to see a test cast this fixes...

@manyoso
Copy link
Collaborator

manyoso commented Oct 2, 2024

This markdown is not displayable by QTextDocument apparently as a markdown table. The issue seems to be the missing headers.

image

Moreover, this is the raw markdown you're producing in your test case:

|---|---|---|---|---|---|---|---|
|||||||||
|The password is not:|||Formula 1:|Formula 2:||||
|Apples|||4|6||||
|||||||||
|||||||||
||||**Bold**|||||
||||_Underline_|||||
|The password is not:|||*Italics*|||||
|Pears|||***Bold Italics***|||||
||||||||The password is:|
||||||||Stairs|

Which apparently doesn't have the spaces you intended to add?

markdown += headerRowMarkdown + "\n";

// Create Markdown separator row
// Separator row (no header)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to cause problems and I don't think it is actually helping with the password test case as the resulting markdown doesn't have extra spaces.

gpt4all-chat/src/xlsxtomd.cpp Outdated Show resolved Hide resolved

// Escape special characters
static QRegularExpression special(uR"([\\`*_{}[\]()#+-.!])"_s);
cellText.replace(special, uR"(\\1)"_s);
Copy link
Collaborator

@manyoso manyoso Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The escaping is worse. I am getting this kind of md on the walt disney xlsx example for produced markdown:

| | | | | | | |
|---|---|---|---|---|---|---|
|~~***_Walt Disney Co\\1_***~~| | | | | | |
|~~***_Consolidated Income Statement_***~~| | | | | | |
| | | | | | | |
|US$ in millions| | | | | | |
|~~***_12 months ended:_***~~|~~***_45199_***~~|~~***_44835_***~~|~~***_44471_***~~|~~***_44107_***~~|~~***_43736_***~~|~~***_43372_***~~|
|Services|79562\\10|74200\\10|61768\\10|59265\\10|60542\\10|50869\\10|
|Products|9336\\10|8522\\10|5650\\10|6123\\10|9028\\10|8565\\10|
|~~***_Revenues_***~~|~~***_88898\\10_***~~|~~***_82722\\10_***~~|~~***_67418\\10_***~~|~~***_65388\\10_***~~|~~***_69570\\10_***~~|~~***_59434\\10_***~~|
|Cost of services\\1 exclusive of depreciation and amortization|\\153139\\10|\\148962\\10|\\141129\\10|\\139406\\10|\\136450\\10|\\127528\\10|

Which looks like this:

image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice the strikeout as well...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a mistake in my regex and a bug in QXlsx. I have fixed both of them.

@manyoso manyoso merged commit b850e7c into main Oct 4, 2024
4 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants