Skip to content

Commit

Permalink
New GetText() option: NegativeGapAsWhitespace
Browse files Browse the repository at this point in the history
When parsing PDF files with tables containing multiple lines in a cell or "merged" cells, the separate words can appear out of horizontal order. This option can better predict the spaces between the words.
  • Loading branch information
Kizaemon authored and BobLd committed Dec 9, 2024
1 parent 97ae62c commit a2ae1f1
Showing 1 changed file with 12 additions and 0 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,11 @@ public static string GetText(Page page, Options options)
{
var gap = letter.StartBaseLine.X - previous.EndBaseLine.X;

if (options.NegativeGapAsWhitespace)
{
gap = Math.Abs(gap);
}

if (WhitespaceSizeStatistics.IsProbablyWhitespace(gap, previous))
{
sb.Append(" ");
Expand Down Expand Up @@ -178,6 +183,13 @@ public class Options
/// character. Default <see langword="false"/>.
/// </summary>
public bool ReplaceWhitespaceWithSpace { get; set; }

/// <summary>
/// When parsing PDF files with tables containing multiple lines in a cell or "merged" cells,
/// the separate words can appear out of horizontal order. This option can better predict the
// spaces between the words. Default <see langword="false"/>.
/// </summary>
public bool NegativeGapAsWhitespace { get; set; }
}
}
}

0 comments on commit a2ae1f1

Please sign in to comment.