Index error on Hybrid Parser #252

bosd · 2024-11-01T12:15:25Z

Describe the bug

In some cases, there is an index error while using the Hybrid parser on a multipage pdf.
It is described and tested in #251
What is merged there is rather a workaround then a fix.
A it now fails gracefully.

Steps to reproduce the bug

See

pypdf_table_extraction/tests/test_network.py

Lines 145 to 157 in 35d8d20

    
           def test_network_no_infinite_execution(testdir): 
        
               """Test for not infinite execution. 
        
               This test used to fail, because the network parse was'nt able to process the tables on this pages. 
        
               After a refactor it stops infinite execution. But parsing result could be improved. 
        
               Hence this is no qualitative test. 
        
               """ 
        
               filename = os.path.join(testdir, "tabula/schools.pdf") 
        
               tables = camelot.read_pdf( 
        
                   filename, flavor="network", backend="ghostscript", pages="4" 
        
               ) 
        
               assert len(tables) >= 1

Expected behavior

Potential better fix would be to re-assemble the parts of the table detcted by the netwerk parser into the hybrid parser.
That part of the code also contained a TODO note from the original author.

pypdf_table_extraction/camelot/parsers/network.py

Lines 935 to 957 in 35d8d20

    
           def _generate_columns_and_rows(self, bbox, user_cols): 
        
               # select elements which lie within table_bbox 
        
               self.t_bbox = text_in_bbox_per_axis( 
        
                   bbox, self.horizontal_text, self.vertical_text 
        
               ) 
        
               all_tls = list( 
        
                   sorted( 
        
                       filter( 
        
                           lambda textline: len(textline.get_text().strip()) > 0, 
        
                           self.t_bbox["horizontal"] + self.t_bbox["vertical"], 
        
                       ), 
        
                       key=lambda textline: (-textline.y0, textline.x0), 
        
                   ) 
        
               ) 
        
               text_x_min, text_y_min, text_x_max, text_y_max = bbox_from_textlines(all_tls) 
        
               # FRHTODO: 
        
               # This algorithm takes the horizontal textlines in the bbox, and groups 
        
               # them into rows based on their bottom y0. 
        
               # That's wrong: it misses the vertical items, and misses out on all 
        
               # the alignment identification work we've done earlier. 
        
               rows_grouped = self._group_rows(all_tls, row_tol=self.row_tol) 
        
               rows = self._join_rows(rows_grouped, text_y_max, text_y_min)

PDF

tabula/schools.pdf

Screenshots

bosd added bug Something isn't working help wanted Extra attention is needed labels Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index error on Hybrid Parser #252

Index error on Hybrid Parser #252

bosd commented Nov 1, 2024

Index error on Hybrid Parser #252

Index error on Hybrid Parser #252

Comments

bosd commented Nov 1, 2024