Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix empty batches & update data API #28

Merged
merged 4 commits into from
Nov 18, 2024
Merged

Fix empty batches & update data API #28

merged 4 commits into from
Nov 18, 2024

Conversation

percevalw
Copy link
Member

@percevalw percevalw commented Jun 14, 2024

Description

Changed

  • Default to fp16 when inferring with gpu
  • Support inputs parameter in TrainablePipe.postprocess(...) method (as in edsnlp)
  • We now check that the user isn't trying to write a single file in a split fashion (when write_in_worker is True or num_rows_per_file is not None) and raise an error if they do

Fixed

  • Batches full of empty content boxes no longer crash the huggingface-embedding component
  • Ensure models are always loaded in non training mode
  • Improved performance of edspdf.data methods over a filesystem (fs parameter)

Checklist

  • If this PR is a bug fix, the bug is documented in the test suite.
  • Changes were documented in the changelog (pending section).
  • If necessary, changes were made to the documentation.

@percevalw percevalw force-pushed the fix-empty-batches branch 2 times, most recently from f85cd4c to 19c74eb Compare June 14, 2024 19:06
@percevalw percevalw force-pushed the fix-empty-batches branch 2 times, most recently from 3b2aade to e510744 Compare November 18, 2024 15:39
Copy link

sonarcloud bot commented Nov 18, 2024

Copy link

Coverage Report

NameStmtsMissCover
edspdf/processing/multiprocessing.py

New missing coverage at lines 226-230 !

 if os.environ.get("TORCH_SHARING_STRATEGY"):
-     try:
-         torch.multiprocessing.set_sharing_strategy(os.environ["TORCH_SHARING_STRATEGY"])
-     except NameError:
-         pass
New missing coverage at line 248 !
         def save_align_devices_hook(pickler: Any, obj: Any):
-             pickler.save_reduce(load_align_devices_hook, (obj.__dict__,), obj=obj)
New missing coverage at lines 251-258 !
         def load_align_devices_hook(state):
-             state["execution_device"] = MAP_LOCATION
-             new_obj = AlignDevicesHook.__new__(AlignDevicesHook)
-             new_obj.__dict__.update(state)
-             return new_obj
- 
-     except ImportError:
-         AlignDevicesHook = None
New missing coverage at line 416 !
                 if lc.sort_chunks:
-                     docs.sort(
                         key=doc_size_fns.get(
New missing coverage at line 456 !
-             new_batch_iterator = None
New missing coverage at line 495 !
                 if task is None and stage == self.exchanger.num_stages + 1:
-                     return
                 # Non prioritized STOP signal: there are no more tasks to process
New missing coverage at lines 577-579 !
                     else:
-                         batch = gpu_pipe.prepare_batch(docs, device=device)
-                         inputs = None
                     active_batches[batch_id] = (docs, task_id, inputs)
New missing coverage at line 1016 !
                 if v is not None:
-                     os.environ[k] = v

4061795.81%
edspdf/processing/utils.py

New missing coverage at line 15 !

                 if isinstance(res, types.GeneratorType):
-                     results.extend(res)
                 else:
New missing coverage at lines 66-78 !
 ) -> Iterable[List[T]]:
-     batch = []
-     total = 0
-     for item in iterable:
-         count = len(item.pages)
-         if len(batch) > 0 and total + count > batch_size:
-             yield batch
-             batch = []
-             total = 0
-         batch.append(item)
-         total += count
-     if len(batch) > 0 and not drop_last:
-         yield batch

551376.36%
edspdf/trainable_pipe.py

New missing coverage at line 76 !

         if cache_key in cache:
-             return cache[cache_key]
         res = fn(self, doc)
New missing coverage at lines 349-357 !
         """
-         batch = [
-             (self.preprocess_supervised(doc) if supervision else self.preprocess(doc))
-             for doc in docs
-         ]
-         batch = decompress_dict(list(batch_compress_dict(batch)))
-         batch = self.collate(batch)
-         batch = self.batch_to_device(batch, device=device)
-         return batch
New missing coverage at line 380 !
             if hasattr(self, "compiled"):
-                 res = self.compiled(batch)
             else:
New missing coverage at line 459 !
                 if pipe_overrides:
-                     overrides[name] = pipe_overrides
         tensor_dict = {

202896.04%
edspdf/layers/relative_attention.py

New missing coverage at lines 157-159 !

         if head_size is None and key_size is not None:
-             assert key_size % n_heads == 0
-             head_size = key_size // n_heads
         value_head_size = None
New missing coverage at line 172 !
         ):
-             self.register_buffer("position_embedding", position_embedding)
         else:
New missing coverage at line 177 !
         if same_key_query_proj:
-             self.content_query_proj = self.content_key_proj
         else:
New missing coverage at line 196 !
             if same_key_query_proj or same_positional_key_query_proj:
-                 self.position_query_proj = self.position_key_proj
             else:
New missing coverage at line 355 !
             if mask.ndim == 3:
-                 mask = mask[:, :, :, None]
New missing coverage at line 366 !
-         return attn

117794.02%
edspdf/utils/collections.py

New missing coverage at line 149 !

     def __getstate__(self):
-         return {"seq": self.seq}
New missing coverage at lines 152-154 !
     def __setstate__(self, state):
-         self.seq = state["seq"]
-         self.flatten = None
New missing coverage at lines 252-254 !
         base[attr] = val
-     except (KeyError, TypeError):
-         setattr(base, attr, val)
     return base

166596.99%
edspdf/structures.py

New missing coverage at line 186 !

     def page(self):
-         return next(p for p in self.doc.pages if p.page_num == self.page_num)
New missing coverage at line 192 !
         if self_page_num < other_page_num:
-             return True
         if self_page_num > other_page_num:
New missing coverage at line 194 !
         if self_page_num > other_page_num:
-             return False
New missing coverage at line 222 !
-         return ((self.y0 + self.y1) / 2, (self.x0 + self.x1) / 2) < (
             (other.y0 + other.y1) / 2,
New missing coverage at line 250 !
     def __str__(self):
-         return self.text

97594.85%
edspdf/utils/optimization.py

New missing coverage at line 29 !

     def param_groups(self, value):
-         self.optim.param_groups = value
New missing coverage at line 33 !
     def state(self):
-         return self.optim.state
New missing coverage at line 37 !
     def state(self, value):
-         self.optim.state = value

68395.59%
edspdf/lazy_collection.py

New missing coverage at lines 323-326 !

         """Moves the pipeline to a given device"""
-         for name, pipe, *_ in self.torch_components():
-             pipe.to(device)
-         return self

120397.50%
edspdf/registry.py

New missing coverage at lines 112-114 !

                     raise
-                 except ConfitValidationError as e:
-                     errors.append(e.raw_errors)
             if not errors:

90297.78%
edspdf/processing/simple.py

New missing coverage at lines 27-29 !

         no_grad = sys.modules["torch"].no_grad
-     except (KeyError, AttributeError):
-         no_grad = nullcontext
     reader = lc.reader

51296.08%
edspdf/pipes/extractors/pdfminer.py

New missing coverage at line 161 !

                     if len(text) == 0:
-                         continue
                     content_boxes.append(
New missing coverage at line 222 !
             else:
-                 fontname, italic, bold = (None, None, None)
         else:

88297.73%
edspdf/data/files.py

New missing coverage at lines 100-102 !

             if self.load_annotations and self.filesystem.exists(json_path):
-                 with self.filesystem.open(json_path) as f:
-                     record["annotations"] = json.load(f)

102298.04%
edspdf/visualization/annotations.py

New missing coverage at line 67 !

     elif isinstance(colors, list):
-         colors = {label: color for label, color in zip(unique_labels, colors)}

32196.88%
edspdf/utils/lazy_module.py

New missing coverage at line 92 !

         """
-         return __all__

31196.77%
edspdf/utils/file_system.py

New missing coverage at line 26 !

 ) -> list:
-     return [
         os.path.join(dirpath, f)

24195.83%
edspdf/utils/alignment.py

New missing coverage at line 19 !

     if len(src_boxes) == 0 or len(dst_boxes) == 0:
-         return []

30196.67%
edspdf/pipeline.py

New missing coverage at line 992 !

             if overrides:
-                 config = config.merge(overrides)
             pwd = os.getcwd()

335199.70%
edspdf/data/parquet.py

New missing coverage at line 49 !

             # read in worker -> each task is a non yet parsed line
-             return (
                 (line, 1)

107199.07%
edspdf/data/pandas.py

New missing coverage at line 101 !

             if isinstance(rec, dict):
-                 rec.pop(FILENAME, None)
         return records, len(records)

44197.73%
TOTAL31347697.57%

35 files skipped due to complete coverage.

@percevalw percevalw merged commit 3deff32 into main Nov 18, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant