Fix empty batches & update data API #28

percevalw · 2024-06-14T15:00:39Z

Description

Changed

Default to fp16 when inferring with gpu
Support inputs parameter in TrainablePipe.postprocess(...) method (as in edsnlp)
We now check that the user isn't trying to write a single file in a split fashion (when write_in_worker is True or num_rows_per_file is not None) and raise an error if they do

Fixed

Batches full of empty content boxes no longer crash the huggingface-embedding component
Ensure models are always loaded in non training mode
Improved performance of edspdf.data methods over a filesystem (fs parameter)

Checklist

If this PR is a bug fix, the bug is documented in the test suite.
Changes were documented in the changelog (pending section).
If necessary, changes were made to the documentation.

sonarcloud · 2024-11-18T15:48:32Z

Quality Gate passed

Issues
8 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

github-actions · 2024-11-18T15:54:16Z

Coverage Report

Name	Stmts	Miss	Cover
edspdf/processing/multiprocessing.py New missing coverage at lines 226-230 ! if os.environ.get("TORCH_SHARING_STRATEGY"): - try: - torch.multiprocessing.set_sharing_strategy(os.environ["TORCH_SHARING_STRATEGY"]) - except NameError: - pass New missing coverage at line 248 ! def save_align_devices_hook(pickler: Any, obj: Any): - pickler.save_reduce(load_align_devices_hook, (obj.__dict__,), obj=obj) New missing coverage at lines 251-258 ! def load_align_devices_hook(state): - state["execution_device"] = MAP_LOCATION - new_obj = AlignDevicesHook.__new__(AlignDevicesHook) - new_obj.__dict__.update(state) - return new_obj - - except ImportError: - AlignDevicesHook = None New missing coverage at line 416 ! if lc.sort_chunks: - docs.sort( key=doc_size_fns.get( New missing coverage at line 456 ! - new_batch_iterator = None New missing coverage at line 495 ! if task is None and stage == self.exchanger.num_stages + 1: - return # Non prioritized STOP signal: there are no more tasks to process New missing coverage at lines 577-579 ! else: - batch = gpu_pipe.prepare_batch(docs, device=device) - inputs = None active_batches[batch_id] = (docs, task_id, inputs) New missing coverage at line 1016 ! if v is not None: - os.environ[k] = v	406	17	95.81%
edspdf/processing/utils.py New missing coverage at line 15 ! if isinstance(res, types.GeneratorType): - results.extend(res) else: New missing coverage at lines 66-78 ! ) -> Iterable[List[T]]: - batch = [] - total = 0 - for item in iterable: - count = len(item.pages) - if len(batch) > 0 and total + count > batch_size: - yield batch - batch = [] - total = 0 - batch.append(item) - total += count - if len(batch) > 0 and not drop_last: - yield batch	55	13	76.36%
edspdf/trainable_pipe.py New missing coverage at line 76 ! if cache_key in cache: - return cache[cache_key] res = fn(self, doc) New missing coverage at lines 349-357 ! """ - batch = [ - (self.preprocess_supervised(doc) if supervision else self.preprocess(doc)) - for doc in docs - ] - batch = decompress_dict(list(batch_compress_dict(batch))) - batch = self.collate(batch) - batch = self.batch_to_device(batch, device=device) - return batch New missing coverage at line 380 ! if hasattr(self, "compiled"): - res = self.compiled(batch) else: New missing coverage at line 459 ! if pipe_overrides: - overrides[name] = pipe_overrides tensor_dict = {	202	8	96.04%
edspdf/layers/relative_attention.py New missing coverage at lines 157-159 ! if head_size is None and key_size is not None: - assert key_size % n_heads == 0 - head_size = key_size // n_heads value_head_size = None New missing coverage at line 172 ! ): - self.register_buffer("position_embedding", position_embedding) else: New missing coverage at line 177 ! if same_key_query_proj: - self.content_query_proj = self.content_key_proj else: New missing coverage at line 196 ! if same_key_query_proj or same_positional_key_query_proj: - self.position_query_proj = self.position_key_proj else: New missing coverage at line 355 ! if mask.ndim == 3: - mask = mask[:, :, :, None] New missing coverage at line 366 ! - return attn	117	7	94.02%
edspdf/utils/collections.py New missing coverage at line 149 ! def __getstate__(self): - return {"seq": self.seq} New missing coverage at lines 152-154 ! def __setstate__(self, state): - self.seq = state["seq"] - self.flatten = None New missing coverage at lines 252-254 ! base[attr] = val - except (KeyError, TypeError): - setattr(base, attr, val) return base	166	5	96.99%
edspdf/structures.py New missing coverage at line 186 ! def page(self): - return next(p for p in self.doc.pages if p.page_num == self.page_num) New missing coverage at line 192 ! if self_page_num < other_page_num: - return True if self_page_num > other_page_num: New missing coverage at line 194 ! if self_page_num > other_page_num: - return False New missing coverage at line 222 ! - return ((self.y0 + self.y1) / 2, (self.x0 + self.x1) / 2) < ( (other.y0 + other.y1) / 2, New missing coverage at line 250 ! def __str__(self): - return self.text	97	5	94.85%
edspdf/utils/optimization.py New missing coverage at line 29 ! def param_groups(self, value): - self.optim.param_groups = value New missing coverage at line 33 ! def state(self): - return self.optim.state New missing coverage at line 37 ! def state(self, value): - self.optim.state = value	68	3	95.59%
edspdf/lazy_collection.py New missing coverage at lines 323-326 ! """Moves the pipeline to a given device""" - for name, pipe, *_ in self.torch_components(): - pipe.to(device) - return self	120	3	97.50%
edspdf/registry.py New missing coverage at lines 112-114 ! raise - except ConfitValidationError as e: - errors.append(e.raw_errors) if not errors:	90	2	97.78%
edspdf/processing/simple.py New missing coverage at lines 27-29 ! no_grad = sys.modules["torch"].no_grad - except (KeyError, AttributeError): - no_grad = nullcontext reader = lc.reader	51	2	96.08%
edspdf/pipes/extractors/pdfminer.py New missing coverage at line 161 ! if len(text) == 0: - continue content_boxes.append( New missing coverage at line 222 ! else: - fontname, italic, bold = (None, None, None) else:	88	2	97.73%
edspdf/data/files.py New missing coverage at lines 100-102 ! if self.load_annotations and self.filesystem.exists(json_path): - with self.filesystem.open(json_path) as f: - record["annotations"] = json.load(f)	102	2	98.04%
edspdf/visualization/annotations.py New missing coverage at line 67 ! elif isinstance(colors, list): - colors = {label: color for label, color in zip(unique_labels, colors)}	32	1	96.88%
edspdf/utils/lazy_module.py New missing coverage at line 92 ! """ - return __all__	31	1	96.77%
edspdf/utils/file_system.py New missing coverage at line 26 ! ) -> list: - return [ os.path.join(dirpath, f)	24	1	95.83%
edspdf/utils/alignment.py New missing coverage at line 19 ! if len(src_boxes) == 0 or len(dst_boxes) == 0: - return []	30	1	96.67%
edspdf/pipeline.py New missing coverage at line 992 ! if overrides: - config = config.merge(overrides) pwd = os.getcwd()	335	1	99.70%
edspdf/data/parquet.py New missing coverage at line 49 ! # read in worker -> each task is a non yet parsed line - return ( (line, 1)	107	1	99.07%
edspdf/data/pandas.py New missing coverage at line 101 ! if isinstance(rec, dict): - rec.pop(FILENAME, None) return records, len(records)	44	1	97.73%
TOTAL	3134	76	97.57%

35 files skipped due to complete coverage.

percevalw force-pushed the fix-empty-batches branch 2 times, most recently from f85cd4c to 19c74eb Compare June 14, 2024 19:06

percevalw added 3 commits November 18, 2024 16:23

fix: support batch full of empty boxes in huggingface_embeddings

d982085

fix: ensure that models are loaded in eval mode

cad3317

refacto: update trainable components and data api

cb6062f

percevalw force-pushed the fix-empty-batches branch 2 times, most recently from 3b2aade to e510744 Compare November 18, 2024 15:39

ci: drop codecov

965326d

percevalw force-pushed the fix-empty-batches branch from e510744 to 965326d Compare November 18, 2024 15:47

percevalw merged commit 3deff32 into main Nov 18, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix empty batches & update data API #28

Fix empty batches & update data API #28

percevalw commented Jun 14, 2024 •

edited

Loading

sonarcloud bot commented Nov 18, 2024

github-actions bot commented Nov 18, 2024

Fix empty batches & update data API #28

Fix empty batches & update data API #28

Conversation

percevalw commented Jun 14, 2024 • edited Loading

Description

Changed

Fixed

Checklist

sonarcloud bot commented Nov 18, 2024

Quality Gate passed

github-actions bot commented Nov 18, 2024

Coverage Report

percevalw commented Jun 14, 2024 •

edited

Loading