Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added feature to save pdfs to ram or files dynamically #1788

Closed
wants to merge 60 commits into from

Conversation

HimaGirija99
Copy link
Contributor

Description

  1. Added Memory Limit Configuration: Integrated memory limit settings into the custom_settings.yml file within the configs directory.
  2. Configured Memory Settings Loader: Set up and loaded memory settings in the memoryConfig file using a YAML parser
  3. Implemented Logic in MemoryUtils: Developed logic within the memoryUtils class to determine whether to use file-based storage or in-memory processing based on the configured memory thresholds and available system resources.
  4. Utilized MemoryUtils in PDF Processing: Applied the memoryUtils logic within a Java class responsible for saving PDFs, ensuring efficient memory usage during file processing.
  5. Enhanced Exception Handling: Added comprehensive exception handling in the image extraction method to manage errors more effectively during PDF processing.

I tried implementing the feature for single Java file - ExtractImagesController

Closes #(1775)

Checklist:

  • [x ] I have read the Contribution Guidelines
  • [x ] I have performed a self-review of my own code
  • [x ] I have commented my code, particularly in hard-to-understand areas
  • [x ] My changes generate no new warnings

@github-actions github-actions bot added Java Pull requests that update Java code API API-related issues or pull requests labels Sep 2, 2024

This comment was marked as outdated.

@@ -135,4 +139,20 @@ public Predicate<Path> processOnlyFiles() {
}
};
}

@Bean
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to make a new settings file requirement here.
This code also Requires the entry
currently the PR fails with

      Assertion Failed: Expected application/octet-stream but got application/json. Response content: b'{"timestamp":"2024-09-03T09:09:03.423+00:00","status":500,"error":"Internal Server Error","exception":"java.lang.NullPointerException","trace":"java.lang.NullPointerException: Cannot invoke \\"stirling.software.SPDF.config.memoryConfig$MemorySettings.getRamThresholdGB()\\" because the return value of \\"stirling.software.SPDF.config.memoryConfig.getMemory()\\" is null\\n\\tat stirling.software.SPDF.utils.memoryUtils.shouldUseFileBasedStorage(memoryUtils.java:17)\\n\\tat 

Please add this to the actual settings.yml instead
https://github.com/Stirling-Tools/Stirling-PDF/blob/main/src/main/resources/settings.yml.template
and
https://github.com/Stirling-Tools/Stirling-PDF/blob/main/src/main/java/stirling/software/SPDF/model/ApplicationProperties.java

@HimaGirija99
Copy link
Contributor Author

sure thank you

@@ -63,7 +60,7 @@ public ResponseEntity<byte[]> extractImages(@ModelAttribute PDFExtractImagesRequ
throws IOException, InterruptedException, ExecutionException {
MultipartFile file = request.getFileInput();
String format = request.getFormat();
boolean allowDuplicates = request.isAllowDuplicates();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please dont remove!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The multithreading approach for image extraction was revised to improve thread safety, especially after errors were encountered during testing after adding a new feature(saving PDFs to temp files. Despite these improvements, the output still shows inconsistency, with a variable number of images being extracted (between 60 and 68) from the same test PDF.

Copy link
Member

@Frooodle Frooodle Sep 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at code one issue is imageIndex needs to be atomicInteger to be thread safe

also instead of sync block maybe just ConcurrentHashSet ? (ConcurrentHashMap.newKeySet();)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have included atomic integer- image Index. I will include the second one too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! lemme know if it helps at all

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite the addition of atomic locks and using a ConcurrentHashSet for image extraction, the issue of inconsistent image output persists. The number of extracted images continues to vary, ranging from 65 to 68. I am not aware if it is really thread safety issue or a formatting issue
I am seeing the following output for skipped images : 14:13:57.276 [pool-3-thread-2] ERROR s.s.S.c.a.m.ExtractImagesController - Error processing image from page 2
java.io.IOException: ICCBased colorspace array must have a stream as the second element


@RestController
@RequestMapping("/api/v1/misc")
@Tag(name = "Misc", description = "Miscellaneous APIs")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also been removed

if (processedImages.stream()
.anyMatch(hash -> Arrays.equals(hash, imageHash))) {
try {
if (!allowDuplicates) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The duplicate logic doesnt seem to work anymore looking at the github PR runs
shows that the extract test extracts 5 images instead of 2 (when 3 are dups and should be removed)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. I will check, test and update

@Frooodle
Copy link
Member

This still doesn't pass the tests, please either resolve the issue or close PR, remerging is just taking up CPU cycles

@HimaGirija99 HimaGirija99 deleted the svMem branch September 23, 2024 11:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API API-related issues or pull requests Back End Issues related to back-end development Java Pull requests that update Java code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants