This Python script, DICOMetaExtractor_v32.py
, is designed to efficiently extract metadata from DICOM files across a directory structure, leveraging advanced data processing libraries to handle large datasets effectively. The tool outputs the extracted metadata into a CSV file, providing a comprehensive overview of the DICOM files processed.
- Python 3.7 or newer
- pip (Python package installer)
Before running the script, ensure you have the required libraries installed. The primary libraries used in this version are pydicom
, polars
, tqdm
, pandas
, and portalocker
. You can install these libraries using pip:
pip install pydicom polars tqdm pandas portalocker
Ensure all dependencies are installed successfully before proceeding to use the script.
The DICOMetaExtractor_v32.py
script is designed to be used from the command line interface (CLI). The basic usage pattern is outlined below:
python DICOMetaExtractor_v32.py <path_to_dicom_directory> -o <output_csv_file_path>
<path_to_dicom_directory>
: This is the path to the root directory containing your DICOM files. The script will recursively search this directory and its subdirectories for DICOM files to process.-o <output_csv_file_path>
: (Optional) Path where the extracted metadata CSV file will be saved. If not specified, the script defaults todicom_data.csv
in the current directory.
To extract metadata from DICOM files located in /path/to/dicom/files
and save the output to extracted_metadata.csv
in the current working directory, run:
python DICOMetaExtractor_v32.py /path/to/dicom/files -o extracted_metadata.csv
- The script handles large datasets by processing files in parallel and managing memory usage efficiently. It creates temporary files during processing, which are automatically cleaned up upon completion.
- An internet connection is required for the initial installation of dependencies but not for running the script on local DICOM files.
The DICOMetaExtractor_v32.py
script uses parallel processing to improve efficiency, especially when working with large datasets. By default, the script dynamically allocates a certain number of worker processes to optimize performance based on your system's capabilities. However, you might find it necessary to adjust the number of workers manually to better match your system's resources or to optimize the script's performance for your specific dataset.
To customize the number of worker processes used by the script, you will need to modify the source code slightly. This involves changing the max_workers
parameter in the ProcessPoolExecutor
and potentially the ThreadPoolExecutor
, depending on where you want to adjust the parallelism.
-
Open the script in your preferred text editor or Integrated Development Environment (IDE).
-
Find the
ProcessPoolExecutor
instantiation. Look for the following line in thecollect_and_process_dicom_data
function:with ProcessPoolExecutor(max_workers=12) as executor:
-
Modify the
max_workers
parameter to reflect the number of worker processes you wish to use. For example, to use 8 workers, change the line to:with ProcessPoolExecutor(max_workers=8) as executor:
-
(Optional) Adjust ThreadPoolExecutor: If you also wish to change the number of threads used for directory scanning, find the
ThreadPoolExecutor
instantiation in thefind_dcm_folders
function and adjust themax_workers
parameter similarly.with ThreadPoolExecutor(max_workers=4) as executor:
-
Save your changes and close the file.
-
CPU Resources: The optimal number of worker processes usually correlates with the number of CPU cores available on your system. Setting
max_workers
to the number of cores or logical processors can maximize your CPU usage. -
Memory Constraints: Be mindful of your system's memory (RAM). Increasing the number of workers increases memory usage. Monitor your system's memory usage and adjust the number of workers to prevent exhausting system resources.
-
Disk I/O: For disk-bound tasks, such as reading DICOM files from a slow disk, increasing the number of workers might not lead to performance improvements. In such cases, disk speed is the limiting factor.
-
Trial and Error: Finding the optimal setting may require some experimentation. Start with a number close to your system's CPU core count and adjust based on observed performance and system resource usage.
After adjusting the number of workers, run the script as usual to process your DICOM files with the new configuration. This customization allows you to tailor the script's performance to your specific system and dataset characteristics, optimizing efficiency and resource utilization.
Contributions to enhance the script or address issues are welcome.
This script is released under the MIT License. Please refer to the LICENSE
file for more details.
This tool leverages several open-source libraries, and we are grateful to the maintainers and contributors of these projects:
- Pydicom for DICOM file handling.
- Polars for efficient data processing.
- Pandas for data manipulation.
- Tqdm for progress bar functionality.
- Portalocker for file locking.
For any questions or issues, please open an issue on the GitHub repository page.