Error processing in .pkl .fasta #204

JingkaiZeng · 2024-11-20T12:34:07Z

Hi,

Thank you for developing DiG! I am excited about the potential of DiG to provide more insights into the conformational states of my protein of interest (PDBID: 8io4), as it allows for exploring different metastable states.

I am trying to apply this to my protein of interests (pdbID: 8io4), I applied OpenFold to predict the structure of 8io4 (with proper alignments a3m file, FASTA, and CIF files) and generated a .pkl file containing the required features (the pkl file) for DiG. After verifying that the OpenFold predictions make sense compared to the Cryo-EM structure, I proceeded to use DiG's inference pipeline with the following command

PDBID="8io4"
CKPT_PATH=./checkpoints/checkpoint-520k.pth
FEATURE_PATH=./dataset/${PDBID}.pkl
FASTA_PATH=./dataset/${PDBID}.fasta
OUTDIR=./output/

python run_inference.py \
    -c ${CKPT_PATH} \
    -i ${FEATURE_PATH} \
    -s ${FASTA_PATH} \
    -o ${PDBID} \
    --output-prefix ${OUTDIR} \
    -n 1 \
    --use-gpu \
    --use-tqdm

Upon running the command, I encountered the following error:

$ python run_inference.py     -c ${CKPT_PATH}     -i ${FEATURE_PATH}     -s ${FASTA_PATH}     -o ${PDBID}     --output-prefix ${OUTDIR}     -n 1         --use-tqdm
/home/mdmm/dig/Graphormer/Distributional-Graphormer/protein/run_inference.py:80: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  checkpoint = torch.load(checkpoint_path, map_location=torch.device("cpu"))
/home/mdmm/dig/Graphormer/Distributional-Graphormer/protein/run_inference.py:278: DeprecationWarning: numpy.core.numeric is deprecated and has been renamed to numpy._core.numeric. The numpy._core namespace contains private NumPy internals and its use is discouraged, as NumPy internals can change without warning in any release. In practice, most real-world usage of numpy.core is to access functionality in the public NumPy API. If that is the case, use the public NumPy API. If not, you are using NumPy internals. If you would still like to access an internal attribute, use numpy._core.numeric._frombuffer.
  pkl_data = pickle.load(open(pkl, "rb"))
Error processing ./dataset/8io4.pkl, ./dataset/8io4.fasta, 8io4

since in OpenFold we only have the "result_model_*.pkl" but without "features.pkl", I used the code like to extract the

with open("model_1_multimer_v3_output_dict.pkl", "rb") as f:
    custom_data = pickle.load(f)


single = custom_data["single"]
pair = custom_data["pair"]


converted_data = {
    "representations": {
        "pair": pair,
        "single": single   
    }
}

with open("model_convert.pkl", "wb") as f:
    pickle.dump(converted_data, f)

Update: I test to predict the sequence of PDBID:1ake as example, and I also use a similar way to extract the 1ake.pkl from OpenFold production (with AlphaFold pre-trained parameter), it is runnable in DiG but the production seems problematic. For example, for the prediction production "1ake_0.pdb", it seems the protein structure are not well constructed. I check the numerical value in python and plot its histogram, then I plot its density and check its summary statistics, seems some different in density but much different in numerical value.
"single_diff = np.abs(official_single - converted_single)
pair_diff = np.abs(official_pair - converted_pair)
Max difference in 'single': 7869.0894
Mean difference in 'single': 219.26212
Max difference in 'pair': 1169.8732
Mean difference in 'pair': 16.570103"

Additional Details:
The protein has 1864 residues in total, and the generated .pkl file is approximately 5.4 GB.
I used the same .fasta and .pkl files that worked with OpenFold.
The error occurs during the loading of the .pkl file.
The GPU I am using is "NVIDIA A800 80GB"

Questions:
Is the error potentially related to the size and complexity of the protein (given the large number of residues and the size of the .pkl file)?
Could this be an issue with the format or compatibility of the .pkl file generated by OpenFold? (but seems someone did OpenFold approach before as I checked the issue: #202) Or do it due to the pretrained parameter? (I was used the alphafold2 pretrained)
Is there any known limitation on the size of the protein or the size of the .pkl file that DiG can handle?
Could the warning about torch.load and weights_only=False be relevant to this issue?

Best,

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error processing in .pkl .fasta #204

Error processing in .pkl .fasta #204

JingkaiZeng commented Nov 20, 2024 •

edited

Loading

Error processing in ***.pkl ***.fasta #204

Error processing in ***.pkl ***.fasta #204

Comments

JingkaiZeng commented Nov 20, 2024 • edited Loading

Error processing in .pkl .fasta #204

Error processing in .pkl .fasta #204

JingkaiZeng commented Nov 20, 2024 •

edited

Loading