Alexandria database #107

JonathanSchmidt1 · 2024-01-24T13:34:13Z

JonathanSchmidt1
Jan 24, 2024

Hi,
As I previously mentioned I would like to add the Alexandria database https://alexandria.icams.rub.de/ to the repository.
The implementation seems to be quite straight-forward after @laserkelvin tutorials and I have some machine learning running with the data right now.
The dataset includes ~400k PBEsol and 400k SCAN calculations of relaxed crystal structures and ~4.4M PBE relaxed crystal structures as well as 130k 2D and some 1D crystal structures. I am thinking of providing separate download options for the different datasets.

While one could set up a download through the OPTIMADE api that is painfully slow so we will probably just provide a link to some jsons on materials cloud or zenodo once we have sth out on arxiv for the database which should be rather soon. I will have to see if we can host the dataset at one of our universities until than.
The implementation of the dataset is rather similar to pymatgen e.g. using parse_structure, parse_symmetry with a pymatgen Structure object as input to the functions.
I started an implementation at https://github.com/JonathanSchmidt1/matsciml_alexandria .
I would have two questions:

How are you dealing with structures with a single atom right now, should we remove them during downloading? (at the moment they produce an error when trying to create a pyg graph while the dgl graph is created with no edges. (Interestingly you can than transform that dgl graph to a pyg graph).
Do you have a good set of input parameters to build the network for maybe e(n)gnn and faenet? Maybe respectively the input parameters from your benchmark paper and parameters that correspond to the original faenet paper. I would just like to confirm that I get reasonable training results to confirm that everything is fine when using the dataset in the pipeline.

laserkelvin · 2024-01-25T22:26:24Z

laserkelvin
Jan 25, 2024
Maintainer

How are you dealing with structures with a single atom right now, should we remove them during downloading? (at the moment they produce an error when trying to create a pyg graph while the dgl graph is created with no edges. (Interestingly you can than transform that dgl graph to a pyg graph).

I think we can keep them, and for the modeling, just add a self-loop to include them as "graphs".

Regarding hosting, Zenodo is great. Have you also looked into Colabfit as a place to host the dataset as well?

Do you have a good set of input parameters to build the network for maybe e(n)gnn and faenet? Maybe respectively the input parameters from your benchmark paper and parameters that correspond to the original faenet paper. I would just like to confirm that I get reasonable training results to confirm that everything is fine when using the dataset in the pipeline.

For my most recent paper on E(n)-GNN, this is the configuration I used:

model_class: PLEGNNBackbone
model_args:
        embed_in_dim: 256
        embed_hidden_dim: 1024
        embed_out_dim: 256
        embed_depth: 3                           
        embed_feat_dims: [256,256,256]        
        embed_message_dims: [256,256,256]        
        embed_position_dims: [64, 64]            
        embed_edge_attributes_dim: 0             
        embed_activation: silu                  
        embed_residual: True                     
        embed_normalize: True                    
        embed_tanh: True                         
        embed_activate_last: False               
        embed_k_linears: 1                       
        embed_use_attention: False               
        embed_attention_norm: sigmoid            
        readout: sum                             
        node_projection_depth: 3                 
        node_projection_hidden_dim: 256
        node_projection_activation: silu        
        prediction_out_dim: 1                    
        prediction_depth: 3                      
        prediction_hidden_dim: 128             
        prediction_activation: relu    
        num_atom_embedding: 201
        encoder_only: true

Will have to get back to you for FAENet :)

1 reply

melo-gonzo Jan 25, 2024
Maintainer

Here is a default set of FAENet arguments used for the benchmark paper:

encoder_class=FAENet,
encoder_kwargs={
    "pred_as_dict": False,
    "hidden_dim": 128,
    "output_dim": 64,
    "tag_hidden_channels": 0,
},
output_kwargs={
    "norm": LayerNorm(64),
    "hidden_dim": 64,
    "activation": SiLU,
    "lazy": False,
    "input_dim": 64,
},
lr=1e-4

JonathanSchmidt1 · 2024-01-30T16:43:04Z

JonathanSchmidt1
Jan 30, 2024
Author

Thank you for the input parameters so far the machine learning is looking sensible.
It's still a bit rough but this branch should have the most important dataset features implemented now. Once we have a paper published we can switch the link to some fair repository
https://github.com/JonathanSchmidt1/matsciml_alexandria/tree/alexandria_api

0 replies

JonathanSchmidt1 · 2024-01-30T16:46:40Z

JonathanSchmidt1
Jan 30, 2024
Author

Regarding hosting, Zenodo is great. Have you also looked into Colabfit as a place to host the dataset as well?

This looks interesting. Right now the dataset is just the relaxed structures but we also have ten to a hundred times as many geometry optimization steps. When we get to publishing them, that might be an option.

1 reply

laserkelvin Jan 30, 2024
Maintainer

That sounds incredibly exciting - can't wait to read about it and use the data as well!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alexandria database #107

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Alexandria database #107

JonathanSchmidt1 Jan 24, 2024

Replies: 3 comments · 2 replies

laserkelvin Jan 25, 2024 Maintainer

melo-gonzo Jan 25, 2024 Maintainer

JonathanSchmidt1 Jan 30, 2024 Author

JonathanSchmidt1 Jan 30, 2024 Author

laserkelvin Jan 30, 2024 Maintainer

JonathanSchmidt1
Jan 24, 2024

Replies: 3 comments 2 replies

laserkelvin
Jan 25, 2024
Maintainer

melo-gonzo Jan 25, 2024
Maintainer

JonathanSchmidt1
Jan 30, 2024
Author

JonathanSchmidt1
Jan 30, 2024
Author

laserkelvin Jan 30, 2024
Maintainer