Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BonDNet #12

Merged
merged 186 commits into from
Nov 6, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
186 commits
Select commit Hold shift + click to select a range
4871a84
attempting to resurrect atom mapping
samblau Apr 6, 2023
6fb1d42
Merge branch 'sam_dev' into atom_mapping_2023
samblau Apr 6, 2023
350c179
new atom mapping testing
samblau Apr 7, 2023
4300e54
small tweak
samblau Apr 7, 2023
a4f07b0
Debugging old atom mapping now possible
samblau Apr 8, 2023
1aefe9b
Debugging old atom mapping
samblau Apr 10, 2023
c0e6e0d
Merge branch 'sam_dev' into atom_mapping_2023
samblau May 7, 2023
ea87b5e
Minor progress
samblau May 8, 2023
adac528
noH_graph just the start
samblau May 16, 2023
6157475
compressed graph
samblau May 23, 2023
873b2c2
mapping nearly working
samblau May 24, 2023
2f59a51
Atom mapping working (without symmetry)
samblau May 24, 2023
494ccf4
address last (?) corner case
samblau May 25, 2023
0eb3511
bug fix A+A -> AA case
samblau May 25, 2023
9d7bd1f
starting to deal with small mol symmetry
samblau May 25, 2023
095bda0
co3 dealt with?
samblau May 26, 2023
d129033
report issues
samblau May 26, 2023
65db489
fix report issues
samblau May 26, 2023
83833be
Symmetry solved?!?!
samblau Jun 3, 2023
3c8f839
Phase 1 atom mapping fixes
samblau Jun 6, 2023
c3aa765
add mapping for redox and charge transfer
samblau Jun 6, 2023
4018648
Mapping fully working for all tests!
samblau Jun 6, 2023
0d0b02d
Tiny tweak
samblau Jun 6, 2023
6a3a2e7
test_1
Jun 29, 2023
a2ad798
test_2
Jun 29, 2023
3defee5
test for creating dgl molecule graphs
Jun 29, 2023
ae02fd7
test for creating dgl molecule graphs
Jun 29, 2023
bebf00e
test
Jun 29, 2023
00e7495
test
Jun 29, 2023
de51473
bug_fix
Jun 29, 2023
0b7081b
bug_fix
Jun 29, 2023
d75bda3
bug_fix
Jun 29, 2023
372a43d
non_metal_bonds
Jun 29, 2023
a618276
normalization trial 1
Jun 29, 2023
864cd44
normalization trial 1
Jun 29, 2023
02089fd
reaction_filter_trial
Jun 29, 2023
57ae06d
reaction_filter_trial
Jun 29, 2023
f6e1f53
reaction_filter_trial
Jun 29, 2023
17f67ca
bug fix
Jun 29, 2023
a4b9d8e
bug_fix
Jun 29, 2023
2a11766
update files
Jun 29, 2023
4a60501
bug fix
Jun 29, 2023
757335d
test
Jun 29, 2023
e98d920
transform atom mapping test
Jul 6, 2023
bdb8d23
bug fix
Jul 6, 2023
a07c468
bug fix
Jul 6, 2023
ca22091
print netwrokx_graph
Jul 6, 2023
79dca7d
printing
Jul 6, 2023
292db4d
printing
Jul 6, 2023
3fb7549
printing
Jul 6, 2023
8860b3e
printing
Jul 6, 2023
914968c
check total_bonds
Jul 6, 2023
386c4a8
check total_bonds
Jul 6, 2023
7f294b8
check total_bonds
Jul 6, 2023
c44b560
check total_bonds
Jul 6, 2023
75b1c90
check total_bonds
Jul 6, 2023
b866b17
check total_bonds
Jul 6, 2023
29ac7b0
check total_bonds
Jul 6, 2023
45a482b
check total_bonds
Jul 6, 2023
3c31d62
check total_bonds
Jul 6, 2023
14422fd
print bonds broken
Jul 6, 2023
7201a50
fix assert
Jul 6, 2023
be5cc0a
bonds
Jul 6, 2023
3a9b9d1
bonds
Jul 6, 2023
f339820
bonds
Jul 6, 2023
2379a56
edge_case
Jul 6, 2023
fdeb032
edge_case
Jul 6, 2023
a4d057c
edge_case
Jul 6, 2023
6e69bed
fix
Jul 6, 2023
c0f34a9
fix
Jul 6, 2023
b652242
trial
Jul 6, 2023
049c272
trial2
Jul 6, 2023
d2000ed
fix trial2
Jul 6, 2023
4f1f43e
fix trial3
Jul 6, 2023
2ee6d90
investigate error
Jul 6, 2023
49ef7ac
investigate error
Jul 6, 2023
7fd1c71
trial
Jul 10, 2023
bf0c575
trial
Jul 10, 2023
3c3df7b
print
Jul 10, 2023
80c2470
trial
Jul 10, 2023
c3025f3
trial
Jul 10, 2023
25d48de
print rxn_grphs
Jul 10, 2023
3a66ac9
trial
Jul 10, 2023
1ca4b70
trial2
Jul 10, 2023
5af73df
trial3
Jul 10, 2023
61fd480
trial
Jul 10, 2023
e51851f
trial
Jul 10, 2023
620f144
trial
Jul 10, 2023
6e207fb
trial
Jul 10, 2023
1fe5ed2
trial
Jul 10, 2023
4eb9f8b
fix_error
Jul 10, 2023
a341e96
fix_error
Jul 10, 2023
d9f4ef9
fix_error
Jul 10, 2023
3e9040b
fix_error
Jul 10, 2023
dd467e3
fix_error
Jul 10, 2023
f8ef6f6
create json data
Jul 11, 2023
02523f5
create json data
Jul 11, 2023
d44f2eb
create json data
Jul 11, 2023
0fb60c1
cleaning up
Jul 11, 2023
bc953b7
clean codes
Jul 11, 2023
919aae4
clean codes
Jul 11, 2023
0feb99e
clean codes
Jul 11, 2023
5bc7271
clean codes
Jul 11, 2023
b96f2de
clean codes
Jul 11, 2023
47d51ed
clean codes
Jul 11, 2023
cfa7148
fix assert
Jul 11, 2023
b58ac35
clean code
Jul 11, 2023
f42a3bb
change some words
Jul 11, 2023
c4d1b67
testing
Jul 11, 2023
cc3109f
add mol_wrapper pickle
Jul 13, 2023
e6195e2
fix error
Jul 13, 2023
f482234
trial
Jul 13, 2023
44b3ee8
add
Jul 13, 2023
219d9a9
successfully grab molwrapper
Jul 20, 2023
dda7913
include lmdb
Jul 24, 2023
fad4dbe
fix error
Jul 25, 2023
bcb3371
fix error
Jul 25, 2023
ba88bab
lmdb error fix trial
Jul 25, 2023
84935bc
trial
Jul 25, 2023
7734c12
trial
Jul 25, 2023
933744f
trial
Jul 25, 2023
f94e3ea
trial
Jul 25, 2023
c1d1ce3
trial
Jul 25, 2023
d82a728
trial
Jul 25, 2023
644f2ca
trial
Jul 25, 2023
4f70212
trial
Jul 25, 2023
eca83a5
trial
Jul 25, 2023
af2468f
trial
Jul 25, 2023
d625afa
small change
Jul 25, 2023
908f043
trial
Jul 25, 2023
c7ab1a7
trial
Jul 25, 2023
228cf70
trial
Jul 25, 2023
f666ed4
trial
Jul 25, 2023
132e382
trial
Jul 25, 2023
4196bed
trial
Jul 25, 2023
dfd8958
remove printing
Jul 25, 2023
ae0aa18
remove printing
Jul 25, 2023
05ee81c
add
Jul 26, 2023
874a240
add printing
Jul 26, 2023
f4296c3
trial2
Jul 26, 2023
d25076c
trial2
Jul 26, 2023
13ee61b
trial
Jul 26, 2023
588d20f
trial2
Jul 26, 2023
1d67df6
trial2
Jul 26, 2023
ded6b04
trial3
Jul 27, 2023
5bd470d
trial3
Jul 27, 2023
42413f7
trial4
Jul 27, 2023
61d9089
time it
Jul 27, 2023
5002136
trial5
Jul 27, 2023
8992d98
trial5
Jul 27, 2023
ed69385
trial
Jul 27, 2023
fcdeb83
trial
Jul 27, 2023
c68711f
fix
Jul 27, 2023
c4bee31
trial
Jul 27, 2023
0bb71cc
print
Jul 27, 2023
59298d6
save grapher features
Jul 27, 2023
4329323
change
Jul 27, 2023
cd0a41b
fix error
Jul 27, 2023
335faa8
trial
Aug 1, 2023
c21933b
trial
Aug 1, 2023
4b503e1
trial
Aug 1, 2023
0b94533
trial
Aug 1, 2023
9779e71
trial
Aug 1, 2023
3ad4b1d
trial
Aug 1, 2023
5a1b8f8
trial
Aug 1, 2023
39fe073
trial
Aug 1, 2023
8648f14
trial
Aug 1, 2023
da2e980
trial
Aug 1, 2023
74d9ce6
trial
Aug 1, 2023
167eaf9
print
Aug 3, 2023
f3114ec
only redox
Aug 3, 2023
59321f4
printing
Aug 3, 2023
ed0f18a
printing
Aug 3, 2023
e6a4c75
printing
Aug 3, 2023
ae0aea1
check atom_map
Aug 3, 2023
6e2b418
Merge branch 'sam_dev' into BonDNet
samblau Nov 2, 2023
19b9df9
Merge branch 'sam_dev' into atom_mapping_2023
samblau Nov 2, 2023
d558563
Merge branch 'atom_mapping_2023' into BonDNet
samblau Nov 2, 2023
3bdeb29
Merge branch 'sam_dev' into atom_mapping_2023
samblau Nov 3, 2023
365ae1a
Merge branch 'atom_mapping_2023' into BonDNet
samblau Nov 3, 2023
0c7f299
Merge branch 'sam_dev' into atom_mapping_2023
samblau Nov 3, 2023
8ac73d2
Merge branch 'atom_mapping_2023' into BonDNet
samblau Nov 3, 2023
594fcd2
Merge branch 'sam_dev' into atom_mapping_2023
samblau Nov 6, 2023
cd87c92
Merge branch 'atom_mapping_2023' into BonDNet
samblau Nov 6, 2023
d4c3bc7
Merge branch 'sam_dev' into atom_mapping_2023
samblau Nov 6, 2023
64c7d40
Merge branch 'atom_mapping_2023' into BonDNet
samblau Nov 6, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
329 changes: 329 additions & 0 deletions HiPRGen/lmdb_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,329 @@
#give dgl graphs, reaction features, meta. write them into lmdb file.
#1. check expend lmdb reasonably

#give dgl graphs, reaction features, meta. write them into lmdb file.
#1. check expend lmdb reasonably

from torch.utils.data import Dataset
from pathlib import Path
import numpy as np
import pickle
import lmdb
from torch.utils.data import random_split
import multiprocessing as mp
import os
import pickle
from tqdm import tqdm
import glob


class LmdbDataset(Dataset):
"""
Dataset class to
1. write Reaction networks objecs to lmdb
2. load lmdb files
"""
def __init__(self, config, transform=None):
super(LmdbDataset, self).__init__()

self.config = config
self.path = Path(self.config["src"])

#Get metadata in case
#self.metadata_path = self.path.parent / "metadata.npz"
self.env = self.connect_db(self.path)

# If "length" encoded as ascii is present, use that
# If there are additional properties, there must be length.
length_entry = self.env.begin().get("length".encode("ascii"))
if length_entry is not None:
num_entries = pickle.loads(length_entry)
else:
# Get the number of stores data from the number of entries
# in the LMDB
num_entries = self.env.stat()["entries"]

self._keys = list(range(num_entries))
self.num_samples = num_entries

#Get portion of total dataset
self.sharded = False
if "shard" in self.config and "total_shards" in self.config:
self.sharded = True
self.indices = range(self.num_samples)
# split all available indices into 'total_shards' bins
self.shards = np.array_split(
self.indices, self.config.get("total_shards", 1)
)
# limit each process to see a subset of data based off defined shard
self.available_indices = self.shards[self.config.get("shard", 0)]
self.num_samples = len(self.available_indices)

#TODO
self.transform = transform

def __len__(self):
return self.num_samples

def __getitem__(self, idx):
# if sharding, remap idx to appropriate idx of the sharded set
if self.sharded:
idx = self.available_indices[idx]

#!CHECK, _keys should be less then total numbers of keys as there are more properties.
datapoint_pickled = self.env.begin().get(
f"{self._keys[idx]}".encode("ascii")
)

data_object = pickle.loads(datapoint_pickled)

#TODO
if self.transform is not None:
data_object = self.transform(data_object)

return data_object

def connect_db(self, lmdb_path=None):
env = lmdb.open(
str(lmdb_path),
subdir=False,
readonly=True,
lock=False,
readahead=True,
meminit=False,
max_readers=1,
)
return env

def close_db(self):
if not self.path.is_file():
for env in self.envs:
env.close()
else:
self.env.close()

def get_metadata(self, num_samples=100):
pass

@property
def dtype(self):
dtype = self.env.begin().get("dtype".encode("ascii"))
return pickle.loads(dtype)

@property
def feature_size(self):
feature_size = self.env.begin().get("feature_size".encode("ascii"))
return pickle.loads(feature_size)

@property
def feature_name(self):
feature_name = self.env.begin().get("feature_name".encode("ascii"))
return pickle.loads(feature_name)

@property
def mean(self):
mean = self.env.begin().get("mean".encode("ascii"))
return pickle.loads(mean)

@property
def std(self):
std = self.env.begin().get("std".encode("ascii"))
return pickle.loads(std)


def divide_to_list(a, b):
quotient = a // b
remainder = a % b

result = []
for i in range(b):
increment = 1 if i < remainder else 0
result.append(quotient + increment)

return result

def cleanup_lmdb_files(directory, pattern):
"""
Cleans up files matching the given pattern in the specified directory.
"""
file_list = glob.glob(os.path.join(directory, pattern))

for file_path in file_list:
try:
os.remove(file_path)
print(f"Deleted file: {file_path}")
except OSError as e:
print(f"Error deleting file: {file_path}. {str(e)}")

def CRNs2lmdb( dtype,
feature_size,
feature_name,
mean,
std,
lmdb_dir,
lmdb_name
):

#os.makedirs(os.path.join(lmdb_dir, exist_ok=True))
os.makedirs(lmdb_dir, exist_ok=True)

db_paths = os.path.join(lmdb_dir, "_tmp_data.%04d.lmdb")

meta_keys = {
"dtype" : dtype,
"feature_size": feature_size,
"feature_name": feature_name,
"mean" : mean,
"std" : std
}

# Merge LMDB files
merge_lmdbs(db_paths, lmdb_dir, lmdb_name)
cleanup_lmdb_files(lmdb_dir, "_tmp_data*")



def write_crns_to_lmdb(mp_args):
#pid is idx of workers.
db_path, samples, pid, meta_keys = mp_args

db = lmdb.open(
db_path,
map_size=1099511627776 * 2,
subdir=False,
meminit=False,
map_async=True,
)

pbar = tqdm(
total=len(samples),
position=pid,
desc=f"Worker {pid}: Writing CRNs Objects into LMDBs",
)

#write indexed samples
idx = 0
for sample in samples:
txn=db.begin(write=True)
txn.put(
f"{idx}".encode("ascii"),
pickle.dumps(sample, protocol=-1),
)
idx += 1
pbar.update(1)
txn.commit()

#write properties
txn=db.begin(write=True)
txn.put("length".encode("ascii"), pickle.dumps(len(samples), protocol=-1))
txn.commit()

for key, value in meta_keys.items():
txn=db.begin(write=True)
txn.put(key.encode("ascii"), pickle.dumps(value, protocol=-1))
txn.commit()

db.sync()
db.close()


def merge_lmdbs(db_paths, out_path, output_file):
"""
merge lmdb files and reordering indexes.
"""
env_out = lmdb.open(
os.path.join(out_path, output_file),
map_size=1099511627776 * 2,
subdir=False,
meminit=False,
map_async=True,
)


idx = 0
for db_path in db_paths:
env_in = lmdb.open(
str(db_path),
subdir=False,
readonly=True,
lock=False,
readahead=True,
meminit=False,
)

#should set indexes so that properties do not writtent down as well.
with env_out.begin(write=True) as txn_out, env_in.begin(write=False) as txn_in:
cursor = txn_in.cursor()
for key, value in cursor:
#write indexed samples
try:
int(key.decode("ascii"))
txn_out.put(
f"{idx}".encode("ascii"),
value,
)
idx+=1
#print(idx)
#write properties
except ValueError:
txn_out.put(
key,
value
)
env_in.close()

#update length
txn_out=env_out.begin(write=True)
txn_out.put("length".encode("ascii"), pickle.dumps(idx, protocol=-1))
txn_out.commit()

env_out.sync()
env_out.close()



def write_to_lmdb(new_samples, current_length, lmdb_update, db_path):

# #pid is idx of workers.
# db_path, samples, pid, meta_keys = mp_args
db = lmdb.open(
db_path,
map_size=1099511627776 * 2,
subdir=False,
meminit=False,
map_async=True,
)

pbar = tqdm(
total=len(new_samples),
desc=f"Adding new samples into LMDBs",
)

#write indexed samples
idx = current_length
for sample in new_samples:
txn=db.begin(write=True)
txn.put(
f"{idx}".encode("ascii"),
pickle.dumps(sample, protocol=-1),
)
idx += 1
pbar.update(1)
txn.commit()

#write properties

total_length = current_length + len(new_samples)

txn=db.begin(write=True)
txn.put("length".encode("ascii"), pickle.dumps(total_length, protocol=-1))
txn.commit()

#write mean, std, feature_size, feature_name. dtype etc.
for key, value in lmdb_update.items():
txn=db.begin(write=True)
txn.put(key.encode("ascii"), pickle.dumps(value, protocol=-1))
txn.commit()

db.sync()
db.close()
Loading
Loading