You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think there exists errors with the correspondence between edge ids and edge features,
more specifically, edge ids is from 1 to #edges, but the edges' feature matrix's shape is [#edges, dim] (dim = 172 in your code)
in TGB/train_tgb_lpp.py , you just use edge ids to search features in edges' feature matrix ( I think it should be ids -1 to search or edges' feature matrix's should be added one row whose elements are all 0 in the head)
when id = #edges , the programma will raise errors of Array query out of bounds in
# in GraphMixer.py# self.edge_raw_features 'shape: [#edges, 172], but neighbor_edge_ids is [0, #edges] :0 is because the array is initialized to 0 and # should not actually appear; 1 to #edges are the values returned by querying the edge id in TGB/utils.py NeighborSampler.find_neighbors_beforenodes_edge_raw_features=self.edge_raw_features[torch.from_numpy(neighbor_edge_ids)]
in find_neighbors_before
deffind_neighbors_before(self, node_id: int, interact_time: float, return_sampled_probabilities: bool=False):
""" extracts all the interactions happening before interact_time (less than interact_time) for node_id in the overall interaction graph the returned interactions are sorted by time. :param node_id: int, node id :param interact_time: float, interaction time :param return_sampled_probabilities: boolean, whether return the sampled probabilities of neighbors :return: neighbors, edge_ids, timestamps and sampled_probabilities (if return_sampled_probabilities is True) with shape (historical_nodes_num, ) """# return index i, which satisfies list[i - 1] < v <= list[i]# return 0 for the first position in self.nodes_neighbor_times since the value at the first position is emptyi=np.searchsorted(self.nodes_neighbor_times[node_id], interact_time)
ifreturn_sampled_probabilities:
returnself.nodes_neighbor_ids[node_id][:i], self.nodes_edge_ids[node_id][:i], self.nodes_neighbor_times[node_id][:i], \
self.nodes_neighbor_sampled_probabilities[node_id][:i]
else: #defaultreturnself.nodes_neighbor_ids[node_id][:i], self.nodes_edge_ids[node_id][:i], self.nodes_neighbor_times[node_id][:i], None
self.nodes_neighbor_ids is [1, #edges]
classNeighborSampler:
def__init__(self, adj_list: list, sample_neighbor_strategy: str='uniform', time_scaling_factor: float=0.0, seed: int=None):
""" Neighbor sampler. :param adj_list: list, list of list, where each element is a list of triple tuple (node_id, edge_id, timestamp) :param sample_neighbor_strategy: str, how to sample historical neighbors, 'uniform', 'recent', or 'time_interval_aware' :param time_scaling_factor: float, a hyper-parameter that controls the sampling preference with time interval, a large time_scaling_factor tends to sample more on recent links, this parameter works when sample_neighbor_strategy == 'time_interval_aware' :param seed: int, random seed """self.sample_neighbor_strategy=sample_neighbor_strategyself.seed=seed# list of each node's neighbor ids, edge ids and interaction times, which are sorted by interaction timesself.nodes_neighbor_ids= []
self.nodes_edge_ids= []
self.nodes_neighbor_times= []
ifself.sample_neighbor_strategy=='time_interval_aware':
self.nodes_neighbor_sampled_probabilities= []
self.time_scaling_factor=time_scaling_factor# the list at the first position in adj_list is empty, hence, sorted() will return an empty list for the first position# its corresponding value in self.nodes_neighbor_ids, self.nodes_edge_ids, self.nodes_neighbor_times will also be empty with length 0fornode_idx, per_node_neighborsinenumerate(adj_list):
# per_node_neighbors is a list of tuples (neighbor_id, edge_id, timestamp)# sort the list based on timestamps, sorted() function is stable# Note that sort the list based on edge id is also correct, as the original data file ensures the interactions are chronologicalsorted_per_node_neighbors=sorted(per_node_neighbors, key=lambdax: x[2])
self.nodes_neighbor_ids.append(np.array([x[0] forxinsorted_per_node_neighbors]))
self.nodes_edge_ids.append(np.array([x[1] forxinsorted_per_node_neighbors])) # 范围为[1, #edges]self.nodes_neighbor_times.append(np.array([x[2] forxinsorted_per_node_neighbors]))
# additional for time interval aware sampling strategy (proposed in CAWN paper)ifself.sample_neighbor_strategy=='time_interval_aware':
self.nodes_neighbor_sampled_probabilities.append(self.compute_sampled_probabilities(np.array([x[2] forxinsorted_per_node_neighbors])))
ifself.seedisnotNone:
self.random_state=np.random.RandomState(self.seed)
# ++++++++++++ add here +++++++++++print('max ',np.max(np.concatenate(self.nodes_edge_ids))) # print is #edges:print('min ',np.min(np.concatenate(self.nodes_edge_ids))) # 1.0
Your program skips the table header when preprocessing the data set, take tgbl-wiki as example
which means your edges' feature (in tgbl-wiki is msg) shape is [#edges, dim] (#edges means the number of edges)
The problem lies in this function, ** is first initialized to a 0 matrix**
Then use the return value of the above function to assign values according to different strategies, but this function returns [1, #edges], and when #edges appears, query *[#edges, dim=172] *The matrix appears out of bounds
inside TGB/process_data.py
The edges are all 1 less than the dimension of the feature matrix (so that the query doesn't go out of bounds, but the equivalent of idx=0 is useless, and all of these matrices are handled as all 0)
In the case of wikipedia, the matrix is 157475
But there are actually only 157,474 edges
But for tgbl-wiki
Apparently, TGB's handler undercuts the first row of the feature matrix with all zeros.
Printing it out shows that the first row of tgbl-wki just matches Wikipedia's 0th row.
The problem eventually arises when TGB generates its own npy file based on the original csv file and forgets to insert the first line with all zeros.
The following processing of the csv file msg is the feature matrix, due to skiprows=1, msg is not complementary 0, so there will be an out of bounds
msg.shape (157474, 172)
# in tgb \linkproppred\dataset.pyimportpandasaspddefload_edgelist_wiki(fname: str) ->pd.DataFrame:
""" loading wikipedia dataset into pandas dataframe similar processing to https://pytorch-geometric.readthedocs.io/en/latest/_modules/torch_geometric/datasets/jodie.html Parameters: fname: str, name of the input file Returns: df: a pandas dataframe containing the edgelist data """df=pd.read_csv(fname, skiprows=1, header=None)
src=df.iloc[:, 0].valuesdst=df.iloc[:, 1].valuesdst+=int(src.max()) +1t=df.iloc[:, 2].valuesmsg=df.iloc[:, 4:].valuesprint(msg.shape)
idx=np.arange(t.shape[0])
w=np.ones(t.shape[0])
returnpd.DataFrame({"u": src, "i": dst, "ts": t, "idx": idx, "w": w}), msg, Nonedf, edge_feat, node_ids=load_edgelist_wiki('data/tgb/tgbl-wiki_edgelist_v2.csv')
The code in TGB/utils, on the other hand, uses a range of [1,#edges] for querying the feature's edge ids, which ultimately leads to an error when edge id=#edges.
classNeighborSampler:
def__init__(self, adj_list: list, sample_neighbor_strategy: str='uniform', time_scaling_factor: float=0.0, seed: int=None):
""" Neighbor sampler. :param adj_list: list, list of list, where each element is a list of triple tuple (node_id, edge_id, timestamp) :param sample_neighbor_strategy: str, how to sample historical neighbors, 'uniform', 'recent', or 'time_interval_aware' :param time_scaling_factor: float, a hyper-parameter that controls the sampling preference with time interval, a large time_scaling_factor tends to sample more on recent links, this parameter works when sample_neighbor_strategy == 'time_interval_aware' :param seed: int, random seed """self.sample_neighbor_strategy=sample_neighbor_strategyself.seed=seed# list of each node's neighbor ids, edge ids and interaction times, which are sorted by interaction timesself.nodes_neighbor_ids= []
self.nodes_edge_ids= []
self.nodes_neighbor_times= []
ifself.sample_neighbor_strategy=='time_interval_aware':
self.nodes_neighbor_sampled_probabilities= []
self.time_scaling_factor=time_scaling_factor# the list at the first position in adj_list is empty, hence, sorted() will return an empty list for the first position# its corresponding value in self.nodes_neighbor_ids, self.nodes_edge_ids, self.nodes_neighbor_times will also be empty with length 0fornode_idx, per_node_neighborsinenumerate(adj_list):
# per_node_neighbors is a list of tuples (neighbor_id, edge_id, timestamp)# sort the list based on timestamps, sorted() function is stable# Note that sort the list based on edge id is also correct, as the original data file ensures the interactions are chronologicalsorted_per_node_neighbors=sorted(per_node_neighbors, key=lambdax: x[2])
self.nodes_neighbor_ids.append(np.array([x[0] forxinsorted_per_node_neighbors]))
self.nodes_edge_ids.append(np.array([x[1] forxinsorted_per_node_neighbors])) # 范围为[1, #edges]self.nodes_neighbor_times.append(np.array([x[2] forxinsorted_per_node_neighbors]))
# additional for time interval aware sampling strategy (proposed in CAWN paper)ifself.sample_neighbor_strategy=='time_interval_aware':
self.nodes_neighbor_sampled_probabilities.append(self.compute_sampled_probabilities(np.array([x[2] forxinsorted_per_node_neighbors])))
ifself.seedisnotNone:
self.random_state=np.random.RandomState(self.seed)
# print('max ',np.max(np.concatenate(self.nodes_edge_ids))) # 打印结果为 #edges: 793219.0# print('min ',np.min(np.concatenate(self.nodes_edge_ids))) # 1.0
I think there exists errors with the correspondence between edge ids and edge features,
more specifically, edge ids is from 1 to #edges, but the edges' feature matrix's shape is [#edges, dim] (dim = 172 in your code)
in TGB/train_tgb_lpp.py , you just use edge ids to search features in edges' feature matrix ( I think it should be ids -1 to search or edges' feature matrix's should be added one row whose elements are all 0 in the head)
when id = #edges , the programma will raise errors of Array query out of bounds in
in find_neighbors_before
self.nodes_neighbor_ids is [1, #edges]
Your program skips the table header when preprocessing the data set, take tgbl-wiki as example
which means your edges' feature (in tgbl-wiki is msg) shape is [#edges, dim] (#edges means the number of edges)
The problem lies in this function, ** is first initialized to a 0 matrix**
Then use the return value of the above function to assign values according to different strategies, but this function returns [1, #edges], and when #edges appears, query *[#edges, dim=172] *The matrix appears out of bounds
inside TGB/process_data.py
The edges are all 1 less than the dimension of the feature matrix (so that the query doesn't go out of bounds, but the equivalent of idx=0 is useless, and all of these matrices are handled as all 0)
In the case of wikipedia, the matrix is 157475
But there are actually only 157,474 edges
But for tgbl-wiki
Apparently, TGB's handler undercuts the first row of the feature matrix with all zeros.
Printing it out shows that the first row of tgbl-wki just matches Wikipedia's 0th row.
The problem eventually arises when TGB generates its own npy file based on the original csv file and forgets to insert the first line with all zeros.
The following processing of the csv file msg is the feature matrix, due to skiprows=1, msg is not complementary 0, so there will be an out of bounds
msg.shape
(157474, 172)
The code in TGB/utils, on the other hand, uses a range of [1,#edges] for querying the feature's edge ids, which ultimately leads to an error when edge id=#edges.
So, the easiest solution I think is that:
The text was updated successfully, but these errors were encountered: