Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors with the correspondence between edge ids and edge features #1

Open
weichow23 opened this issue Sep 19, 2023 · 0 comments
Open

Comments

@weichow23
Copy link

I think there exists errors with the correspondence between edge ids and edge features,
more specifically, edge ids is from 1 to #edges, but the edges' feature matrix's shape is [#edges, dim] (dim = 172 in your code)
in TGB/train_tgb_lpp.py , you just use edge ids to search features in edges' feature matrix ( I think it should be ids -1 to search or edges' feature matrix's should be added one row whose elements are all 0 in the head)

when id = #edges , the programma will raise errors of Array query out of bounds in

# in  GraphMixer.py
# self.edge_raw_features 'shape: [#edges, 172], but  neighbor_edge_ids is [0, #edges] :0 is because the array is initialized to 0 and 
# should not actually appear; 1 to #edges are the values returned by querying the edge id in TGB/utils.py  NeighborSampler.find_neighbors_before
nodes_edge_raw_features = self.edge_raw_features[torch.from_numpy(neighbor_edge_ids)]

in find_neighbors_before

    def find_neighbors_before(self, node_id: int, interact_time: float, return_sampled_probabilities: bool = False):
        """
        extracts all the interactions happening before interact_time (less than interact_time) for node_id in the overall interaction graph
        the returned interactions are sorted by time.
        :param node_id: int, node id
        :param interact_time: float, interaction time
        :param return_sampled_probabilities: boolean, whether return the sampled probabilities of neighbors
        :return: neighbors, edge_ids, timestamps and sampled_probabilities (if return_sampled_probabilities is True) with shape (historical_nodes_num, )
        """
        # return index i, which satisfies list[i - 1] < v <= list[i]
        # return 0 for the first position in self.nodes_neighbor_times since the value at the first position is empty
        i = np.searchsorted(self.nodes_neighbor_times[node_id], interact_time)
        if return_sampled_probabilities:
            return self.nodes_neighbor_ids[node_id][:i], self.nodes_edge_ids[node_id][:i], self.nodes_neighbor_times[node_id][:i], \
                   self.nodes_neighbor_sampled_probabilities[node_id][:i]
        else: #default
            return self.nodes_neighbor_ids[node_id][:i], self.nodes_edge_ids[node_id][:i], self.nodes_neighbor_times[node_id][:i], None

self.nodes_neighbor_ids is [1, #edges]

class NeighborSampler:

    def __init__(self, adj_list: list, sample_neighbor_strategy: str = 'uniform', time_scaling_factor: float = 0.0, seed: int = None):
        """
        Neighbor sampler.
        :param adj_list: list, list of list, where each element is a list of triple tuple (node_id, edge_id, timestamp)
        :param sample_neighbor_strategy: str, how to sample historical neighbors, 'uniform', 'recent', or 'time_interval_aware'
        :param time_scaling_factor: float, a hyper-parameter that controls the sampling preference with time interval,
        a large time_scaling_factor tends to sample more on recent links, this parameter works when sample_neighbor_strategy == 'time_interval_aware'
        :param seed: int, random seed
        """
        self.sample_neighbor_strategy = sample_neighbor_strategy
        self.seed = seed

        # list of each node's neighbor ids, edge ids and interaction times, which are sorted by interaction times
        self.nodes_neighbor_ids = []
        self.nodes_edge_ids = []
        self.nodes_neighbor_times = []

        if self.sample_neighbor_strategy == 'time_interval_aware':
            self.nodes_neighbor_sampled_probabilities = []
            self.time_scaling_factor = time_scaling_factor

        # the list at the first position in adj_list is empty, hence, sorted() will return an empty list for the first position
        # its corresponding value in self.nodes_neighbor_ids, self.nodes_edge_ids, self.nodes_neighbor_times will also be empty with length 0
        for node_idx, per_node_neighbors in enumerate(adj_list):
            # per_node_neighbors is a list of tuples (neighbor_id, edge_id, timestamp)
            # sort the list based on timestamps, sorted() function is stable
            # Note that sort the list based on edge id is also correct, as the original data file ensures the interactions are chronological
            sorted_per_node_neighbors = sorted(per_node_neighbors, key=lambda x: x[2])
            self.nodes_neighbor_ids.append(np.array([x[0] for x in sorted_per_node_neighbors]))
            self.nodes_edge_ids.append(np.array([x[1] for x in sorted_per_node_neighbors]))  # 范围为[1, #edges]
            self.nodes_neighbor_times.append(np.array([x[2] for x in sorted_per_node_neighbors]))

            # additional for time interval aware sampling strategy (proposed in CAWN paper)
            if self.sample_neighbor_strategy == 'time_interval_aware':
                self.nodes_neighbor_sampled_probabilities.append(self.compute_sampled_probabilities(np.array([x[2] for x in sorted_per_node_neighbors])))

        if self.seed is not None:
            self.random_state = np.random.RandomState(self.seed)

       # ++++++++++++ add here +++++++++++
        print('max ',np.max(np.concatenate(self.nodes_edge_ids)))  # print is #edges:
        print('min ',np.min(np.concatenate(self.nodes_edge_ids)))  # 1.0

Your program skips the table header when preprocessing the data set, take tgbl-wiki as example

    df = pd.read_csv(fname, skiprows=1, header=None)
    src = df.iloc[:, 0].values
    dst = df.iloc[:, 1].values
    dst += int(src.max()) + 1
    t = df.iloc[:, 2].values
    msg = df.iloc[:, 4:].values
    idx = np.arange(t.shape[0])
    w = np.ones(t.shape[0])

    return pd.DataFrame({"u": src, "i": dst, "ts": t, "idx": idx, "w": w}), msg, None

which means your edges' feature (in tgbl-wiki is msg) shape is [#edges, dim] (#edges means the number of edges)

The problem lies in this function, ** is first initialized to a 0 matrix**

image

image

Then use the return value of the above function to assign values according to different strategies, but this function returns [1, #edges], and when #edges appears, query *[#edges, dim=172] *The matrix appears out of bounds

inside TGB/process_data.py

image

The edges are all 1 less than the dimension of the feature matrix (so that the query doesn't go out of bounds, but the equivalent of idx=0 is useless, and all of these matrices are handled as all 0)

image

In the case of wikipedia, the matrix is 157475

image

But there are actually only 157,474 edges

image

But for tgbl-wiki

image

Apparently, TGB's handler undercuts the first row of the feature matrix with all zeros.

Printing it out shows that the first row of tgbl-wki just matches Wikipedia's 0th row.

image
image

The problem eventually arises when TGB generates its own npy file based on the original csv file and forgets to insert the first line with all zeros.

The following processing of the csv file msg is the feature matrix, due to skiprows=1, msg is not complementary 0, so there will be an out of bounds

msg.shape (157474, 172)

# in tgb \linkproppred\dataset.py
import pandas as pd
def load_edgelist_wiki(fname: str) -> pd.DataFrame:
    """
    loading wikipedia dataset into pandas dataframe
    similar processing to
    https://pytorch-geometric.readthedocs.io/en/latest/_modules/torch_geometric/datasets/jodie.html

    Parameters:
        fname: str, name of the input file
    Returns:
        df: a pandas dataframe containing the edgelist data
    """
    df = pd.read_csv(fname, skiprows=1, header=None)
    src = df.iloc[:, 0].values
    dst = df.iloc[:, 1].values
    dst += int(src.max()) + 1
    t = df.iloc[:, 2].values
    msg = df.iloc[:, 4:].values
    print(msg.shape)
    idx = np.arange(t.shape[0])
    w = np.ones(t.shape[0])

    return pd.DataFrame({"u": src, "i": dst, "ts": t, "idx": idx, "w": w}), msg, None

df, edge_feat, node_ids = load_edgelist_wiki('data/tgb/tgbl-wiki_edgelist_v2.csv')

The code in TGB/utils, on the other hand, uses a range of [1,#edges] for querying the feature's edge ids, which ultimately leads to an error when edge id=#edges.

class NeighborSampler:

    def __init__(self, adj_list: list, sample_neighbor_strategy: str = 'uniform', time_scaling_factor: float = 0.0, seed: int = None):
        """
        Neighbor sampler.
        :param adj_list: list, list of list, where each element is a list of triple tuple (node_id, edge_id, timestamp)
        :param sample_neighbor_strategy: str, how to sample historical neighbors, 'uniform', 'recent', or 'time_interval_aware'
        :param time_scaling_factor: float, a hyper-parameter that controls the sampling preference with time interval,
        a large time_scaling_factor tends to sample more on recent links, this parameter works when sample_neighbor_strategy == 'time_interval_aware'
        :param seed: int, random seed
        """
        self.sample_neighbor_strategy = sample_neighbor_strategy
        self.seed = seed

        # list of each node's neighbor ids, edge ids and interaction times, which are sorted by interaction times
        self.nodes_neighbor_ids = []
        self.nodes_edge_ids = []
        self.nodes_neighbor_times = []

        if self.sample_neighbor_strategy == 'time_interval_aware':
            self.nodes_neighbor_sampled_probabilities = []
            self.time_scaling_factor = time_scaling_factor

        # the list at the first position in adj_list is empty, hence, sorted() will return an empty list for the first position
        # its corresponding value in self.nodes_neighbor_ids, self.nodes_edge_ids, self.nodes_neighbor_times will also be empty with length 0
        for node_idx, per_node_neighbors in enumerate(adj_list):
            # per_node_neighbors is a list of tuples (neighbor_id, edge_id, timestamp)
            # sort the list based on timestamps, sorted() function is stable
            # Note that sort the list based on edge id is also correct, as the original data file ensures the interactions are chronological
            sorted_per_node_neighbors = sorted(per_node_neighbors, key=lambda x: x[2])
            self.nodes_neighbor_ids.append(np.array([x[0] for x in sorted_per_node_neighbors]))
            self.nodes_edge_ids.append(np.array([x[1] for x in sorted_per_node_neighbors]))  # 范围为[1, #edges]
            self.nodes_neighbor_times.append(np.array([x[2] for x in sorted_per_node_neighbors]))

            # additional for time interval aware sampling strategy (proposed in CAWN paper)
            if self.sample_neighbor_strategy == 'time_interval_aware':
                self.nodes_neighbor_sampled_probabilities.append(self.compute_sampled_probabilities(np.array([x[2] for x in sorted_per_node_neighbors])))

        if self.seed is not None:
            self.random_state = np.random.RandomState(self.seed)

        # print('max ',np.max(np.concatenate(self.nodes_edge_ids)))  # 打印结果为 #edges: 793219.0
        # print('min ',np.min(np.concatenate(self.nodes_edge_ids)))  # 1.0

So, the easiest solution I think is that:

# in TGB/train_tgb_lpp.py 

    node_raw_features, edge_raw_features, full_data, train_data, val_data, test_data, dataset = \
        get_link_pred_data_TRANS_TGB(dataset_name=args.dataset_name)

    # + ------------------------------------------------------
    # add that
    node_raw_features = np.vstack((np.zeros((1, 172)), node_raw_features))
    edge_raw_features = np.vstack((np.zeros((1, 172)), edge_raw_features))
    # + ------------------------------------------------------
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant