Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging dataframes for GloVe embeddings #160

Open
baubrey opened this issue Apr 20, 2023 · 2 comments
Open

Merging dataframes for GloVe embeddings #160

baubrey opened this issue Apr 20, 2023 · 2 comments
Assignees

Comments

@baubrey
Copy link

baubrey commented Apr 20, 2023

df = pd.merge(
    base_df, emb_df, left_index=True, right_index=True
) 

Doesn't seem to work correctly for GloVe embeddings, because the index of base_df and emb_df are not the same.

(Pdb) base_df
           word       onset      offset  accuracy  ... in_bert-base-cased  in_bert-large-cased  in_roberta-base in_roberta-large
1          okay  13132.0000  13388.0000         1  ...               True                 True             True             True
4         feels  13862.0000  14076.0000         1  ...               True                 True             True             True
5         great  14033.0000  14217.0000         1  ...               True                 True             True             True
6          yeah  14217.0000  14345.0000         1  ...               True                 True             True             True
7          Good  13877.0000  14066.0000         1  ...               True                 True             True             True
...         ...         ...         ...       ...  ...                ...                  ...              ...              ...
79647      that  73547.6624  73609.1024         1  ...               True                 True             True             True
79648        do  73609.1024  73660.3024         1  ...               True                 True             True             True
79649  anything  73660.3024  73798.5424         1  ...               True                 True             True             True
79650       Not  73967.6048  74033.4798         1  ...               True                 True             True             True
79651    really  74084.6798  74248.4385         1  ...               True                 True             True             True

[69152 rows x 44 columns]
(Pdb) emb_df
                                              embeddings
0      [0.19901, -0.77517, -0.11574, -0.35179, 0.4122...
1      [-0.086751, -0.10439, -0.48462, -0.27358, 1.01...
2      [-0.026567, 1.3357, -1.028, -0.3729, 0.52012, ...
3      [-0.80924, -0.030977, 0.5102, -0.75298, 0.4904...
4      [-0.35586, 0.5213, -0.6107, -0.30131, 0.94862,...
...                                                  ...
69147  [0.88387, -0.14199, 0.13566, 0.098682, 0.51218...
69148  [0.29605, -0.13841, 0.043774, -0.38744, 0.1226...
69149  [0.12032, -0.14806, 0.0059001, -0.1513, 0.7347...
69150  [0.55025, -0.24942, -0.0009386, -0.264, 0.5932...
69151  [0.0016675, -0.16376, -0.092648, -0.33466, 0.7...

[69152 rows x 1 columns]

@zkokaja
Copy link
Contributor

zkokaja commented May 4, 2023

to_dict() might be removing index. There are multiple save_pkl functions. This also may be an issue for whisper and base where we remove rows from base_df before generating emb_df and causing a mismatch in indexes.

@zkokaja zkokaja transferred this issue from hassonlab/247-encoding May 4, 2023
@zkokaja
Copy link
Contributor

zkokaja commented May 4, 2023

See #153

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants