Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarity on data description #6

Open
YiqiJ opened this issue Dec 19, 2024 · 4 comments
Open

Clarity on data description #6

YiqiJ opened this issue Dec 19, 2024 · 4 comments

Comments

@YiqiJ
Copy link

YiqiJ commented Dec 19, 2024

Hi,
Thank you for sharing the dataset. I am processing the data but encountered several issues:

  1. Center Out Reaching 001057 (nwb) file sub-Dataset-3-Animals-1-2-3-&-4 has no Target_ID field but target_dir field. Additionally, there are 15 unique target_dir values. By /(2*np.pi) * 360, I obtain unique values as [0,45,90,135,180,225,270,325, 1,-1,2,-2,3,-3, nan] . Specifically, Animal 3 has the correct radian values, while Animals 1, 3, 4 all has non-radian values. I initialized a map between the index to radian. However, if you plot the plt.plot(hand_vel_x, hand_vel_y), the index labels (-3 to 3) are seemingly incorrect. As shown here, this is trial_dir equal to -2, which is identical to 225 degree presumably.
Screen Shot 2024-12-18 at 10 44 46 PM
  1. Center Out Reaching 001057 (nwb) file sub-Dataset-5-Animal-1 has Target_ID field, but with values array([ 6., 3., 2., 7., 4., 5., 0., 1., 12., nan]), which looks confusing.

  2. Currently, I am using df_raw , bin = get_dataframe(data,filter_result=[b'R']). But when applying the df=rebin(df_raw,prev_bin_size = bin ,new_bin_size = 30) and df = align_event(df, start_event='EventTarget_Onset', bin_size=30,offset_min=-50,offset_max=400), the trial length decrease from ~120 time bins to less than 10 time bins. Additionally, after calling these two functions, each trial appear to include various different target_dir, which should not be the case. I thought the align function usually tend to first find the key point (e.g., target onset), then include 50 time bins before and 400 time bins after.

  3. The kaggle dataset end with .parquet looks very different from the nwb files. If you plot the hand velocity

Can you please specify

  1. how to interpret the trial type information? Including trial_dir = {-3, -2, -1, 1, 2, 3} and 'target_ID = array([ 7., 2., 3., 1., 0., 6., 5., 4., nan])

  2. how to get 50 ms before key point and 400 ms after key point? Thus resulting in 45 time bins if the bin_width is 10 ms.

  3. Is the original data with timestamp and hand position available? (looks like dataset 4 has cursor_pos information available, but not for dataset 3 and dataset 5) I feel like this could be extremely useful.

Thank you!

@acarolinafilipe
Copy link
Collaborator

Hi,

Thank you for your detailed feedback and observations. Let me address each point:

  1. Target_ID in Dataset 3
    You are correct that the original data for Dataset 3 does not include a Target_ID field but rather a target_dir field. I have double-checked the data, and the values in target_dir are the same as in the original dataset. According to their documentation, the target_dir values should represent angles in radians for all animals. I did not apply any further processing to this variable. However, I agree that the values you pointed out do not look consistent or interpretable. I will contact the original authors of this dataset to clarify and verify the correctness of these values.

  2. Target_ID in Dataset 5
    The Target_ID values in Dataset 5 indicate that the experiment involved 13 unique targets across all experiments. However, not all targets were used for every animal or task, as this dataset includes both a center-out task and a random-target task. The nan values represent incomplete or aborted trials. I will update the labeling in the data to ensure these cases are explicitly identified. Thank you for catching this issue. (note: cursor position is available in this dataset.)

  3. Alignment and Rebinning
    I re-tested the alignment function and confirmed that it works as expected. The offsets specified in the function are in milliseconds (ms), not time bins. Please check if you are filtering by dataset, animal, session, and trial when verifying single-trial lengths. Filtering inconsistently or across datasets can lead to discrepancies.
    Your intuition is correct; it should work as you described. If the trial length decreases significantly after alignment, this could be caused by misaligned event timestamps or inconsistencies in the way trials are filtered. Can you tell me in which dataset you observed this problem? That will help me replicate the problem and investigate it further.

Thank you for using our dataset and for your detailed observations. We're excited to see people working with the data, and your feedback is invaluable for improving it. Please don't hesitate to contact us if you have any other problems and share details about the datasets or specific problems so that we can provide further assistance.

@YiqiJ
Copy link
Author

YiqiJ commented Dec 19, 2024

Hi Carolina,

Thank you for the quick response! These are very helpful. A few more questions:

  1. These are some sample behavioral trajectories in Dataset-3-5. Note, these are the plot of [cursor_vel_x, cursor_vel_y]. The colors are assigned based on the Target_ID. Note that I ignored the trials if there Target_ID is either 12 or nan.

Q1 (Dataset 5): it looks like a fraction of trials are labeled correctly, while a fraction of trials have wrong Target_ID labels (some random colors in each target direction). I wonder if it is because this Dataset-5 has includes two task? Two follow up questions are: 1) Are the ID in the table in the Q4 of the NeuroTask_datasheet.pdf refer to the dataset id or some other id? 2) if the dataset-5 includes both CenterOut and RandomTarget task, how do we differentiate between the two tasks?

Screen Shot 2024-12-19 at 1 10 29 PM

Q2 (Dataset 4): Here are the results I got comparing between w/ rebin+alignment and w/o rebin/alignment. It looks like the rebin and alignment is not giving me the correct results. I'm doing

fpath = "data/NeuroTask/CenterOutReaching/001057/sub-Dataset-3-Animals-1-2-3-&-4/sub-Dataset-3-Animals-1-2-3-&-4.nwb"
data = nap.load_file(fpath)
df_raw , bin = get_dataframe(data,filter_result=[b'R'])
print(bin)
df = rebin(df_raw,prev_bin_size = bin ,new_bin_size = 10)
df = align_event(df_raw, start_event='EventTarget_Onset', bin_size=10,offset_min=-50,offset_max=400)
Screen Shot 2024-12-19 at 1 49 26 PM

Q3 (Dataset 3): For Animal-1, it seems like the target_dir might be incorrect. The colors are consistent across sessions, which might indicate that some directions are consistently labeled falsely. (It is hard to believe that the monkey is performing incorrectly, because we are filtering out the rewarded trials by calling df_raw , bin = get_dataframe(data,filter_result=[b'R']). Animal-2 looks correct. For Animal-3, all sessions appear to have only 4 directions, but the direction labels are correct. For Animal-4, even though there are 12 sessions, it seems like session 1 and 2 are identical, session 3 and 4 are identical, etc.. So in total only 6 different sessions. Furthermore, the target_dir labels looks incorrect for this animal.

Note that, the animals (Anima-1 and Animal-4) that have false target_dir labels are those whose target_dir has values [-3, -2, -1, 0, 1, 2, 3] after division by 2 pi and multiplication of 360. While Animal-2 and Animal-3, who have correct target_dir values are those whose target_dir has values [0, 45, 90, 135, 180, 225, 270, 325] after division by 2 pi and multiplication of 360.

Screen Shot 2024-12-19 at 1 55 09 PM Screen Shot 2024-12-19 at 1 55 15 PM
  1. For Dataset-3, I obtain the original bin size equal to 30ms. I'm confuse about this BIN value from df , BIN = get_dataframe(data,filter_result=[b'R']). Is that the sampling rate? If so, 30 Hz sounds too slow for Electrophysiology data. From the original paper, it seems like the sampling rate is 30 kHz. If the original data is already binned, isn't that the data we have for each bin should represent spike rate, thus a real value instead of binary value?

  2. For datasets that has multiple tasks, how can we differentiate between them?

Thanks!

@acarolinafilipe
Copy link
Collaborator

I wanted to let you know that I’ve already contacted the authors of the original datasets to investigate the issue with the target ID and dir. I plan to upload the revised dataset after the Christmas break, in early January.

Regarding the align_event function, please ensure the dataframe used is consistent with the re-binning process. In your code, when calling this function, use the dataframe obtained after re-binning rather than the original one to maintain consistency.

On DANDI, each dandiset corresponds to a task, while on Kaggle, this information is indicated in the file name. Currently, RTT (for dataset 5) is only available in the Parquet format and not in NWB, but I will include it in the upcoming version.

Yes, the ID corresponds to datasetID.
If there’s any other field or feature you’d like to see in the dataset, let me know, and I’ll include it in the new version.

Thank you again for your feedback, and happy holidays!

@acarolinafilipe
Copy link
Collaborator

I've uploaded dataset 3 with the corrected targets - both in kaggle and dandi. The numbering of animal 4's sessions has also been changed, because it has 2 brain areas recorded at the same time, and could have been a little misleading previously.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants