Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Correctly map deduplicated chunk indices when setting content #54

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

cooper667
Copy link

@cooper667 cooper667 commented Jun 2, 2022

Previously, if a chunk was removed as a duplicate, the indices would be incorrect and getContent wouldn't be able to piece them back together.

This keeps the same logic, but ensures that the indices correctly point to the elements in the new chunks array. Also adds tests to illustrate the issue.

#51

@tasn
Copy link
Member

tasn commented Jun 4, 2022

I'm a bit worried that now the implementations deviate between Rust and JS and was trying to understand whether this was happening in Rust too. As in, is this a small difference between that should be fixed, or whether the algorithm had a flaw that you've uncovered.

I haven't managed to replicate it in rust by doing the following (which is slightly different to what you did):

    let content = randombytes_deterministic(10 * 1024, &[0; 32]); // 10kb of pseuedorandom data
    // Flatten it to 40kb of repeating 10kb chunks
    let content = vec![content.clone(), content.clone(), content.clone(), content].concat();

And this works when I set and get in a test that I thought was similar to what you did.

I also adjusted your tests to not need tho stringifying just so it's simpler (will make an inline comment in a sec), and your tests still fails. So I suspect it's just in JS.

Now I wonder: what's the different with the Rust impl and the JS one, and can we keep them mostly identical for ease of maintenance?

src/Etebase.test.ts Outdated Show resolved Hide resolved
@cooper667
Copy link
Author

I added a test in Rust similar to yours and it seems like the chunking that happens above 'never' results in duplicate chunks, so there isn't anything to deduplicate?

Copy link
Member

@tasn tasn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test in Rust similar to yours and it seems like the chunking that happens above 'never' results in duplicate chunks, so there isn't anything to deduplicate?

Oh, interesting. I wonder why it would happen here but not happen there. I also tried the exact test in Rust, but I neglected to verify that things actually get deduplicated.

Anyhow, I'll take a better look here. Thanks!

package.json Outdated Show resolved Hide resolved
Copy link
Member

@tasn tasn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies again for the delay, it's just a very dangerous piece of code to get wrong and wanted to take multiple passes at it.

Comment on lines +324 to +332
const chunkKeys = [...uniqueChunksMap.keys()];

// Change the original (shuffled) indices to point at the deduplicated chunks
const newIndices = indices.map((i) => {
const [id] = chunks[i];
return chunkKeys.indexOf(id);
});

chunks = [...uniqueChunksMap.values()];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it took me so long to review this. It's a very intrusive change in a very sensitive place.

I think the change is overall correct, though one thing that I'm very concerned about is this highlighted code I'm responding to. I'm not sure that per spec keys and values are necessarily in the same order and that it's deterministic. It could very well be, though I'm unsure. I'd much rather we created both in one go.

I'm also not sure what uniqueChunksMap actually does to the ordering.

I'll revert back to Rust, this is how it's done there, do you think that makes sense?

            // Filter duplicates and construct the indice list.
            let mut uid_indices: HashMap<String, usize> = HashMap::new();
            chunks = chunks
                .into_iter()
                .enumerate()
                .filter_map(|(i, chunk)| {
                    let uid = &chunk.0;
                    match uid_indices.get(uid) {
                        Some(previous_index) => {
                            indices[i] = *previous_index;
                            None
                        }
                        None => {
                            uid_indices.insert(uid.to_string(), i);
                            Some(chunk)
                        }
                    }
                })
                .collect();

            // If we have more than one chunk we need to encode the mapping header in the last chunk
            if indices.len() > 1 {
                // We encode it in an array so we can extend it later on if needed
                let buf = rmp_serde::to_vec_named(&(indices,))?;
                let hash = to_base64(&crypto_manager.0.calculate_mac(&buf)?)?;
                chunks.push(ChunkArrayItem(hash, Some(buf)));
            }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants