Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-11650. ContainerId list to track all containers created in a datanode #7402

Open
wants to merge 35 commits into
base: master
Choose a base branch
from

Conversation

swamirishi
Copy link
Contributor

@swamirishi swamirishi commented Nov 7, 2024

What changes were proposed in this pull request?

We need to persist the list of containers created on a datanode to ensure that volume failures doesn't lead to recreation of containers on write chunk leading to partial inconsistent data in datanodes.

For ratis containers this is taken care of by persisting the containerInfo to bcsiD map in ratis snapshot file, but this list won't be present for EC containers.
Details to follow in a doc.

https://docs.google.com/document/d/1k4Tc7P-Jszdwn4dooRQZiKctlcGWNlnjZO1fh_TBneM/edit?usp=sharing

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11650

How was this patch tested?

Adding more unit tests & integration tests

…anode

Change-Id: I94fd413a2d778ac5d86a7da5126cf3d1cac8113a
Change-Id: I22654091edbd3a11c585aa95ca2b554eba0f9d95
Change-Id: Ibad0d380486c03dda6c61f87b329c98f90a01fd8

# Conflicts:
#	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
Change-Id: Icaa5ae0b29ec0ffccf5914bec0fd6ed6ae117219
Change-Id: I995fc25b93f16aa859eeb8f0418aa774e3719330
Change-Id: Ibeadc9330185f699e4cf1d9c1c8631d1af52683e
Change-Id: I5ac6a685a49e79be5ea43717294dd649383433f2
Change-Id: I18f48f9d97b0cc16a3c97a3137ee01ebda4fcbec
Change-Id: I16cec52ea1c2853c80ee9a6e3279a23408d05651
Change-Id: Icf779e2bff8ace1721b529e3c89edbe9effa9989
Change-Id: I485646e86105a8a1bab6b638262669fc5f92d94d
Change-Id: I03ab7dd188ae39248ca889f40b9490eb2870579f
Change-Id: I82647ef09edc6fd9432652d911bf2ff4bccf25a5
Change-Id: I73091fc280dea5ad447b9df8bb0a1877d8f1ff35
Change-Id: Ife63a4ab2a69869cce9d1c407bfdeba2540d2482
Change-Id: Ic9fe75b9efe885080e3ad440f132eb0100c41a17
Change-Id: I5a8e092d8fb751a2ca69256740df59edd59b9b95
…containers

Change-Id: Ic67537ed852920d8945430665e22eeddc7350d6e
@swamirishi swamirishi marked this pull request as ready for review November 9, 2024 15:06
StorageContainerException {
Preconditions.checkNotNull(container, "container cannot be null");

long containerId = container.getContainerData().getContainerID();
State containerState = container.getContainerData().getState();
if (!overwriteMissingContainers && missingContainerSet.contains(containerId)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we cannot just check if the container is in the ContainerIDTable? Why do we need to use this missingContainerSet instead?

I traced the creation of the missing set to a ratis snapshot (I know nothing about it) and a diff against the containers found in memory. What about EC containers? Do they all become part of this missing set?

Prior to this change, the missingSet doesn't seem to be used for anything inside this class, but not it is used for these checks. What was missingSet used for before?

I would have thought the ContainerID table starts initially empty after this change is committed. On first start, it gets populated with all containers in the ContainerSet (scanned from disk). After initial startup, or later startups, it can also have all containers added, but they should already all be there. Then we can just track previously created in the containerID table?

Copy link
Contributor Author

@swamirishi swamirishi Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The container set is first initialised with all the containers present on disk by performing an ls on the data volumes. Then we iterate through the containerIDs table on the rocksdb, and any container present in this table and not present on disk will be added to the missing container set. This would include both ratis and ec containers.
Now on container creation we just have to check this in the in-memory set than going to the rocksdb everytime which would incur an io cost, however small it is not need in my opinion when we have everything in memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, I mis-read the Java doc and thought it was a ratis snapshot. However my question still stands - why not simply use the ContainerIDTable to check for previous container creation, than augmenting this missing list?

Copy link
Contributor Author

@swamirishi swamirishi Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just reusing the data structure which was already there. This missing container set was just used for ratis container initially, I am now making it track even the EC containers. I didn't want to change a lot of code and reuse most of the stuff which is already there.
Moreover an in memory get is faster than fetching from rocksdb

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prior to this change, the missingContainerSet variable in the ContainerSet was not used for anything within that class. Now, we are potentially adding to it from the containerList - what impact does that have more widely? What was the missingContainerSet used for before?

We also have to open the new containerList table - it would be more clear if we just loaded that into memory and use the new "seenContainerList" for previously existing checks.

However backing up a bit - how can containers get created?

One way, is via the write path - that is the problem we are trying to solve.

Another is via EC Reconstruction - there is already logic there to deal with a duplicate container I think, however it may not cover volume failure. However EC reconstruction effectively uses the same write path as normal writes, so it its basically the same as the write path.

Then there is replication, which is a very different path into the system.

Maybe this problem is better not solved in ContainerSet, but at the higher level in the dispatcher where the writes are being handled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we should track all the containers getting added to the system. This flow also accomodates for replication both RATIS & EC. Barring the exception we give replication job to overwrite missing containers we should prevent overwriting any container coming through any flow. If we have this check in a central place, any future code changes would be foolproof. At least things would break and not silently end up writing something which is wrong.

@sodonnel
Copy link
Contributor

For me, this change is too big, and its going to cause conflicts and be difficult to backport. Its also hard to review, as the number of changed files is massive, although many are test files and simple one line changes.

It feels like this change could be largely contained with the ContainerSet class, without changing a lot of generic classes and causing the ripple effect into the wider system. All it has to do, is create a new overloaded constructor and not change the existing one. Make the existing one call the new one with the correct runtime requirements. Then we just need to check if a container has previously existing or not, and add new containers to the "previously exists" list. This shouldn't require changing 90 something files in a single PR.

@swamirishi
Copy link
Contributor Author

swamirishi commented Nov 15, 2024

For me, this change is too big, and its going to cause conflicts and be difficult to backport. Its also hard to review, as the number of changed files is massive, although many are test files and simple one line changes.

It feels like this change could be largely contained with the ContainerSet class, without changing a lot of generic classes and causing the ripple effect into the wider system. All it has to do, is create a new overloaded constructor and not change the existing one. Make the existing one call the new one with the correct runtime requirements. Then we just need to check if a container has previously existing or not, and add new containers to the "previously exists" list. This shouldn't require changing 90 something files in a single PR.

All of the tests use these containerSet. If we pass a null value it would make the container set a read only data structure(I have put guardrails in place so that we don't end up inadvertently writing containers without persisting the data). Since we have made the table a must have arg we are having to pass a mocked table in all the existing tests.

Another big change. We didn't have a pre-existing rocksdb on the master volume. DatanodeStore was only defined for data volumes, I wanted to reuse the code for creating a master volume that ended up adding generic everywhere, having created a base interface everywhere.

StorageContainerException {
Preconditions.checkNotNull(container, "container cannot be null");

long containerId = container.getContainerData().getContainerID();
State containerState = container.getContainerData().getState();
if (!overwriteMissingContainers && missingContainerSet.contains(containerId)) {
throw new StorageContainerException(String.format("Container with container Id %d is missing in the DN " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KeyValueHandler has this similar logic - why is it needed in both places? Feels like containerSet (ie here) may not need to check this, if its checked in the handler that creates the containers on the write path?

Copy link
Contributor Author

@swamirishi swamirishi Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KeyValueHandler is catching the exception and wrapping the exception as an error response.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you are right. This is redundant. Actually let me shift this logic to a single function so that we don't have redundant code. But we should perform the missing container check on add container.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a function which doesn't overwrite containers. We should be performing this check on every container add. In KeyValueHandler we are having to do it there because the creation of KeyValueContainerData creates the required directory structure in the data volume. So we are having to provide this check. But whenever we are adding to data structure from other paths we have to ensure the container is not missing by default unless we explicity intend to overwrite the missing container.

Change-Id: Icf8b45e0c2de6d353f3f880c441de7d7a6138009
/**
* Add Container to container map.
* @param container container to be added
* @return If container is added to containerMap returns true, otherwise
* false
*/
public boolean addContainer(Container<?> container) throws
public boolean addContainer(Container<?> container, boolean overwriteMissingContainers) throws
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest adding a new method here - addContainerIgnoringMissing (or something similar), rather than changing the existing interface. That way, it avoids changes in existing calling code, and only code that explicitly needs to pass the boolean can be changed to call the new method. I suspect this change would reduce the number of files impacted by this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a method for that.

public boolean addContainer(Container<?> container) throws StorageContainerException {
return addContainer(container, false);
}


import java.io.Closeable;

/**
* DB handle abstract class.
*/
public abstract class DBHandle implements Closeable {
public abstract class DBHandle<STORE extends AbstractStore> implements Closeable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid making changes to this class and get the desired functionality another way (extending another base class, something else)? Changing this has a knock on effect on many other classes and this feels like refactoring that perhaps isn't needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I have limited the changes to only the AbstractStore changes. I don't see a point in bringing down the changes further & I don't think there is going to be a lot of conflict while backporting the AbstractStore changes

@sodonnel
Copy link
Contributor

Thanks for reducing the changed files. Its down to 47 now from 97, which is a big improvement.

I still think its confusing that we have two structures (missingContainers map and ContainerIDsTable) to track what is going on.

If we want to block creating a container that has been created before, can we just check for existence in the ContainerIDsTable? Right now we check the missingTable and then add an entry to the containerIDTable.

If we remove a container, there are two boolean parameters - markMissing and removeFromDB - is it ever valid to pass "true, true" so it marks it missing and removes it from the DB?

Provided we always add to the containerIDsTable on new containerCreation, then we should be able to check for previous creation from it too.

It feels like we can remove the container for a failed volume we don't touch the containerIDs table. Something that tries to recreate it will get blocked by the existence chance against the table.

Or, we remove it "safely" via replication, balancer and we remove it from the DB.

If we just use the containerID table, then you don't need to iterate it at startup and checkoff all the containers. You also don't need to hold anything in memory as the RocksDB will cache things itself. However we may need to populate the table from all existing containers on disk.

I also suspect the code would be more understandable if we didn't overload add/removeContainer in ContainerSet with the booleans, and instead create a new method add??? and remove??? which are named more inline with what they are doing, eg remove, removeSafely, removeUnexpectedly, or something along those lines.

@swamirishi
Copy link
Contributor Author

swamirishi commented Nov 18, 2024

Thanks for reducing the changed files. Its down to 47 now from 97, which is a big improvement.

I still think its confusing that we have two structures (missingContainers map and ContainerIDsTable) to track what is going on.

If we want to block creating a container that has been created before, can we just check for existence in the

ContainerIDsTable? Right now we check the missingTable and then add an entry to the containerIDTable.
MissingContainerSet is always in memory. We add it to the containerIdTable and remove the entry from the missingContainerSet.

If we remove a container, there are two boolean parameters - markMissing and removeFromDB - is it ever valid to pass "true, true" so it marks it missing and removes it from the DB?

removeFromDB is only false when the physical container directory is not deleted from the node. So true, true should not be possible.
false, false : This happens on finding a duplicate container on startup where the container info in memory gets swapped by removing and adding the updated container info.
true, false : This happens on volume failure and we have to mark it as missing. This is a plain inmemory operation where in we remove the container from containerMap and add the containerId to the missingContainerSet.
false, true : This is a regular delete flow which happens when SCM sends a delete container command. This will only happen after deleting the entire container data from the disk.
true, true : This is an invalid case, this doesn't happen anywhere.

Provided we always add to the containerIDsTable on new containerCreation, then we should be able to check for previous creation from it too.
Since we have the containerInfo in the containerMap in memory already. There is no point in going to the rocksdb.

It feels like we can remove the container for a failed volume we don't touch the containerIDs table. Something that tries to recreate it will get blocked by the existence chance against the table.

Or, we remove it "safely" via replication, balancer and we remove it from the DB.

On replication we always have the overwrite flag overwriteMissingContainers set to true.

If we just use the containerID table, then you don't need to iterate it at startup and checkoff all the containers. You also don't need to hold anything in memory as the RocksDB will cache things itself. However we may need to populate the table from all existing containers on disk.

This is not an expensive operation. We anyways keep the entire containerMap in memory. There is no point in adding an overhead of rocksdb get, when we have the entire data in memory.

I also suspect the code would be more understandable if we didn't overload add/removeContainer in ContainerSet with the booleans, and instead create a new method add??? and remove??? which are named more inline with what they are doing, eg remove, removeSafely, removeUnexpectedly, or something along those lines.

This can be done but we would go back to the same problem of changing all the test files in that case. We can take this up as follow up refactoring jira.

@sodonnel
Copy link
Contributor

This is not an expensive operation. We anyways keep the entire containerMap in memory. There is no point in adding an overhead of rocksdb get, when we have the entire data in memory.

The point is - either use RocksDB, or don't use it - don't use half in memory and half on RocksDB. We are managing two structures (a map and a table), when one can do it.

You must read the entire small table from RocksDB on startup - so it will pull it into RockDB memory naturally. After that, writes will get buffered in RocksDB memory, so doing an existence check should be very fast. Removing the reliance on the missing containers map simplifies the code making it cleaner overall.

This can be done but we would go back to the same problem of changing all the test files in that case.

Not if we keep the original add / remove for the common case and then rename the overloaded versions to better specify their intent.

@swamirishi
Copy link
Contributor Author

This is not an expensive operation. We anyways keep the entire containerMap in memory. There is no point in adding an overhead of rocksdb get, when we have the entire data in memory.

The point is - either use RocksDB, or don't use it - don't use half in memory and half on RocksDB. We are managing two structures (a map and a table), when one can do it.

We are just using Rocksdb as a ledger and nothing more than that. I didn't want to change the existing data structure. All of these have have been working before when things are completely in memory. I am justing adding an operation to also add it to rocksdb and recreate the entire data structure on startup by doing so.

You must read the entire small table from RocksDB on startup - so it will pull it into RockDB memory naturally. After that, writes will get buffered in RocksDB memory, so doing an existence check should be very fast. Removing the reliance on the missing containers map simplifies the code making it cleaner overall.

This can be done but we would go back to the same problem of changing all the test files in that case.

Not if we keep the original add / remove for the common case and then rename the overloaded versions to better specify their intent.

Change-Id: Ibe2b04907bb4b4d19a465b7601d0e1b4b30f1fb6
Change-Id: Id021b079fb9bb023a5297a458a3ac703f5d75668
Change-Id: I06873c6e1b4aa0d78ace0d203c1fadb887127d14
@swamirishi
Copy link
Contributor Author

swamirishi commented Nov 18, 2024

This is not an expensive operation. We anyways keep the entire containerMap in memory. There is no point in adding an overhead of rocksdb get, when we have the entire data in memory.

The point is - either use RocksDB, or don't use it - don't use half in memory and half on RocksDB. We are managing two structures (a map and a table), when one can do it.

You must read the entire small table from RocksDB on startup - so it will pull it into RockDB memory naturally. After that, writes will get buffered in RocksDB memory, so doing an existence check should be very fast. Removing the reliance on the missing containers map simplifies the code making it cleaner overall.

This can be done but we would go back to the same problem of changing all the test files in that case.

Not if we keep the original add / remove for the common case and then rename the overloaded versions to better specify their intent.

Done

@sodonnel
Copy link
Contributor

The latest change looks OK to me, but I want someone else to review it too.

In the future, please structure large changes like this as a series of smaller logical commits and push them to the PR unsquashed. It makes it so much easier to work with, and if someone disagrees with some part, easier to revert and change too.

Also, as we found, we halved the number of changed files with a few simple changes. Perhaps these changes were not perfect from an OO perspective or if you were doing this from scratch, but minimizing the wider impact of a change and reducing code change elsewhere is worth some small compromises on the code structure.

@swamirishi
Copy link
Contributor Author

The latest change looks OK to me, but I want someone else to review it too.

In the future, please structure large changes like this as a series of smaller logical commits and push them to the PR unsquashed. It makes it so much easier to work with, and if someone disagrees with some part, easier to revert and change too.

Also, as we found, we halved the number of changed files with a few simple changes. Perhaps these changes were not perfect from an OO perspective or if you were doing this from scratch, but minimizing the wider impact of a change and reducing code change elsewhere is worth some small compromises on the code structure.

Yeah I will keep this in mind for future patches.

Comment on lines 91 to 98
@Override
public Void call() throws Exception {

OzoneConfiguration conf = createOzoneConfiguration();
try {
OzoneConfiguration conf = createOzoneConfiguration();

final Collection<String> datanodeStorageDirs =
HddsServerUtil.getDatanodeStorageDirs(conf);
final Collection<String> datanodeStorageDirs =
HddsServerUtil.getDatanodeStorageDirs(conf);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is just wrapping existing code in try-finally, with new code added only in finally. You can reduce the change by:

  • rename existing call() to something else (using oldCall() here as example)
  • add new call() as:
  public Void call() throws Exception {
    try {
      oldCall();
    } finally {
      ...
    }
  }

This way indentation is unchanged, diff is reduced.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -21,7 +21,7 @@ OZONE-SITE.XML_ozone.scm.datanode.ratis.volume.free-space.min=10MB
OZONE-SITE.XML_ozone.scm.pipeline.creation.interval=30s
OZONE-SITE.XML_ozone.scm.pipeline.owner.container.count=1
OZONE-SITE.XML_ozone.scm.names=scm
OZONE-SITE.XML_ozone.scm.datanode.id.dir=/data
OZONE-SITE.XML_ozone.scm.datanode.id.dir=/data/metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain this part of the change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creation of rocksdb under /data fails because of permission issue. So had to create under a metadata directory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So why do we need to create DB in the ozone.scm.datanode.id.dir, which is for the datanode ID file? Can we use ozone.metadata.dirs, which is already configured to /data/metadata?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want the datanode.id file & the rocksdb to be present in the same volume. Keeping them in the same directory becomes logical in that case.

Comment on lines 39 to 41
public static <KEY, VALUE> Table<KEY, VALUE> getInMemoryTableForTest() {
return new Table<KEY, VALUE>() {
private final Map<KEY, VALUE> map = new ConcurrentHashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create a non-anonymous class for the Table implementation, and drop DBTestUtils, which has no other code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 21 to 25
* Defines a functional interface to call void returning function.
*/
@FunctionalInterface
public interface VoidCallable<EXCEPTION_TYPE extends Exception> {
void call() throws EXCEPTION_TYPE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use CheckedRunnable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

import java.util.Map;

/**
* Class for defining the schema for master volume in a datanode.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a master volume?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Master volume is defined in this doc:
https://docs.google.com/document/d/1k4Tc7P-Jszdwn4dooRQZiKctlcGWNlnjZO1fh_TBneM/edit?usp=sharing . It is the same volume where we have datanode.id file.

Comment on lines 29 to 31
* Abstract Interface for interacting with datanode databases.
*/
public interface AbstractStore extends Closeable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interfaces are already abstract, so both doc and interface name are very vague. Please try to find a better name. As far as I see, it's mostly for managing lifecycle of a DBStore. Also, this parent interface does not seem to be specific to datanodes. Should it be moved to hadoop-hdds/framework?

Copy link
Contributor Author

@swamirishi swamirishi Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a base interface for initialising a rocksdb in datanode alone, like loading the datanode profile and other things that comes allied with it. I believe there is a duplicate interface for OM & SCM as well. We should pick this up as a follow up task item and unify these things in one place.

Comment on lines 178 to 179
case UNSUPPORTED_REQUEST:
case CONTAINER_MISSING:// Blame client for sending unsupported request.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment about unsupported request belongs to UNSUPPORTED_REQUEST case...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Change-Id: Iead70ac8119ac0bd970a8575267e7becb354fb7b
Change-Id: I4d42a3b04b84ce6544cf5549aedcb1bbe37b2b86
Change-Id: Icd905015d020fad583a49454e4847e8ed03e7785
@swamirishi
Copy link
Contributor Author

@adoroszlai I have addressed all the comments. Do you have any more review comments?

Copy link
Contributor

@kerneltime kerneltime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the logic embedded into hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java
A cleaner approach would be expand via aggregation the container set logic.

  1. Create a new Java class (WitnessedContainers.java) that reads and writes to the DB and keeps a concurrent copy in memory to handle concurrency (the current code does this)
  2. Do not bother with missing Container logic at the level of WitnessedContainers.java. When the replication manager tries to create a container, it checks if it is in the witnessed list and then adds it. So we only need two methods for addition AddContainer and AddContainerIfNotPresent.

Do we need reference counted code to be introduced? Can we get away with just embedding it within the ContainerSet?

I think this code change can go through more than one round of simplification for what it is trying to solve.

@swamirishi
Copy link
Contributor Author

swamirishi commented Nov 19, 2024

Why is the logic embedded into hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java A cleaner approach would be expand via aggregation the container set logic.

  1. Create a new Java class (WitnessedContainers.java) that reads and writes to the DB and keeps a concurrent copy in memory to handle concurrency (the current code does this)
  2. Do not bother with missing Container logic at the level of WitnessedContainers.java. When the replication manager tries to create a container, it checks if it is in the witnessed list and then adds it. So we only need two methods for addition AddContainer and AddContainerIfNotPresent.

Then what is the purpose of missing container set? We shouldn't have 2 data structures solving the same stuff. We should remove the missing container set and only rely on the witnessed container list.
IMO we only need 2 out of the 3 data structures:
WitnessedContainerSet:
Containers with data on disk:
Missing Container List

If we have 2 out of the 3 we can compute the 3rd one. With the above approach proposed we would have the logic present spread across all the 3 set which would complicate things further in future. We shouldn't be adding redundant data structures if not required.
I took the approach of reusing missing container set since it's usage is spread across multiple places in the code. I didn't want to change the logic there.

Do we need reference counted code to be introduced? Can we get away with just embedding it within the ContainerSet?

I had to do the reference counted db implementation just because of the test cases. The PR would have gotten much bigger to fix the test case. A simpler approach was to use the reference counted, we can have only one instance of a writable db.

I think this code change can go through more than one round of simplification for what it is trying to solve.

Change-Id: I24693b81d1409120537f37289aec886bad845588
Change-Id: Ie1d1f7163a04990f0447517b0e4e2f89fc8bc48b
@swamirishi swamirishi marked this pull request as draft November 19, 2024 23:41
@swamirishi
Copy link
Contributor Author

swamirishi commented Nov 20, 2024

MissingContainerSet in the containerSet was added as part of solving HDDS-935, which is trying to solve the same problem (Avoid creating an already created container on a datanode in case of disk removal followed by datanode restart) we are trying to tackle here. But unfortunately the solution only accounted for ratis container which was because it was the only container type back then. The solution looks into adding all the containers that are present in the ratis snapshot's containerBCSIdMap but not found on disk after scanning through all the data volumes in the node.
HDDS-8448 caused a regression resulting in removing the containers on a failed volume instead of marking it as missing.
This patch is trying to do something along the same lines by extending the working for EC containers and is trying to fix the issues in the implementation for handling volume failures in the system. I believe this particular implementation was working fine for ratis container prior to HDDS-8448 which caused a regression. Creating a duplicate data structure performing the same functionalities is just redundant and could lead to future bugs because we would have to update 2 structures in place of one. I also don't see any value in implementing something all over again which would also involve thorough testing of that flow. IMO we should be reusing the existing frameworks in the system to achieve something rather than add more flows complicating the already existing complicated system further.
If you guys still strongly feel we should move this logic out of the ContainerSet class then my opinion would be to get rid of the missingContainerSet entirely from all of our code flows and have only one of the 2 but not both. @sodonnel @kerneltime @adoroszlai what do you think we should do here? Adding @ashishkumar50 since he was involved in HDDS-8448
I have addressed the other review comments on the patch (Getting rid of the Proto2EnumCodec, Removing ReferenceCountedDB for the WitnessedContainerMetadataStore)

…oid column family option close

Change-Id: I71ae4da25b0340a12eb6bbaab7338ceb0823ad8b
Copy link
Contributor

@kerneltime kerneltime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks better now. I want to go over the DB code refactoring and test code.

/**
* Add Container to container map.
* @param container container to be added
* @return If container is added to containerMap returns true, otherwise
* false
*/
public boolean addContainer(Container<?> container) throws
private boolean addContainer(Container<?> container, boolean overwriteMissingContainers) throws
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private boolean addContainer(Container<?> container, boolean overwriteMissingContainers) throws
private boolean addContainer(Container<?> container, boolean overwrite) throws

Change-Id: Ie9c55a9111e215b4ae0fc58d80079ee71a5cdffb
/**
* Removes the Container matching with specified containerId.
* @param containerId ID of the container to remove
* @return If container is removed from containerMap returns true, otherwise
* false
*/
public boolean removeContainer(long containerId) {
private boolean removeContainer(long containerId, boolean markMissing, boolean removeFromDB)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removeFromDB do we expect it to be true outside of testing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

false, false : This happens on finding a duplicate container on startup where the container info in memory gets swapped by removing and adding the updated container info.
true, false : This happens on volume failure and we have to mark it as missing. This is a plain inmemory operation where in we remove the container from containerMap and add the containerId to the missingContainerSet.
false, true : This is a regular delete flow which happens when SCM sends a delete container command. This will only happen after deleting the entire container data from the disk.
true, true : This is an invalid case, this doesn't happen anywhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that once a container is created, it will remain the DB forever.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have any use of that container information. Why keep it if it is not required?

@swamirishi swamirishi marked this pull request as ready for review November 20, 2024 03:11
Change-Id: I5ba57cc8de58aed79ed41c9f1198cf9ee7fd3c27
Change-Id: I24bc0bce41a0a8cccec5b0e872359502835f1231
}
}, 1000, 30000);

// Read Chunk
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a second write chunk as well?

@kerneltime
Copy link
Contributor

The change seems reasonable now. Please ensure we are testing the APIs we care about for a container after the container goes missing. I think we can generate more test cases via fault injection. We can discuss if it needs to be part of this PR or a separate PR (to keep this PR small).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants