HDDS-11650. ContainerId list to track all containers created in a datanode #7402

swamirishi · 2024-11-07T03:56:15Z

What changes were proposed in this pull request?

We need to persist the list of containers created on a datanode to ensure that volume failures doesn't lead to recreation of containers on write chunk leading to partial inconsistent data in datanodes.

For ratis containers this is taken care of by persisting the containerInfo to bcsiD map in ratis snapshot file, but this list won't be present for EC containers.
Details to follow in a doc.

https://docs.google.com/document/d/1k4Tc7P-Jszdwn4dooRQZiKctlcGWNlnjZO1fh_TBneM/edit?usp=sharing

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11650

How was this patch tested?

Adding more unit tests & integration tests

…anode Change-Id: I94fd413a2d778ac5d86a7da5126cf3d1cac8113a

Change-Id: I22654091edbd3a11c585aa95ca2b554eba0f9d95

Change-Id: Ibad0d380486c03dda6c61f87b329c98f90a01fd8 # Conflicts: # hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java

Change-Id: Icaa5ae0b29ec0ffccf5914bec0fd6ed6ae117219

Change-Id: I995fc25b93f16aa859eeb8f0418aa774e3719330

Change-Id: Ibeadc9330185f699e4cf1d9c1c8631d1af52683e

Change-Id: I5ac6a685a49e79be5ea43717294dd649383433f2

Change-Id: I18f48f9d97b0cc16a3c97a3137ee01ebda4fcbec

Change-Id: I16cec52ea1c2853c80ee9a6e3279a23408d05651

Change-Id: Icf779e2bff8ace1721b529e3c89edbe9effa9989

Change-Id: I485646e86105a8a1bab6b638262669fc5f92d94d

Change-Id: I03ab7dd188ae39248ca889f40b9490eb2870579f

Change-Id: I82647ef09edc6fd9432652d911bf2ff4bccf25a5

Change-Id: I73091fc280dea5ad447b9df8bb0a1877d8f1ff35

Change-Id: Ife63a4ab2a69869cce9d1c407bfdeba2540d2482

Change-Id: Ic9fe75b9efe885080e3ad440f132eb0100c41a17

Change-Id: I5a8e092d8fb751a2ca69256740df59edd59b9b95

…containers Change-Id: Ic67537ed852920d8945430665e22eeddc7350d6e

sodonnel · 2024-11-12T16:33:41Z

...tainer-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java

      StorageContainerException {
    Preconditions.checkNotNull(container, "container cannot be null");

    long containerId = container.getContainerData().getContainerID();
+    State containerState = container.getContainerData().getState();
+    if (!overwriteMissingContainers && missingContainerSet.contains(containerId)) {


I don't understand why we cannot just check if the container is in the ContainerIDTable? Why do we need to use this missingContainerSet instead?

I traced the creation of the missing set to a ratis snapshot (I know nothing about it) and a diff against the containers found in memory. What about EC containers? Do they all become part of this missing set?

Prior to this change, the missingSet doesn't seem to be used for anything inside this class, but not it is used for these checks. What was missingSet used for before?

I would have thought the ContainerID table starts initially empty after this change is committed. On first start, it gets populated with all containers in the ContainerSet (scanned from disk). After initial startup, or later startups, it can also have all containers added, but they should already all be there. Then we can just track previously created in the containerID table?

The container set is first initialised with all the containers present on disk by performing an ls on the data volumes. Then we iterate through the containerIDs table on the rocksdb, and any container present in this table and not present on disk will be added to the missing container set. This would include both ratis and ec containers.
Now on container creation we just have to check this in the in-memory set than going to the rocksdb everytime which would incur an io cost, however small it is not need in my opinion when we have everything in memory.

Ah ok, I mis-read the Java doc and thought it was a ratis snapshot. However my question still stands - why not simply use the ContainerIDTable to check for previous container creation, than augmenting this missing list?

I am just reusing the data structure which was already there. This missing container set was just used for ratis container initially, I am now making it track even the EC containers. I didn't want to change a lot of code and reuse most of the stuff which is already there.
Moreover an in memory get is faster than fetching from rocksdb

Prior to this change, the missingContainerSet variable in the ContainerSet was not used for anything within that class. Now, we are potentially adding to it from the containerList - what impact does that have more widely? What was the missingContainerSet used for before?

We also have to open the new containerList table - it would be more clear if we just loaded that into memory and use the new "seenContainerList" for previously existing checks.

However backing up a bit - how can containers get created?

One way, is via the write path - that is the problem we are trying to solve.

Another is via EC Reconstruction - there is already logic there to deal with a duplicate container I think, however it may not cover volume failure. However EC reconstruction effectively uses the same write path as normal writes, so it its basically the same as the write path.

Then there is replication, which is a very different path into the system.

Maybe this problem is better not solved in ContainerSet, but at the higher level in the dispatcher where the writes are being handled?

I feel we should track all the containers getting added to the system. This flow also accomodates for replication both RATIS & EC. Barring the exception we give replication job to overwrite missing containers we should prevent overwriting any container coming through any flow. If we have this check in a central place, any future code changes would be foolproof. At least things would break and not silently end up writing something which is wrong.

sodonnel · 2024-11-15T16:12:29Z

For me, this change is too big, and its going to cause conflicts and be difficult to backport. Its also hard to review, as the number of changed files is massive, although many are test files and simple one line changes.

It feels like this change could be largely contained with the ContainerSet class, without changing a lot of generic classes and causing the ripple effect into the wider system. All it has to do, is create a new overloaded constructor and not change the existing one. Make the existing one call the new one with the correct runtime requirements. Then we just need to check if a container has previously existing or not, and add new containers to the "previously exists" list. This shouldn't require changing 90 something files in a single PR.

swamirishi · 2024-11-15T16:28:14Z

For me, this change is too big, and its going to cause conflicts and be difficult to backport. Its also hard to review, as the number of changed files is massive, although many are test files and simple one line changes.

It feels like this change could be largely contained with the ContainerSet class, without changing a lot of generic classes and causing the ripple effect into the wider system. All it has to do, is create a new overloaded constructor and not change the existing one. Make the existing one call the new one with the correct runtime requirements. Then we just need to check if a container has previously existing or not, and add new containers to the "previously exists" list. This shouldn't require changing 90 something files in a single PR.

All of the tests use these containerSet. If we pass a null value it would make the container set a read only data structure(I have put guardrails in place so that we don't end up inadvertently writing containers without persisting the data). Since we have made the table a must have arg we are having to pass a mocked table in all the existing tests.

Another big change. We didn't have a pre-existing rocksdb on the master volume. DatanodeStore was only defined for data volumes, I wanted to reuse the code for creating a master volume that ended up adding generic everywhere, having created a base interface everywhere.

sodonnel · 2024-11-15T17:07:16Z

...tainer-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java

      StorageContainerException {
    Preconditions.checkNotNull(container, "container cannot be null");

    long containerId = container.getContainerData().getContainerID();
+    State containerState = container.getContainerData().getState();
+    if (!overwriteMissingContainers && missingContainerSet.contains(containerId)) {
+      throw new StorageContainerException(String.format("Container with container Id %d is missing in the DN " +


KeyValueHandler has this similar logic - why is it needed in both places? Feels like containerSet (ie here) may not need to check this, if its checked in the handler that creates the containers on the write path?

KeyValueHandler is catching the exception and wrapping the exception as an error response.

Yeah you are right. This is redundant. Actually let me shift this logic to a single function so that we don't have redundant code. But we should perform the missing container check on add container.

There is a function which doesn't overwrite containers. We should be performing this check on every container add. In KeyValueHandler we are having to do it there because the creation of KeyValueContainerData creates the required directory structure in the data volume. So we are having to provide this check. But whenever we are adding to data structure from other paths we have to ensure the container is not missing by default unless we explicity intend to overwrite the missing container.

Change-Id: Icf8b45e0c2de6d353f3f880c441de7d7a6138009

sodonnel · 2024-11-17T21:10:06Z

...tainer-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java

  /**
   * Add Container to container map.
   * @param container container to be added
   * @return If container is added to containerMap returns true, otherwise
   * false
   */
-  public boolean addContainer(Container<?> container) throws
+  public boolean addContainer(Container<?> container, boolean overwriteMissingContainers) throws


I would suggest adding a new method here - addContainerIgnoringMissing (or something similar), rather than changing the existing interface. That way, it avoids changes in existing calling code, and only code that explicitly needs to pass the boolean can be changed to call the new method. I suspect this change would reduce the number of files impacted by this PR.

There is already a method for that.

ozone/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java

Lines 106 to 108 in af0f757

public boolean addContainer(Container<?> container) throws StorageContainerException {

return addContainer(container, false);

}

...tainer-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java

sodonnel · 2024-11-17T21:12:10Z

...iner-service/src/main/java/org/apache/hadoop/ozone/container/common/interfaces/DBHandle.java


 import java.io.Closeable;

 /**
 * DB handle abstract class.
 */
-public abstract class DBHandle implements Closeable {
+public abstract class DBHandle<STORE extends AbstractStore> implements Closeable {


Can we avoid making changes to this class and get the desired functionality another way (extending another base class, something else)? Changing this has a knock on effect on many other classes and this feels like refactoring that perhaps isn't needed.

yeah I have limited the changes to only the AbstractStore changes. I don't see a point in bringing down the changes further & I don't think there is going to be a lot of conflict while backporting the AbstractStore changes

sodonnel · 2024-11-18T12:52:42Z

Thanks for reducing the changed files. Its down to 47 now from 97, which is a big improvement.

I still think its confusing that we have two structures (missingContainers map and ContainerIDsTable) to track what is going on.

If we want to block creating a container that has been created before, can we just check for existence in the ContainerIDsTable? Right now we check the missingTable and then add an entry to the containerIDTable.

If we remove a container, there are two boolean parameters - markMissing and removeFromDB - is it ever valid to pass "true, true" so it marks it missing and removes it from the DB?

Provided we always add to the containerIDsTable on new containerCreation, then we should be able to check for previous creation from it too.

It feels like we can remove the container for a failed volume we don't touch the containerIDs table. Something that tries to recreate it will get blocked by the existence chance against the table.

Or, we remove it "safely" via replication, balancer and we remove it from the DB.

If we just use the containerID table, then you don't need to iterate it at startup and checkoff all the containers. You also don't need to hold anything in memory as the RocksDB will cache things itself. However we may need to populate the table from all existing containers on disk.

I also suspect the code would be more understandable if we didn't overload add/removeContainer in ContainerSet with the booleans, and instead create a new method add??? and remove??? which are named more inline with what they are doing, eg remove, removeSafely, removeUnexpectedly, or something along those lines.

swamirishi · 2024-11-18T13:08:41Z

Thanks for reducing the changed files. Its down to 47 now from 97, which is a big improvement.

I still think its confusing that we have two structures (missingContainers map and ContainerIDsTable) to track what is going on.

If we want to block creating a container that has been created before, can we just check for existence in the

ContainerIDsTable? Right now we check the missingTable and then add an entry to the containerIDTable.
MissingContainerSet is always in memory. We add it to the containerIdTable and remove the entry from the missingContainerSet.

If we remove a container, there are two boolean parameters - markMissing and removeFromDB - is it ever valid to pass "true, true" so it marks it missing and removes it from the DB?

removeFromDB is only false when the physical container directory is not deleted from the node. So true, true should not be possible.
false, false : This happens on finding a duplicate container on startup where the container info in memory gets swapped by removing and adding the updated container info.
true, false : This happens on volume failure and we have to mark it as missing. This is a plain inmemory operation where in we remove the container from containerMap and add the containerId to the missingContainerSet.
false, true : This is a regular delete flow which happens when SCM sends a delete container command. This will only happen after deleting the entire container data from the disk.
true, true : This is an invalid case, this doesn't happen anywhere.

Provided we always add to the containerIDsTable on new containerCreation, then we should be able to check for previous creation from it too.
Since we have the containerInfo in the containerMap in memory already. There is no point in going to the rocksdb.

It feels like we can remove the container for a failed volume we don't touch the containerIDs table. Something that tries to recreate it will get blocked by the existence chance against the table.

Or, we remove it "safely" via replication, balancer and we remove it from the DB.

On replication we always have the overwrite flag overwriteMissingContainers set to true.

If we just use the containerID table, then you don't need to iterate it at startup and checkoff all the containers. You also don't need to hold anything in memory as the RocksDB will cache things itself. However we may need to populate the table from all existing containers on disk.

This is not an expensive operation. We anyways keep the entire containerMap in memory. There is no point in adding an overhead of rocksdb get, when we have the entire data in memory.

I also suspect the code would be more understandable if we didn't overload add/removeContainer in ContainerSet with the booleans, and instead create a new method add??? and remove??? which are named more inline with what they are doing, eg remove, removeSafely, removeUnexpectedly, or something along those lines.

This can be done but we would go back to the same problem of changing all the test files in that case. We can take this up as follow up refactoring jira.

sodonnel · 2024-11-18T14:43:12Z

This is not an expensive operation. We anyways keep the entire containerMap in memory. There is no point in adding an overhead of rocksdb get, when we have the entire data in memory.

The point is - either use RocksDB, or don't use it - don't use half in memory and half on RocksDB. We are managing two structures (a map and a table), when one can do it.

You must read the entire small table from RocksDB on startup - so it will pull it into RockDB memory naturally. After that, writes will get buffered in RocksDB memory, so doing an existence check should be very fast. Removing the reliance on the missing containers map simplifies the code making it cleaner overall.

This can be done but we would go back to the same problem of changing all the test files in that case.

Not if we keep the original add / remove for the common case and then rename the overloaded versions to better specify their intent.

swamirishi · 2024-11-18T15:01:09Z

This is not an expensive operation. We anyways keep the entire containerMap in memory. There is no point in adding an overhead of rocksdb get, when we have the entire data in memory.

The point is - either use RocksDB, or don't use it - don't use half in memory and half on RocksDB. We are managing two structures (a map and a table), when one can do it.

We are just using Rocksdb as a ledger and nothing more than that. I didn't want to change the existing data structure. All of these have have been working before when things are completely in memory. I am justing adding an operation to also add it to rocksdb and recreate the entire data structure on startup by doing so.

You must read the entire small table from RocksDB on startup - so it will pull it into RockDB memory naturally. After that, writes will get buffered in RocksDB memory, so doing an existence check should be very fast. Removing the reliance on the missing containers map simplifies the code making it cleaner overall.

This can be done but we would go back to the same problem of changing all the test files in that case.

Not if we keep the original add / remove for the common case and then rename the overloaded versions to better specify their intent.

Change-Id: Ibe2b04907bb4b4d19a465b7601d0e1b4b30f1fb6

Change-Id: Id021b079fb9bb023a5297a458a3ac703f5d75668

Change-Id: I06873c6e1b4aa0d78ace0d203c1fadb887127d14

swamirishi · 2024-11-18T16:28:35Z

This is not an expensive operation. We anyways keep the entire containerMap in memory. There is no point in adding an overhead of rocksdb get, when we have the entire data in memory.

The point is - either use RocksDB, or don't use it - don't use half in memory and half on RocksDB. We are managing two structures (a map and a table), when one can do it.

You must read the entire small table from RocksDB on startup - so it will pull it into RockDB memory naturally. After that, writes will get buffered in RocksDB memory, so doing an existence check should be very fast. Removing the reliance on the missing containers map simplifies the code making it cleaner overall.

This can be done but we would go back to the same problem of changing all the test files in that case.

Not if we keep the original add / remove for the common case and then rename the overloaded versions to better specify their intent.

Done

sodonnel · 2024-11-18T17:31:30Z

The latest change looks OK to me, but I want someone else to review it too.

In the future, please structure large changes like this as a series of smaller logical commits and push them to the PR unsquashed. It makes it so much easier to work with, and if someone disagrees with some part, easier to revert and change too.

Also, as we found, we halved the number of changed files with a few simple changes. Perhaps these changes were not perfect from an OO perspective or if you were doing this from scratch, but minimizing the wider impact of a change and reducing code change elsewhere is worth some small compromises on the code structure.

swamirishi · 2024-11-18T17:48:13Z

The latest change looks OK to me, but I want someone else to review it too.

In the future, please structure large changes like this as a series of smaller logical commits and push them to the PR unsquashed. It makes it so much easier to work with, and if someone disagrees with some part, easier to revert and change too.

Also, as we found, we halved the number of changed files with a few simple changes. Perhaps these changes were not perfect from an OO perspective or if you were doing this from scratch, but minimizing the wider impact of a change and reducing code change elsewhere is worth some small compromises on the code structure.

Yeah I will keep this in mind for future patches.

adoroszlai · 2024-11-18T17:17:57Z

hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/freon/ClosedContainerReplicator.java

  @Override
  public Void call() throws Exception {

-    OzoneConfiguration conf = createOzoneConfiguration();
+    try {
+      OzoneConfiguration conf = createOzoneConfiguration();

-    final Collection<String> datanodeStorageDirs =
-        HddsServerUtil.getDatanodeStorageDirs(conf);
+      final Collection<String> datanodeStorageDirs =
+          HddsServerUtil.getDatanodeStorageDirs(conf);


This change is just wrapping existing code in try-finally, with new code added only in finally. You can reduce the change by:

rename existing call() to something else (using oldCall() here as example)

add new call() as:

public Void call() throws Exception { try { oldCall(); } finally { ... } }

This way indentation is unchanged, diff is reduced.

adoroszlai · 2024-11-18T17:19:04Z

hadoop-ozone/dist/src/main/compose/compatibility/docker-config

@@ -21,7 +21,7 @@ OZONE-SITE.XML_ozone.scm.datanode.ratis.volume.free-space.min=10MB
 OZONE-SITE.XML_ozone.scm.pipeline.creation.interval=30s
 OZONE-SITE.XML_ozone.scm.pipeline.owner.container.count=1
 OZONE-SITE.XML_ozone.scm.names=scm
-OZONE-SITE.XML_ozone.scm.datanode.id.dir=/data
+OZONE-SITE.XML_ozone.scm.datanode.id.dir=/data/metadata


Can you please explain this part of the change?

Creation of rocksdb under /data fails because of permission issue. So had to create under a metadata directory.

So why do we need to create DB in the ozone.scm.datanode.id.dir, which is for the datanode ID file? Can we use ozone.metadata.dirs, which is already configured to /data/metadata?

We want the datanode.id file & the rocksdb to be present in the same volume. Keeping them in the same directory becomes logical in that case.

adoroszlai · 2024-11-18T17:20:31Z

hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/DBTestUtils.java

+  public static <KEY, VALUE> Table<KEY, VALUE> getInMemoryTableForTest() {
+    return new Table<KEY, VALUE>() {
+      private final Map<KEY, VALUE> map = new ConcurrentHashMap<>();


Please create a non-anonymous class for the Table implementation, and drop DBTestUtils, which has no other code.

adoroszlai · 2024-11-18T17:21:26Z

hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/VoidCallable.java

+ * Defines a functional interface to call void returning function.
+ */
+@FunctionalInterface
+public interface VoidCallable<EXCEPTION_TYPE extends Exception> {
+  void call() throws EXCEPTION_TYPE;


Please use CheckedRunnable.

adoroszlai · 2024-11-18T17:23:27Z

...rvice/src/main/java/org/apache/hadoop/ozone/container/metadata/MasterVolumeDBDefinition.java

+import java.util.Map;
+
+/**
+ * Class for defining the schema for master volume in a datanode.


What is a master volume?

Master volume is defined in this doc:
https://docs.google.com/document/d/1k4Tc7P-Jszdwn4dooRQZiKctlcGWNlnjZO1fh_TBneM/edit?usp=sharing . It is the same volume where we have datanode.id file.

adoroszlai · 2024-11-18T17:29:49Z

...ontainer-service/src/main/java/org/apache/hadoop/ozone/container/metadata/AbstractStore.java

+ * Abstract Interface for interacting with datanode databases.
+ */
+public interface AbstractStore extends Closeable {


Interfaces are already abstract, so both doc and interface name are very vague. Please try to find a better name. As far as I see, it's mostly for managing lifecycle of a DBStore. Also, this parent interface does not seem to be specific to datanodes. Should it be moved to hadoop-hdds/framework?

We need a base interface for initialising a rocksdb in datanode alone, like loading the datanode profile and other things that comes allied with it. I believe there is a duplicate interface for OM & SCM as well. We should pick this up as a follow up task item and unify these things in one place.

adoroszlai · 2024-11-18T17:37:00Z

...iner-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/HddsDispatcher.java

+    case UNSUPPORTED_REQUEST:
+    case CONTAINER_MISSING:// Blame client for sending unsupported request.


Comment about unsupported request belongs to UNSUPPORTED_REQUEST case...

Change-Id: Iead70ac8119ac0bd970a8575267e7becb354fb7b

Change-Id: I4d42a3b04b84ce6544cf5549aedcb1bbe37b2b86

Change-Id: Icd905015d020fad583a49454e4847e8ed03e7785

swamirishi · 2024-11-18T22:28:01Z

@adoroszlai I have addressed all the comments. Do you have any more review comments?

kerneltime

Why is the logic embedded into hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java
A cleaner approach would be expand via aggregation the container set logic.

Create a new Java class (WitnessedContainers.java) that reads and writes to the DB and keeps a concurrent copy in memory to handle concurrency (the current code does this)
Do not bother with missing Container logic at the level of WitnessedContainers.java. When the replication manager tries to create a container, it checks if it is in the witnessed list and then adds it. So we only need two methods for addition AddContainer and AddContainerIfNotPresent.

Do we need reference counted code to be introduced? Can we get away with just embedding it within the ContainerSet?

I think this code change can go through more than one round of simplification for what it is trying to solve.

...iner-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/HddsDispatcher.java

...tainer-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java

...rvice/src/main/java/org/apache/hadoop/ozone/container/metadata/MasterVolumeDBDefinition.java

...vice/src/main/java/org/apache/hadoop/ozone/container/metadata/MasterVolumeMetadataStore.java

swamirishi · 2024-11-19T16:15:38Z

Why is the logic embedded into hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java A cleaner approach would be expand via aggregation the container set logic.

Create a new Java class (WitnessedContainers.java) that reads and writes to the DB and keeps a concurrent copy in memory to handle concurrency (the current code does this)

Do not bother with missing Container logic at the level of WitnessedContainers.java. When the replication manager tries to create a container, it checks if it is in the witnessed list and then adds it. So we only need two methods for addition AddContainer and AddContainerIfNotPresent.

Then what is the purpose of missing container set? We shouldn't have 2 data structures solving the same stuff. We should remove the missing container set and only rely on the witnessed container list.
IMO we only need 2 out of the 3 data structures:
WitnessedContainerSet:
Containers with data on disk:
Missing Container List

If we have 2 out of the 3 we can compute the 3rd one. With the above approach proposed we would have the logic present spread across all the 3 set which would complicate things further in future. We shouldn't be adding redundant data structures if not required.
I took the approach of reusing missing container set since it's usage is spread across multiple places in the code. I didn't want to change the logic there.

Do we need reference counted code to be introduced? Can we get away with just embedding it within the ContainerSet?

I had to do the reference counted db implementation just because of the test cases. The PR would have gotten much bigger to fix the test case. A simpler approach was to use the reference counted, we can have only one instance of a writable db.

I think this code change can go through more than one round of simplification for what it is trying to solve.

Change-Id: I24693b81d1409120537f37289aec886bad845588

Change-Id: Ie1d1f7163a04990f0447517b0e4e2f89fc8bc48b

swamirishi · 2024-11-20T00:15:28Z

MissingContainerSet in the containerSet was added as part of solving HDDS-935, which is trying to solve the same problem (Avoid creating an already created container on a datanode in case of disk removal followed by datanode restart) we are trying to tackle here. But unfortunately the solution only accounted for ratis container which was because it was the only container type back then. The solution looks into adding all the containers that are present in the ratis snapshot's containerBCSIdMap but not found on disk after scanning through all the data volumes in the node.
HDDS-8448 caused a regression resulting in removing the containers on a failed volume instead of marking it as missing.
This patch is trying to do something along the same lines by extending the working for EC containers and is trying to fix the issues in the implementation for handling volume failures in the system. I believe this particular implementation was working fine for ratis container prior to HDDS-8448 which caused a regression. Creating a duplicate data structure performing the same functionalities is just redundant and could lead to future bugs because we would have to update 2 structures in place of one. I also don't see any value in implementing something all over again which would also involve thorough testing of that flow. IMO we should be reusing the existing frameworks in the system to achieve something rather than add more flows complicating the already existing complicated system further.
If you guys still strongly feel we should move this logic out of the ContainerSet class then my opinion would be to get rid of the missingContainerSet entirely from all of our code flows and have only one of the 2 but not both. @sodonnel @kerneltime @adoroszlai what do you think we should do here? Adding @ashishkumar50 since he was involved in HDDS-8448
I have addressed the other review comments on the patch (Getting rid of the Proto2EnumCodec, Removing ReferenceCountedDB for the WitnessedContainerMetadataStore)

…oid column family option close Change-Id: I71ae4da25b0340a12eb6bbaab7338ceb0823ad8b

kerneltime

This looks better now. I want to go over the DB code refactoring and test code.

...tainer-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java

kerneltime · 2024-11-20T01:29:26Z

...tainer-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java

  /**
   * Add Container to container map.
   * @param container container to be added
   * @return If container is added to containerMap returns true, otherwise
   * false
   */
-  public boolean addContainer(Container<?> container) throws
+  private boolean addContainer(Container<?> container, boolean overwriteMissingContainers) throws


Suggested change

private boolean addContainer(Container<?> container, boolean overwriteMissingContainers) throws

private boolean addContainer(Container<?> container, boolean overwrite) throws

...tainer-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java

Change-Id: Ie9c55a9111e215b4ae0fc58d80079ee71a5cdffb

kerneltime · 2024-11-20T02:29:14Z

...tainer-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java

  /**
   * Removes the Container matching with specified containerId.
   * @param containerId ID of the container to remove
   * @return If container is removed from containerMap returns true, otherwise
   * false
   */
-  public boolean removeContainer(long containerId) {
+  private boolean removeContainer(long containerId, boolean markMissing, boolean removeFromDB)


removeFromDB do we expect it to be true outside of testing?

false, false : This happens on finding a duplicate container on startup where the container info in memory gets swapped by removing and adding the updated container info.
true, false : This happens on volume failure and we have to mark it as missing. This is a plain inmemory operation where in we remove the container from containerMap and add the containerId to the missingContainerSet.
false, true : This is a regular delete flow which happens when SCM sends a delete container command. This will only happen after deleting the entire container data from the disk.
true, true : This is an invalid case, this doesn't happen anywhere.

I was under the impression that once a container is created, it will remain the DB forever.

We don't have any use of that container information. Why keep it if it is not required?

Change-Id: I5ba57cc8de58aed79ed41c9f1198cf9ee7fd3c27

Change-Id: I24bc0bce41a0a8cccec5b0e872359502835f1231

kerneltime · 2024-11-20T07:35:39Z

...ation-test/src/test/java/org/apache/hadoop/ozone/container/ozoneimpl/TestOzoneContainer.java

+        }
+      }, 1000, 30000);
+
+      // Read Chunk


Should we add a second write chunk as well?

kerneltime · 2024-11-20T07:40:17Z

The change seems reasonable now. Please ensure we are testing the APIs we care about for a container after the container goes missing. I think we can generate more test cases via fault injection. We can discuss if it needs to be part of this PR or a separate PR (to keep this PR small).

swamirishi added 18 commits November 6, 2024 19:53

HDDS-11650. ContainerId list to track all containers created in a dat…

4d2ea78

…anode Change-Id: I94fd413a2d778ac5d86a7da5126cf3d1cac8113a

HDDS-11650. Fix test cases

87a4809

Change-Id: I22654091edbd3a11c585aa95ca2b554eba0f9d95

Merge remote-tracking branch 'apache/master' into HEAD

5603b56

Change-Id: Ibad0d380486c03dda6c61f87b329c98f90a01fd8 # Conflicts: # hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java

HDDS-11650. Add comments

c048a5e

Change-Id: Icaa5ae0b29ec0ffccf5914bec0fd6ed6ae117219

HDDS-11650. Fix tests

f4c538f

Change-Id: I995fc25b93f16aa859eeb8f0418aa774e3719330

HDDS-11650. Remove from rocskdb

f22f0d1

Change-Id: Ibeadc9330185f699e4cf1d9c1c8631d1af52683e

HDDS-11650. Fix checkstyle

c94734a

Change-Id: I5ac6a685a49e79be5ea43717294dd649383433f2

HDDS-11650. Fix Issues

e579d0e

Change-Id: I18f48f9d97b0cc16a3c97a3137ee01ebda4fcbec

HDDS-11650. Fix checkstyle & rat

2c376bc

Change-Id: I16cec52ea1c2853c80ee9a6e3279a23408d05651

HDDS-11650. Fix checkstyle & rat

4b77481

Change-Id: Icf779e2bff8ace1721b529e3c89edbe9effa9989

HDDS-11650. Fix tests failures

5027029

Change-Id: I485646e86105a8a1bab6b638262669fc5f92d94d

HDDS-11650. Fix tests failures

c5392d0

Change-Id: I03ab7dd188ae39248ca889f40b9490eb2870579f

HDDS-11650. Fix MasterVolumeMetaStore cache

7c4837a

Change-Id: I82647ef09edc6fd9432652d911bf2ff4bccf25a5

HDDS-11650. Fix MasterVolumeMetaStore cache

1ae494b

Change-Id: I73091fc280dea5ad447b9df8bb0a1877d8f1ff35

HDDS-11650. Fix MasterVolumeMetaStore cache

bdc2e50

Change-Id: Ife63a4ab2a69869cce9d1c407bfdeba2540d2482

HDDS-11650. Fix acceptance tests

06ca347

Change-Id: Ic9fe75b9efe885080e3ad440f132eb0100c41a17

HDDS-11650. Fix acceptance tests

8f98ab9

Change-Id: I5a8e092d8fb751a2ca69256740df59edd59b9b95

HDDS-11650. Add an integration test to test dn restarts with missing …

7d7f078

…containers Change-Id: Ic67537ed852920d8945430665e22eeddc7350d6e

swamirishi requested review from sodonnel and kerneltime November 9, 2024 02:53

swamirishi marked this pull request as ready for review November 9, 2024 15:06

sodonnel reviewed Nov 12, 2024

View reviewed changes

swamirishi mentioned this pull request Nov 15, 2024

HDDS-11667. Validating DatanodeID on any request to the datanode #7418

Open

sodonnel reviewed Nov 15, 2024

View reviewed changes

HDDS-11650. Address review comments

108bf82

Change-Id: Icf8b45e0c2de6d353f3f880c441de7d7a6138009

sodonnel reviewed Nov 17, 2024

View reviewed changes

...tainer-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java Outdated Show resolved Hide resolved

sodonnel reviewed Nov 17, 2024

View reviewed changes

swamirishi added 3 commits November 18, 2024 08:07

Merge remote-tracking branch 'apache/master' into HEAD

082cfc9

Change-Id: Ibe2b04907bb4b4d19a465b7601d0e1b4b30f1fb6

HDDS-11650. Address review comments make method more descriptive

3766bc3

Change-Id: Id021b079fb9bb023a5297a458a3ac703f5d75668

HDDS-11650. Address review comments make method more descriptive

46ee375

Change-Id: I06873c6e1b4aa0d78ace0d203c1fadb887127d14

adoroszlai reviewed Nov 18, 2024

View reviewed changes

swamirishi added 3 commits November 18, 2024 12:06

HDDS-11650. Address review comments

564ae17

Change-Id: Iead70ac8119ac0bd970a8575267e7becb354fb7b

HDDS-11650. Address review comments

af144a2

Change-Id: I4d42a3b04b84ce6544cf5549aedcb1bbe37b2b86

HDDS-11650. Remove extra line

79e5de8

Change-Id: Icd905015d020fad583a49454e4847e8ed03e7785

kerneltime requested changes Nov 19, 2024

View reviewed changes

swamirishi added 2 commits November 19, 2024 08:55

HDDS-11650. Address review comment and rename metadata store class name

b0ffe5d

Change-Id: I24693b81d1409120537f37289aec886bad845588

HDDS-11650. Address review comments

7a0e341

Change-Id: Ie1d1f7163a04990f0447517b0e4e2f89fc8bc48b

swamirishi marked this pull request as draft November 19, 2024 23:41

HDDS-11667. Stop metadatastore instead of close. Had to do this to av…

730d75e

…oid column family option close Change-Id: I71ae4da25b0340a12eb6bbaab7338ceb0823ad8b

kerneltime reviewed Nov 20, 2024

View reviewed changes

HDDS-11650. Address review comments

5446f4a

Change-Id: Ie9c55a9111e215b4ae0fc58d80079ee71a5cdffb

kerneltime reviewed Nov 20, 2024

View reviewed changes

swamirishi marked this pull request as ready for review November 20, 2024 03:11

swamirishi added 2 commits November 19, 2024 19:17

HDDS-11650. Address review comments

9cbc45f

Change-Id: I5ba57cc8de58aed79ed41c9f1198cf9ee7fd3c27

HDDS-11650. Address review comments

261f8fc

Change-Id: I24bc0bce41a0a8cccec5b0e872359502835f1231

kerneltime reviewed Nov 20, 2024

View reviewed changes

	public boolean addContainer(Container<?> container) throws StorageContainerException {
	return addContainer(container, false);
	}

		case UNSUPPORTED_REQUEST:
		case CONTAINER_MISSING:// Blame client for sending unsupported request.

	private boolean addContainer(Container<?> container, boolean overwriteMissingContainers) throws
	private boolean addContainer(Container<?> container, boolean overwrite) throws

HDDS-11650. ContainerId list to track all containers created in a datanode #7402

Are you sure you want to change the base?

HDDS-11650. ContainerId list to track all containers created in a datanode #7402

Conversation

swamirishi commented Nov 7, 2024 • edited Loading

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Choose a reason for hiding this comment

swamirishi Nov 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

swamirishi Nov 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sodonnel commented Nov 15, 2024

swamirishi commented Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

swamirishi Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sodonnel commented Nov 18, 2024

swamirishi commented Nov 18, 2024 • edited Loading

sodonnel commented Nov 18, 2024

swamirishi commented Nov 18, 2024

swamirishi commented Nov 18, 2024 • edited Loading

sodonnel commented Nov 18, 2024

swamirishi commented Nov 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

swamirishi Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

swamirishi commented Nov 18, 2024

kerneltime left a comment

Choose a reason for hiding this comment

swamirishi commented Nov 19, 2024 • edited Loading

swamirishi commented Nov 20, 2024 • edited Loading

kerneltime left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kerneltime commented Nov 20, 2024

swamirishi commented Nov 7, 2024 •

edited

Loading

swamirishi Nov 12, 2024 •

edited

Loading

swamirishi Nov 12, 2024 •

edited

Loading

swamirishi commented Nov 15, 2024 •

edited

Loading

swamirishi Nov 15, 2024 •

edited

Loading

swamirishi commented Nov 18, 2024 •

edited

Loading

swamirishi commented Nov 18, 2024 •

edited

Loading

swamirishi Nov 18, 2024 •

edited

Loading

swamirishi commented Nov 19, 2024 •

edited

Loading

swamirishi commented Nov 20, 2024 •

edited

Loading