Skip to content

Commit

Permalink
Merge pull request ceph#50487 from zdover23/wip-doc-2023-03-12-rados-…
Browse files Browse the repository at this point in the history
…bluestore-migration

doc/rados: edit operations/bs-migration (1 of x)

Reviewed-by: Anthony D'Atri <[email protected]>
  • Loading branch information
zdover23 authored Mar 20, 2023
2 parents 3faae31 + 96166b4 commit 7f61cd8
Showing 1 changed file with 67 additions and 67 deletions.
134 changes: 67 additions & 67 deletions doc/rados/operations/bluestore-migration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,68 +2,67 @@
BlueStore Migration
=====================

Each OSD can run either BlueStore or Filestore, and a single Ceph
cluster can contain a mix of both. Users who have previously deployed
Filestore OSDs should transition to BlueStore in order to
take advantage of the improved performance and robustness. Moreover,
Ceph releases beginning with Reef do not support Filestore. There are
several strategies for making such a transition.

An individual OSD cannot be converted in place;
BlueStore and Filestore are simply too different for that to be
feasible. The conversion process uses either the cluster's normal
replication and healing support or tools and strategies that copy OSD
content from an old (Filestore) device to a new (BlueStore) one.


Deploy new OSDs with BlueStore
==============================

New OSDs (e.g., when the cluster is expanded) should be deployed
using BlueStore. This is the default behavior so no specific change
is needed.

Similarly, any OSDs that are reprovisioned after replacing a failed drive
should use BlueStore.

Convert existing OSDs
=====================

Mark out and replace
--------------------

The simplest approach is to ensure that the cluster is healthy,
then mark ``out`` each device in turn, wait for
data to replicate across the cluster, reprovision the OSD, and mark
it back ``in`` again. Proceed to the next OSD when recovery is complete.
This is easy to automate but results in more data migration than
is strictly necessary, which in turn presents additional wear to SSDs and takes
longer to complete.
Each OSD must be formatted as either a Filestore OSD or a BlueStore OSD.
However, an individual Ceph cluster can operate with a mixture of both
Filestore OSDs and BlueStore OSDs. Because BlueStore is superior to Filestore
in performance and robustness, and because Filestore is not supported by Ceph
releases beginning with Reef, users deploying Filestore OSDs should transition
to BlueStore. There are several strategies for making the transition to
BlueStore.

BlueStore is so different from Filestore that an individual OSD cannot
be converted in place. Instead, the conversion process must use either
(1) the cluster's normal replication and healing support, or (2) tools
and strategies that copy OSD content from an old (Filestore) device to
a new (BlueStore) one.

Deploying new OSDs with BlueStore
=================================

Use BlueStore when deploying new OSDs (for example, when the cluster is
expanded). Because this is the default behavior, no specific change is
needed.

Similarly, use BlueStore for any OSDs that have been reprovisioned after
a failed drive was replaced.

Converting existing OSDs
========================

Mark-``out`` replacement
------------------------

The simplest approach is to verify that the cluster is healthy and
then follow these steps for each Filestore OSD in succession: mark the OSD
``out``, wait for the data to replicate across the cluster, reprovision the OSD,
mark the OSD back ``in``, and wait for recovery to complete before proceeding
to the next OSD. This approach is easy to automate, but it entails unnecessary
data migration that carries costs in time and SSD wear.

#. Identify a Filestore OSD to replace::

ID=<osd-id-number>
DEVICE=<disk-device>

You can tell whether a given OSD is Filestore or BlueStore with:
#. Determine whether a given OSD is Filestore or BlueStore:

.. prompt:: bash $
.. prompt:: bash $

ceph osd metadata $ID | grep osd_objectstore
ceph osd metadata $ID | grep osd_objectstore

You can get a current count of Filestore and BlueStore OSDs with:
#. Get a current count of Filestore and BlueStore OSDs:

.. prompt:: bash $
.. prompt:: bash $

ceph osd count-metadata osd_objectstore
ceph osd count-metadata osd_objectstore

#. Mark the Filestore OSD ``out``:
#. Mark a Filestore OSD ``out``:

.. prompt:: bash $

ceph osd out $ID

#. Wait for the data to migrate off the OSD in question:
#. Wait for the data to migrate off this OSD:

.. prompt:: bash $

Expand All @@ -75,7 +74,7 @@ longer to complete.

systemctl kill ceph-osd@$ID

#. Note which device this OSD is using:
#. Note which device the OSD is using:

.. prompt:: bash $

Expand All @@ -87,41 +86,43 @@ longer to complete.

umount /var/lib/ceph/osd/ceph-$ID

#. Destroy the OSD data. Be *EXTREMELY CAREFUL* as this will destroy
the contents of the device; be certain the data on the device is
not needed (i.e., that the cluster is healthy) before proceeding:
#. Destroy the OSD's data. Be *EXTREMELY CAREFUL*! These commands will destroy
the contents of the device; you must be certain that the data on the device is
not needed (in other words, that the cluster is healthy) before proceeding:

.. prompt:: bash $

ceph-volume lvm zap $DEVICE

#. Tell the cluster the OSD has been destroyed (and a new OSD can be
reprovisioned with the same ID):
#. Tell the cluster that the OSD has been destroyed (and that a new OSD can be
reprovisioned with the same OSD ID):

.. prompt:: bash $

ceph osd destroy $ID --yes-i-really-mean-it

#. Provision a BlueStore OSD in its place with the same OSD ID.
This requires you do identify which device to wipe based on what you saw
mounted above. BE CAREFUL! Also note that hybrid OSDs may require
adjustments to these commands:
#. Provision a BlueStore OSD in place by using the same OSD ID. This requires
you to identify which device to wipe, and to make certain that you target
the correct and intended device, using the information that was retrieved
when we directed you to "[N]ote which device the OSD is using". BE CAREFUL!
Note that you may need to modify these commands when dealing with hybrid
OSDs:

.. prompt:: bash $

ceph-volume lvm create --bluestore --data $DEVICE --osd-id $ID

#. Repeat.

You can allow balancing of the replacement OSD to happen
concurrently with the draining of the next OSD, or follow the same
procedure for multiple OSDs in parallel, as long as you ensure the
cluster is fully clean (all data has all replicas) before destroying
any OSDs. If you reprovision multiple OSDs in parallel, be **very** careful to
only zap / destroy OSDs within a single CRUSH failure domain, e.g. ``host`` or
``rack``. Failure to do so will reduce the redundancy and availability of
your data and increase the risk of (or even cause) data loss.

You may opt to (1) have the balancing of the replacement BlueStore OSD take
place concurrently with the draining of the next Filestore OSD, or instead
(2) follow the same procedure for multiple OSDs in parallel. In either case,
however, you must ensure that the cluster is fully clean (in other words, that
all data has all replicas) before destroying any OSDs. If you opt to reprovision
multiple OSDs in parallel, be **very** careful to destroy OSDs only within a
single CRUSH failure domain (for example, ``host`` or ``rack``). Failure to
satisfy this requirement will reduce the redundancy and availability of your
data and increase the risk of data loss (or even guarantee data loss).

Advantages:

Expand All @@ -131,11 +132,10 @@ Advantages:

Disadvantages:

* Data is copied over the network twice: once to some other OSD in the
cluster (to maintain the desired number of replicas), and then again
* Data is copied over the network twice: once to another OSD in the
cluster (to maintain the specified number of replicas), and again
back to the reprovisioned BlueStore OSD.


Whole host replacement
----------------------

Expand Down

0 comments on commit 7f61cd8

Please sign in to comment.