DTube Post Incident Report: 15-7-2018

Introduction

In the light of issues related to DTube from an operations standpoint I felt it important to begin logging events, causes and resolutions both to provide users with a sense of understanding as to what was occurring and why.

Background

At DTube we use Ceph, which has some terminology I'm going to briefly cover to get you up to speed.

Ceph is a distributed object & file store. We use it to store videos which are served to users. Ceph's storage is comprised of "virtual disks" called object storage devices (OSDs), in our case these are LVM volumes (but you can imagine them as standard hard drives), these OSDs have a file system on them comprised of two pools: "metadata" & "data". These pools are then split among of placement groups (PGs). It's in these that the actual data is stored. Data is typically replicated across multiple OSDs. This system can be referred to as the Ceph cluster.

With that brief introduction let's begin to talk about setting for the incident.

The events of the 8th.

Last week we encountered a similar issue where an OSD became inaccessible due to a lack of proper authorisation which was ultimately the result of a mis-typed command.

Following this the OSD was removed, formatted and reintroduced to the cluster, Ceph attempted to rebuild the OSD with the data it had previously from the replicas stored on other OSDs.

This process, along with Ceph attempting to place all of the data on the correct OSDs, lead to an inability to read and write to the cluster & as such replication was disabled.

Ceph was allowed to re-disperse the information as it saw fit (this process took 3 days). Replication was then restored & over the following 4 days Ceph had been attempting to duplicate all of it's data.

The events of the 16th

DTube was reported down with uploads stalling and "white screens" as downloads were unable to process.

It was discovered that one of the OSD devices had failed. Files in the /var/lib/ceph/osd/ceph-N directory (which holds the authorisation key & critical filesystem information) were found to be missing. This lead to the OSD to be unable to function & communicate with the cluster.

Reintroduction

Work was performed to create fresh authentication on the failing OSD and restore it to the cluster. This took the form of removing the OSD from the cluster and deleting the known authentication for it. Then performing an install with the current data.

This proved unsuccessful as the OSD would not reintegrate while holding data.

Repair

During this time 18 (of 384) PGs were listed as being "incomplete". Meaning that they didn't hold a total sum of the data intended for them.

These PGs were marked as in need of repair and ceph was allowed to attempt repair as we began to look at additional solutions while the PGs were repairing.

Extraction and Import

Using a list of the incomplete PGs we attempted to use ceph-objectstore-tool to extract the incomplete PGs from the removed OSD. An outline of the process we intended to take is detailed here:
https://ceph.com/geen-categorie/incomplete-pgs-oh-my/

Whereby the incomplete PGs are extracted from the failed OSDs and then inserted directly into the OSDs that Ceph expects them to exist within.

We were unable to extract the PGs and as a result this method proved to be unsuccessful.

As the repair had also failed to provide result we began to look at the PGs and making the cluster accessible without them.

Lost OSD

The missing OSD was marked "lost" and the PGs changed state from incomplete to down.

Read and write access was temporarily restored however upon requesting a number of vital files which were placed in ceph (including SSL keys, scripts used to launch containers & other regularly accessed files) the cluster became stuck waiting for a return from the requests. This then affected all following requests resulting in a halt to read and write operations.

Wipe

As this data was confirmed to be lost, restoration seemed impossible & running the cluster without that data appeared to unfeasible both short or long term, we opted to completely remove the filesystem and start anew.

Ceph does allow the use of multiple filesystems however the feature is experimental and given the issues with Ceph we felt it unwise to continue to maintain a failing system with the hope we recover the data at a later date.

As such the pools were wiped and the filesystem destroyed.

New pools were created and a new filesystem set up.

We then restored a copy of the files used by the uploader and video player services from backup, which resulted in users being able to upload and watch videos again.

Summary

Due to events on the 8th the filesystem we use lost all replicate data. This was in the process of being restored on the 16th when a storage device failed. These two events combined resulted in the loss of the only copy of some data.

Without this data the filesystem unstable and could not be recovered.

The filesystem was destroyed and rebuilt.
The data within the filesystem could not be saved.

The Future

Changes implemented

A duplicate copy of the keys and other data used by the OSDs have been taken and are now kept in two locations on each server in the cluster and on a number of my own personal devices.

Files in the storage cluster were replicated to ensure that two copies are kept at all times, I will be looking to add additional storage and ensure all files will be replicated three times.

We've been unable to locate the reason that the OSD authentication and filesystem data became missing. At this time the reason for that remains a mystery. Investigation is on-going

How can you assure files remain accessible even if DTube does not hold them?

Files uploaded to DTube are provided in two formats, initially as a flat file from our Ceph storage cluster, then should we not have a copy of the file, from IPFS.

All files uploaded to DTube are added to an IPFS node and can be requested from ipfs.io.

While DTube may not always hold a copy of your file, if it's accessible in the IPFS network it will still be played. As such it remains a wise choice to ensure that your content is stored both on DTube via standard upload & (as protection against total catastrophic failure) in your own IPFS node.

How can you restore access to your video?

To restore access, simply re-upload an old video, ensure the advanced tab has a copy of the hashes and that they match the hashes from your previously uploaded file, then feel free to close the tab, DTube will have received the video and will display it on your old post.

Final words

These events, while entirely undesirable, highlight the need for distributed & decentralised systems. The time frame during which issues occurred, DTube lacked a form of decentralisation and this proved ultimately fatal for our data storage. Had the cluster been able to reattain it's distributed, replicated state, this situation would not have occurred.

I apologise that this occurred and hope that you feel comfortable with the mitigations put in place which should ensure this does not happen again.

I regret the loss of files and the inconvenience presented to creators and hope that this post serves to provide some insight into both the actions performed and the intent behind those actions.

H2
H3
H4
3 columns
2 columns
1 column
9 Comments