Cloud Storage: API Updates

PR-2345

Big News: We’ve rewritten our Cloud Storage API! The system is much more stable and quicker thanks to the new platform, Hummingbird. Hummingbird is essentially an implementation of several OpenStack Swift components written in Go. We’ll discuss how we implemented hummingbird and what problems it helped us solve in this article.

OpenStack Swift Object Storage Model

The OpenStack Swift object storage model includes several integrated elements:

  • Proxy server. The proxy server receives data sets from the end user, sends service requests to proper storage components, and then formulates and returns the proper response.
  • Account and container servers. These work on similar principles—they each save metadata as well as container lists (account servers) and files (container servers) in separate sqlite databases and share this data on request. As such, we’ve combined the two services.
  • Object servers. These save user files and file-level metadata stored in extended attributes. This is most basic abstraction level—here we only have objects (as files) that can be written to specific partitions.

Each of these components is a separate daemon. All daemons in a cluster are connected by rings—hash tables for determining the location of objects in a cluster. Rings are created separately (ring builder) and are delivered to all of the nodes in the cluster. Rings indicate how many replicas of an object should be created to ensure maximum data durability (the standard recommendation is three copies of every object with each copy saved on a separate server). They also set the number of partitions (the internal swift structure) that need to be created and allocate devices to zones and regions. The ring also provides a list of handoff devices that data will be sent to if the primary devices are ever unavailable.

Let’s take a closer look at all of these components.

Proxy Server

We originally used the standard swift-proxy, but when workloads and our own code grew, we switched over to gevent and gunicorn. We later replaced gunicorn with uwsgi because the latter works better under heavier loads. This decision was not particularly effective: wait times for proxy connections were fairly long and, since Python itself is rather slow, we ended up having to use even more servers to process authorized traffic. In the end, we needed 12 machines to process all of our traffic (all traffic, both public and private, is now processed on 3).

After all of this mitigation, we rewrote the proxy server in Go. We took the prototype from Hummingbird and then finished writing the middleware, which implements all of our user functionality—authorization, quotas, layering for static sites, symbolic references, object segmenting (dynamic and static), adding domains, versioning, etc. Separate endpoints were also implemented for some of our special functions, such as configuring domains and user ssl certificates. For middleware chaining, we use justinas/alice, and for saving global variables in the request context, we use gorilla/context.

Directclient is used to send requests to OpenStack Swift services and has full access to all storage components. Additionally, we actively use account, container and object-level metadata caching. This metadata is included in requests; it’s needed for making decisions about future developments. To avoid making too many service requests to storage, we keep this data in the memcache. This way, our proxy server gets the request, formulates its context, and passes it through various layers (middleware), one of which should say, “This request is for me!” It’s this layer that processes the request and returns a response to the user.

All unauthorized requests to storage are first passed through Apache Trafficserver, which we’ve chosen as our caching proxy.

Since standard caching policies allow objects to be held in the cache for a fairly long time (otherwise there’d be no point in having it), we made a separate daemon for purging the cache. It accepts PUT request events from the proxy and clears the cache of all the modified object names (each object in storage has at least two names: userId.selcdn.ru and userId.selcdn.com; users may also attach their own domains to containers, the cache of which will also have to be cleared).

Accounts and Containers

The accounts layer in storage looks like this: for each user, a separate sqlite database is created, where a set of global metadata (account quota, TempURL keys, etc.) is stored with a list of this user’s containers. The number of containers a user has shouldn’t count in the billions, which is why the size of the databases shouldn’t be too big and replicate quickly.

Overall, the accounts work great. There aren’t any problems, which is why everyone who’s progressive minded loves them.

Containers are another story. From a technical point of view, they’re the same as sqlite databases; however, these databases don’t contain tiny lists of a user’s containers, but the metadata for a specific container and a list of the files it contains. Some of our clients save up to 100 million files in one container. Requests to these big sqlite databases are processed slowly, and with replication (taking asynchronous writing into account), things are much more complicated.

Naturally, sqlite container storage had to be replaced, but by what? We tested some account and container servers with MongoDB and Cassandra, but in terms of horizontal scaling, binding solutions like this to a central database doesn’t exactly work out. Since the number of clients and files grows over time, saving data in many small databases is preferable to one giant database with billions of records.

Implement automatic container sharding wouldn’t be excessive: if we could divide huge containers into several sqlite databases, that would be wonderful.

Sharding can be read about here. As far as we can tell, everything’s still in the works.

Another function that’s directly related to container servers is the time limits for storing objects (expiring). Using either the X-Delete-At or X-Delete-After headers, we can assign a period of time after which any object will be deleted from object storage. These headers can be sent either when an object is created (as a PUT request) or when its metadata is modified (as a POST request). However, this hasn’t implemented as well as we’d like. The thing is that this feature was initially added in order to make the fewest amount of changes to the existing OpenStack Swift infrastructure as possible. And here’s where we took the easy way—we decided that the addresses of all objects with limited storage times would be stored in a special account, “.expiring_objects”, and this account would periodically be scanned using another daemon called object-expirer. This created two additional problems:

  • The first is the service account. Now, when we make a PUT/POST request to an object with one of these headers, containers are created on this account. The container’s name is a unix timestamp. Pseudo objects are created in these containers with names that include a timestamp and the full path to the corresponding real object. A special function is used for this, which contacts the container server and creates a record in its database; the object server isn’t activated at all. This way, actively using the function to limit the storage term of objects increases the load on the container server several times over.
  • The second has to do with the object-expirer daemon. This daemon periodically looks over a huge list of pseudo objects, checking the timestamps and sending requests to delete expired files. Its main drawback is how incredibly slow it is. Because of this, it’s not uncommon for an object to have already been deleted, but to still appear in the container list because the corresponding record in the container database still hasn’t been deleted.

In our work, it’s not uncommon to find over 200 million files in queue for deletion and that the object-expirer cannot handle its workload. This is why we had to write our own daemon in Go.

There is a solution that has long been in the works, and we hope that it will soon be released.

What’s it do? Additional fields appear in the container database scheme that let the container replicator delete expired files. This solves the problem of having records of deleted objects in the container database and object auditor will delete expired objects’ files. This also lets us completely abstain from using object-expirer, which is so antiquated it should be in a museum.

Objects

The object level is the most basic part of OpenStack Swift. At this level, there are only sets of object bites (files) and a defined set of operations on top of these files (write, read, delete). There are a standard set of daemons for using these objects:

  • Object-server — accepts requests from the proxy server and moves/returns objects from the real file system;
  • Object-replicator — implements replication logic on top of rsync;
  • Object-auditor — checks the integrity of objects and their attributes, and also puts damaged objects in quarantine so the replicator can restore the proper version from another source;
  • Object-updater — executes delayed operations for updating account and container databases. These operations may occur because of timeouts related to locked sqlite databases.

It seems simple enough, right? But this layer has several significant problems which still appear in every new release.

  1. Slow (very slow) object replication on rsync. If this seems rough in a small cluster, then it’s just awful after dealing with a few billion objects. Rsync assumes push model replication: the object-replicator service looks at all of the files on its node and tries to launch them on all of the other nodes the file should be on. This is no longer a simple process—additional hash partitions are now used to speed this up. You can find more information about this here.
  2. Periodic problems with object servers that may unceremoniously lock when an input-output operation hangs on a disk. This might appear as spikes in request response times. Why does it happen? In big clusters, there’s no guarantee that absolutely all servers will be launched, all disks mounted, and all file systems available. Disks on object servers periodically hang (these are everyday HDDs with a limited number of IOPS). In OpenStack Swift, since standard object servers don’t have a mechanism for isolating operations to one disk, so this often leads to the temporary unavailability of all the object nodes. This in particular causes a large number of timeouts.

Luckily, an alternative to Swift was released not long ago, which effectively solves all of these problems. This is Hummingbird, which we mentioned above. We’ll take a closer look at this in the following section.

Hummingbird as an Attempt to Fix Swift

Not long ago, Rackspace started redeveloping OpenStack Swift and rewriting it in Go. The object layer is practically ready, including the server, replicator, and auditor. There isn’t any storage policy support at the moment, but we need it for our Cloud Storage. The only thing that’s missing from the object daemons is object-updater.

Overall, hummingbird is a feature branch in the official OpenStack repository. It is under intense development and will soon be included in the master (maybe). To take part in the development — https://github.com/openstack/swift/tree/feature/hummingbird

How is Hummingbird Better than Swift?

Firstly, the replication logic in Hummingbird has changed. The replicator goes around to all the files in the local file system and sends all the nodes a query, “Do you need this?” (to making routing these requests easier, it uses the REPCONN method). If the response tells it that there’s a newer version of the file somewhere, the local file will be deleted. If the file is missing, a copy will fill in the missing file. If the file was already deleted but a tombstone file with a newer timestamp is found, the local file will be deleted.

Here we should explain what a tombstone file is. This is an empty file that takes an object’s place when it’s deleted.

Why do we need them? In big distributed storage, we can’t guarantee that absolutely all copies of an object will be deleted when a DELETE request is sent. This is because we send the user a “delete success” response after receiving n requests, where n is the quorum (where there are 3 copies, we should receive two responses). This was done intentionally so that some devices could be unavailable for one reason or another (such as for maintenance). Naturally, copies on these devices will not be deleted.

Moreover, after the device returns to the cluster, the file will again become available. This is why during deletion, the object is replaced by an empty file with a timestamp in the filename. If two tombstone files with later timestamps are found while the replicator scans the server, and one copy has an older last-modified header, it means the object was deleted and the remaining copy is also subject to deletion.

It’s the same with object servers: disks are isolated and semaphores limit competing disk connections. If a disk “hangs” for any reason or becomes overloaded, requests can simply be sent to another node.

Hummingbird is making good progress; we hope it will soon be officially included in OpenStack.

We’ve transferred our entire Cloud Server cluster to Hummingbird. This has lowered average response times from object servers and the number of errors has significantly dropped. As we’ve already mentioned, we use our own proxy server built on the Hummingbird prototype. We’ve also replaced the object layer with the daemons from Hummingbird.

From the standard OpenStack Swift components, only the account and container layer daemons are used, as well as object-updater.

Conclusion

In this article, we looked at Hummingbird and the problems we’ve been able to solve with it. If you have any questions, we’d be happy to answer them in the comments below.

Thanks to the new platform, we’ve managed to significantly expand the abilities of our Cloud Storage API and have added new functions: user management, domain management, SSL certificate management, and more. In our next article, we’ll be looking at these new innovations and functions more in-depth.