Jump to contentJump to page navigation: previous page [access key p]/next page [access key n]
Applies to SUSE OpenStack Cloud 8

6 Object Storage

6.1 Introduction to Object Storage

OpenStack Object Storage (swift) is used for redundant, scalable data storage using clusters of standardized servers to store petabytes of accessible data. It is a long-term storage system for large amounts of static data which can be retrieved and updated. Object Storage uses a distributed architecture with no central point of control, providing greater scalability, redundancy, and permanence. Objects are written to multiple hardware devices, with the OpenStack software responsible for ensuring data replication and integrity across the cluster. Storage clusters scale horizontally by adding new nodes. Should a node fail, OpenStack works to replicate its content from other active nodes. Because OpenStack uses software logic to ensure data replication and distribution across different devices, inexpensive commodity hard drives and servers can be used in lieu of more expensive equipment.

Object Storage is ideal for cost effective, scale-out storage. It provides a fully distributed, API-accessible storage platform that can be integrated directly into applications or used for backup, archiving, and data retention.

6.2 Features and benefits



Leverages commodity hardware

No lock-in, lower price/GB.

HDD/node failure agnostic

Self-healing, reliable, data redundancy protects from failures.

Unlimited storage

Large and flat namespace, highly scalable read/write access, able to serve content directly from storage system.

Multi-dimensional scalability

Scale-out architecture: Scale vertically and horizontally-distributed storage. Backs up and archives large amounts of data with linear performance.

Account/container/object structure

No nesting, not a traditional file system: Optimized for scale, it scales to multiple petabytes and billions of objects.

Built-in replication 3✕ + data redundancy (compared with 2✕ on RAID)

A configurable number of accounts, containers and object copies for high availability.

Easily add capacity (unlike RAID resize)

Elastic data scaling with ease.

No central database

Higher performance, no bottlenecks.

RAID not required

Handle many small, random reads and writes efficiently.

Built-in management utilities

Account management: Create, add, verify, and delete users; Container management: Upload, download, and verify; Monitoring: Capacity, host, network, log trawling, and cluster health.

Drive auditing

Detect drive failures preempting data corruption.

Expiring objects

Users can set an expiration time or a TTL on an object to control access.

Direct object access

Enable direct browser access to content, such as for a control panel.

Realtime visibility into client requests

Know what users are requesting.

Supports S3 API

Utilize tools that were designed for the popular S3 API.

Restrict containers per account

Limit access to control usage by user.

Support for NetApp, Nexenta, Solidfire

Unified support for block volumes using a variety of storage systems.

Snapshot and backup API for block volumes.

Data protection and recovery for VM data.

Standalone volume API available

Separate endpoint and API for integration with other compute systems.

Integration with Compute

Fully integrated with Compute for attaching block volumes and reporting on usage.

6.3 Object Storage characteristics

The key characteristics of Object Storage are that:

  • All objects stored in Object Storage have a URL.

  • All objects stored are replicated 3✕ in as-unique-as-possible zones, which can be defined as a group of drives, a node, a rack, and so on.

  • All objects have their own metadata.

  • Developers interact with the object storage system through a RESTful HTTP API.

  • Object data can be located anywhere in the cluster.

  • The cluster scales by adding additional nodes without sacrificing performance, which allows a more cost-effective linear storage expansion than fork-lift upgrades.

  • Data does not have to be migrated to an entirely new storage system.

  • New nodes can be added to the cluster without downtime.

  • Failed nodes and disks can be swapped out without downtime.

  • It runs on industry-standard hardware, such as Dell, HP, and Supermicro.

Object Storage (swift)

Figure 6.1:

Developers can either write directly to the Swift API or use one of the many client libraries that exist for all of the popular programming languages, such as Java, Python, Ruby, and C#. Amazon S3 and RackSpace Cloud Files users should be very familiar with Object Storage. Users new to object storage systems will have to adjust to a different approach and mindset than those required for a traditional filesystem.

6.4 Components

Object Storage uses the following components to deliver high availability, high durability, and high concurrency:

  • Proxy servers - Handle all of the incoming API requests.

  • Rings - Map logical names of data to locations on particular disks.

  • Zones - Isolate data from other zones. A failure in one zone does not impact the rest of the cluster as data replicates across zones.

  • Accounts and containers - Each account and container are individual databases that are distributed across the cluster. An account database contains the list of containers in that account. A container database contains the list of objects in that container.

  • Objects - The data itself.

  • Partitions - A partition stores objects, account databases, and container databases and helps manage locations where data lives in the cluster.

Object Storage building blocks

Figure 6.2:

6.4.1 Proxy servers

Proxy servers are the public face of Object Storage and handle all of the incoming API requests. Once a proxy server receives a request, it determines the storage node based on the object's URL, for example: https://swift.example.com/v1/account/container/object. Proxy servers also coordinate responses, handle failures, and coordinate timestamps.

Proxy servers use a shared-nothing architecture and can be scaled as needed based on projected workloads. A minimum of two proxy servers should be deployed for redundancy. If one proxy server fails, the others take over.

For more information concerning proxy server configuration, see Configuration Reference.

6.4.2 Rings

A ring represents a mapping between the names of entities stored on disks and their physical locations. There are separate rings for accounts, containers, and objects. When other components need to perform any operation on an object, container, or account, they need to interact with the appropriate ring to determine their location in the cluster.

The ring maintains this mapping using zones, devices, partitions, and replicas. Each partition in the ring is replicated, by default, three times across the cluster, and partition locations are stored in the mapping maintained by the ring. The ring is also responsible for determining which devices are used for handoff in failure scenarios.

Data can be isolated into zones in the ring. Each partition replica is guaranteed to reside in a different zone. A zone could represent a drive, a server, a cabinet, a switch, or even a data center.

The partitions of the ring are equally divided among all of the devices in the Object Storage installation. When partitions need to be moved around (for example, if a device is added to the cluster), the ring ensures that a minimum number of partitions are moved at a time, and only one replica of a partition is moved at a time.

You can use weights to balance the distribution of partitions on drives across the cluster. This can be useful, for example, when differently sized drives are used in a cluster.

The ring is used by the proxy server and several background processes (like replication).

The ring

Figure 6.3:

These rings are externally managed. The server processes themselves do not modify the rings, they are instead given new rings modified by other tools.

The ring uses a configurable number of bits from an MD5 hash for a path as a partition index that designates a device. The number of bits kept from the hash is known as the partition power, and 2 to the partition power indicates the partition count. Partitioning the full MD5 hash ring allows other parts of the cluster to work in batches of items at once which ends up either more efficient or at least less complex than working with each item separately or the entire cluster all at once.

Another configurable value is the replica count, which indicates how many of the partition-device assignments make up a single ring. For a given partition number, each replica's device will not be in the same zone as any other replica's device. Zones can be used to group devices based on physical locations, power separations, network separations, or any other attribute that would improve the availability of multiple replicas at the same time.

6.4.3 Zones

Object Storage allows configuring zones in order to isolate failure boundaries. If possible, each data replica resides in a separate zone. At the smallest level, a zone could be a single drive or a grouping of a few drives. If there were five object storage servers, then each server would represent its own zone. Larger deployments would have an entire rack (or multiple racks) of object servers, each representing a zone. The goal of zones is to allow the cluster to tolerate significant outages of storage servers without losing all replicas of the data.


Figure 6.4:

6.4.4 Accounts and containers

Each account and container is an individual SQLite database that is distributed across the cluster. An account database contains the list of containers in that account. A container database contains the list of objects in that container.

Accounts and containers

Figure 6.5:

To keep track of object data locations, each account in the system has a database that references all of its containers, and each container database references each object.

6.4.5 Partitions

A partition is a collection of stored data. This includes account databases, container databases, and objects. Partitions are core to the replication system.

Think of a partition as a bin moving throughout a fulfillment center warehouse. Individual orders get thrown into the bin. The system treats that bin as a cohesive entity as it moves throughout the system. A bin is easier to deal with than many little things. It makes for fewer moving parts throughout the system.

System replicators and object uploads/downloads operate on partitions. As the system scales up, its behavior continues to be predictable because the number of partitions is a fixed number.

Implementing a partition is conceptually simple, a partition is just a directory sitting on a disk with a corresponding hash table of what it contains.


Figure 6.6:

6.4.6 Replicators

In order to ensure that there are three copies of the data everywhere, replicators continuously examine each partition. For each local partition, the replicator compares it against the replicated copies in the other zones to see if there are any differences.

The replicator knows if replication needs to take place by examining hashes. A hash file is created for each partition, which contains hashes of each directory in the partition. Each of the three hash files is compared. For a given partition, the hash files for each of the partition's copies are compared. If the hashes are different, then it is time to replicate, and the directory that needs to be replicated is copied over.

This is where partitions come in handy. With fewer things in the system, larger chunks of data are transferred around (rather than lots of little TCP connections, which is inefficient) and there is a consistent number of hashes to compare.

The cluster eventually has a consistent behavior where the newest data has a priority.


Figure 6.7:

If a zone goes down, one of the nodes containing a replica notices and proactively copies data to a handoff location.

6.4.7 Use cases

The following sections show use cases for object uploads and downloads and introduce the components. Upload

A client uses the REST API to make a HTTP request to PUT an object into an existing container. The cluster receives the request. First, the system must figure out where the data is going to go. To do this, the account name, container name, and object name are all used to determine the partition where this object should live.

Then a lookup in the ring figures out which storage nodes contain the partitions in question.

The data is then sent to each storage node where it is placed in the appropriate partition. At least two of the three writes must be successful before the client is notified that the upload was successful.

Next, the container database is updated asynchronously to reflect that there is a new object in it.

Object Storage in use

Figure 6.8: Download

A request comes in for an account/container/object. Using the same consistent hashing, the partition name is generated. A lookup in the ring reveals which storage nodes contain that partition. A request is made to one of the storage nodes to fetch the object and, if that fails, requests are made to the other nodes.

6.5 Ring-builder

Use the swift-ring-builder utility to build and manage rings. This utility assigns partitions to devices and writes an optimized Python structure to a gzipped, serialized file on disk for transmission to the servers. The server processes occasionally check the modification time of the file and reload in-memory copies of the ring structure as needed. If you use a slightly older version of the ring, one of the three replicas for a partition subset will be incorrect because of the way the ring-builder manages changes to the ring. You can work around this issue.

The ring-builder also keeps its own builder file with the ring information and additional data required to build future rings. It is very important to keep multiple backup copies of these builder files. One option is to copy the builder files out to every server while copying the ring files themselves. Another is to upload the builder files into the cluster itself. If you lose the builder file, you have to create a new ring from scratch. Nearly all partitions would be assigned to different devices and, therefore, nearly all of the stored data would have to be replicated to new locations. So, recovery from a builder file loss is possible, but data would be unreachable for an extended time.

6.5.1 Ring data structure

The ring data structure consists of three top level fields: a list of devices in the cluster, a list of lists of device ids indicating partition to device assignments, and an integer indicating the number of bits to shift an MD5 hash to calculate the partition for the hash.

6.5.2 Partition assignment list

This is a list of array('H') of devices ids. The outermost list contains an array('H') for each replica. Each array('H') has a length equal to the partition count for the ring. Each integer in the array('H') is an index into the above list of devices. The partition list is known internally to the Ring class as _replica2part2dev_id.

So, to create a list of device dictionaries assigned to a partition, the Python code would look like:

devices = [self.devs[part2dev_id[partition]] for
part2dev_id in self._replica2part2dev_id]

That code is a little simplistic because it does not account for the removal of duplicate devices. If a ring has more replicas than devices, a partition will have more than one replica on a device.

array('H') is used for memory conservation as there may be millions of partitions.

6.5.3 Overload

The ring builder tries to keep replicas as far apart as possible while still respecting device weights. When it can not do both, the overload factor determines what happens. Each device takes an extra fraction of its desired partitions to allow for replica dispersion; after that extra fraction is exhausted, replicas are placed closer together than optimal.

The overload factor lets the operator trade off replica dispersion (durability) against data dispersion (uniform disk usage).

The default overload factor is 0, so device weights are strictly followed.

With an overload factor of 0.1, each device accepts 10% more partitions than it otherwise would, but only if it needs to maintain partition dispersion.

For example, consider a 3-node cluster of machines with equal-size disks; node A has 12 disks, node B has 12 disks, and node C has 11 disks. The ring has an overload factor of 0.1 (10%).

Without the overload, some partitions would end up with replicas only on nodes A and B. However, with the overload, every device can accept up to 10% more partitions for the sake of dispersion. The missing disk in C means there is one disk's worth of partitions to spread across the remaining 11 disks, which gives each disk in C an extra 9.09% load. Since this is less than the 10% overload, there is one replica of each partition on each node.

However, this does mean that the disks in node C have more data than the disks in nodes A and B. If 80% full is the warning threshold for the cluster, node C's disks reach 80% full while A and B's disks are only 72.7% full.

6.5.4 Replica counts

To support the gradual change in replica counts, a ring can have a real number of replicas and is not restricted to an integer number of replicas.

A fractional replica count is for the whole ring and not for individual partitions. It indicates the average number of replicas for each partition. For example, a replica count of 3.2 means that 20 percent of partitions have four replicas and 80 percent have three replicas.

The replica count is adjustable. For example:

$ swift-ring-builder account.builder set_replicas 4
$ swift-ring-builder account.builder rebalance

Removing unneeded replicas saves on the cost of disks.

You can gradually increase the replica count at a rate that does not adversely affect cluster performance. For example:

$ swift-ring-builder object.builder set_replicas 3.01
$ swift-ring-builder object.builder rebalance
<distribute rings and wait>...

$ swift-ring-builder object.builder set_replicas 3.02
$ swift-ring-builder object.builder rebalance
<distribute rings and wait>...

Changes take effect after the ring is rebalanced. Therefore, if you intend to change from 3 replicas to 3.01 but you accidentally type 2.01, no data is lost.

Additionally, the swift-ring-builder X.builder create command can now take a decimal argument for the number of replicas.

6.5.5 Partition shift value

The partition shift value is known internally to the Ring class as _part_shift. This value is used to shift an MD5 hash to calculate the partition where the data for that hash should reside. Only the top four bytes of the hash is used in this process. For example, to compute the partition for the /account/container/object path using Python:

partition = unpack_from('>I',
md5('/account/container/object').digest())[0] >>

For a ring generated with part_power P, the partition shift value is 32 - P.

6.5.6 Build the ring

The ring builder process includes these high-level steps:

  1. The utility calculates the number of partitions to assign to each device based on the weight of the device. For example, for a partition at the power of 20, the ring has 1,048,576 partitions. One thousand devices of equal weight each want 1,048.576 partitions. The devices are sorted by the number of partitions they desire and kept in order throughout the initialization process.


    Each device is also assigned a random tiebreaker value that is used when two devices desire the same number of partitions. This tiebreaker is not stored on disk anywhere, and so two different rings created with the same parameters will have different partition assignments. For repeatable partition assignments, RingBuilder.rebalance() takes an optional seed value that seeds the Python pseudo-random number generator.

  2. The ring builder assigns each partition replica to the device that requires most partitions at that point while keeping it as far away as possible from other replicas. The ring builder searches for a device in a different zone, or on a different server. If it does not find one, it looks for a device with no replicas. Finally, if all options are exhausted, the ring builder assigns the replica to the device that has the fewest replicas already assigned.


    The ring builder assigns multiple replicas to one device only if the ring has fewer devices than it has replicas.

  3. When building a new ring from an old ring, the ring builder recalculates the desired number of partitions that each device wants.

  4. The ring builder unassigns partitions and gathers these partitions for reassignment, as follows:

    • The ring builder unassigns any assigned partitions from any removed devices and adds these partitions to the gathered list.

    • The ring builder unassigns any partition replicas that can be spread out for better durability and adds these partitions to the gathered list.

    • The ring builder unassigns random partitions from any devices that have more partitions than they need and adds these partitions to the gathered list.

  5. The ring builder reassigns the gathered partitions to devices by using a similar method to the one described previously.

  6. When the ring builder reassigns a replica to a partition, the ring builder records the time of the reassignment. The ring builder uses this value when it gathers partitions for reassignment so that no partition is moved twice in a configurable amount of time. The RingBuilder class knows this configurable amount of time as min_part_hours. The ring builder ignores this restriction for replicas of partitions on removed devices because removal of a device happens on device failure only, and reassignment is the only choice.

These steps do not always perfectly rebalance a ring due to the random nature of gathering partitions for reassignment. To help reach a more balanced ring, the rebalance process is repeated until near perfect (less than 1 percent off) or when the balance does not improve by at least 1 percent (indicating we probably cannot get perfect balance due to wildly imbalanced zones or too many partitions recently moved).

6.6 Cluster architecture

6.6.1 Access tier

Large-scale deployments segment off an access tier, which is considered the Object Storage system's central hub. The access tier fields the incoming API requests from clients and moves data in and out of the system. This tier consists of front-end load balancers, ssl-terminators, and authentication services. It runs the (distributed) brain of the Object Storage system: the proxy server processes.


If you want to use OpenStack Identity API v3 for authentication, you have the following options available in /etc/swift/dispersion.conf: auth_version, user_domain_name, project_domain_name, and project_name.

Object Storage architecture

Figure 6.9:

Because access servers are collocated in their own tier, you can scale out read/write access regardless of the storage capacity. For example, if a cluster is on the public Internet, requires SSL termination, and has a high demand for data access, you can provision many access servers. However, if the cluster is on a private network and used primarily for archival purposes, you need fewer access servers.

Since this is an HTTP addressable storage service, you may incorporate a load balancer into the access tier.

Typically, the tier consists of a collection of 1U servers. These machines use a moderate amount of RAM and are network I/O intensive. Since these systems field each incoming API request, you should provision them with two high-throughput (10GbE) interfaces - one for the incoming front-end requests and the other for the back-end access to the object storage nodes to put and fetch data. Factors to consider

For most publicly facing deployments as well as private deployments available across a wide-reaching corporate network, you use SSL to encrypt traffic to the client. SSL adds significant processing load to establish sessions between clients, which is why you have to provision more capacity in the access layer. SSL may not be required for private deployments on trusted networks.

6.6.2 Storage nodes

In most configurations, each of the five zones should have an equal amount of storage capacity. Storage nodes use a reasonable amount of memory and CPU. Metadata needs to be readily available to return objects quickly. The object stores run services not only to field incoming requests from the access tier, but to also run replicators, auditors, and reapers. You can provision object stores provisioned with single gigabit or 10 gigabit network interface depending on the expected workload and desired performance.

Object Storage (swift)

Figure 6.10:

Currently, a 2 TB or 3 TB SATA disk delivers good performance for the price. You can use desktop-grade drives if you have responsive remote hands in the datacenter and enterprise-grade drives if you don't. Factors to consider

You should keep in mind the desired I/O performance for single-threaded requests. This system does not use RAID, so a single disk handles each request for an object. Disk performance impacts single-threaded response rates.

To achieve apparent higher throughput, the object storage system is designed to handle concurrent uploads/downloads. The network I/O capacity (1GbE, bonded 1GbE pair, or 10GbE) should match your desired concurrent throughput needs for reads and writes.

6.7 Replication

Because each replica in Object Storage functions independently and clients generally require only a simple majority of nodes to respond to consider an operation successful, transient failures like network partitions can quickly cause replicas to diverge. These differences are eventually reconciled by asynchronous, peer-to-peer replicator processes. The replicator processes traverse their local file systems and concurrently perform operations in a manner that balances load across physical disks.

Replication uses a push model, with records and files generally only being copied from local to remote replicas. This is important because data on the node might not belong there (as in the case of hand offs and ring changes), and a replicator cannot know which data it should pull in from elsewhere in the cluster. Any node that contains data must ensure that data gets to where it belongs. The ring handles replica placement.

To replicate deletions in addition to creations, every deleted record or file in the system is marked by a tombstone. The replication process cleans up tombstones after a time period known as the consistency window. This window defines the duration of the replication and how long transient failure can remove a node from the cluster. Tombstone cleanup must be tied to replication to reach replica convergence.

If a replicator detects that a remote drive has failed, the replicator uses the get_more_nodes interface for the ring to choose an alternate node with which to synchronize. The replicator can maintain desired levels of replication during disk failures, though some replicas might not be in an immediately usable location.


The replicator does not maintain desired levels of replication when failures such as entire node failures occur; most failures are transient.

The main replication types are:

  • Database replication

    Replicates containers and objects.

  • Object replication

    Replicates object data.

6.7.1 Database replication

Database replication completes a low-cost hash comparison to determine whether two replicas already match. Normally, this check can quickly verify that most databases in the system are already synchronized. If the hashes differ, the replicator synchronizes the databases by sharing records added since the last synchronization point.

This synchronization point is a high water mark that notes the last record at which two databases were known to be synchronized, and is stored in each database as a tuple of the remote database ID and record ID. Database IDs are unique across all replicas of the database, and record IDs are monotonically increasing integers. After all new records are pushed to the remote database, the entire synchronization table of the local database is pushed, so the remote database can guarantee that it is synchronized with everything with which the local database was previously synchronized.

If a replica is missing, the whole local database file is transmitted to the peer by using rsync(1) and is assigned a new unique ID.

In practice, database replication can process hundreds of databases per concurrency setting per second (up to the number of available CPUs or disks) and is bound by the number of database transactions that must be performed.

6.7.2 Object replication

The initial implementation of object replication performed an rsync to push data from a local partition to all remote servers where it was expected to reside. While this worked at small scale, replication times skyrocketed once directory structures could no longer be held in RAM. This scheme was modified to save a hash of the contents for each suffix directory to a per-partition hashes file. The hash for a suffix directory is no longer valid when the contents of that suffix directory is modified.

The object replication process reads in hash files and calculates any invalidated hashes. Then, it transmits the hashes to each remote server that should hold the partition, and only suffix directories with differing hashes on the remote server are rsynced. After pushing files to the remote server, the replication process notifies it to recalculate hashes for the rsynced suffix directories.

The number of uncached directories that object replication must traverse, usually as a result of invalidated suffix directory hashes, impedes performance. To provide acceptable replication speeds, object replication is designed to invalidate around 2 percent of the hash space on a normal node each day.

6.8 Large object support

Object Storage (swift) uses segmentation to support the upload of large objects. By default, Object Storage limits the download size of a single object to 5GB. Using segmentation, uploading a single object is virtually unlimited. The segmentation process works by fragmenting the object, and automatically creating a file that sends the segments together as a single object. This option offers greater upload speed with the possibility of parallel uploads.

6.8.1 Large objects

The large object is comprised of two types of objects:

  • Segment objects store the object content. You can divide your content into segments, and upload each segment into its own segment object. Segment objects do not have any special features. You create, update, download, and delete segment objects just as you would normal objects.

  • A manifest object links the segment objects into one logical large object. When you download a manifest object, Object Storage concatenates and returns the contents of the segment objects in the response body of the request. The manifest object types are:

    • Static large objects

    • Dynamic large objects

To find out more information on large object support, see Large objects in the OpenStack End User Guide, or Large Object Support in the developer documentation.

6.9 Object Auditor

On system failures, the XFS file system can sometimes truncate files it is trying to write and produce zero-byte files. The object-auditor will catch these problems but in the case of a system crash it is advisable to run an extra, less rate limited sweep, to check for these specific files. You can run this command as follows:

$ swift-object-auditor /path/to/object-server/config/file.conf once -z 1000

"-z" means to only check for zero-byte files at 1000 files per second.

It is useful to run the object auditor on a specific device or set of devices. You can run the object-auditor once as follows:

$ swift-object-auditor /path/to/object-server/config/file.conf once \

This will run the object auditor on only the sda and sdb devices. This parameter accepts a comma-separated list of values.

6.10 Erasure coding

Erasure coding is a set of algorithms that allows the reconstruction of missing data from a set of original data. In theory, erasure coding uses less capacity with similar durability characteristics as replicas. From an application perspective, erasure coding support is transparent. Object Storage (swift) implements erasure coding as a Storage Policy. See Storage Policies for more details.

There is no external API related to erasure coding. Create a container using a Storage Policy; the interaction with the cluster is the same as any other durability policy. Because support implements as a Storage Policy, you can isolate all storage devices that associate with your cluster's erasure coding capability. It is entirely possible to share devices between storage policies, but for erasure coding it may make more sense to use not only separate devices but possibly even entire nodes dedicated for erasure coding.


The erasure code support in Object Storage is considered beta in Kilo. Most major functionality is included, but it has not been tested or validated at large scale. This feature relies on ssync for durability. We recommend deployers do extensive testing and not deploy production data using an erasure code storage policy. If any bugs are found during testing, please report them to https://bugs.launchpad.net/swift

6.11 Account reaper

The purpose of the account reaper is to remove data from the deleted accounts.

A reseller marks an account for deletion by issuing a DELETE request on the account's storage URL. This action sets the status column of the account_stat table in the account database and replicas to DELETED, marking the account's data for deletion.

Typically, a specific retention time or undelete are not provided. However, you can set a delay_reaping value in the [account-reaper] section of the account-server.conf file to delay the actual deletion of data. At this time, to undelete you have to update the account database replicas directly, set the status column to an empty string and update the put_timestamp to be greater than the delete_timestamp.


It is on the development to-do list to write a utility that performs this task, preferably through a REST call.

The account reaper runs on each account server and scans the server occasionally for account databases marked for deletion. It only fires up on the accounts for which the server is the primary node, so that multiple account servers aren't trying to do it simultaneously. Using multiple servers to delete one account might improve the deletion speed but requires coordination to avoid duplication. Speed really is not a big concern with data deletion, and large accounts aren't deleted often.

Deleting an account is simple. For each account container, all objects are deleted and then the container is deleted. Deletion requests that fail will not stop the overall process but will cause the overall process to fail eventually (for example, if an object delete times out, you will not be able to delete the container or the account). The account reaper keeps trying to delete an account until it is empty, at which point the database reclaim process within the db_replicator will remove the database files.

A persistent error state may prevent the deletion of an object or container. If this happens, you will see a message in the log, for example:

Account <name> has not been reaped since <date>

You can control when this is logged with the reap_warn_after value in the [account-reaper] section of the account-server.conf file. The default value is 30 days.

6.12 Configure project-specific image locations with Object Storage

For some deployers, it is not ideal to store all images in one place to enable all projects and users to access them. You can configure the Image service to store image data in project-specific image locations. Then, only the following projects can use the Image service to access the created image:

  • The project who owns the image

  • Projects that are defined in swift_store_admin_tenants and that have admin-level accounts

To configure project-specific image locations

  1. Configure swift as your default_store in the glance-api.conf file.

  2. Set these configuration options in the glance-api.conf file:

    • swift_store_multi_tenant

      Set to True to enable tenant-specific storage locations. Default is False.

    • swift_store_admin_tenants

      Specify a list of tenant IDs that can grant read and write access to all Object Storage containers that are created by the Image service.

With this configuration, images are stored in an Object Storage service (swift) endpoint that is pulled from the service catalog for the authenticated user.

6.13 Object Storage monitoring


This section was excerpted from a blog post by Darrell Bishop and has since been edited.

An OpenStack Object Storage cluster is a collection of many daemons that work together across many nodes. With so many different components, you must be able to tell what is going on inside the cluster. Tracking server-level meters like CPU utilization, load, memory consumption, disk usage and utilization, and so on is necessary, but not sufficient.

6.13.1 Swift Recon

The Swift Recon middleware (see Defining Storage Policies) provides general machine statistics, such as load average, socket statistics, /proc/meminfo contents, as well as Swift-specific meters:

  • The MD5 sum of each ring file.

  • The most recent object replication time.

  • Count of each type of quarantined file: Account, container, or object.

  • Count of "async_pendings" (deferred container updates) on disk.

Swift Recon is middleware that is installed in the object servers pipeline and takes one required option: A local cache directory. To track async_pendings, you must set up an additional cron job for each object server. You access data by either sending HTTP requests directly to the object server or using the swift-recon command-line client.

There are Object Storage cluster statistics but the typical server meters overlap with existing server monitoring systems. To get the Swift-specific meters into a monitoring system, they must be polled. Swift Recon acts as a middleware meters collector. The process that feeds meters to your statistics system, such as collectd and gmond, should already run on the storage node. You can choose to either talk to Swift Recon or collect the meters directly.

6.13.2 Swift-Informant

Swift-Informant middleware (see swift-informant) has real-time visibility into Object Storage client requests. It sits in the pipeline for the proxy server, and after each request to the proxy server it sends three meters to a StatsD server:

  • A counter increment for a meter like obj.GET.200 or cont.PUT.404.

  • Timing data for a meter like acct.GET.200 or obj.GET.200. [The README says the meters look like duration.acct.GET.200, but I do not see the duration in the code. I am not sure what the Etsy server does but our StatsD server turns timing meters into five derivative meters with new segments appended, so it probably works as coded. The first meter turns into acct.GET.200.lower, acct.GET.200.upper, acct.GET.200.mean, acct.GET.200.upper_90, and acct.GET.200.count].

  • A counter increase by the bytes transferred for a meter like tfer.obj.PUT.201.

This is used for receiving information on the quality of service clients experience with the timing meters, as well as sensing the volume of the various modifications of a request server type, command, and response code. Swift-Informant requires no change to core Object Storage code because it is implemented as middleware. However, it gives no insight into the workings of the cluster past the proxy server. If the responsiveness of one storage node degrades, you can only see that some of the requests are bad, either as high latency or error status codes.

6.13.3 Statsdlog

The Statsdlog project increments StatsD counters based on logged events. Like Swift-Informant, it is also non-intrusive, however statsdlog can track events from all Object Storage daemons, not just proxy-server. The daemon listens to a UDP stream of syslog messages, and StatsD counters are incremented when a log line matches a regular expression. Meter names are mapped to regex match patterns in a JSON file, allowing flexible configuration of what meters are extracted from the log stream.

Currently, only the first matching regex triggers a StatsD counter increment, and the counter is always incremented by one. There is no way to increment a counter by more than one or send timing data to StatsD based on the log line content. The tool could be extended to handle more meters for each line and data extraction, including timing data. But a coupling would still exist between the log textual format and the log parsing regexes, which would themselves be more complex to support multiple matches for each line and data extraction. Also, log processing introduces a delay between the triggering event and sending the data to StatsD. It would be preferable to increment error counters where they occur and send timing data as soon as it is known to avoid coupling between a log string and a parsing regex and prevent a time delay between events and sending data to StatsD.

The next section describes another method for gathering Object Storage operational meters.

6.13.4 Swift StatsD logging

StatsD (see Measure Anything, Measure Everything) was designed for application code to be deeply instrumented. Meters are sent in real-time by the code that just noticed or did something. The overhead of sending a meter is extremely low: a sendto of one UDP packet. If that overhead is still too high, the StatsD client library can send only a random portion of samples and StatsD approximates the actual number when flushing meters upstream.

To avoid the problems inherent with middleware-based monitoring and after-the-fact log processing, the sending of StatsD meters is integrated into Object Storage itself. The submitted change set (see https://review.openstack.org/#change,6058) currently reports 124 meters across 15 Object Storage daemons and the tempauth middleware. Details of the meters tracked are in the Administrator's Guide.

The sending of meters is integrated with the logging framework. To enable, configure log_statsd_host in the relevant config file. You can also specify the port and a default sample rate. The specified default sample rate is used unless a specific call to a statsd logging method (see the list below) overrides it. Currently, no logging calls override the sample rate, but it is conceivable that some meters may require accuracy (sample_rate=1) while others may not.

log_statsd_host =
log_statsd_port = 8125
log_statsd_default_sample_rate = 1

Then the LogAdapter object returned by get_logger(), usually stored in self.logger, has these new methods:

  • set_statsd_prefix(self, prefix) Sets the client library stat prefix value which gets prefixed to every meter. The default prefix is the name of the logger such as object-server, container-auditor, and so on. This is currently used to turn proxy-server into one of proxy-server.Account, proxy-server.Container, or proxy-server.Object as soon as the Controller object is determined and instantiated for the request.

  • update_stats(self, metric, amount, sample_rate=1) Increments the supplied meter by the given amount. This is used when you need to add or subtract more that one from a counter, like incrementing suffix.hashes by the number of computed hashes in the object replicator.

  • increment(self, metric, sample_rate=1) Increments the given counter meter by one.

  • decrement(self, metric, sample_rate=1) Lowers the given counter meter by one.

  • timing(self, metric, timing_ms, sample_rate=1) Record that the given meter took the supplied number of milliseconds.

  • timing_since(self, metric, orig_time, sample_rate=1) Convenience method to record a timing meter whose value is "now" minus an existing timestamp.


These logging methods may safely be called anywhere you have a logger object. If StatsD logging has not been configured, the methods are no-ops. This avoids messy conditional logic each place a meter is recorded. These example usages show the new logging methods:

# swift/obj/replicator.py
def update(self, job):
     # ...
    begin = time.time()
        hashed, local_hash = tpool.execute(tpooled_get_hashes, job['path'],
                do_listdir=(self.replication_count % 10) == 0,
        # See tpooled_get_hashes "Hack".
        if isinstance(hashed, BaseException):
            raise hashed
        self.suffix_hash += hashed
        self.logger.update_stats('suffix.hashes', hashed)
        # ...
        self.partition_times.append(time.time() - begin)
        self.logger.timing_since('partition.update.timing', begin)
# swift/container/updater.py
def process_container(self, dbfile):
    # ...
    start_time = time.time()
    # ...
        for event in events:
            if 200 <= event.wait() < 300:
                successes += 1
                failures += 1
        if successes > failures:
            # ...
            # ...
        # Only track timing data for attempted updates:
        self.logger.timing_since('timing', start_time)
        self.no_changes += 1

6.14 System administration for Object Storage

By understanding Object Storage concepts, you can better monitor and administer your storage solution. The majority of the administration information is maintained in developer documentation at docs.openstack.org/developer/swift/.

See the OpenStack Configuration Reference for a list of configuration options for Object Storage.

6.15 Troubleshoot Object Storage

For Object Storage, everything is logged in /var/log/syslog (or messages on some distros). Several settings enable further customization of logging, such as log_name, log_facility, and log_level, within the object server configuration files.

6.15.1 Drive failure Problem

Drive failure can prevent Object Storage performing replication. Solution

In the event that a drive has failed, the first step is to make sure the drive is unmounted. This will make it easier for Object Storage to work around the failure until it has been resolved. If the drive is going to be replaced immediately, then it is just best to replace the drive, format it, remount it, and let replication fill it up.

If you cannot replace the drive immediately, then it is best to leave it unmounted, and remove the drive from the ring. This will allow all the replicas that were on that drive to be replicated elsewhere until the drive is replaced. Once the drive is replaced, it can be re-added to the ring.

You can look at error messages in the /var/log/kern.log file for hints of drive failure.

6.15.2 Server failure Problem

The server is potentially offline, and may have failed, or require a reboot. Solution

If a server is having hardware issues, it is a good idea to make sure the Object Storage services are not running. This will allow Object Storage to work around the failure while you troubleshoot.

If the server just needs a reboot, or a small amount of work that should only last a couple of hours, then it is probably best to let Object Storage work around the failure and get the machine fixed and back online. When the machine comes back online, replication will make sure that anything that is missing during the downtime will get updated.

If the server has more serious issues, then it is probably best to remove all of the server's devices from the ring. Once the server has been repaired and is back online, the server's devices can be added back into the ring. It is important that the devices are reformatted before putting them back into the ring as it is likely to be responsible for a different set of partitions than before.

6.15.3 Detect failed drives Problem

When drives fail, it can be difficult to detect that a drive has failed, and the details of the failure. Solution

It has been our experience that when a drive is about to fail, error messages appear in the /var/log/kern.log file. There is a script called swift-drive-audit that can be run via cron to watch for bad drives. If errors are detected, it will unmount the bad drive, so that Object Storage can work around it. The script takes a configuration file with the following settings:

Table 6.1: Description of configuration options for [drive-audit] in drive-audit.conf

Configuration option = Default value


device_dir = /srv/node

Directory devices are mounted under

error_limit = 1

Number of errors to find before a device is unmounted

log_address = /dev/log

Location where syslog sends the logs to

log_facility = LOG_LOCAL0

Syslog log facility

log_file_pattern = /var/log/kern.*[!.][!g][!z]

Location of the log file with globbing pattern to check against device errors locate device blocks with errors in the log file

log_level = INFO

Logging level

log_max_line_length = 0

Caps the length of log lines to the value given; no limit if set to 0, the default.

log_to_console = False

No help text available for this option.

minutes = 60

Number of minutes to look back in /var/log/kern.log

recon_cache_path = /var/cache/swift

Directory where stats for a few items will be stored

regex_pattern_1 = \berror\b.*\b(dm-[0-9]{1,2}\d?)\b

No help text available for this option.

unmount_failed_device = True

No help text available for this option.


This script has only been tested on Ubuntu 10.04; use with caution on other operating systems in production.

6.15.4 Emergency recovery of ring builder files Problem

An emergency might prevent a successful backup from restoring the cluster to operational status. Solution

You should always keep a backup of swift ring builder files. However, if an emergency occurs, this procedure may assist in returning your cluster to an operational state.

Using existing swift tools, there is no way to recover a builder file from a ring.gz file. However, if you have a knowledge of Python, it is possible to construct a builder file that is pretty close to the one you have lost.


This procedure is a last-resort for emergency circumstances. It requires knowledge of the swift python code and may not succeed.

  1. Load the ring and a new ringbuilder object in a Python REPL:

    >>> from swift.common.ring import RingData, RingBuilder
    >>> ring = RingData.load('/path/to/account.ring.gz')
  2. Start copying the data we have in the ring into the builder:

    >>> import math
    >>> partitions = len(ring._replica2part2dev_id[0])
    >>> replicas = len(ring._replica2part2dev_id)
    >>> builder = RingBuilder(int(math.log(partitions, 2)), replicas, 1)
    >>> builder.devs = ring.devs
    >>> builder._replica2part2dev = ring._replica2part2dev_id
    >>> builder._last_part_moves_epoch = 0
    >>> from array import array
    >>> builder._last_part_moves = array('B', (0 for _ in xrange(partitions)))
    >>> builder._set_parts_wanted()
    >>> for d in builder._iter_devs():
                d['parts'] = 0
    >>> for p2d in builder._replica2part2dev:
                for dev_id in p2d:
                    builder.devs[dev_id]['parts'] += 1
    This is the extent of the recoverable fields.
  3. For min_part_hours you either have to remember what the value you used was, or just make up a new one:

    >>> builder.change_min_part_hours(24) # or whatever you want it to be
  4. Validate the builder. If this raises an exception, check your previous code:

    >>> builder.validate()
  5. After it validates, save the builder and create a new account.builder:

    >>> import pickle
    >>> pickle.dump(builder.to_dict(), open('account.builder', 'wb'), protocol=2)
    >>> exit ()
  6. You should now have a file called account.builder in the current working directory. Run swift-ring-builder account.builder write_ring and compare the new account.ring.gz to the account.ring.gz that you started from. They probably are not byte-for-byte identical, but if you load them in a REPL and their _replica2part2dev_id and devs attributes are the same (or nearly so), then you are in good shape.

  7. Repeat the procedure for container.ring.gz and object.ring.gz, and you might get usable builder files.

Print this page