SEARCH
TOOLBOX
LANGUAGES
Create a book
Podservice/Storage Needs and Utilisation

Podservice/Storage Needs and Utilisation

From Steeple

Jump to: navigation, search

[edit] 1 Storage Needs and Utilisation

There are 6 different data areas with varying requirements needed to facilitate operations. These are broken down as: Dropbox, Processing, Nearline, Publish, Development and Workspace.

  • Dropbox – is used as a fairly open (public), minimally-restricted input (write only) area to accommodate large file deposits. This is in lieu of any other equivalent service available.
  • Processing – is the shared file store accessed by all systems within the XGrid, this handles the working files and temporary outputs of the Podcast Producer system.
  • Publish – is the centrally maintained web-accessible filestore hosting post-processing outputs. Currently this function is handled outside of the cluster by SysDev on their AFS system (media.podcasts.ox.ac.uk).
  • Nearline – is the intermediate temporary store of source material needing to be archived. This space is necessary to facilitate reuse and reprocessing of material without making unreasonable demands on our final archiving solution. It is worth noting that material is frequently submitted for processing, modified and then resubmitted several times over. Material for final archiving is manually sorted, and then taken from this volume.
  • Development – is a volume of space within the SAN that is dedicated to the independent development system.
  • Workspace – is a shared file store for OUCS based works and project data, or primary storage area in media production terminology. In local parlance, this is the replacement for the invaluable LTGShare network store.
  • Metadata – a background function not illustrated, but a necessary requirement to operate an XSAN. A volume will exist to contain the Metadata Catalogue for the XSAN. This is used primarily by the Metadata Controllers.

The above diagram illustrates the dataflow and tasks between these storage volumes. Data enters the system typically via the Dropbox, either by web upload or file transfer (sftp, scp, etc). Material that needs further work is brought to the Workspace. Material ready to be processed is then ingested by Podcast Producer and passes through the Processing store. The outputs of that process are written to the Nearline Backup, and (dependant on workflow) to a publish point, such as the Publish volume or exported to yet another system. Development handles much of these individual tasks within the sandbox of its own volume and independent Podcast Producer system.

[edit] 1.1 Backup and archiving

As a full OUCS service, due care needs to be given to the storage of the digital assets being handled by the system. Each storage volume has different usage characteristics and subtly different risks associated to its content. The following table outlines the anticipated data resilience and management plan.


Volume Weekly Changes Backup Archive
Metadata Capacity will be less than 1Tb, and anticipated size will be less than 250Gb. This data is critical to the running of the SAN, and likely to change frequently and unsuitable for backup in anything other than a synchronised storage solution. No No
Dropbox 1200Gb Capacity. Anticipate 90% change per week. No (6)This area is considered unsuitable for backup, due the open nature of the volume. We do not want to be backing up all data without reviewing its content and value. No
Processing 400Gb Capacity. Transient store. No No
Nearline Ideally should hold 18 months worth of material. Size to be determined by available space in SAN (7).Growth of 67.9 Gb per week. No (8) Yes
Publish Growth of 14.3 Gb per week. Daily (9) No
Workspace To support 4 video workstations and project usage, anticipate capacity to be 600Gb. Rate of data change in the region of 10% per day. Daily No
Development Anticipate provisioning 1.5Tb for development usage. No No

(6) This area is considered unsuitable for backup, due to the open nature of the volume. We do not want to be backing up all data without reviewing its content and value.

(7) Within a 16 disk RAID solution, this is projected to be approximately 7Tb, or enough for 12 months data under projected usage figures.

(8) This area may be setup for immediate archive, thus negating the need for a separate backup. Policies still to be determined upon funding available.

(9) This area is handled by the Systems Development Team and via their own processes.


The one volume with question marks over available sizing is the Nearline Backup space. Based on initial calculations, this would need to be approximately 5Tb. Total usable capacity would need to be 7.7Tb.

Backup can be done via the IBM-based TSM system to the HFS (backup) service, though final arrangements have yet to be confirmed. Final archiving destinations have yet to be confirmed, but two options are being discussed: HFS Archiving, or archiving to a potential MPU-based Digital Media Store.

Initial conversations with HFS suggest that:

  • Archiving for Nearline – Can be handled by the HFS Archive project, with a customised setup that would potentially archive immediately any content placed in this volume (this needs reviewing in light of actual workflow plans and usage). Cost is £400 per TB of allocated space, per year, paid up front for a 5 year period. Therefore, estimate would suggest that we would archive approximately 2Tb of data in the first year, thus requiring an initial spend of £4000.
  • Backup of the Workspace volume is understood to be necessary. HFS provides storage for two revisions of every file, as well as a 90 day restore of deleted content policy. This would reduce the need to save some revisions of files within the workspace. However, we anticipate the data churn for this volume to be large, both creating problems for backup windows and processes. It is anticipate that 7Tb of storage would need to be allocated to backup this area based on pilot usage. The exact arrangement of this with HFS will have to be determined as a special project once the cluster is up and running.

Other potential solutions to consider are the use of third-party commercial storage systems (e.g. Amazon’s S3) and purchasing of backup (tape) hardware and consumables for the service to perform archiving manually. These will need further investigation.