SEARCH
TOOLBOX
LANGUAGES
Create a book
Podservice/Risk analysis

Podservice/Risk analysis

From Steeple

Jump to: navigation, search

[edit] Risk analysis

This section will look at the risks to the podcasting processing service (part of the OUCS Podcasting Service and related to the Steeple Development Project), primarily from a hardware point of view.

Severity is rated as:

  • Very Low – No disruption to services
  • Low - Minor disruption to services amounting to minutes of downtime
  • Medium – Disruption to services requiring 30 minutes or more of downtime
  • High – Services Offline for 24 hours or more
  • Very High – Services Offline for 1 week or more

Likelihood is rated as:

  • Very Low – Unlikely to occur more than once in five years
  • Low – Unlikely to occur more than once in a year
  • Medium – Unlikely to occur more than once a month
  • High – Likely to occur more than once a month
  • Very High – Likely to occur more than once a week

Impact gives a brief outline of the disruption to services and systems. Due to the minimal nature of the setup necessitated by limited funding, there is very little redundancy or high-availability built into the solution. This does mean that in most cases, the SAN being offline implies Podcast Processing Service & Development systems are also offline (though not the published content).


Risk Severity Likelihood Impact Mitigation
Single disk failure in RAID box Very Low Low Negligible impact in any volume due to use of with RAID 1 or 5 within LUNS. Hot spare available for Metadata LUN which will automatically trigger a rebuilt (RAID 1) and alert administrators.

Hot spare available for extra LUNS. RAID 5 systems able to tolerate a single disk failure without data compromise or loss of service.

Two disk failure in RAID box:

a) Same LUN, Same time.

b) Same LUN, within 12 hours

c) Different LUNS

a) High – Very High

b) Medium – Very High

c) Very Low

a) Very Low (11)

b) Low

c) Low

a) If metadata LUN, this would temporarily cripple the XSAN and all services reliant upon it. If data LUNS, this could result in catastrophic data loss within the XSAN.

b) If metadata LUN, damage will be dependent on time taken to rebuild spare disk. If second failure occurs to functional disk before rebuild complete, then same as (a) above. If after rebuild is complete, then no impact on operation.If within data LUNS, this could result in catastrophic data loss as data rebuild to a spare disk is likely to exceed 12 hours.

c) This should be the same as a single disk failure scenario.

a) For metadata LUN, assuming the two drives are lost, hot spare would be initialised and data may be able to be restored from backup (12) , and/or the MDC will attempt to rebuild the cache from a scan of the entire cluster. For data LUNS, Backups can be restored to working volumes upon drive replacements being fitted. Some data loss will occur. Alternatively, capacity permitting, a RAID 6 (or 60) solution will allow for two disk failures without compromising data or service.

b) Same as for (a).

c) Same as for single disk failure.

Single power supply failure or loss of one power channel.

a) In RAID Box

b) In MDC

c) In Head Node

d) In XGrid node

e) In Fibre Channel Switch

f) In Metadata Ethernet Switch

g) In Public Ethernet Switch

a) Very Low

b) Very Low c) Very Low

d) Very Low

e) Very High

f) Low

g) Low-High

a) Low

b) Low

c) Low

d) Low

e) Low

f) Low

g) Low

a) No impact due to dual supply.

b) No impact due to dual supply.

c) No impact due to dual supply

d) No impact due to dual supply

e) XSAN crippled, with minor data loss possible.

f) XSAN performance degraded as metadata switches to public LAN.

g) External access to podcasting system cut, minor data loss possible.

a)Dual PSU in unit connected to two different feeds should stop any problems caused by a single PSU failure. Replacement of faulty component should be possible without disruption to services.

b) As (a)

c) As (a)

d) As (a)

e) With only a single path solution in place, loss of the FC Switch will stop all data transfers. Solution is to have a dual path setup with second FC Switch on a separate power circuit, and/or use FC Switches with dual PSUs.

f) This should be a minor inconvenience, but could be removed by using a more expensive Ethernet switch containing dual PSUs.

g) Dual PSUs in the Ethernet switch will mitigate this failure.(13)

Loss of two power channels, or concurrent failure of dual PSUs Medium-Very High Very Low Loss of service until power restored. Minor risk of some data loss (i.e. data in mid transit) OUCS Machine room based UPS should provide a limited amount of protection to shut down the systems in an orderly fashion, thus negating data loss. Unclear as to if this is an automatic procedure, or whether automatic shutdown is limited to systems with their own UPSes attached.
Failure of FC controller in:

a) RAID Box

b) MDC

c) Head Node

d) XGrid Node

e) Fibre Channel Switch

a) Very Low

b) Very Low

c) Very High(14)

d) Medium-Very High

e) Very High

a) Low

b) Low

c) Low

d) Low

e) Low

a) No impact due to dual controllers.

b) Minimal performance impact due to dual MDCs.

c) Podcasting processing service would be offline pending part replacement.

d) Podcasting processing degraded until part replaced.

e) XSAN crippled. No Podcasting service available.

a)No further options.

b) Possibility to install secondary FC interface cards, but at £380 a machine.

c) Same as (b)

d) Same as (b). This also assumes that processing workflow is not dependent upon a specific machine (such as for encoding licences) for processing. If so, then secondary option is to buy extra licences and have a spare processing node available.

e) Dual path Fibre Channel infrastructure would negate the loss of all SAN connectivity resulting from the switch failing.


Fibre Channel Cable failure between switch and:

a) RAID Box

b) MDC

c) Head Node

d) XGrid Node

a) Very Low

b) Very Low

c) High (15)

d) High

a) Low

b) Low

c) Low

d) Low

a) No impact due to dual controllers being connected from RAID to switch.

b) Minimal performance drop in SAN as secondary MDC takes over.

c) Podcasting processing service would be offline pending a cable replacement.

d) Podcasting processing degraded until cable is replaced.

a) Dual controllers already provide a form of dual path redundancy at least to the switch.

b) Dual MDCs already provide a form of dual path redundancy for the SAN.

c) Dual path FC infrastructure would negate this loss of service. Another possibility is to investigate some form of temporary local store queuing that might act as a buffer to allow jobs to be submitted, but then queued until the shared file store was available, thus permitting service to appear online, but delaying the outputs.

d) Same as (c)


Failure of one Metadata Controller (MDC) – e.g. motherboard failure. Very Low Low Loss of one MDC will only cause a slight performance dip in the XSAN. The Dual MDC setup negates this critical problem. The dual system disks and PSUs in the MDCs negate the two most common causes for XServes to fail.
Failure of the Head Node – e.g. motherboard failure. High Low Podcasting processing service will be offline. There is no current provisioning for failover to an alternative machine if the defined Podcast Producer Head Node fails. Service however can resume under two alternative levels:
  1. The Development server can be manually tasked with suitable workflows assuming it is in a position to be so deployed.
  2. Processing can potentially be done using manual techniques on desktop machines similar to the early iTunes U setup days (assuming compatible workflows).


Failure of a XGrid Node – e.g. PcP engine Low-High Low Reduced performance of the podcasting processing service, or (dependent on licensing and workflow) loss of particular workflows for processing (e.g. service offline for some) One option is to ensure that whatever codecs and software used in all workflows is available on all machines in the XGrid; however this may involve licensing costs, or may have software conflicts with services running on particular machines. Adding additional Xgrid nodes will increase capacity and decrease the impact of a node being offline for any period (such as an upgrade). Podcast Producer should allow for a job be put on hold automatically until the necessary node was again available.
Data corruption/loss in XSAN LUN High Low If in metadata LUN, this could potentially offline the XSAN whilst the cache is rebuilt.

If in another data LUN, the damage would depend on what was corrupted, however, loss of data or service is likely low and limited.

For metadata LUN, rebuilding the cache is only known solution at present.

For data LUNs, after identifying the corruption, it may be possible to restore the file from backups (where existing), or alternatively, overwrite with resubmitting inputs from Archive.

Critical Software Update for

a) OS X

b) XSAN

c) Podcast Producer

d) Secondary Software

a) Medium

b) Medium

c) Medium

d) Medium

a) Medium

b) Low

c) Low

d) Medium

a) Likely to require machine to be offline for duration of installation and testing. See failure of particular machines for impact.

b) Same as (a)

c) Podcasting processing service will be offline whilst the Head node is updated.

d) Podcasting processing service may suffer disruption whilst software critical for workflows is updated.

a) The SAN can be maintained by doing a rolling upgrade on the MDCs. Without secondary machines for other services, they’ll be impacted as described.

b) Same as (a) c) Presenting the service independent of the cluster is it hosted on and having a dual cluster setup would be one option for negating downtime here. Testing updates on the development machine and potentially on another experimental box could help minimise downtime caused by unforeseen interaction issues. d) Multiple instances of the software and a rolling upgrade should minimise any impact of extra software updates.

Unforeseen / Undocumented issues caused by updates or user configuration Low-Very High (16) This is highly dependent upon the configuration of the development server in relation to the production server. This is similar to the Critical Software Update issue. Having a development system setup in a near pre-production configuration will help identify these issues before they are applied to the production system. However this is only true if the development system and production system are closely aligned and this may not often be the case. Having another single box PcP setup to act as a Pre-production test bed (and as a service backup) would reduce this risk further and increase service uptimes.
A (series of) failure that offlines the entire Apple Cluster Very High Very Low Podcasting processing service would be offline. Hosting of published content on a separate system (which will need to be assessed in another document) mitigates a catastrophic system failure by ensuring the public view of podcasting content is available.The likelihood of this happening is dependent upon the exact configuration (e.g. single/dual path Fibre Channel, Workflows & Licensing).
Fibre Channel Firmware needs upgrade. Medium Low If single path installation, then SAN is offline, therefore Podcasting processing service also. If dual path, no interruption to service. Dual path FC will allow a rolling upgrade to be performed on the FC Switches without disrupting SAN usage.
Fibre Channel needs reconfiguring Medium Low-Medium If single path installation, then SAN is offline, therefore Podcasting processing service also. If dual path, negligible interruption to service. Dual path FC will allow a rolling upgrade to be performed on the FC Switches without disrupting SAN usage.
XSAN is misconfigured and requires correction Medium – Very High Low – High Misconfiguration can lead to a catastrophic data loss, performance issues, corruption of data and more. Expertise in XSAN setup is essential from the outset to avoid costly problems occurring. Having an experienced consultant available to manage system setup and providing adequate training for the support staff should reduce the likelihood to low (or very low) and reduce any changes due to improvements/upgrades down to medium severity issues. Good backup and archive practices will help reduce the risk of permanent data loss. Having the published content available in an independent system will minimise the public awareness of a service disruption.
No qualified support staff available to rectify issues Medium – Very High Low Issues could range from XSAN to OS X to Services to Third party applications, with impacts from service being offline to catastrophic data loss. Training of the named support staff in the first instance will decrease response times and decrease the risk of a suitably qualified person being available to address the issue. Low service level agreements will provide more time to resolve matters within user’s expectations (though this is not anticipated to be a long term allowance). Whilst many aspects of the cluster and software can be understood with generic or similar understandings (e.g. UNIX knowledge), specific awareness of XSAN 2, Podcast Producer, XGrid and Open Directory is essential.

(11) Two disk failures are considered a little more likely than combining the odds of a single disk failure. This is because most RAID solutions that ship with disks feature disks from the same vendor and often, the same manufacturing batch. This would suggest that multiple disks could fail around the same time due to sharing so many characteristics. We have discussed this issue with our preferred SAN supplier and they confirmed that they liaise very closely with the disk manufacturers and put disk sets together from different batches to reduce this problem.

(12) This aspect of system management needs to be confirmed via XSAN training or with Apple.

(13) This switch is part of existing infrastructure. Uncertain at time of writing whether device already has dual PSUs.

(14) No spare FC controller cards are planned to be in stock, therefore will require ordering of parts upon failure.

(15) Assumes no spare cable and that there is only a single path FC solution in place.

(16) This is highly dependent upon the configuration of the development server in relation to the production server.