3/23/2014

Enterprise Storage Advisory

Digital Edge uses a propitiatory methodology when analyzing any type of enterprise storage. We want to make feedback from industry professionals available to the public. This could be a very useful guide for IT experts who are not deeply involved in storage but would like to receive a high-level understanding of storage health, capacity and performance conditions. The proposed methodology is solely Digital Edge's approach of assessing enterprise storage and is not bound to any other manufacturer or storage brand.

Before we begin, we want to remind you that storage is not just capacity technology, but capacity AND performance technology that must be evaluated together. When capacity is very easy to analyze, the performance parameters may become confusing and not so obvious.

Areas to be analyzed:

1. Capacity allocation and expected IOPS.

2. Expected IOPS and load from servers.

3. Network expected performance and network load from servers

4. System errors and warnings

5. Patch levels and recommendations

6. Conclusion.

Here is brief description of the information collected and analyzed for each item. This description also explains why we believe our methodology is both valid and convenient for high level assessment. This methodology may not produce completely accurate troubleshooting-ready statistics; instead it assesses conditions and indicators for further tuning and troubleshooting.

Some fundamental statements to simplify our analysis:

Enterprise storage could be SAN, NAS or a unified platform playing role of SAN and NAS at the same time.

Enterprise class NAS is a SAN with servers attached to the SAN infrastructure that exposes SAN storage to servers over NAS protocols. Those servers in terminology of EMC are called "data movers." They are attached to SAN through fiber interface. From SAN's perspective, data movers are the same clients as any other servers connected to it.

It is relatively easy for clients to build those servers without purchasing them from hardware manufacturers. However, servers pre-configured by manufacturers with high availability and management interface may be beneficial.

SAN consist of controllers that are connected to Storage Area Network through multiple fiber channel and/or iSCSI interfaces on the frontend and to disk trays on the back.

Capacity is provided by disks.

Performance is the function of performance parameters of disks themselves, controllers and the network.

Each disk has pre-defined performance parameters. The faster disk, the faster it can perform an I/O operation.

The more disks playing in I/O load, the better performance of the system is.

Disks are congregated in RAID groups. Performance of the SAN disks is a function of configuration of the RAID groups. Performance of RAID group depends on the amount of disks included in the group, their speed, type, and its penalty.

RAID groups are cut on LUNs. LUNs are exposed to servers.

As performance of storage depends on RAID group configuration - LUNs on the same RAID group will affect each other. LUNs on separate RAID groups will not affect each other. This is true considering network I/O is not a bottleneck.

Network performance is a function of types and links to Storage Area Network processing power of controllers.

Logical View of SAN

1. Capacity Allocation and Expected IOPs.

Capacity analysis can easily be introduced in the capacity report. Capacity is shown by RAID group and how those RAID groups are cut on LUNs. The total expected IO performance is displayed per RAID group.

Disk 0/ 0	RAID Group 0 RAID5 Drive Type: FC Capacity:286GB Percent Full: 99% IOPS: 900	LUN 61 PROD-ORALCE-Data Size: 286GB	Host: NYORAN1/2 Type: Oracle ASM Used: 192GB (51%) Free: 94GB
Disk 0/ 1
Disk 0/ 2
Disk 0/ 3
Disk 0/ 4

Disk 3/ 0	RAID Group 3 RAID 5 Drive Type: SATA Capacity: 11005.93GB Percent Full: 99% Free: 0.928GB IOPS:630	LUN 16 PROD-VMStore1 Size: 2048GB	Host: ESXi1/2/3/4/5 Type: VM Datastore Used: 1.4TB Free: 614GB
Disk 3/ 1		LUN 29 PROD-VMStore5 Size: 1 TB	Host: ESXi1/2/3/4/5 Type: VM Datastore Used: 969TB Free: 55GB
Disk 3/ 2		LUN 30 PROD-SQLCUSTER_DATA Size: 500GB	Host: NYSQL1/2 Type: Windows Used: 299B Free: 201GB
Disk 3/ 3		LUN 0 Place Holder Size: 1GB	None
Disk 3/ 4
Disk 3/ 5
Disk 5/ 1

Disk 5/2

RAID Group 9

RAID 5

Drive Type: SATA

Capacity: 5502 GB

Percent Full: 83%

Free: 927GB

IOPS:360

LUN 41

PROD-ORACLE-LOGS

Size: 500GB

Host: NYORA1/2

Type: ORACLE ASM

Used: 47GB

Free: 453GB

LUN 42

PROD-SQLCLUSTER_LOG

Size: 500GB

Host: NYSQL1/2

Type: Windows

Used: 136B

Free: 453GB

Disk 5/3

LUN 45

PROD-EXCH-DATA

Size: 500GB

Host: EX1/EX2

Type: Windows

Used: 284B

Free: 216GB

LUN 46

PROD-EXCH_LOG

Size: 1.4 TB

Host: EX1/2

Type: Windows

Used: 699GB

Free: 1.3TB

Disk 5/4

LUN 49

QA-VMDATASTORE

Size: 325GB

Host: ESXi6/7/8

Type: VM datastore

Used: 123B

Free: 202GB

LUN 58

PROD-EXCH-DATA-II

Size: 500GB

Host: EX1/2

Type: Widnows

Used: 166B

Free: 334GB

Disk 5/5

Lun 68

QA-SQL-DATA-2

Size: 300GB

Host: QASQL1/2;

Type: Windows

Used: Unmounted

Free: Unmounted

IOPS avg: 0

IOPS max: 0

The first row includes disk information and its position in disk tray. The RAID group information includes: RAID type, disk type, total capacity, free space and excepted IOPS. Expected IOPS are calculated based on disks number in the group, disk speed and RAID type.

LUN information includes: total capacity, host(s) that mounts onto the LUN, and the space used by host(s).

2. Expected IOPS And Load From Servers.

In contrast to capacity, performance is something that is difficult to assess. Therefore we offer a method that allows assessing SAN performance and illuminates potential problem spots. IT professionals can then use different techniques to go deeper into actual performance tuning and troubleshooting.

We often see multiple examples of a mistaken vision when people think about SANs. People tend to think that the more loads you put on SAN, the slower the SAN will work. That is wrong! SAN will work per-parameters it was built. If you configure a RAID group to provide 900 IOPS, it will deliver those expected IOPS. The applications on servers that are pushing I/O to SAN may slow down however. SAN cannot satisfy all of the requests. In such a case, requests will be queued on the server and the end user will begin to feel the SAN performing slower. In all actuality, the SAN is working at the same speed; it just has more requests waiting for each other to finish.

SAN baseline performance can be easily tested with tools like iometer. After the Storage Area Network connectivity and raid groups are setup, the performance of the SAN itself should remain constant. Performance might be affected by degrading RAID levels using mismatching hot spare disks or RAID re-building. Under normal circumstances however, the SAN will not slow down.

To assess the SAN performance, we evaluate the expected IOPS provided by RAID groups. Then we compare this value with aggregated average and maximum IOPS from servers on all LUN of the analyzed RAID group. Here’s what it may look like:

Disk 1/0

RAID Group 1

RAID 10

Drive Type: FC

Capacity: 1,073GB

Percent Full: 99%

IOPS: 720

LUN 1

UAT-ORACLE-FILES

Size: 200GB

Host: NYORAUAT1/2

Type: Oracle

Used: 181GB

Free: 19 GB

IOPS: 45/354

Disk 1/1

LUN 8

PROD-ORACLE-FILES

Size:200 GB

Host: NYORAPROD1/2

Type: Oracle

Used: 150

Free: 50

IOPS: 52/643

Disk 2/0

LUN 43

PROD-SQL-SERVER-DB

Size: 260GB

Host: NYSQLPROD

Type: Windows

Used: 184GB

Free: 76GB

IOPS: 198/2077

Disk 2/1

LUN 44

PROD-SQLUAT-SERVER-DB

Size: 260GB

Host: none

Free 260G

TOTAL EXPECTED: 720

TOTAL PUSHED (avg/max): 295/3074

In this example, RAID Group 1 is RAID 10 raid group built on 1500 RPM Fiber Channel disks. The expected performance of such configuration is 720 IOPS. The I/O is measured for LUN1, LUN8 and LUN43 from the server side using host built-in tools like PerfMon or IOStat. Average and Max values are recorded and then totals are compared.

In the end, a follow up report is created for the entire SAN:

RAID GROUP #	IOPS EXPECTED	AGREGATED IOPS PUSHED (avg/max)	AGREGATED WAITS (avg/max) ms
RAID Group 0	900	233/376	1.25/23
RAID Group 1	720	295/3074	7.21/865.06
RAID Group 2	360	2/134	315.04/29329.85
RAID Group 3	630	1233/6103	4302.15/27239
RAID Group 4	270	106 / 3160	580.97/14342.94
RAID Group 5	180	4.26/250	2546.45/51500
RAID Group 6	720	31.38/602	6462.16/224913.33
RAID Group 9	360	3145/29233	885.42/23764.33
RAID Group 10	720	6.6/305	4838.33/126350
RAID Group 11	720	45/ 2875	4958.44/160240.66
RAID Group 14	630	264/ 2696	1320/4030
RAID Group 15	900	164/1903	837.21/2990
RAID Group 16	720	23 / 2262	2.258/50
RAID Group 17	720	2.371 / 377	1.154/49
RAID Group 18	360	147/1394	1510/9170
RAID Group 19	360	35/ 2571	4.21/69.95
RAID Group 20	180	6.9 / 80	6.88/135.75
RAID Group 21	180	150 / 790	3.905/38.09
RAID Group 22	180	9 / 224	9.91/131.6
RAID Group 23	180	0	0/0
RAID Group 24	720	1335/ 15037	0.92/26.22
RAID Group 25	720	2290/10973	2.08/29.625
RAID Group 26	180	55/239	1970/6280
META			1660.17/12966.29

Red RAID groups are oversubscribed on average. Hosts are trying to push much more I/O requests than the host can handle. This can be demonstrated using Average and Maximum waits (the last column). These are assessment indicators that tell the storage admin to take closer look at the LUNs. The reason for oversubscription could be constant load when applications are “frying” disks and desperately need more I/O. In which case, more I/O can be gained by spreading loads to more physical disks like spindles or introducing some flash disks adding cashing and some others.

High load indicators could be a result of spikes. In the event of high load indicators, time factors, nature of spikes and their time span should be reviewed and analyzed. It may be that RAID Group X with expected 900 IOPS has LUNs A, B and C. Then the report from all LUNs will show 800 IOPS on average or maximum of about 1000. In some cases the report can be totally well balanced between LUNs and RAID Group. The max I/O could be produced in different time frames and the overall average would not yield more load than RAID Group provisioned.

A deeper analysis of READ/WRITE, WAITS and DISK QUEUE graphs should correlate showing spikes at the same time. Sometimes spikes are caused by backups and could be totally ignored if they do not occur during production hours.

Storage pools could be analyzed using the same logic.

3. Network Expected Performance and Network Load from Servers

Next, an assessment report is created based on network statistics collected on SAN, switches and hosts.

			Expected Speed (Mbps)	Actual Avg and Max Load on switch (Mbps)	Aggregated load from servers (Mbps) avg/max	Aggregated waits from servers (Ms)
Backend	SPA	Nic1	10,000	86/213	173/422	1342/28938
	SPA	Nic2	10,000	87/209	173/422
	SPB	Nic1	10,000	67/209	112/405
	SPB	Nic2	10,000	45/196	112/405

In this situation we have a SAN with 4x10G iSCSI connections. Based on average and maximum load from the switches, we see that we are far from saturation. Large waits are functions of IOPS. They are accumulating when hosts are waiting for read/write operations from SAN. The network doesn’t contribute to the waits in this case.

4. System Errors and Warnings

System errors and warnings are collected on SPA and SPB controllers. In most cases we find out about any errors through our Enterprise Storage Monitoring system. However for a complete report, we assemble all the logs and load them into our database. Next, we group them by type and determine whether anything should be reported or taken under closer consideration.

Any database engine can be used to semi-automate analyses of large amounts of data.

Overtime, we accumulated lots of SQL stored procedures and statements through our analyses of logs. These procedures and statements help us to complete analyses faster.

5. Patch Levels and Recommendations

We review the level of the management software in comparison to the current versions. Then we classify each patch with following classification:

Critical – Data Lose or Downtime
Critical – Security
Non critical

We also check EOL (End of Life) or EOW (End of Warranty) dates and provide recommendations for our clients.

6. Conclusion.

Digital Edge believes that the preceding methodology should be practiced to analyze storage devices at least once a quarter or six month. We believe that “even hardware should go for a blood test from time to time.”

This gives enterprise IT groups assurance that everything working as it supposed to, not oversubscribed, that applications are not “frying” HDDs.

We understand that enterprise IT groups have their own expert methods of using and configuring storage. Our methodology could be used by any of them or Digital Edge Enterprise Storage team could be engaged to provide independent audit and assessment.

Michael Petrov

Founder, Chief Executive Officer

Michael brings 20 years of experience as an information architect, optimization specialist and operations’ advisor. His experience includes extensive high-profile project expertise, such as mainframe and client server integration for Mellon Bank, extranet systems for Sumitomo Bank, architecture and processing workflow for alternative investment division of US Bank.