Archive for the ‘Business Continuity’ Category

Hi folks,

This time I have decided to share our best practices material for SharePoint. Yes, I know, some of you SharePoint savvies would question its meaning. as SharePoint is SOOO broad and infrastructure is just one pillar. TRUE! however its the foundation, and as such you have to invest there first before building your service consisting of web applications, sites, web parts, workflows, dashboards and…. you get the idea 🙂

So here’s a summary of the presentation attached, James Baldwin and myself presented back at EMC World 2011 in Vegas.

I’ve also added a lot of links and references for you to find the technical material necessary to accomplish plan and deployment activities.

Please feel free to comment as I would love to get your feedback as for what works/doesn’t work  in your environment.

Servers and Virtualization

Virtualizing SharePoint servers has the same benefits as any other application and/or database servers in your datacenter. but there’s actually more than “just”:

  • Consolidation – Achieve 2-10x consolidation ratio, especially for larger deployments
  • Performance – Improved front end performance with more, smaller WFEs rather than few large WFEs.
  • Maintenance – Live migration of virtual machines (VMware vMotion, Hyper-V Quick/Live Migration)
  • Load Balancing – Maximized overall performance with balanced HW utilization across the farm (VMware DRS, SCVMM PRO)

Aside from all those obvious benefits, there are couple that stand out in distributed configurations like SharePoint:

  • Availability – VM based protection for SharePoint provides HOMOGENOUS availability (VMware HA, WSFC)
  • Business Continuity – Simplified DR management (vCenter Site Recovery Manager, Cluster Enabler)

Virtualization has been fully supported by Microsoft since the launch of SVVP (Server Virtualization Validation Program) in 2008.

SharePoint is a perfect candidate for horizontal scaling. meaning scaling out server roles as more resources required. You would actually get some performance benefits when Web servers (front/back end) are broken into multiple instances by scaling out rather up, utilizing same hardware resources. In many other cases I find that working for the application and SQL server roles too.

Don’t get intimidated by the hypervisor overhead (<10%), again you can always distribute content databases using multiple SQL instances, Partition Index across multiple Crawl/Index servers. all servers roles in SharePoint 2010 can scale out!

Some would recommend to consider leaving index and/or SQL physical, we have proved so many times that ALL server roles can be virtualized without any problem while balancing processing (CPU) pressure  (I wouldn’t be that worried with I/O throughput) with multiple virtual machines. Microsoft general recommendation to dedicate at least 8 physical cores for a medium sized farm is very generic and I don’t see that as a showstopper. if you take that guidance literally I would suggest to wait for the next rev. of vSphere (very soon) and Hyper-V (later this/next year) which would present up to 32 vCPU support.

  • Plan for USER LOAD peaks and not for systematic peaks. from what we have observed in lab tests and actual customer data, SharePoint’s regular timer jobs are responsible for most of the I/O and CPU peaks.

    Virtual SharePoint farm - Reference Architecture

The configuration above supported more than 20,000 users with 10% concurrency using only three Dell R910 ESX cluster.

Storage Planning

Of course you would have to plan for performance and not capacity in most cases; definitely for SharePoint farms in a production environments. before getting into details, here’s a good picture of where SharePoint data is located:

For a complete list of all databases installed with SharePoint goto Database types and descriptions (SharePoint Server 2010)


To point the finger the the “hottest” areas in terms of I/Os I would rank it in the following order of importance (IOPS and latency):

  1. Search databases (Crawl and Property) data and log files
  2. Query Server/s – Query component/s
  3. tempdb data and log files
  4. databases logs

From our lab tests this is what we have found in terms of I/O sizes and R/W ratios. as you can understand this is not your typical OLTP or OLAP profile, SharePoint workload can vary but here’s some data you might found useful when planning storage resources:

Microsoft suggests really high I/Os derived from the search components of the farm. while definitely true, I would argue the suggested IOPS requirements apply to most environments. here’s what we have observed vs. TechNet recommendations:


For sizing, look no further using the following TechNet articles:

Here’s a summary with our recommendations:

SQL Configuration

  • Use 64KB unit allocation size (cluster) when formatting a DB Volume (MSDN)
    • Plan Database file sizes accordingly
      • Don’t rely on autogrowth – File growth can cause locking and has some performance implications. set files size and autogrowth increments appropriately
    • When using  Thin/Virtual provisioning
      • Use the “Quick Format” option
      • Enable Instant file initialization as it enhances the speed for data file creations, restores, data file growth
      • Assign SQL service account  to “Perform Volume Maintenance Tasks” permission
      • Just bear in mind that log files (ldf) are fully allocated and zeroed upon creation or expansion
    • Standard storage response time guidelinesapply, On well-tuned storage system, ideal values would be:
        • 1–5 ms for Log
        • 4–20 ms for Data on OLTP systems (ideally 10 ms or less)
        • 30 ms or less on DSS (decision support system) type


In general, provides great value but for maximum efficiency, it depends on which storage role:

  • Search Index component – No (Highly changing, throw-away data)
  • Search Query component –  Yes (Highly-read data with small burst write changes)
  • TempDB – Yes (The same blocks are re-used on disk and performance of TEMPDB directly affects SharePoint performance request  – tempdb is used in every SharePoint request)
  • Content databases/BLOB Store – Maybe (Depending on the diversity of workload. If some site collections tend to be busier than others)

CX/VNX Thin Provisioning and LUN Compression

BLOB Storage (RBS)

We work with Metalogix StoragePoint for BLOB externalization. currently all EMC storage solutions are supported with StoragePoint as it has connectors to file, block and object storage. some of the possible BLOB stores:

  • Symmetrix VMAX – Block
  • VNX – Block and/or File (NFS/CIFS)
  • Atmos/VE – object (REST)
  • Centera – Centera API
  • Isilon – File (CIFS/NFS)
  • Data Domain – File (CIFS/NFS)

While each would make sense to our customers depending on their general storage design and requirements, please consider the following guidelines:

  • Latency – TTFB (Time to First Byte) should be less than 20 ms
  • Recommended maximum content database size remains 200 GB (guidance might change by MS)
  • SQL RBS FILESTREAM provider works, but it doesn’t scale as other RBS providers like StoragePoint
  • Performance improvement would be more salient when
    • Externalizing larger objects (>1MB)
    • Read-intensive access
  • File size limit remains 2GB even with RBS
  • Backup and Replication considerations:
    • Native/Item level backup (stsadm based) would include BLOBs
    • SQL based backup would only protect the content database metadata
      • To maintain consistency:
        • Backup – First Content Databases then BLOB Store
        • Restore – First BLOB Store then Content Databases
    • For DR purposes always tie RBS volumes with SQL Server volumes
    • For faster recovery, consider larger intervals of garbage collection jobs (Keeps previous BLOB versions)
    • Here’s a reference architecturebased on:
      • EMC VNX5300, SQL Database: 15K SAS, BLOB Store: 7.2K NL-SAS (CIFS Share for RBS)
      • Max user capacity – 8,630 (10%)
      • BLOBs consumed 92% of content databases
      • Full crawl duration – 34 hours (4.4 documents)

Disaster Recovery

While Microsoft has some guidance for SharePoint availability as covered in Plan for availability (SharePoint Server 2010), there’s a lot more involved to obtain a true and complete SharePoint availability across multiple sites. Storage based replication can accelerate the failover process and can scale as your farm sotrage needs grow. The most basic methods of availability can be achieved with SQL server log shipping and/or database mirroring, but while effective for smaller configurations they still lack the complete farm protection. the only components that can be continuously protected with db mirroring/log shipping are SQL databases, and not all of them! what about the index? WFE? app servers?

SharePoint DR involves a lot, but can be significantly simplified when virtualizing all server roles, thus providing end-to-end mobility of SharePoint farm services without worrying about the BLOB filesystem, individual databases, index partitions etc. When choosing storage based replication, the first thing to consider is leveraging consistency grouping all SharePoint volumes. This is of a great value as it can guarantee an end-to-end (I like that term 🙂 ) consistency at any point in time.

Here’s a table to help you understand what is related to what and how to go about consistency grouping available with almost all EMC replication solutions (SRDF, RecoverPoint, MirrorView etc.):While this is a great solution, that type of DR still involves manual failover, restart and configuration. that’s the reason why I would be always recommending virtualizing all server roles, that would enable you to leverage automation solutions for virtual infrastracture. Namely, vCenter Site Recovery Manager (SRM) or Multi-Site Hyper-V clustering enabled by EMC Cluster Enabler (SRDF, RecoverPoint, MirrorView). Assuming VPLEX is deployed, you won’t even need Cluster Enabler but just rely on Windows Server Failover Cluster (VMs) to achieve that.

There are several reference architectures we have successfully tested and published available on emc.com:

I’ll keep updating that post based on feedback, updated findings and updated guidance from our friends at Microsoft.




Read Full Post »

Hello again my virtual friends.

Now when EMC world is behind I found the time for some tech updates. This time, is about another solution we have tested leveraging EMC’s VPLEX Geo.

VPLEX is a solution for federating EMC/Others storage. it sits between the servers and your storage adding a virtualization layer. but it can do much more than that as it presents a sophisticated SDRAM cache which can be distributed to a remote site while maintaining coherence. EMC VPLEX family consists of three viable configurations/offerings:

  • VPLEX Local – For managing data mobility and access within the walls of your data center using a single VPLEX cluster
  • VPLEX Metro – For  mobility across two sites separated by an inter-site latency of up to 5 ms (roundtrip). We have tested that solution last year; vMotion over distance for Microsoft, Oracle and SAP.
  • VPLEX Geo – For access between two sites over extended asynchronous distances with up to 50 ms latency (RTT).

Couple of months ago the virtualization team in Hopkinton worked on testing application mobility on VPLEX Geo this time with Microsoft Hyper-V clustering (sounds strange?! yes, we work with both VMware and Hyper-V in our labs but don’t expect us to add a NetApp FAS or anything crazy like that 🙂

Its an interesting whitepaper that covers SAP, Oracle and SharePoint mobility with heterogeneous VMAX and VNX configuration. the wan link was optimized using EMC’s Select parnter Silver Peak  but let me highlight the SharePoint part in that solution. for further reading please download Long distance application mobility – Enabled by VPLEX Geo.

Physical Architecture Diagram

The SharePoint farm had few site collecions and a total of 400GB of user content. a total of seven VMs constituted the server farm – 3 WFE, 2 Index/Crawl, 1 App and 1 SQL. the configuration supported more than 12,000 users with 10% concurrency with a sub-3sec user response time for all operations (browse, search, modify). Hyper-V clustering was configured using CSVs (cluster shared volumes).

The highlight of the test was migrating the ENTIRE farm from site A to site B with simulated distance of 2,000 km (~1200 miles) WITHOUT a disruption of service using Live migration. while migration took place, the farm’s user load capacity somewhat degraded but still was able sustain more than 9,000 users (-23% load).

Live migration times- VMs breakdown

Using Silver Peak WAN optimization in that solution resulted in almost 70% reduction of data transferred between the two sites. impressive!

To summarize, you have a virtual infrastructure stretched across 2 sites with ZERO downtime for migration operation and very low downtime for outage/disaster scenarios while the server/hypervisor infrastructure addresses the storage element as a SINGLE entity.

Stay tuned… best practices for SharePoint is coming in my next post.


Read Full Post »

Hello everybody,

I have to apologize again for going down under, I probably don’t have the dual talent as some of my peers at EMC; blogging and working at the same time…..

I have a good idea for a startup 🙂 Run a keylogger type program to constantly run on your laptop/pc/tablet and capture important achievements, interesting topics, emails etc and automatically publish those as blog posts and tweeter feeds! who wants to pick it up? I guess WikiLeaks already did that to some extent…

Today, I wanted to share some recent SharePoint 2010 infrastructure testing done by our Shanghai team and led by Frances Hu. We call those “Solutions” as they are real SOLUTIONS and not just a couple of products proven to work on our storage arrays. and before I get to the details here’s a short checklist of what’s included in that solution:

  • Server virtualization – Yes, VMware 4.1
  • High availability – Sure, VMware HA cluster based
  • Storage tiering – Yes, EMC CX4 (we didnt get the VNXs on time) with Flash (FAST Cache), FC (System, RecoverPoint) and SATA (everything else)
  • Disaster Recovery – Oh yes, vCenter SRM using RecoverPoint/SE
  • Remote BLOB storage – Of course, that’s actually part of tiering, we used the native SQL RBS FILESTREAM for that.
  • Efficiency – Yes, the BLOB store was compressed using LUN compression
  • Backup/Replication – Check, we used Replication Manager and Metalogix Selective Restore for item level recovery.

All that in one solution which we are going to demo at EMC World 2011 (I hope you’re coming to Vegas)…

Some of the key results/figures:

  • The SharePoint farm in the solution (including the SQL server) was virtualized depicting a highly available midsize SharePoint environment
  • The sustained simulated maximum user capacity was 13,080 at 10% concurrency
  • Search crawl performance was improved by 91% and search response time was reduced by 27% when using EMC FAST Cache on those LUNs
  • 30% of disk space savings by using EMC LUN compression features on the SharePoint server BLOB store LUNs.
  • 91.2% of SQL database data file storage space was freed after enabling SQL RBS FileStream
  • A full-site disaster caused only 15 minutes of downtime while all farm’s VMs failed over to DR using SRM and RecvoerPoint
  • Using Replication Manager 5.3.1, it took only 6 minutes to restore a 100 GB content database from a SnapView replica

Storage Layout reasoning is based mainly on cost and somewhat controversial but I like it (mostly SATA!!!) as it proves to work!

Here’s the environment architecture

So yes, this is common in our labs, only 3-4 physical servers to support tens of thousands of users.

For more details you can download the whitepaper from: http://www.emc.com/collateral/software/white-papers/h8139-protection-virtualized-sharepoint-wp.pdf

If you have any questions let me know…

Until next time (which I believe would be dedicated to SharePoint DR discussion).

Happy Passover/April vacation.


Read Full Post »