THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Linchi Shea

Checking out SQL Server via empirical data points

Clustering every SQL Server instance

You may disagree, but I believe it is a good practice to cluster all the SQL Server instances. That is, even when you plan to run a SQL Server instance on a single machine, you should install it in a single node cluster.

The primary advantage is that you only need a single standard SQL Server build instead of one for the stand alone and one for the clustered. This results in simplified configurations such as when you configure network aliases, Kerberos, and multiple instances.

If you need to add a second node later, that change will be totally transparent to your client. You may question how often one would need to add a second node. True, that may not happen very often. But when it does happen, you’d be thankful that you already have the cluster in place. Moreover, consider a DR scenario where you don’t want to have a two-node cluster doing nothing just in case there is a disaster. If it turns out that you do need to run your production on DR for an extended time period, you probably want to protect your production with a cluster. In that case, adding another node would be painless if your DR is already a single-node cluster.

When it comes to using network aliases, I strongly prefer to expose and manage all the network aliases of my SQL Server instances explicitly as network resources. Network resources introduce additional constraints that make the network aliases less fragile to mistakes and help reduce chaos when we need to move them among the hosts.

Kerberos configurations can sometimes be finicky and the procedures for a clustered instance and those for a stand-alone instance are different. Anything you can do to reduce the Kerberos configuration complexity is a plus for robustness of your environment.

When multiple instances are configured in a cluster, they must have separate IP addresses, and this makes it trivially easy to create network aliases for a named instance, completely eliminating the need to reference any two-part name on any client and making the ‘server location’ completely transparent to all the clients. By 100% server location transparency, I mean the ability to move a SQL Server instance to a different machine or a different cluster without changing any configuration at all on the client side. This offers huge convenience, and in many cases is an absolute necessity.

From my personal experience, I have not seen any serious downside to this approach. Hope that’s your experience as well. As always, I’d like to hear your different views/opinions on this.

Published Friday, February 10, 2012 11:32 PM by Linchi Shea

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Jason said:

I love this idea, and will think about doing it at my job.  I've just got management convinced that the hardware costs for a second node are not all that much when you run Active / Passive for resilience.  So now, I've got the approval that any new installs (save perhaps for unique instances like SCCM at a remote location) will be clustered.  But I like your thinking, as it would make things much easier for me to go back with this approval and just build up and attach a second node.

February 11, 2012 3:15 PM
 

GrumpyOldDBA said:

It does assume that your storage model is sharable, if your server is using only internal disks then it's not something you can do, even with a DAS you may have some issues.

Other than that I sort of agree with you, notwithstanding the changes in SQL 2012 with mirroring and availability groups.

I hasten to add that I've worked with clustered production systems for as long as I can remember, I guess it would probably be SQL 7 when I last had stand alone production servers.

I do like the point about the DR server being a single node cluster, however my DR site does not have the benefit of a SAN or other shared storage.

February 13, 2012 8:24 AM
 

noeldr said:

Well, I have to throw a wrench here:

Clustering components have had a *lot* of bad reputation in the past let alone more failure surface.

In the latest editions that has changed but I can't forget all those nights that "ping-pong" issues affected me in the past.

Baring that, it sounds like something to think about.

February 13, 2012 5:47 PM
 

Greg Linwood said:

There are two huge problems with clustering - (a) requirement for external storage wipes out PCIE ioDrives (internal storage) which means you can't use the fastest, lowest cost storage available and (b) management costs are significantly increased with clustering, due to the issue noel has alluded to.

I am generally an advocate for achieving simplicity through quality process & simplified use of technology rather than complicating High Availability technology, just to achieve automatic fail over (which usually results in increased incidence of fail over, therefore down time & even worse, more calamity when things go truely wrong.

Your idea is certainly creative but I think the costs outweigh the benefits

February 14, 2012 1:02 AM
 

Greg Linwood said:

I meant to say that I advocate achieving High Availability through simplified use of technology & implementing quality processes rather than using complicated High Availability merely to achieve automation of fail over (what clustering essentially is).

In our experience, there is a glaring story of DBAs having to spend significantly more time supporting clustered SQL instances than non-clustered ones. One of the most commong problems is that storage engineers constantly screw you over with promises that patch upgrades to their firmware won't require down time. There's also the inherent extra complexity involved with clustering technology which has more cogs & generally breaks more often on its own

February 14, 2012 1:18 AM
 

alen said:

only problem i see with that is if you have something ridiculously expensive like EMC SAN's then you are spending a lot more on storage than you need to

February 14, 2012 3:02 PM
 

K. Brian Kelley said:

We've seen the issue with the complexity of clusters doing us no favors. Some other things to consider.

With clustering, you lose shared memory as a network library. Now, if all of your processes are connecting from other systems, this isn't a loss. But if you have anything running local, it is. And it can be potentially affected by anything that plays with the network stack. We've been burned by that, and even with MS Premier Support were caught where we couldn't diagnose the real cause of a general network error with SSIS packages where SQL Server was up and responsive but the SSIS package stopped being able to talk to it. Move the packages over to a non-clustered server (where SQL Server was installed locally and the packages were talking to it) and everything was fine.

Also, if you're going the virtualization route, such as with VMware, the HA provided by such solutions may be more viable. We did a recent test where we intentionally moved the VM over from one physical host to another to simulate the scenario where we had to work on a physical host or the vSphere installation itself. While it took a good bit of time to get over, about 20 minutes for 192 GB of RAM on the new HP DL 580 G7s, the downtime visible to the clients was 600 ms, which is significantly smaller than the seconds one would see in a cluster fail-over. Now, we still want to see POST times for spinning up because we doubt that would be faster than a fail-over situation where the host unexpectedly died on you, but we know it'll be quicker than a cold boot since the server would have already undergone its POST cycle during power on and the POST is only in the VM.

February 14, 2012 9:54 PM
 

Greg Linwood said:

Virtualisation has its own benefits & drawbacks.

One of the biggest draw backs is that it adds an extra layer of poorly co-ordinated multi-tasking which impacts performance all of the time, which can be much worse in many cases than the overhead associated with communicating over a network protocol vs shared memory (compute intensive vs network intensive workloads).

Both clustering & virtualisation are technologies which provide small benefits to a few support staff in the rare event of which hurting performance for all users all of the time.

I much prefer giving users good performance all of the time & having to do a little extra work as a DBA to achieve excellent High Availability through excellent process rather than just throwing a few very complicated, expensive technologies around

February 15, 2012 12:32 AM
 

Prakash Heda said:

Major benifit of having a cluster is in case sql server is hung for whatever reason cluster will try to restart it....it helps a lot when you dont have to getup in night and able to meet SLA better

But on downside its the licensing cost which matters, even though sql server supports 2 node cluster with std edition windows does not.

Regarding shared storage since iscsi is supported its less of a struggle...we are using iscsi protocal for clustrers in vm's and working fine (small customners do not need top notch performance)

Once 2012 is available I plan to run everything on PCI2.0 ssd storage and have availability group configured for perfect HA solution...(though I prefer VM anyday as for 90% of sql prod servers I dont need to provide top notch performance or HA)

Prakash

February 15, 2012 10:13 PM
 

Greg Linwood said:

Prakash, your first paragraph represents a common approach taken by many DBAs - automation of fail-over in the hope that "whatever reason" will go away with a simple restart (which clustering automates).

But this is NOT how to go about High Availability, for two major reasons:

a) it implies acceptance of restarts (involving downtime) for "whatever reason". What this usually means is more frequent downtime than is necessary as "whatever reason" becomes a reason to restart when the whole objective of HA is to AVOID restarts & downtime. You might not want to be woken in the middle of the night, but unless you take the time to look into the problem (usually as it's occurring) you are far less likely to diganose it properly & therefore less likely to avoid it in future.

and

b) you expect a restart to solve problems, which just doesn't work & is hardly what DBAs are paid to do.

You did say that you don't need top notch performance or HA, and you certainly won't get it with this approach. The path toward getting good performance and HA out of SQL Server is to LIMIT the incidence of downtime & restarts by AVOIDing the problems in the first place, not simply automating the process of restarts.

Its a relatively easy thing to role out complex technology to automate the process of failover. Being proactive about avoding problems in the first place is a much bigger task, requiring more skill & experience as it requires knowing more about what NOT to do than what TO do.. Clustering falls into the bracket of what NOT to do in most cases when striving for high performance and availability, though you did state that these weren't your goals I guess..

February 16, 2012 1:12 AM
 

Ewan said:

Some interesting points raised here around HA in general.

Linchi - you wrote about this before and I think it's a great idea.

But then all of my data storage is provided by a SAN at the backend anyway, so no biggie for me to throw a cluster at a single-node.

My experience must be atypical since I've had no problems with clustering at all. But we do have a standard build so it all seems to just work.

SANs are expensive and not particularly fast when shared, but that's what we've got so the shared-storage prerequisite doesn't bother me.

Greg - Not sure that PCIio is the 'lowest cost' storage, but it is nice kit.

Also, on the general use of clustering: I use it to protect against hardware failure , so the automation of failover in that scenario is actually quite valuable to me (relatively rare I admit, but then so is loss of a datacenter and we still provide DR). It's handy to minimise downtime when the server team need to do OS patching as well. Greg - in the absence of clustering and virtualisation, what excellent processes do you use to provide HA?

I had high hopes for availability groups to overhaul the whole clustering/log-shipping/seamless-failover thing (the main benefit of the indirection provided by Linchi's post here), but if I thought the current crop was expensive, the new EE licensing model is mental and blows any possibility of me implementing that. So it's back to the drawing board to figure out the 2012 HA picture...

Ewan

February 16, 2012 10:12 AM
 

Greg Linwood said:

Is there a problem with this site at the moment? Whenever I try to post responses to this thread longer than a single line it bombs with a http 403 - Forbidden error..

February 16, 2012 9:14 PM
 

SQLKID said:

There is nothing glorious about SQL clustering atleast as of sql 2008.

A failover mechanism which is at best primitive and very rudimentary. The isalive chek failing for 180 secs and there we go - restart.

If a DB had too many open transactions (think of batch job at night) and if failover happenned .. its likely that before the DB recovers the isalive starts failing again and another failover. I have personally looked at the server where this ping pong nonsense continued for hrs before someone knew how to stop it.

Call MSFT support if you have ANY unexplained outage on SQL cluster - their answer is same. Not sufficient data! Here is PSSDIAG and sqldumper (If sqldumper is so important not sure why its not enabled by default) and lets collect the data on the next outage. On next outage you provide the logs and the answer is 'SQL was healty' possibly something at OS/network stack. let me involve the xyz team. An option for low level tracing for isalive chek is something that clusering needs. I have some very bad experiences.

If my DBA recommends clustering for all the SQL servers - I will have him fired!

February 18, 2012 6:32 PM
 

EZ-one said:

It is a bad idea to post some ideas like this one on this website - just to post something (it looks like was the main goal. right?)

<I believe it is a good practice to cluster all the SQL Server instances>

It does not make much sense ..

- It can confuse some not experienced DBAs just by the one reason -- > it was posted on the sqlblog.com..

I agree with SQLKID

February 19, 2012 8:41 PM
 

Linchi Shea said:

SQLKID;

Clustering is not perfect and is far from being perfect. But having too many open transactions is certainly not the root cause for an instance to bounce between the nodes. If you have actually checked how the IsAlive check works, you should know that many open transactions in the user databases will not cause the IsAlive check to fail. The IsAlive check is rather simple, and doesn't really touch the user databases. It doesn't matter how long it may take to recover a user database, how many transactions it may have to recover in a user database, or whether it can recover a user database at all.

Before you discount clustering SQL Server, you may want to be more precise on exactly why its problems may outweigh its benefits.

February 20, 2012 10:48 PM
 

Greg Linwood said:

It is easy to present the case that the problems out weigh the benefits with clustering

It's as simple as pointing out that users are forced to experience degraded performance due to clustering's external storage dependency, purely so that a few maintenance staff don't have to be woken up at night in the rare event of a failure.

As users suffer degraded performance all of the time from clustering, its hard to build a case that the automated fail-over capabilities in clustering are actually a benefit, as failures should be very rare and there are options which don't affect performance yet merely pose some maintenance inconvenience for support staff (eg Log Shipping)

Then, there's the whole wider picture of increased complexity with clustering & the added faults this brings.

Worst of all however are SAN engineers who routinely bring down clusters, with no option of failover, under-mining the entire idea.

February 21, 2012 12:03 AM
 

Linchi Shea said:

Greg;

I think in most places whether or not you have an external SAN storage is orthogonal to SQL clustering. And in most of these places, SAN is a given and SQL clustering comes later.

True, mainstream SAN storage systems trade performance for manageability and so on in many respects. But it's my experience that a properly maintained SAN storage system should offer sufficient performance capacity for the vast majority of the apps. Where your performance requirements exceed what a SAN storage system can offer, sure you need to consider a different solution.

February 21, 2012 8:51 AM
 

Greg Linwood said:

>Where your performance requirements exceed what a SAN storage system can offer, sure you need to consider a different solution.<

This doesn't make sense at all - why would you start with the most expensive, slowest solution (SAN) & then move to the fastest & cheapest solution when you could always have started with the fastest & cheapest solution (SSD)?

It is absurd to deliberately choose the most expensive & slowest solution, especially if the objective is only to implement clustering, which should really be seen as the High Availability inhibitor that it is.. All it does is automate failover, which is not the same thing as achieving high levels of availability.

February 21, 2012 5:35 PM
 

Linchi Shea said:

You never pick an enterprise storage system on performance alone. Nor do you pick a storage system just because it offers highest performance. SAN storage systems offer many features to simplify managing your storage across the entire enterprise. If you condsider these features important, you have to pay for them. If you don't consider them important, SAN storage systems are not for you and would be a waste.

Nobody chooses an expensive SAN storage solution just to implement clustering. If your environment doesn't support failover clustering, you probably don't want to incur a huge expensive just to be able to put in clustering. But a lot of environments these days already have some sort of SAN storage systems in place. At least, all the places I have worked at have SAN storage systems.

I agree that automated failover is not the same thing as achiving HA, but it's one aspect of HA.

I'm not sure why we are even discussing this. If you don't like clustering or had bad experience with clustering, stay away from it and find a different solution to cover what it is supposed to cover. I'm telling my own experience, not rehashing any second-hand experience. And my experience is that for what it is supposed to cover, it's simple and effective, and I have not had any problems that would make me consider not cluster any important SQL instance.

I understand there are cases, for instance, where the availability requirement of a user database (not the instance) is such that failover clustering may not recover the user database fast enough, failover clustering may not be the solution and a technology such as database mirroring may be better.

February 21, 2012 9:52 PM
 

Greg Linwood said:

You said yourself that SAN trades performance for manageability.

I say that any DBA who makes that trade off is making decisions in their own selfish interests rather than that of their users, who have to suffer the consequences of poorly performing, expensive SAN storage solutions all day every day, just so the DBA might get a little extra sleep

Users interests should come first but they rarely get a say in storage decisions, as DBAs & storage engineers often are the only ones at the table & tend to only represent their interests.

As for the "enterprise" aspect in this discussion - I prefer not to make a distinction between "enterprise" & "non-enterprise" as I consider this elitist talk that has no merit. All users deserve the best performance possible from whatever budget is available for infrastructure & SANs are theh worst choice in nearly all circumstances.

Your last comments about failover clustering not recovering the database fast enough have really missed the point. Failover clustering causes MORE outage time due to its inherent design. The increased complexity causes greater amounts of down time when faults occur as most faults occur in storage (which clustering can do nothing about) and when problems occur, they tend to be more complex & time consuming to resolve.

High Availability has little to do with recovery time - the main game is avoidance of problems in the first place and increasing complexity (eg clustering) works against the basic objective of keeping systems up in the first place.

The performance implications of external storage are a seperate but very commonly over looked problem & frankly, should be enough of an issue on their own to sway almost any decision about infrastructure away from clustering.

February 21, 2012 10:56 PM
 

Linchi Shea said:

Greg;

It looks like we just have had very different experience with clustering and external storage.

"Users interests should come first but they rarely get a say in storage decisions, as DBAs & storage engineers often are the only ones at the table & tend to only represent their interests."

You know this is simply not my experience, not at all. If DBAs are at the table at all, they would be the last one. In my own experience, they are never at the table for the enterprise storage decision. It's simply not a decision that would involve DBAs. Like it or not, they are too far down the food chain to have any real influence. Sure, DBAs may get invited to participate in the ritual afterwards, or in the process of determining how to best use it.

"High Availability has little to do with recovery time - the main game is avoidance of problems in the first place"

Not sure about the systems you have worked with. But I don't know how to avoid such failures as CPU or memory module failure or any failure inside the host computer. You talk about performance. There is actual a case for performance. And that is, sometimes memory modules go bad without outright failure. the end result is that performance is degraded. Having a cluster allows you to quick move your SQL instance to another node without having to live with the degraded hardware performance.

You brought up the complexity of clustering being an issue. This may be the case many years ago. But these days, clustering is rather simple and stable. You build it and forget about it. Again that's just my own experience. Yours may differ, and obviously are different.

When it comes to external storage, the fact of life is that they are there in all major enterprises and they are not going away any time soon. I don't think we have any choice but to live with them. As I mentioned before, to us database folks, they are more or less a given. So why not take full advanatge of it?

BTW, without external central storage, how do you achieve such things as storage on demand and automatic tiering? Or do you consider those feature not really worth having?

February 22, 2012 12:02 AM
 

Greg Linwood said:

Linchi, correct me if I'm wrong but my undestanding of your background is that you've worked predominently on Wall St & perhaps things are done in a certain way there, but this isn't representative of most of the business world.

Talking about "enterprise" practises is elitist & an un-necessarily intimidating terminology.

Whilst I understand that the larger an organisation is, the more likely they homogenise their approach to storage but there is also the other reality with such big businesses - that many of their internal customers suffer from terrible experiences from centralised IT services & end up "silo"ing their systems onto their own infrastructure to get away from such homogenised approaches.

After working with SQL Server systems for nearly 20 years & running a dedicated SQL Server support business for the past decade with customers all over the globe, in all major industries & of all sizes I put it to you that most businesses don't operate the way Wall St businesses do & to the vast majority of businesses, end user performance is extremely important, usually more than centralisation of storage.

In my opinoin, your idea of every server being clustered would bring much damage to anybody who followed it. You indicated in your opening sentence that you expected others to disagree so this probably doesn't even surprise you.

February 22, 2012 2:19 AM
 

Greg Linwood said:

(part II of a previous post which SQLBlog.com wouldn't let me post as a single post for whatever reason)..

As for your last question about storage on demand & automatic tiering, first - these are phenomenon's that mainly apply to the super expensive SAN's which you seem to work with. They are so expensive, storage needs to be provisioned only on demand & tiered so that investments can be recovered on equipment only as it ages.

With dedicated storage, it's usually feasible to invest in sufficient storage for the life expectancy of the system & it is also usually expandable sufficiently when needed. Individual PCIE ioDrives are currently available in 5TB format & capacity is growing regularly. It is feasible to have many of these in a single machine so internal expansion is a definite reality. Tiering isn't so crucial as it is with SANs as there's not such as huge requirement to recover capital investment in the equipment (as it's so much cheaper).

February 22, 2012 2:20 AM
 

Ewan said:

Hi Greg

Well, if you don't like SANs, why don't you just say so!

It's all very well calling 'Enterprise' practices 'elitist', but you know what, that's the actual world that lots of DBAs in big organisations live in. 'Enterprise' exists as real stuff in the real world. Once you reach a certain scale your technology options change.

Yes, I'd love to throw FusionIO into everything, but I am forbidden to use local storage. End of story. It's all SAN. And that's a decision taken way over my head and is justified on the basis of management of storage infrastructure in general, not database storage in particular. If I can *prove* that the SAN cannot provide the IOPS for a specific database then I might have a chance of having a niche exception case. But actually, for practically all of my systems, the SAN is absolutely fine. And while we're at it, Virtualisation is perfectly adequate for the most part too.

I actually have some sympathy with your position. SANs are expensive and slow. Although to be fair, they are generally reliable and our SAN team doesn't randomly fail my systems - which seems to be your experience.

Finally, the suggestion that I'm both 'lazy' and 'selfish' because of how I choose to implement my HA is frankly astonishing. I have an entire Risk department looking over my shoulder that I have to satisfy on system availability. They report to a government agency. I prefer to have my recovery automated.

I asked earlier what 'excellent' procedures you use to guarantee availability in the event of failure - I'm still curious about that. If a better option exists then I would love to hear about it.

A properly designed system on properly managed infrastructure will perform appropriately. Your specific experience doesn't define the general rule.

Ewan

February 22, 2012 6:31 AM
 

Greg Linwood said:

I actually put together a response to your earlier question about achieving high availability through process but SQLBlog.com wouldn't let me post it. I received HTTP 403 errors numerous times, despite trying to break the content down to smaller posts etc. I'm not sure why this happened but after trying for some time I gave up.

I'm not sure why you would draw a personal inference from my comments about laziness, when you've expressed that you had no say in the storage decision where you work. I wasn't referring to individuals specifically other than perhaps Prakesh who said he wanted to avoid having to get up to meet asn SLA.

It was intended as a general comment about infrastructure design which I think holds true - High Availability is often designed more for the convenience of support staff without due consideration for impact on user experience.

I also think there's a nerdy curiosity with technologies such as clustering which are perceived as "advanced" and IT engineers want to deploy the technologies sometimes for no better reason than to gain experience or be able to put on their resume that they've implemented a complext technology. Again, without due consideration for end user experience..

Anyway, this post was primarily about clustering being used for EVERY instance of SQL Server & my position on it is that the storage performance trade off is not worth the benefits Linchi has outlined.

Sure, not everyone can switch from SAN to SSD, but my point is simply that clustering requires external storage and therefore rules out the best performing & value for money storage available. When it comes to DBMS servers, this is a huge trade off & not worth the benefits Linchi has outlined.

February 22, 2012 7:15 AM
 

Linchi Shea said:

Greg;

I might have misunderstood you, but you seem to see a dichotomy between "convenience of support staff" vs. "user experience". I happen to think making the life of the support staff easier enhances the user experience. In addition, if we don't consider the impact on user experience, the convenience of the support staff will be greatly hurt.

You also seem to view SAN and SSD as mutually exclusive. They are not. Modern disk arrays from all major vendors can have tons of SSD in them, plus tons of cache. Centralized storage behind a SAN may be slower (compared to what you  can do with the storage devices when you are totally unconstrained). But unless you are doing some fancy stuff in there (e.g. synchronous storage replication between two sites at a distance), it's actually hard to find an app that can't live with its performance. If you disagree, please provide an example.

You know one of the hardest things in IT is to predict demand accurately and consistently. You may have success in predicting how much storage is needed by an app and translate that into how much storage to allocate to a server. But I have not had such success and I don't know anyone who consistently has such success. The truth is that you can't because there are always unexpected events that demand storage right away. You could have a sudden corruptuion in a large database and you want 4TB of extra storage somewhere quick just to be able to check the database out. You could have a sudden court order to retrieve some old stuff from an archived database that you need storage to restore from the backup. A centralized storage system gives you a very large pool so that you don't have to make any prediction at a very granular level. You get what you wnat and you don't even have to mess with the server itself. And if your prediction turns out to be too excessive, just give what you don't need back to the pool. This is exactly like having a centralized power grid. I'm sure people argued about pros and cons of having a centralized power grid, and some may still be arguing. But for the vast majority of us, we just enjoy being able to hit the light switch and have our houses lit without having to worry about if we need to add another power generator. That's excellent user experience, if you ask me.

February 22, 2012 11:03 AM
 

Ewan said:

Hi Greg

Fair enough response, based on the fact that I didn't choose the storage. I didn't make the distinction when I first read through your posts.

On the other hand, if I was given the option today of throwing away the SAN and replacing it all with PCIio, would I? Probably not. Not exclusively anyway - maybe for one or two systems. For the simple reason that the current system works fine. Any system is a collection of compromises. If a query runs in 100ms vs 50ms, is that important? Depends on concurrency, frequency of use, etc etc. Balance that against the fact that I survive a hardware failure. Probably the SAN is fine.

I'm not using clustering as a CV++ tool. I have challenging RTOs during the working week - particularly overnight. Yes, I could use mirroring perhaps, or AGs in 2012. To my mind they're more complex than clustering.

Anyway, I guess it's a matter of perspective. If I had a small greenfield site, would I buy a SAN for my databases, with my own money. Maybe not.

If I'm in a company where I have to use a SAN for all of my database storage, am I bothered? As long as the disk performance is appropriate (and that doesn't mean as fast as it possibly can be) then probably not.

Getting back to the orginal topic - Clustering is not, in my experience, flaky or unreliable or overly complex. If I have SAN storage sitting behind all of my servers anyway, then I suffer no performance penalty for making each a cluster, and I get the benefits that Linchi talks about. And I think this is where Linchi is coming from.

Ewan

February 22, 2012 12:20 PM
 

Greg Linwood said:

@Linchi, I never said SAN & SSD are mutually exclusive & neither did I say that support effort vs user experience are a dichotomy.

What I did say is that clusters eliminate the best cost / performance storage option, which is PCIE attached ioDrives for most systems. This is not the same as saying SAN & SSD are mutually exclusive so I'm not sure why you read further into my comments beyond what I actually stated.

As for the support effort vs user experience discussion, it is perfectly possible to achieve both (this is the core of what my company MyDBA does for its clients). The question is HOW you go about doing this & through extensive experience, I have learned that automation of failover & recovery via clustering generally does more harm than good.

Sure, clustering can automate recovery, but it also generates more down time than alternatives as it is overly sensitive to conditions that cause failovers - which other readers of this blog have pointed out can cost you transactions.

I have not said that clustering or SAN should NEVER be used. My point was that clustering should not ALWAYS be used (the topic of this blog).

As for your comments about provisioning storage on demand through SAN, most readers of your blog will recognise the extreme cost in this approach. Perhaps Wall St companies with too much money can keep expensive SAN capacity sitting idle in case somebody might want it someday, but most companies would approach this problem more practically.

For example, one of our clients is a UK based law firm (one of the largest law firms in the world) & had a corrupt multi-TB evidence database resulting from a SAN fault. The firm was involved in an international class action & had to respond to discovery procedure requirements quickly & their backups were also corrupt (same SAN). The practical response was to simply provision another server & storage to get on with this time critical task, which we completed without delay.

I bring this example up to show that even in extreme legal, time-critical cases, SAN provisioning isn't neceesary as hardware can be provisioned quickly enough when needed.

In my experience, very often SAN engineers are actually SLOWER at provisioning SAN storage than storage can be bought. This is usually becasue SAN technology is SO EXPENSIVE that very few firms are prepared to waste money keeping sufficient excess capacity available for ondemand storage. Most SAN storage teams put up a fight just to provision SAN storage, yet you describe it as if on demand provisioning through SAN is easy.. I put it to you that your experience is high unusual

Most companies are more practical & budget conscious. We are a long way off topic however & I am not totally opposed to either clustering or SANs.

My main response to your blog is that clustering should not ALWAYS be used, which is the extreme position you have presented in this blog.

I would go further to say that the cases for clustering SQL Server systems are actually reasonably rare & with the improvements to Live Migration coming in Windows 8 Hyper-V, I think the reasons to cluster systems will be reduced even further

February 22, 2012 6:10 PM
 

Greg Linwood said:

@Ewan, I think you're presenting a reasonably balanced position & I think we actually agree. I don't object to technologies being used where they are appropriate & I also don't think that performance extremes always need to be achieved. Where 100ms is good enough, fair enough but it's important to recognise that there are MANY systems out there that need extreme performance from whatever budget they have & clustering is an inhibitor to this.

My point is that clustering should not ALWAYS be used.

Do you agree with Linchi that clustering should ALWAYS be used?

February 22, 2012 6:14 PM
 

Ewan said:

Hi Greg

I'll refer back to my previous post: where you already have SANs backing your servers, my view is that the benefits provided by clustering justify its use. Always. Namely, indirection and aliasing provided by the network name resource. Maybe if I had admin rights over DNS and AD I could automate this easily (I don't know), but clustering gives me this out of the box. Even (and in this context especially) in a single-node configuration.

I accept that there will be many systems that require massive performance and therefore any shared storage model is inappropriate. But I have only ever come across one (and I work in a large Financial Services firm in the UK, so I expect I work with reasonably high-end requirements). Which means the 700 systems I currently support are entitled to define the general rule, which I'm happy to reject on an exception basis.

Re-reading your posts, I think your issue is with SANs in general as opposed to clustering. Clustering, to you, is merely problematic bacuse there's a SAN involved. Would that be fair?

I reckon that most companies that have spent 500k on a SAN has reasonable capacity to flex. The disks, although expensive, are the cheapest part of the thing. But my expectation of logic may be unreasonable - I've plenty of experience of idiocy after all. For my part, I have a pool of database storage that I 'own'. I can carve stuff out of it as I need, so perhaps it's ownership of the storage that causes issues.

And I suspect Live Migration won't be much of a panacea - I certainly don't expect it to replace any clustering implementation I have. VMWare have had it for a while and it has never suggested itself to me as a solution.

Good discussion this, btw. Thanks for sticking with it.

Ewan

February 23, 2012 9:02 AM
 

Robert van den Berg said:

This is a very interesting article. Personally, I've had some mixed experiences with SQL clustering, starting with SQL 7. Clustering works very well when done correctly, and it has gotten a lot easier over the years to do correctly. But for all versions of SQL, I've seen major issues with clustering caused by lack of experience on the part of the people installing and maintaining the cluster. So that is just one reason I would not advise clustering in every situation.

Just a small point about what SQLKID said: a clustered instance can fail because of user connections. SQL only allows roughly 32 thousand connections, and if all connections are used, the IsAlive can no longer connect to perform its check. This will definitely lead to a failover.

October 31, 2012 3:51 AM
 

jerryhung said:

Just came across this post and wanted to add

- It's certainly a very interesting idea I want to try - cluster every SQL "physical" instance - that makes moving much easier when you move

We are all SAN for physical anyway (but NFS for VM)

- I was told by MSFT PFE recently some companied even P2V their 2-node cluster into a VM, and took out physical node = 1-node cluster VM

Although in some environment, it may just be down to

Tier 1 - cluster on physical - b/c of HA requirement

Tier 2+ - VM - b/c it's not important enough

then we don't have to worry about 1-node cluster :)

November 29, 2012 10:54 AM
 

Ricky said:

I don't think SQL Server Cluster is greate enough.

the cluster ware may be weak point  

January 14, 2014 8:23 AM

Leave a Comment

(required) 
(required) 
Submit

About Linchi Shea

Checking out SQL Server via empirical data points

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement