|
|
|
|
Carpe Datum!
-
Windows Azure is part of the Microsoft "stack" - the suite of software and services we offer. Because we have so many products in almost every part of technology, it's hard to know everything about all parts of what we do - even for those of us who work here. So it's no surprise that some folks are not as familiar with Windows and SQL Azure as they are, say Windows Server or XBox.
As I chat with folks about a solution for a business or organization need, I put Windows Azure into the mix. I always start off with "What do you already know about Windows Azure?" so that I don't bore folks with information they already have. I some cases they've checked out the product ahead of time and have specific questions, in others they aren't as familiar, and in still others there is a fair amount of mis-information. Sometimes that's because of a marketing failure, sometimes it's hearsay, and somtetimes it's active misinformation.
I thought I might lay out a few of these misconceptions. As always - do your fact-checking! Never take anyone's word alone (including mine) as gospel. Make sure you educate yourself on your options. Your company or your clients depend on you to have the right information on IT, so make sure you live up to that.
Myth 1: Nobody uses Windows Azure
It's true that we don't give out numbers on the amount of clients on Windows and SQL Azure. But lots of folks are here - companies you may have heard of like Boeing, NASA, Fujitsu, The City of London, Nuedesic, and many others. I deal with firms small and large that use Windows Azure for mission-critical applications, sometimes totally on Windows and/or SQL Azure, sometimes in conjunction with an on-premises system, sometimes for only a specific component in Windows Azure like storage.
The interesting thing is that many sites you visit have a Windows Azure component, or are running on Windows Azure. They just don't announce it. Just like the other cloud providers, the companies have asked to be completely branded themselves - they don't want you to be aware or care that they are on Windows Azure. Sometimes that's for security, other times it's for different reasons. It's just like the web sites you visit. For the most part, they don't advertise which OS or Web Server they use. It really just shouldn't matter. The point is that they just use what works to solve a given problem.
Check out a few public case studies here: https://www.windowsazure.com/en-us/home/case-studies/
Myth 2: It's only for Microsoft stuff - can't use Open Source
This is the one I face the most, and am the most dismayed by. We work just fine with many open source products, including Java, NodeJS, PHP, Ruby, Python, Hadoop, and many other languages and applications. You can quickly deploy a Wordpress, Umbraco and other "kits". We have software development kits (SDK's) for iPhones, iPads, Android, Windows phones and more. We have an SDK to work with FaceBook and other social networks.
In short, we play well with others.
More on the languages and runtimes we support here: https://www.windowsazure.com/en-us/develop/overview/
More on the SDK's here: http://www.wadewegner.com/2011/05/windows-azure-toolkit-for-ios/, http://www.wadewegner.com/2011/08/windows-azure-toolkits-for-devices-now-with-android/, http://azuretoolkit.codeplex.com/
Myth 3: Microsoft expects me to switch everything to "the cloud"
No, we don't. That would be disasterous, unless the only things you run in your company uses works perfectly in Azure. Use Windows Azure - or any cloud for that matter - where it works.
Whenever I talk to companies, I focus on two things:
- Something that is broken and needs to be re-architected
- Something you want to do that is new
If something is broken, and you need new tools to scale, extend, add capacity dynamically and so on, then you can consider using Windows or SQL Azure. It can help solve problems that you have, or it may include a component you don't want to write or architect yourself.
Sometimes you want to do something new, like extend your company's offerings to mobile phones, to the web, or to a social network.
More info on where it works here: http://blogs.msdn.com/b/buckwoody/archive/2011/01/18/windows-azure-and-sql-azure-use-cases.aspx
Myth 4: I have to write code to use Windows and SQL Azure
If Windows Azure is a PaaS - a Platform as a Service - then don't you have to write code to use it?
Nope.
Windows and SQL Azure are made up of various components. Some of those components allow you to write and deploy code (like Compute) and others don't. We have lots of customers using Windows Azure storage as a backup, to securely share files instead of using DropBox, to distribute videos or code or firmware, and more. Others use our High Performance Computing (HPC) offering to rent a supercomputer when they need one. You can even throw workloads at that using Excel!
In addition there are lots of other components in Windows Azure you can use, from the Windows Azure Media Services to others. More here: https://www.windowsazure.com/en-us/home/scenarios/saas/
Myth 5: Windows Azure is just another form of "vendor lock-in"
Windows Azure uses .NET, OSS languages and standard interfaces for the code. Sure, you're not going to take the code line-for-line and run it on a mainframe, but it's standard code that you write, and can port to something else.
And the data is yours - you can bring it back whever you want. It's either in text or binary form, that you have complete control over.
There are no licenses - you can "pay as you go", and when you're done, you can leave the service and take all your code, data and IP with you.
So go out there, read up, try it. Use it where it works. And don't believe everything you hear - sometimes the Internet doesn't get it all correct. :)
|
-
This is a continuation of the books I challenged myself to read to help my career - one a month, for year. You can read my first book review here, and the entire list is here. The book I chose for April 2012 was: Applied Architecture Patterns on the Microsoft Platform. I was traveling at the end of last month so I’m a bit late posting this review here.
Why I chose this book:
I actually know a few of the authors on this book, so when they told me about it I wanted to check it out. The premise of the book is exactly as it states in the title - to learn how to solve a problem using products from Microsoft.
What I learned:
I liked the book - a lot. They've arranged the content in a "Solution Decision Framework", that presents a few elements to help you identify a need and then propose alternate solutions to solve them, and then the rationale for the choice. But the payoff is that the authors then walk through the solution they implement and what they ran into doing it.
I really liked this approach. It's not a huge book, but one I've referred to again since I've read it. It's fairly comprehensive, and includes server-oriented products, not things like Microsoft Office or other client-side tools. In fact, I would LOVE to have a work like this for Open Source and other vendors as well - would make for a great library for a Systems Architect. This one is unashamedly aimed at the Microsoft products, and even if I didn't work here, I'd be fine with that. As I said, it would be interesting to see some books on other platforms like this, but I haven't run across something that presents other systems in quite this way.
And that brings up an interesting point - This book is aimed at folks who create solutions within an organization. It's not aimed at Administrators, DBA's, Developers or the like, although I think all of those audiences could benefit from reading it. The solutions are made up, and not to a huge level of depth - nor should they be. It's a great exercise in thinking these kinds of things through in a structured way.
The information is a bit dated, especially for Windows and SQL Azure. While the general concepts hold, the cloud platform from Microsoft is evolving so quickly that any printed book finds it hard to keep up with the improvements.
I do have one quibble with the text - the chapters are a bit uneven. This is always a danger with multiple authors, but it shows up in a couple of chapters. I winced at one of the chapters that tried to take a more conversational, humorous style. This kind of academic work doesn't lend itself to that style.
I recommend you get the book - and use it. I hope they keep it updated - I'll be a frequent customer. :)
|
-
We’ve all done it. We have a file we need to transfer to someone - but it’s big. Really big. “Big”, in this case, is defined as “bigger than my corporate e-mail will let me send.” So of course we scout around, and find a free file share location. We plop it in, and send a link to the other person. Problem solved! No, problem created. You see, many of these file-sharing sites have some…interesting…language in their terms. Sure, for a picture of aunt Sally riding that llama on the vacation, who cares if someone else steals it and puts it on an advertisement for a shipping company from Elbonia? But most of these offerings were not designed for a corporation with private or secure information - sometimes your private or secure information. So what’s a company to do? You have to let people send files - they are just going to work around IT if you don’t - but you need a way to keep it secure, and maintain a chain of custody in some cases. There is a solution. There’s a free Codeplex project called “BlobShare” that we have for you here. You pull it down, follow the instructions, and deploy it to Windows and SQL Azure. From there, you have a clean, safe, secure, controlled interface to let your folks share files.  There’s a full post on the inner-workings and the design of this code - and it’s just code. You can change it, use your own logo, restrict it further, loosen it up, whatever you like. That’s the point of Platform as a Service - control and ease. http://blogs.msdn.com/b/vbertocci/archive/2011/10/31/blobshare-sample-acs-protected-file-sharing.aspx
|
-
If you want to be wise, watch the actions and outcomes of others. Emulate the successful actions, and avoid the actions that cause failure. That’s true in life in general - and in technology projects in specific. I’ve worked with several clients who have created or migrated an application to “the cloud” - meaning using Microsoft Windows Azure or another provider. Although the statement in the title of this post is trite, I cannot over-emphasize how accurate it is. In every case of those who had a great experience with a distributed computing environment (which is thankfully the vast majority of my projects), What kind of preparation do you need to do? Here are some tips I’ve learned in the successful (and not-so-successful) deployments I’ve seen: Follow standard recommendations for successful projects in general You and your organization have probably done a few projects before - this one should have the same general attributes: a well-defined goal, a small, motivated team, a realistic timeline, and an adequate budget. I know, I know, you *never* seem to get those things - but if you don’t, you’ll fail. Simple as that. Educate yourself Computing technology started out on a single set of hardware for a single purpose - and realizing the limits of the hardware at hand, systems designers quickly realized that scale-out and virtualization was key. No, that’s not new - mainframes almost always worked on the concept of scale-out and virtual machines. But we switched in the 1980’s to single-user systems again, and we’ve been there ever since. By that I mean you install an OS on the things you work on. Now we move back to distributed system concepts, and there are some real differences. You’ll need to learn how those work, and do things a new way. Hey, we’re IT - we LOVE learning new things, right? Get a partner if needed There are a few of us white-haired Gandalf’s around that remember how to work in a distributed system, but if it’s new to you, that’s completely OK. You can save yourself a world of trouble by working with someone who’s done this before - a partner you hire, someone from Microsoft Consulting, whatever. And don’t forget support - who will handle each issue, what is the escalation model, who are your contacts at Microsoft, and what is your “light’s out” strategy? “A new broom sweeps clean”, the old adage goes, but the old brooms know where the dirt is. Build a model Take some time to do a Proof of Concept on your local system and using your Azure hours from your MSDN account if you have one. Going through this build - and being willing to throw it away and try it a different way - is invaluable. Test your theories Three statisticians are walking in a field. They see a rabbit - the first guy raises his gun, firing far in front of the rabbit. The second guy simultaneously raises his gun and fires far behind the rabbit. The third guy yells “We got him!” Not every theory is correct - not every attempt is the right one. Build in your success tests while you’re building your model. Then check them - don’t leave this step out. Rinse, lather, repeat This is advice from a shampoo bottle - which I’ve never used (I don’t really have that much hair - especially now). But in a “Cloud” project, it’s important. It’s an evolving system, that gains new improvements at an amazing rate. As soon as you deploy and stabilize you need to start the process over again. If you created your system in a Services model, with contracts for the APIs and abstracted code, this is far easier. It’s not hard to do a cloud project right. But it’s really simple to do it wrong. Follow these guidelines and you’ll learn from the successes - and mistakes - of others.
|
-
<This is a non-technical post, at least mostly> Back in January I wrote a post on switching to a stand-up desk arrangement. Since then folks have asked me if I stuck with it, how it worked out, and will I go back to sitting down to work. I thought I would post a few thoughts here on what I’ve done and what I’ve learned. I’m still at the stand-up desk - but not working it quite the same way. You can see the setup I created in that link above. I didn’t spend a lot of money, and I didn’t have to do that much re-arranging to make it work. The first day was fine - no problems at all. The second day…not so much. My feet and my back hurt. That continued for the first week after I started…. But then things got better. It wasn’t so bad. In fact, I went to a couple of meetings, and I found sitting down for a long time was uncomfortable. But the standing thing wasn’t perfect, so I added a pad to stand on as I mentioned in the post. That made a world of difference. But it still wasn’t perfect, so…. I added a barstool. I had a wooden one first, but just didn’t use it because it didn’t adjust correctly and it wasn’t very comfortable. So I bought this one:  And that’s been WAY better. Now, every hour or so, I sit down. If I get tired, I sit down. If I can’t move around because I’m trapped on a phone call, I sit down. It’s never for more than a few minutes or so, but I do take those breaks. With one more exception - some days I’m physically tired, like when I’ve hiked on the weekends or something like that. In that case, I take more breaks. I bought some Ethernet-over-power adaptors to run my network out to the yard, and I sit out on the porch swing and work for a while. I need the connected wire since I use VoIP and need a fast connection.  I’ve also moved my office around a bit, and have a Windows Media PC (that and OneNote are Microsoft’s best-kept secrets) so I can watch Ted Talks, listen to Windows Media Player Internet Radio, watch the University of Washington when they have Computer Science lessons and so on in the background. In fact, I hooked up my computer’s speakers to the pass-through of that Windows Media Center PC, with it’s good speakers. I have the video camera set up for Lync calls, and I use Mouse Without Borders to move the keyboard and screen on the other PC.  So I’m pretty happy with the setup. I’ve now heard three separate reports on the radio about how sitting is bad for you, so I feel vindicated in the choice. And I think I’ll work this way from now on, if I have the choice.
|
-
For reasons I don't completely understand, I'm not allowed to call the following advice "Best Practices" - apparently there is some liability or something there. So let's say these are "really good ideas" for developing applications for Windows Azure. (Did you see how I worked it into the title anyway so the search engines find it? Always thinking. :) ) And of course these are by no means comprehensive - just a few notes I've made as I've been involved with projects. So here they are, in no particular order: It may sound pedantic, but it's just true. You need to learn how the platform works. The most issues I run into are things that are documented - most of the time very clearly. - Follow standard best practices for your language
Windows Azure code does have some differences, but in large part it's just code. If you write Java to run on Azure, write good Java. Learn the best practices for your codeset and apply those - it will not only make your code faster and less error-prone, it will be cheaper to run. - Keep a backup of your deployments
Things happen. Make sure you have your latest deployment checked into source code. If someone were to perhaps stop paying the bill and the Role is deleted, Microsoft does not have a secret backup of your application. Do not ask me how I know this. - Batch up calls - fewer calls are usually more efficient (and cheaper) than lots of smaller calls.
This is typical for distributed computing environments - include the retry logic tip as well to ensure the call is successful. - Use Cache and max out the instance wherever you can - you're paying for it, use it!
In a distributed computing environment, you may have a lag in time between operations. Make your code less brittle to failures for a given call, and ensure that you include retry logic. This article is a must-read. Microsoft keeps three copies of your data at all times in a datacenter, and backs up those three copies to another for safety. But that's at the latest moment - if you need point-in-time restore, make sure you code that into your architecture. Affinity groups are a way of keeping your data and code together. This is usually a great idea from a performance standpoint, although there are applications that can break this rule for centralization or seldom-accessed data. - Have a lights-out strategy
Anything that a human makes can fail. That includes entire datacenters, routers, the Internet. So make sure you know what to do if everything is unavailable. Perhaps that's just a notification system, a full fallback to an on-premises system, something. But plan.
|
-
Distributed Computing - and more importantly “-as-a-Service” models of computing have a different cost model. This is something that sounds obvious on the surface but it’s often forgotten during the design and coding phase of a project. In on-premises computing, we’re used to purchasing a server and all of the hardware infrastructure and software licenses needed not only for one project, but several. This is an up-front or “sunk” cost that we consume by running code the organization needs to perform its function. Using a direct connection over wires you’ve already paid for, we don’t often have to think about bandwidth, hits on the data store or the amount of compute we use - we just know more is better. In a pay-as-you-go model, however, each of these architecture decisions has a potential cost impact. The amount of data you store, the number of times you access it, and the amount you send back all come with a charge. The offset is that you don’t buy anything at all up-front, so that sunk cost is freed up. And financial professionals know that money now is worth more than money later. Saving that up-front cost allows you to invest it in other things. It’s not just that you’re using things that now cost money - it’s that the design itself in distributed computing has a cost impact. That can be a really good thing, such as when you dynamically add capacity for paying customers. If you can tie back the cost of a series of clicks to what a user will pay to do so, you can set a profit margin that is easy to track. Here’s a case in point: Assume you are using a large instance in Windows Azure to compute some data that you retrieve from a SQL Azure database. If you don’t monitor the path of the application, you may not know what you are really using. Since you’re paying by the size of the instance, it’s best to maximize it all the time. Recently I evaluated just this situation, and found that downsizing the instance and adding another one where needed, adding a caching function to the application, moving part of the data into Windows Azure tables not only increased the speed of the application, but reduced the cost and more closely tied the cost to the profit. The key is this: from the very outset - the design - make sure you include metrics to measure for the cost/performance (sometimes these are the same) for your application. Windows Azure opens up awesome new ways of doing things, so make sure you study distributed systems architecture before you try and force in the application design you have on premises into your new application structure.
|
-
This is a continuation of the books I challenged myself to read to help my career - one a month, for year. You can read my first book review here, and the entire list is here. The book I chose for March 2012 was: The Information: A History, a Theory, a Flood by James Gleick. I was traveling at the end of last month so I’m a bit late posting this review here.
Why I chose this book:
My personal belief about computing is this: All computing technology is simply re-arranging data. We take data in, we manipulate it, and we send it back out. That’s computing. I had heard from some folks about this book and it’s treatment of data. I heard that it dealt with the basics of data - and the semantics of data, information and so on.
It also deals with the earliest forms of history of information, which fascinates me. It’s similar I was told, to GEB which a favorite book of mine as well, so that was a bonus.
Some folks I talked to liked it, some didn’t - so I thought I would check it out.
What I learned:
I liked the book. It was longer than I thought - took quite a while to read, even though I tend to read quickly. This is the kind of book you take your time with. It does in fact deal with the earliest forms of human interaction and the basics of data.
I learned, for instance, that the genesis of the binary communication system is based in the invention of telegraph (far-writing) codes, and that the earliest forms of communication were expensive. In fact, many ciphers were invented not to hide military secrets, but to compress information. A sort of early “lol-speak” to keep the cost of transmitting data low!
I think the comparison with GEB is a bit over-reaching. GEB is far more specific, fanciful and so on. In fact, this book felt more like something fro Richard Dawkins, and tended to wander around the subject quite a bit. I imagine the author doing his research and writing each chapter as a book that followed on from the last one. This is what possibly bothered those who tended not to like it, I think.
Towards the middle of the book, I think the author tended to be a bit too fragmented even for me. He began to delve into memes, biology and more - I think he might have been better off breaking that off into another work. The existentialism just seemed jarring.
All in all, I liked the book. I recommend it to any technical professional, specifically ones involved with data technology in specific. And isn’t that all of us? :)
|
-
Windows Azure allows you to write code in languages within the .NET stack, you can use Java, C++, PHP, NodeJS and others. Code is code - other than keeping things stateless, using a Web or Worker Role in Azure is not all that different from working with an on-premises system. However…. Working in a scalable, component-based stateless architecture that can use federated security is not all that common for many developers. Some are used to owning the server, scaling up, and state-full paradigms that have a single security domain. Making the transition whilst trying to create a new software application or even port a previous one can be daunting. Sure, we have absolutely tons of free training, kits, videos, online books and more to learn on your own, but some things like architecture can be pivotal as you move along. So the question is, should you just strike out on your own for a Cloud project, or get Microsoft Consulting Services or another partner to work with you on your first one? I use a few decision points to help guide the projects I assist in. Note: I’m a huge fan of having help that ends up giving you training and leaves you in charge. If you do engage with someone to help you, make sure you keep this clear and take more and more ownership yourself as the project progresses. How much time do you have? Usually the first thing I ask is about the timeline for the project. It doesn’t matter how skilled you are, if you have a short window to get things done it’s better to get help - especially if this is your first cloud project. Having someone that knows the platform well can save you amazing amounts of time. If you have longer, then start with the training in the link above and once you feel confident, jump in. How complex is the project? If there are a lot of moving parts, it’s best to engage a partner. The reason is that certain interactions - particularly things like Service Bus or Data Integration - can be quite different than what you may have encountered before. How many people do you have? I have a “pizza rule” about projects I’ve used in my career - if it takes over two pizzas to feed everyone on the project, it’s too big and will fail. That being said, one developer and a one-week deadline does not a good project make, usually. It’s best to have at least one architect (or someone in that role) guiding the project along, and at least two developers to work on a cloud project. That’s a generalization of course, since I’ve seen great software on Azure with one developer writing code all by herself, but for more complex projects, more (to a point) is better. The nice thing about bringing on a partner is that you don’t have to hire them full time - they help you and then they go away. How critical is the project? There’s no shame in using some help. If the platform is new, if the project is large and complex, and if it is critical to the business, you should engage a partner. That’s regardless of Cloud or anything else - get some help. You don’t want to hit your company’s bottom line in a negative way, but you have to innovate and get them a competitive advantage. Do your research, make sure the partner is qualified to help you, and get it done. Don’t let these questions scare you off. There are lots of projects you can implement on Windows and SQL Azure with nothing other than the Software Development Kit (SDK) that you get for free with Windows Azure. And assistance comes in many forms - sometimes just phone support, a friend you can ask. Microsoft Consulting Services or any of our great partners. You can get help on just the architecture piece or have them show you how to write the code. They’ll get involved as little or as much as you like.
|
-
Microsoft is working on a new Windows Azure service called “Trust Services”. Trust Services takes a certificate you upload and uses it to encrypt and decrypt sensitive data in the cloud. Of course, like any security service, there’s a bit more to it than that. I’ll give you a quick overview of how you can use this product to protect data you send to SQL Azure.
The primary issue with storing data in the cloud is that you are in an environment that isn’t under your control – in fact, that’s the benefit of being in a distributed computing environment in the first place. On premises you’re able to encrypt data you don’t want anyone else to see, using various methods such as passwords (not very strong) or certificates (stronger). When you use a certificate, it’s vital that you create (or procure) and protect it yourself.
When you store data remotely, regardless of IaaS, PaaS or SaaS, you don’t own the machines where the data lives. That means if you use a certificate from the cloud vendor to encrypt the data, you have to trust that the data won’t be accessed by the vendor. In some cases having a signed agreement with the vendor that they won’t access your data is sufficient, in other cases that doesn’t meet the requirements your system has for security.
With the new Trust Services service, the basic process is that you use a Portal to create a Trust Server using policies and other controls. You place a X.509 Certificate you create or procure in that server. Using the Software development Kit (SDK), the developer has access to an Application Layer Encryption Framework to set fields of data they want to encrypt. From there, the data can be stored in SQL Azure as a standard field – only it is encrypted before it ever arrives. The portion of the client software that decrypts the data uses the same service, so the authenticated user sees the data if they are allowed to do so. The data remains encrypted “at rest”.

You can learn more about this product and check it out in the SQL Azure labs at Microsoft Codename "Trust Services"
|
-
Windows Azure as a Platform as a Service (PaaS) means that there are various components you can use in it to solve a problem:
- Compute “Roles” - Computers running an OS and optionally IIS - you can have more than one "Instance" of a given Role
- Storage - Blobs, Tables and Queues for Storage
- Other Services - Things like the Service Bus, Azure Connection Services, SQL Azure and Caching
It’s important to understand that some of these services are Stateless and others maintain State. Stateless means (at least in this case) that a system might disappear from one physical location and appear elsewhere. You can think of this as a cashier at the front of a store. If you’re in line, a cashier might take his break, and another person might replace him. As long as the order proceeds, you as the customer aren’t really affected except for the few seconds it takes to change them out. The cashier function in this example is stateless.
The Compute Role Instances in Windows Azure are Stateless. To upgrade hardware, because of a fault or many other reasons, a Compute Role's Instance might stop on one physical server, and another will pick it up. This is done through the controlling fabric that Windows Azure uses to manage the systems.
It’s important to note that storage in Azure does maintain State. Your data will not simply disappear - it is maintained - in fact, it’s maintained three times in a single datacenter and all those copies are replicated to another for safety. Going back to our example, storage is similar to the cash register itself. Even though a cashier leaves, the record of your payment is maintained.
So if a Compute Role Instance can disappear and re-appear, the things running on that first Instance would stop working. If you wrote your code in a Stateless way, then another Role Instance simply re-starts that transaction and keeps working, just like the other cashier in the example.
But if you only have one Instance of a Role, then when the Role Instance is re-started, or when you need to upgrade your own code, you can face downtime, since there’s only one. That means you should deploy at least two of each Role Instance not only for scale to handle load, but so that the first “cashier” has someone to replace them when they disappear. It’s not just a good idea - to gain the Service Level Agreement (SLA) for our uptime in Azure it’s a requirement. We point this out right in the Management Portal when you deploy the application:

(Click to enlarge)
When you deploy a Role Instance you can also set the “Upgrade Domain”. Placing Roles on separate Upgrade Domains means that you have a continuous service whenever you upgrade (more on upgrades in another post) - the process looks like this for two Roles. This example covers the scenario for upgrade, so you have four roles total - One Web and one Worker running the "older" code, and one of each running the new code. In all those Roles you want at least two instances, and this example shows that you're covered for High Availability and upgrade paths:

The take-away is this - always plan for forward-facing Roles to have at least two copies. For Worker Roles that do background processing, there are ways to architect around this number, but it does affect the SLA if you have only one.
|
-
Windows Azure is a Platform as a Service – a PaaS – that runs code you write. That code doesn’t just mean the languages on the .NET platform – you can run code from multiple languages, including Java. In fact, you can develop for Windows and SQL Azure using not only Visual Studio but the Eclipse Integrated Development Environment (IDE) as well.
Although not an exhaustive list, here are several links that deal with Java and Windows Azure:
|
-
(Many thanks to Peter Gvozdjak and Dan Benediktson here at Microsoft who worked with me on this issue and provided the bulk of information for this post) Recently I had a customer inquire about some performance tuning he wanted to do for SQL Azure, and as part of that he found that it was possible to remove the “Encrypt=True” setting on the ADO.NET connection to SQL Azure. We have always stated that the connections to SQL Azure are encrypted, so being able to remove this string surprised him. (More on that reference here: http://msdn.microsoft.com/en-us/library/windowsazure/ff394108.aspx) It is true that all connections to SQL Azure are encrypted - whether you use the Encrypt=True string or not. We’ll force the connection to encrypt even if you don’t, or we won’t route it. However, you do want to use that string, for a couple of reasons. Whenever you include the Encrypt=True string, the connection will require that your client validate the Certificate that SQL Azure presents, to ensure that key is the one used by Microsoft. If you don’t include that string, it’s possible - not probable, but possible - that someone could set up a false DNS to cause your certificate to be validated elsewhere. So don’t give the bad guys a way in - there is no performance gain (other than perhaps if the bad DNS is in your own building!) by leaving it off. Follow the best practice of using Encrypt=True. There’s more on connection management for things like retries and so on here: http://social.technet.microsoft.com/wiki/contents/articles/sql-azure-connection-management.aspx
|
-
(As with all of these types of posts, check the date of the latest update I’ve made here. Anything older than 6 months is probably out of date, given the speed with which we release new features into Windows and SQL Azure)
I don’t normally like to discuss things in terms of tools. I find that whenever you start with a given tool (or even a tool stack) it’s too easy to fit the problem to the tool(s), rather than the other way around as it should be.
That being said, it’s often useful to have an example to work through to better understand a concept. But like many ideas in Computer Science, “Big Data” is too broad a term in use to show a single example that brings out the multiple processes, use-cases and patterns you can use it for.
So we turn to a description of the tools you can use to analyze large data sets. “Big Data” is a term used lately to describe data sets that have the “Four V’s” as a characteristic, but I have a simpler definition I like to use:
Big Data involves a data set too large to process in a reasonable period of time
I realize that’s a bit broad, but in my mind it answers the question and is fairly future-proof. The general idea is that you want to analyze some data, and using whatever current methods, storage, compute and so on that you have at hand it doesn’t allow you to finish processing it in a time period that you are comfortable with. I’ll explain some new tools you can use for this processing.
Yes, this post is Microsoft-centric. There are probably posts from other vendors and open-source that cover this process in the way they best see fit. And of course you can always “mix and match”, meaning using Microsoft for one or more parts of the process and other vendors or open-source for another. I never advise that you use any one vendor blindly - educate yourself, examine the facts, perform some tests and choose whatever mix of technologies best solves your problem.
At the risk of being vendor-specific, and probably incomplete, I use the following short list of tools Microsoft has for working with “Big Data”. There is no single package that performs all phases of analysis. These tools are what I use; they should not be taken as a Microsoft authoritative testament to the toolset we’ll finalize for a given problem-space. In fact, that’s the key: find the problem and then fit the tools to that.
Process Types
I break up the analysis of the data into two process types. The first is examining and processing the data in-line, meaning as the data passes through some process. The second is a store-analyze-present process.
Processing Data In-Line
Processing data in-line means that the data doesn’t have a destination - it remains in the source system. But as it moves from an input or is routed to storage within the source system, various methods are available to examine the data as it passes, and either trigger some action or create some analysis.
You might not think of this as “Big Data”, but in fact it can be. Organizations have huge amounts of data stored in multiple systems. Many times the data from these systems do not end up in a database for evaluation. There are options, however, to evaluate that data real-time and either act on the data or perhaps copy or stream it to another process for evaluation.
The advantage of an in-stream data analysis is that you don’t necessarily have to store the data again to work with it. That’s also a disadvantage - depending on how you architect the solution, you might not retain a historical record. One method of dealing with this requirement is to trigger a rollup collection or a more detailed collection based on the event.
StreamInsight - StreamInsight is Microsoft’s “Complex Event Processing” or CEP engine. This product, hooked into SQL Server 2008R2, has multiple ways of interacting with a data flow. You can create adapters to talk with systems, and then examine the data mid-stream and create triggers to do something with it. You can read more about StreamInsight here: http://msdn.microsoft.com/en-us/library/ee391416(v=sql.110).aspx
BizTalk - When there is more latency available between the initiation of the data and its processing, you can use Microsoft BizTalk. This is a message-passing and Service Bus oriented tool, and it can also be used to join system’s data together than normally does not have a direct link, for instance a Mainframe system to SQL Server. You can learn more about BizTalk here: http://www.microsoft.com/biztalk/en/us/overview.aspx
.NET and the Windows Azure Service Bus - Along the same lines as BizTalk but with a more programming-oriented design are the Windows and Windows Azure Service Bus tools. The Service Bus allows you to pass messages as well, and opens up web interactions and even inter-company routing. BizTalk can do this as well, but the Service Bus tools use an API approach for designing the flow and interfaces you want. The Service Bus offerings are also intended as near real-time, not as a streaming interface. You can learn more about the Windows Azure Service Bus here: http://www.windowsazure.com/en-us/home/tour/service-bus/ and more about the Event Processing side here: http://msdn.microsoft.com/en-us/magazine/dd569756.aspx
Store-Analyze-Present
A more traditional approach with an organization’s data is to store the data and analyze it out-of-band. This began with simply running code over a data store, but as locking and blocking became an issue on a file system, Relational Database Management Systems (RDBMs) were created. Over time a distinction was made between data used in an online processing system, meant to be highly available for writing data (OLTP) and systems designed for analytical and reporting purposes (OLAP).
Later the data grew larger than these systems were designed for, primarily due to consistency requirements. In analysis, however, consistency isn’t always a requirement, and so file-based systems for that analysis were re-introduced from the Mainframe concepts, with new technology layered in for speed and size.
I normally break up the process of analyzing large data sets into four phases:
- Source and Transfer - Obtaining the data at its source and transferring or loading it into the storage; optionally transforming it along the way
- Store and Process - Data is stored on some sort of persistence, and in some cases an engine handles the acquisition and placement on persistent storage, as well as retrieval through an interface.
- Analysis - A new layer introduced with “Big Data” is a separate analysis step. This is dependent on the engine or storage methodology, is often programming language or script based, and sometimes re-introduces the analysis back into the data. Some engines and processes combine this function into the previous phase.
- Presentation - In most cases, the data wants a graphical representation to comprehend, especially in a series or trend analysis. In other cases a simple symbolic representation, similar to the “dashboard” elements in a Business Intelligence suite. Presentation tools may also have an analysis or refinement capability to allow end-users to work with the data sets. As in the Analysis phase, some methodologies bundle in the Analysis and Presentation phases into one toolset.
Source and Transfer
You’ll notice in this area, along with those that follow, Microsoft is adopting not only its own technologies but those within open-source. This is a positive sign, and means that you will have a best-of-breed, supported set of tools to move the data from one location to another. Traditional file-copy, File Transfer Protocol and more are certainly options, but do not normally deal with moving datasets.
I’ve already mentioned the ability of a streaming tool to push data into a store-analyze-present model, so I’ll follow up that discussion with the tools that can extract data from one source and place it in another.
SQL Server Integration Services (SSIS)/SQL Server Bulk Copy Program (BCP) - SSIS is a SQL Server tool used to move data from one location to another, and optionally perform transform or other processes as it does so. You are not limited to working with SQL Server data - in fact, almost any modern source of data from text to various database platforms is available to move to various systems. It is also extremely fast and has a rich development environment. You can learn more about SSIS here: http://msdn.microsoft.com/en-us/library/ms141026.aspx BCP is a tool that has been used with SQL Server data since the first releases; it has multiple sources and destinations as well. It is a command-line utility,and has some limited transform capabilities. You can learn more about BCP here: http://msdn.microsoft.com/en-us/library/ms162802.aspx
Sqoop - Tied to Microsoft’s latest announcements with Hadoop on Windows and Windows Azure, Sqoop is a tool that is used to move data between SQL Server 2008R2 (and higher) and Hadoop, quickly and efficiently. You can read more about that in the Readme file here: http://www.microsoft.com/download/en/details.aspx?id=27584
Application Programming Interfaces - API’s exist in most every major language that can connect to one data source, access data, optionally transforming it and storing it in another system. Most every dialect of the .NET-based languages contain methods to perform this task.
Store and Process
Data at rest is normally used for historical analysis. In some cases this analysis is performed near real-time, and in others historical data is analyzed periodically. Systems that handle data at rest range from simple storage to active management engines.
SQL Server - Microsoft’s flagship RDBMS can indeed store massive amounts of complex data. I am familiar with a two systems in excess of 300 Terabytes of federated data, and the Pan-Starrs project is designed to handle 1+ Petabyte of data. The theoretical limit of SQL Server DataCenter edition is 540 Petabytes. SQL Server is an engine, so the data access and storage is handled in an abstract layer that also handles concurrency for ACID properties. You can learn more about SQL Server here: http://www.microsoft.com/sqlserver/en/us/product-info/compare.aspx
SQL Azure Federations - SQL Azure is a database service from Microsoft associated with the Windows Azure platform. Database Servers are multi-tenant, but are shared across a “fabric” that moves active databases for redundancy and performance. Copies of all databases are kept triple-redundant with a consistent commitment model. Databases are (at this writing - check http://WindowsAzure.com for the latest) capped at a 150 GB size limit per database. However, Microsoft released a “Federation” technology, allowing you to query a head node and have the data federated out to multiple databases. This improves both size and performance. You can read more about SQL Azure Federations here: http://social.technet.microsoft.com/wiki/contents/articles/2281.federations-building-scalable-elastic-and-multi-tenant-database-solutions-with-sql-azure.aspx
Analysis Services - The Business Intelligence engine within SQL Server, called Analysis Services, can also handle extremely large data systems. In addition to traditional BI data store layouts (ROLAP, MOLAP and HOLAP), the latest version of SQL Server introduces the Vertipaq column-storage technology allowing more direct access to data and a different level of compression. You can read more about Analysis Services here: http://www.microsoft.com/sqlserver/en/us/solutions-technologies/business-intelligence/analysis-services.aspx and more about Vertipaq here: http://msdn.microsoft.com/en-us/library/hh212945(v=SQL.110).aspx
Parallel Data Warehouse - The Parallel Data Warehouse (PDW) offering from Microsoft is largely described by the title. Accessed in multiple ways including using Transact-SQL (the Microsoft dialect of the Structured Query Language), This is an MPP appliance scaling in parallel to extremely large datasets. It is a hardware and software offering - you can learn more about it here: http://www.microsoft.com/sqlserver/en/us/solutions-technologies/data-warehousing/pdw.aspx
HPC Server - Microsoft’s High-Performance Computing version of Windows Server deals not only with large data sets, but with extremely complicated computing requirements. A scale-out architecture and inter-operation with Linux systems, as well as dozens of applications pre-written to work with this server make this a capable “Big Data” system. It is a mature offering, with a long track record of success in scientific, financial and other areas of data processing. It is available both on premises and in Windows Azure, and also in a hybrid of both models, allowing you to “rent” a super-computer when needed. You can read more about it here: http://www.microsoft.com/hpc/en/us/product/cluster-computing.aspx
Hadoop - Pairing up with Hortonworks, Microsoft has released the Hadoop Open-Source system - including HDFS and a Map/Reduce standardized software, Hive and Pig - on Windows and the Windows Azure platform. This is not a customized version; off-the-shelf concepts and queries work well here. You can read more about Hadoop here: http://hadoop.apache.org/common/docs/current/ and you can read more about Microsoft’s offerings here: http://hortonworks.com/partners/microsoft/ and here: http://social.technet.microsoft.com/wiki/contents/articles/6204.hadoop-based-services-for-windows.aspx
Windows and Azure Storage - Although not an engine - other than a triple-redundant, immediately consistent commit - Windows Azure can hold terabytes of information and make it available to everything from the R programming language to the Hadoop offering. Binary storage (Blobs) and Table storage (Key-Value Pair) data can be queried across a distributed environment. You can learn more about Windows Azure storage here: http://msdn.microsoft.com/en-us/library/windowsazure/gg433040.aspx
Analysis
In a “Big Data” environment, it’s not unusual to have a specialized set of tasks for analyzing and even interpreting the data. This is a new field called “data Science”, with a requirement not only for computing, but also a heavy emphasis on math.
Transact-SQL - T-SQL is the dialect of the Structured Query Language used by Microsoft. It includes not only robust selection, updating and manipulating of data, but also analytical and domain-level interrogation as well. It can be used on SQL Server, PDW and ODBC data sources. You can read more about T-SQL here: http://msdn.microsoft.com/en-us/library/bb510741.aspx
Multidimensional Expressions and Data Analysis Expressions - The MDX and DAX languages allow you to query multidimensional data models that do not fit well with typical two-plane query languages. Pivots, aggregations and more are available within these constructs to query and work with data in Analysis Services. You can read more about MDX here: http://msdn.microsoft.com/en-us/library/ms145506(v=sql.110).aspx and more about DAX here: http://www.microsoft.com/download/en/details.aspx?id=28572
HPC Jobs and Tasks - Work submitted to the Windows HPC Server has a particular job - essentially a reservation request for resources. Within a job you can submit tasks, such as parametric sweeps and more. You can learn more about Jobs and Tasks here: http://technet.microsoft.com/en-us/library/cc719020(v=ws.10).aspx
HiveQL - HiveQL is the language used to query a Hive object running on Hadoop. You can see a tutorial on that process here: http://social.technet.microsoft.com/wiki/contents/articles/6628.aspx
Piglatin - Piglatin is the submission language for the Pig implementation on Hadoop. An example of that process is here: http://blogs.msdn.com/b/avkashchauhan/archive/2012/01/10/running-apache-pig-pig-latin-at-apache-hadoop-on-windows-azure.aspx
Application Programming Interfaces - Almost all of the analysis offerings have associated API’s - of special note is Microsoft Research’s Infer.NET, a new language construct for framework for running Bayesian inference in graphical models, as well as probabilistic programming. You can read more about Infer.NET here: http://research.microsoft.com/en-us/um/cambridge/projects/infernet/
Presentation
Lots of tools work in presenting the data once you have done the primary analysis. In fact, there’s a great video of a comparison of various tools here: http://msbiacademy.com/Lesson.aspx?id=73 Primarily focused on Business Intelligence. That term itself is now not as completely defined, but the tools I’ll show below can be used in multiple ways - not just traditional Business Intelligence scenarios. Application Programming Interfaces (API’s) can also be used for presentation; but I’ll focus here on “out of the box” tools.
Excel - Microsoft’s Excel can be used not only for single-desk analysis of data sets, but with larger datasets as well. It has interfaces into SQL Server, Analysis Services, can be connected to the PDW, and is a first-class job submission system for the Windows HPC Server. You can watch a video about Excel and big data here: http://www.microsoft.com/en-us/showcase/details.aspx?uuid=e20b7482-11c9-4965-b8f0-7fb6ac7a769f and you can also connect Excel to Hadoop: http://social.technet.microsoft.com/wiki/contents/articles/how-to-connect-excel-to-hadoop-on-azure-via-hiveodbc.aspx
Reporting Services - Reporting Services is a SQL Server tool that can query and show data from multiple sources, all at once. It can also be used with Analysis Services. You can read more about Reporting Services here: http://www.microsoft.com/sqlserver/en/us/solutions-technologies/business-intelligence/reporting-services.aspx
Power View - Power View is a “Self-Service” Business Intelligence reporting tool, which can work with on-premises data in addition to SQL Azure and other data. You can read more about it and see videos of Power View in action here: http://www.microsoft.com/sqlserver/en/us/future-editions/business-intelligence/SQL-Server-2012-reporting-services.aspx
SharePoint Services - Microsoft has rolled several capable tools in SharePoint as “Services”. This has the advantage of being able to integrate into the working environment of many companies. You can read more about lots of these reporting and analytic presentation tools here: http://technet.microsoft.com/en-us/sharepoint/ee692578
This is by no means an exhaustive list - more capabilities are added all the time to Microsoft’s products, and things will surely shift and merge as time goes on. Expect today’s “Big Data” to be tomorrow’s “Laptop Environment”.
|
-
There are multiple ways to store data in a cloud provider, specifically around Windows and SQL Azure. As part of a “Data First” architecture design, one decision vector – assuming you’ve already done a data classification of the elements you want to store – is to decide the transaction level you need for that datum. Once you’ve decided on what level of transactional commitment you need, you can make intelligent decisions about the storage engine, method of access and storage, speed and other requirements.
Although the list below is neither original nor exhaustive, these are the general considerations I use for a given data set. It’s important to note that in many on premises systems the engine choice at hand overrides these concerns. If you have a large Relational Database Management System (RDBMS) for instance, you might simply place all data there without further consideration. In a Platform as a Service (PaaS) like Windows and SQL Azure, however, selection of the proper engine for a particular dataset has implications ranging from cost to performance, and selecting the right engine is critical when you want to leverage the data across “Bid Data” analysis like Hadoop or other constructs.
Monolithic Consistent Transactional The first selection is analogous to a local RDBMS system. The dataset is retrieved in a functionally single, monolithic transaction, i.e. kept together with ACID properties in mind. This is the most reliable type of data design for datasets that require a high degree of safety in the read/write pattern. As an example, a bank ATM transaction should be modeled in a monolithic way. If I make a transfer of funds from one account to another, I want the money to be subtracted from one account if and only if it is successfully added to the other. The bank, on the other hand, wants the money added to the second account if and only if it is subtracted from the first. This is a prime example of a monolithic (single atomic transaction), Consistent (if and only if) and Transactional (as a unit, with provision for roll-back and reporting if unsuccessful) data requirement.
The primary engine used for this type of data is often SQL Azure – an RDMBS in the same datacenters as Windows Azure. Placing both the calling application, whether that is a Data Access Layer-based code widget or a direct call from a Web or Worker Role, means that data is retrieved quickly and in a monolithic way. The costs for this method is based on overall database size. A consideration is how much data you can store this way. Database sizes have limits, although there are ways of overcoming size issues using technologies such as Sharding or SQL Azure Federations. There is also the consideration of performance. In an RDBMs that conforms to ACID properties, locking and other overhead for safety is at conflict with the highest possible read performance. But in some cases the ACID properties are worth the cost, as in the banking example.
You are not limited to SQL Azure in this model. Windows Azure Table storage, while similar to NoSQL offerings is different in that it is immediately consistent across all three replicated copies of data, offering a higher degree of safety. And while Table storage does not offer built-in support for transactions, there are ways to achieve certain transaction levels.
Monolithic Realtime If consistency can be relaxed – meaning that a guaranteed read/write patter is not essential – then more options arise in Windows and SQL Azure. You can still use SQL Azure for this type of storage, with either automatic or programmatic hints allowing for “dirty reads”. Windows Azure Table storage is still consistent, but the selection of the method for querying the data such as separate copies of read and write data can be employed. Because of the relaxed transaction nature, higher speeds are possible by querying cached or separate datasets.
An example here is that same transaction from the bank, but a statement inquiry. Just after the money is deposited, the user wishes to query the current balance. The current balance – minus the transaction that just occurred – is retrieved and shown to the user, perhaps even combining the amount with the latest transaction, perhaps saved as a local cached object, with a caveat to the user.
Distributed Realtime At some point, the data becomes too large to fit inside a single processing session, and parallelism is used. In this case, either separate databases in SQL Azure or Windows Azure Tables, local data storage on the Web or Worker Role, or a combination of all with Caches is the right approach for the data design.
The biggest implication in this type of system is speed – a higher degree of data separation is essential, and so the dataset selection must fit the pattern. It is unacceptable to force an ACID-properties type workload into this environment. Typical examples here are the actual data asset payload for streaming video or music, read-only documents and so on. This pattern is often separated from the meta-data, which is kept in more of a transactional model.
As an example, assume you log on to a website to watch a movie or listen to music. The provider needs to verify your identity and account balance, which are transactional data loads. After that process is complete, the workload shifts to a copy – perhaps one of several – of the asset to stream to your location.
In this case, Windows Azure Blob storage, along with the Content Delivery Network (CDN – a series of servers closer to the user) is employed along with the transactional realtime requirements for the metadata.
Distributed Eventual At the furthest end of the data scale are large datasets that need deeper analysis, but not necessarily in realtime. Examples here are terrabytes of data requiring a Business Intelligence view, but with a tolerance of a few seconds to minutes or hours. In this case, Storage, Processing and Query methods, such as the Hadoop offering in Windows Azure, or perhaps the High Performance Computing (HPC) Windows Server in Windows Azure fit well. Here, the design of the data is often dictated by the source, and more emphasis is placed on the algorithms around processing and re-assembling the data.
There are, of course, other patterns. In many cases a single dataset may have needs in one or more of these categories – in fact, sitting at 30,000 feet typing this entry, I’m having that very design discussion with a gentleman sitting next to me. The key is to design data-first, and fit the technology to the requirement for each datum. Allow each function and engine to handle the data in the most efficient, effective way for cost, performance and utility.
|
|
|
|
|
|