<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://www2.sqlblog.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>Search results matching tag 'Data Science'</title><link>http://www2.sqlblog.com/search/SearchResults.aspx?o=DateDescending&amp;tag=Data+Science&amp;orTags=0</link><description>Search results matching tag 'Data Science'</description><dc:language>en-US</dc:language><generator>CommunityServer 2.1 SP2 (Build: 61129.1)</generator><item><title>Using Hadooop (HDInsight) with Microsoft - Two (OK, Three) Options </title><link>http://www2.sqlblog.com/blogs/buck_woody/archive/2012/12/04/using-hadooop-hdinsight-with-microsoft-two-ok-three-options.aspx</link><pubDate>Tue, 04 Dec 2012 15:28:23 GMT</pubDate><guid isPermaLink="false">21093a07-8b3d-42db-8cbf-3350fcbf5496:46509</guid><dc:creator>BuckWoody</dc:creator><description>&lt;p&gt;Microsoft has many tools for &amp;ldquo;Big Data&amp;rdquo;. In fact, you need many tools &amp;ndash; there&amp;rsquo;s no product called &amp;ldquo;Big Data Solution&amp;rdquo; in a shrink-wrapped box &amp;ndash; if you find one, you probably shouldn&amp;rsquo;t buy it. It&amp;rsquo;s tempting to want a single tool that handles everything in a problem domain, but with large, complex data, that isn&amp;rsquo;t a reality. You&amp;rsquo;ll mix and match several systems, open and closed source, to solve a given problem.&lt;/p&gt;
&lt;p&gt;But there are tools that help with handling data at large, complex scales. Normally the best way to do this is to break up the data into parts, and then put the calculation engines for that chunk of data right on the node where the data is stored. These systems are in a family called &amp;ldquo;Distributed File and Compute&amp;rdquo;. Microsoft has a couple of these, including the &lt;a href="http://www.microsoft.com/hpc/en/us/default.aspx"&gt;High Performance Computing edition of Windows Server&lt;/a&gt;. Recently we partnered with &lt;a href="http://hortonworks.com/"&gt;Hortonworks&lt;/a&gt; to bring the &lt;a href="http://hadoop.apache.org/"&gt;Apache Foundation&amp;rsquo;s release of Hadoop&lt;/a&gt; to Windows. And as it turns out, there are actually two (technically three) ways you can use it.&lt;/p&gt;
&lt;p style="padding-left:30px;"&gt;&lt;span style="color:#993300;"&gt;&lt;em&gt;(There&amp;rsquo;s a more detailed set of information here: &lt;a href="http://www.microsoft.com/sqlserver/en/us/solutions-technologies/business-intelligence/big-data.aspx"&gt;&lt;span style="color:#993300;"&gt;http://www.microsoft.com/sqlserver/en/us/solutions-technologies/business-intelligence/big-data.aspx&lt;/span&gt;&lt;/a&gt;, I&amp;rsquo;ll cover the options at a general level below)&amp;nbsp; &lt;/em&gt;&lt;/span&gt;&lt;/p&gt;
&lt;h1&gt;First Option: Windows Azure HDInsight Service&lt;/h1&gt;
&lt;p&gt;&amp;nbsp;Your first option is that you can simply log on to a Hadoop control node and begin to run Pig or Hive statements against data that you have stored in Windows Azure. There&amp;rsquo;s nothing to set up (although you can configure things where needed), and you can send the commands, get the output of the job(s), and stop using the service when you are done &amp;ndash; and repeat the process later if you wish.&lt;/p&gt;
&lt;p&gt;(There are also connectors to run jobs from Microsoft Excel, but that&amp;rsquo;s another post)&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;a href="http://sqlblog.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-00-79-79/0572.option_2D00_1.png"&gt;&lt;img src="http://sqlblog.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-00-79-79/0572.option_2D00_1.png" alt="" width="367" height="212" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;This option is useful when you have a periodic burst of work for a Hadoop workload, or the data collection has been happening into Windows Azure storage anyway. That might be from a web application, the logs from a web application, &lt;a href="http://en.wikipedia.org/wiki/Telemetry"&gt;telemetrics&lt;/a&gt; (remote sensor input), and other modes of constant collection. &amp;nbsp;&lt;/p&gt;
&lt;p&gt;You can read more about this option here: &amp;nbsp;&lt;a href="http://sqlblog.com/b/windowsazure/archive/2012/10/24/getting-started-with-windows-azure-hdinsight-service.aspx"&gt;http://blogs.msdn.com/b/windowsazure/archive/2012/10/24/getting-started-with-windows-azure-hdinsight-service.aspx&lt;/a&gt;&lt;/p&gt;
&lt;h1&gt;Second Option: Microsoft HDInsight Server&lt;/h1&gt;
&lt;p&gt;Your second option is to use the Hadoop Distribution for on-premises Windows called Microsoft HDInsight Server. You set up the Name Node(s), Job Tracker(s), and Data Node(s), among other components, and you have control over the entire ecostructure.&lt;/p&gt;
&lt;p&gt;&lt;a href="http://sqlblog.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-00-79-79/7041.option_2D00_2.png"&gt;&lt;img src="http://sqlblog.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-00-79-79/7041.option_2D00_2.png" alt="" width="152" height="179" border="0" /&gt;&lt;/a&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;This option is useful if you want to &amp;nbsp;have complete control over the system, leave it running all the time, or you have a huge quantity of data that you have to bulk-load constantly &amp;ndash; something that isn&amp;rsquo;t going to be practical with a network transfer or disk-mailing scheme.&lt;/p&gt;
&lt;p&gt;You can read more about this option here: &lt;a href="http://www.microsoft.com/sqlserver/en/us/solutions-technologies/business-intelligence/big-data.aspx"&gt;http://www.microsoft.com/sqlserver/en/us/solutions-technologies/business-intelligence/big-data.aspx&lt;/a&gt;&lt;/p&gt;
&lt;h1&gt;Third Option (unsupported): Installation on Windows Azure Virtual Machines&lt;/h1&gt;
&lt;p&gt;&amp;nbsp;Although unsupported, you could simply use a Windows Azure Virtual Machine (we support both Windows and Linux servers) and install Hadoop yourself &amp;ndash; it&amp;rsquo;s open-source, so there&amp;rsquo;s nothing preventing you from doing that.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;a href="http://sqlblog.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-00-79-79/0121.option_2D00_3.png"&gt;&lt;img src="http://sqlblog.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-00-79-79/0121.option_2D00_3.png" alt="" width="326" height="188" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Aside from being unsupported, there are other issues you&amp;rsquo;ll run into with this approach &amp;ndash; primarily involving performance and the amount of configuration you&amp;rsquo;ll need to do to access the data nodes properly. But for a single-node installation (where all components run on one system) such as learning, demos, training and the like, this isn&amp;rsquo;t a bad option.&lt;/p&gt;
&lt;p&gt;Did I mention that&amp;rsquo;s unsupported? :) &lt;/p&gt;
&lt;p&gt;You can learn more about Windows Azure Virtual Machines here: &lt;a href="http://www.windowsazure.com/en-us/home/scenarios/virtual-machines/"&gt;http://www.windowsazure.com/en-us/home/scenarios/virtual-machines/&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;And more about Hadoop and the installation/configuration (on Linux) here: &lt;a href="http://en.wikipedia.org/wiki/Apache_Hadoop"&gt;http://en.wikipedia.org/wiki/Apache_Hadoop&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;And more about the HDInsight installation here: &lt;a href="http://www.microsoft.com/web/gallery/install.aspx?appid=HDINSIGHT-PREVIEW"&gt;http://www.microsoft.com/web/gallery/install.aspx?appid=HDINSIGHT-PREVIEW&lt;/a&gt;&lt;/p&gt;
&lt;h1&gt;Choosing the right option&lt;/h1&gt;
&lt;p&gt;Since you have two or three routes you can go, the best thing to do is evaluate the need you have, and place the workload where it makes the most sense.&amp;nbsp; My suggestion is to install the HDInsight Server locally on a test system, and play around with it. Read up on the best ways to use Hadoop for a given workload, understand the parts, write a little Pig and Hive, and get your feet wet. Then sign up for a test account on HDInsight Service, and see how that leverages what you know. If you're a true tinkerer, go ahead and try the VM route as well. &lt;/p&gt;
&lt;p&gt;Oh - there&amp;rsquo;s another great reference on the Windows Azure HDInsight that just came out, here: &lt;a href="http://sqlblog.com/b/brunoterkaly/archive/2012/11/16/hadoop-on-azure-introduction.aspx"&gt;http://blogs.msdn.com/b/brunoterkaly/archive/2012/11/16/hadoop-on-azure-introduction.aspx&lt;/a&gt; &amp;nbsp;&lt;/p&gt;</description></item><item><title>Is Data Science “Science”?</title><link>http://www2.sqlblog.com/blogs/buck_woody/archive/2012/10/16/is-data-science-science.aspx</link><pubDate>Tue, 16 Oct 2012 13:29:03 GMT</pubDate><guid isPermaLink="false">21093a07-8b3d-42db-8cbf-3350fcbf5496:45600</guid><dc:creator>BuckWoody</dc:creator><description>&lt;p&gt;I hold the term &amp;ldquo;science&amp;rdquo; in very high esteem. I grew up on the Space Coast in Florida, and eventually worked at the Kennedy Space Center, surrounded by very intelligent people who worked in various scientific fields.&lt;/p&gt;
&lt;p&gt;Recently a new term has entered the computing dialog &amp;ndash; &amp;ldquo;Data Scientist&amp;rdquo;. Since it&amp;rsquo;s not a standard term, it has a lot of definitions, and in fact has been disputed as a correct term. After all, the reasoning goes, if there&amp;rsquo;s no such thing as &amp;ldquo;Data Science&amp;rdquo; then how can there be a Data Scientist?&lt;/p&gt;
&lt;p&gt;This argument has been made before, albeit with a different term &amp;ndash; &amp;ldquo;Computer Science&amp;rdquo;. In Peter Denning&amp;rsquo;s excellent article &amp;ldquo;Is Computer Science Science&amp;rdquo; (April&amp;nbsp; 2005/Vol. 48, No. 4 COMMUNICATIONS OF THE ACM) there are many points that separate &amp;ldquo;science&amp;rdquo; from &amp;ldquo;engineering&amp;rdquo; and even &amp;ldquo;art&amp;rdquo;.&amp;nbsp; I won&amp;rsquo;t repeat the content of that article here (I recommend you read it on your own) but will leverage the points he makes there.&lt;/p&gt;
&lt;p&gt;&lt;b&gt;Definition of Science&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;To ask the question &amp;ldquo;is data science &amp;lsquo;science&amp;rsquo;&amp;rdquo; then we need to start with a definition of terms. Various references put the definition into the same basic areas:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Study of the physical world&lt;/li&gt;
&lt;li&gt;Systematic and/or disciplined study of a subject area&lt;/li&gt;
&lt;li&gt;...and then they include the things studied, the bodies of knowledge and so on.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The word itself comes from Latin, and means merely &amp;ldquo;to know&amp;rdquo; or &amp;ldquo;to study to know&amp;rdquo;. Greek divides knowledge further into &amp;ldquo;truth&amp;rdquo; (&lt;em&gt;episteme&lt;/em&gt;), and practical use or effects (&lt;em&gt;tekhne&lt;/em&gt;). Normally computing falls into the second realm.&lt;/p&gt;
&lt;p&gt;&lt;b&gt;Definition of Data Science&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;And now a more controversial definition: Data Science. This term is so new and perhaps so niche that the major dictionaries haven&amp;rsquo;t yet picked it up (my OED reference is older &amp;ndash; can&amp;rsquo;t afford to pop for the online registration at present).&lt;/p&gt;
&lt;p&gt;Researching the term's general use I created an amalgam of the definitions this way:&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;span style="color:#0000ff;"&gt;&amp;ldquo;Studying and applying mathematical and other techniques to derive information from complex data sets.&amp;rdquo;&lt;/span&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Using this definition, data science certainly seems to be science - it's learning about and studying some object or area using systematic methods. But implicit within the definition is the word &amp;ldquo;application&amp;rdquo;, which makes the process more akin to engineering or even technology than science. In fact, I find that using these techniques &amp;ndash; and data itself &amp;ndash; part of science, not science itself.&lt;/p&gt;
&lt;p&gt;I leave out the concept of studying data patterns or algorithms as part of this discipline. That is actually a domain I see within research, mathematics or computer science. That of course is a type of science, but does not seek for practical applications.&lt;/p&gt;
&lt;p&gt;As part of the argument against calling it &amp;ldquo;Data Science&amp;rdquo;, some point to the scientific method of creating a hypothesis, testing with controls, testing results against the hypothesis, and documenting for repeatability. &amp;nbsp;These are not steps that we often take in working with data. We normally start with a question, and fit patterns and algorithms to predict outcomes and find correlations. In this way Data Science is more akin to statistics (and in fact makes heavy use of them) in the process rather than starting with an assumption and following on with it.&lt;/p&gt;
&lt;p&gt;So, is Data Science &amp;ldquo;Science&amp;rdquo;? I&amp;rsquo;m uncertain &amp;ndash; and I&amp;rsquo;m uncertain it matters. Even if we are facing rampant &amp;ldquo;title inflation&amp;rdquo; these days (does anyone introduce themselves as a secretary or supervisor anymore?) I can tolerate the term at least from the intent that we use data to study problems across a wide spectrum, rather than restricting it to a single domain. And I also understand those who have worked hard to achieve the very honorable title of &amp;ldquo;scientist&amp;rdquo; who have issues with those who borrow the term without asking.&lt;/p&gt;
&lt;p&gt;What do you think? Science, or not? Does it matter?&lt;/p&gt;</description></item></channel></rss>