


Decision Trees is a directed technique. Your target variable is the one that holds information about a particular decision, divided into a few discrete and broad categories (yes / no; liked / partially liked / disliked, etc.). You are trying to explain this decision using other gleaned information saved in other variables (demographic data, purchasing habits, etc.). With limited statistical significance, you are going to predict the target variable for a new case using its known values of the input variables based on results of your trained model. Recursive partitioning is used to build the tree. The data is split into partitions using a certain value of one of the explaining variables. The partitions are then split again and again. Initially the data is in one big box. The algorithm tries all possible breaks of both input (explaining) variables for the initial split. The goal is to get purer partitions considering the classes of the target variable. You know intuitively that purity is related to the percentage of the cases in each class of the target variable. There are many better, but more complicated measures of the purity, for example entropy or information gain. The tree continues to grow using the two new partitions as separate starting points and splitting them more. You have to stop the process somewhere. Otherwise, you could get a completely fitted tree that has only one case in each class. The class would be, of course, absolutely pure. This would not make any sense. The results could not be used for any meaningful prediction. This phenomenon is called “overfitting”. There are two basic approaches to solve this problem: prepruning (bonsai) and postpruning techniques. The prepruning (bonsai) methods prevent growth of the tree in advance by applying tests at each node to determine whether a further split is going to be useful; the tests can be simple (number of cases) or complicated (complexity penalty). The postpruning methods allow the tree to grow and then prune off the useless branches. The postpruning methods tend to give more accurate results, but they require more computation than prepruning methods. Imagine the following example. You have the answers to a simple question: Did you like the famous Woodstock movie? You also have some demographic data: age (20 to 60) and education (ranged in 7 classes from the lowest to the highest). In all, 55% of the interviewees liked the movie and 45% did not like it. Can you discover factors that have an influence on whether they liked the movie? Starting point: 55% of the interviewees liked the movie and 45% did not like it. After checking all possible splits, you find the best initial split made at the age of 35. With further splitting, you finish with a fullgrown tree. Note that not all branches lead to purer classes. Some of them are not useful at all and should be pruned. Decision trees are used for classification and prediction. Typical usage scenarios include:  Predicting which customers will leave
 Targeting the audience for mailings and promotional campaigns
 Explain reasons for a decision
 Answering questions such as “What movies do young female customers buy?”
Decision Trees is the most popular data mining algorithm. This is because the results are very understandable and simple to interpret, and the quality of the predictions is usually very high.


It is alive! It was really hard to make the choices. Nevertheless, the schedule is public now. To everybody that submitted proposals – thank you very much again! We are sorry we cannot accommodate everybody. Please note that even if you were not selected, we would be happy to see you in Ljubljana.



I am continuing with my data mining and machine learning algorithms series. Naive Bayes is a nice algorithm for classification and prediction. It calculates probabilities for each possible state of the input attribute, given each state of the predictable attribute, which can later be used to predict an outcome of the predicted attribute based on the known input attributes. The algorithm supports only discrete or discretized attributes, and it considers all input attributes to be independent. Starting with a single dependent and a single independent variable, the algorithm is not too complex to understand (I am using an example from my old book about statistics  Thomas H. Wonnacott, Ronald J. Wonnacot: Introductory Statistics, Wiley 1990). Let’s say I am buying a used car. In an auto magazine I find that 30% of secondhand cars are faulty. I take with me a mechanic who can make a shrewd guess on a basis of a quick drive. Of course, he isn’t always right. Of the faulty cars he examined in the past he correctly pronounced 90 % faulty and wrongly pronounced 10% ok. When judging good cars, he correctly pronounced 80% of them as good, and wrongly 20% as faulted. In the graph, we can see that 27% (90% of 30%) of all cars are actually faulty and then correctly identified as such. 14% (20% of 70%) are judged faulty, although they are good. Altogether, 41% (27% + 14%) of cars are judged faulty. Of these cars, 67% (27% / 41%) are actually faulty. To sum up: Once the car has been pronounced faulty by the mechanic, the chance that it is actually faulty rises from the original 30% up to 67%. The following figure shows this process. The calculations in the previous slide can be summarized in another tree, the reverse tree. You can start branching with opinion of the mechanic (59% ok and 41% faulty). Moving to the right, the second branching shows the actual conditions of the cars, and this is the valuable information for you. For example, the top branch says: Once the car is judged faulty, the chance that it actually turns faulty is 67%. The third branch from the top displays clearly: Once the car is judged good, the chance that it is actually faulty is just 5%. You can see the reverse tree in the following figure. As mentioned, Naïve Bayes treats all of the input attributes as independent of each other with respect to the target variable. While this could be a wrong assumption, it allows multiplying the probabilities to determine the likelihood of each state of the target variable based on states of input variables. For example, let’s say that you need to analyze whether there is any association between NULLs in different columns of your Products table. You realize that if Color is missing, 80% of Weight values are missing as well; and if Class is missing, 60% of Weight values are missing as well. You can multiply these probabilities. If Weight is missing, you can calculate the product: 0.8 (Color missing for Weight missing) * 0.6 (Class missing for Weight missing) = 0.48 You can also check what happens to the not missing state of the target variable, the Weight: 0.2 (Color missing for Weight not missing) * 0.4 (Class missing for Weight not missing) = 0.08 You can easily see that the likelihood that Weight is missing is much higher than the likelihood it is not missing when Color and Class are unknown. You can convert the likelihoods to probabilities by normalizing their sum to 1: P (Weight missing if Color and Class are missing) = 0.48 / (0.48 + 0.08) = 0.857 P (Weight not missing if Color and Class are missing) = 0.08 / (0.48 + 0.08) = 0.143 Now when you know that the Color value is NULL and the Class value is null, then you have nearly 86% chances that you get NULL also in the Weight attribute. This might lead you to some conclusions where to start improving your data quality. In general, you use the Naive Bayes algorithm for classification. You want to extract models describing important data classes and then assign new cases to predefined classes. Some typical usage scenarios include:  Categorizing bank loan applications (safe or risky) based on previous experience
 Determining which home telephone lines are used for Internet access
 Assigning customers to predefined segments.
 Quickly obtaining a basic comprehension of the data by checking the correlation between input variables.


I am finishing my list of conferences and seminars I am attending in the second half of the year 2015. Here is my list.  Kulendayz 2015 – September 4th5th. Although I will have huge problems to get there on time, I would never like to miss it. I have one talk there.
 SQL Saturday #413 Denmark – September 17th19th. You can join me already on Friday for the Data Mining Algorithms seminar.
 SQL Saturday #434 Holland – September 25th26th. If you miss the Denmark Data Mining Algorithms seminar, I am repeating it in Utrecht.
 SQL Server Days 2015 Belgium – September 28th29th. This will be my first talk at SQL Server Days.
 SQL Saturday #454 Turin – October 10th. I was not confirmed as a speaker yet, but I still plan to go there, to combine the SQL part with the Expo in Milan part.
 PASS Summit 2015 Seattle – October 27th30th. I still continue to be present at every single summit:) This year I have one presentation.
 SharePoint Days 2015 Slovenia – November 17th18th. No, I don’t like SPS. I will just have one small BI presentation there.
 SQL Saturday #475 Belgrade – November 28th. First SQL Saturday in Serbia. I simply must be there.
 SQL Saturday #460 Slovenia – December 11th12th. I am coorganizing this event. Everybody is welcome, this will be fully Englishspeaking event. Don’t miss beautiful, relaxed and friendly Ljubljana!
That’s it. For now:)


This is a bit different post in the series about the data mining and machine learning algorithms. This time I am honored and humbled to announce that my fourth Pluralsight course is alive. This is the Data Mining Algorithms in SSAS, Excel, and R course. besides explaining the algorithms, I also show demos in different products. This gives you even better understanding than just reading the blog posts. Of course, I will continue with describing the algorithms here as well.


So we are back. PASS SQL Saturday is coming to Slovenia again on December 12th, 2015. Remember last two years? We had two great events. According to feedback, everybody was satisfied and happy. Let's make another outstanding event! How can you help? First of all, these events are free for attendees. Of course, this is possible only because sponsors pay the bills, and speakers come on their own expenses, using their own free time. For a start, please respect these facts. We always need more sponsors. Maybe your company is interested, maybe you know a company that would be interested for the sponsorship. Please spread the word about the event. This event is second to none in Slovenia. No other event in Slovenia attracts so many top speakers from all around the world as our SQL Saturday. Last year there were around two hundred attendees, with more than one third coming abroad. Therefore, as a sponsor, you can reach quite a broad audience, no matter that the event is in a small country. The second pillar of the event are the speakers. Please consider speaking at the event. We are not limited to top speakers only; one of the goals of the event is also to raise new speakers. We are successful with this task. Every year we get somebody new. Therefore, don't hesitate; if you would like to join the SQL Server speakers community, this is a good time and place! Organization of the event takes a lot of time as well. Maybe you can help here? When you register, you can also volunteer for a help. Or you can contact Matija Lah, Mladen Prajdić, or me directly. There is never enough advertisement. Again, please spread the word about the event. Maybe your coworkers and friends didn't hear about the event yet. Please tell them. If you attended SQL Saturday Slovenia before, please share your experience. Finally, please register and then attend the event. Last year we reached less than 5% dropoff rate, which is, as far as we know, a world record. Let's keep up the good work!


Support vector machines are both, unsupervised and supervised learning models for classification and regression analysis (supervised) and for anomaly detection (unsupervised). Given a set of training examples, each marked as belonging to one of categories, an SVM training algorithm builds a model that assigns new examples into one category. An SVM model is a representation of the cases as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on. A support vector machine constructs a hyperplane or set of hyperplanes in a highdimensional space defined by the input variables, which can be used for classification, regression, or other tasks. A SVM is a discrete linear classifier. A good separation is achieved by the hyperplane that has the largest distance to the nearest training data point of any class (socalled functional margin). The larger the margin the lower the generalization error. Let me show you this on a graphical example. Of course, I am showing a twodimensional space defined by only two input variables, and therefore my separating hyperplane is just a line. The first figure shows the twodimensional space with cases and a possible singledimensional hyperplane (line). Of course, you can see that this line cannot be a separator at all, because there are some cases on the line, or said differently, on both sides of the line. The next try is better. The line is separating the cases in the space. However, this is not the best possible separation. Some cases are pretty close to the line. The third picture shows the best possible separation. The hyperplane that separates the cases the best is found, and the model is trained. Support Vector Machines are powerful for some specific classifications:  Text and hypertext categorization
 Images classification
 Classifications in medicine
 Handwritten characters recognition
Oneclass SVM can be used for anomaly detection, like detection of dirty data, or fraud detection. It uses a classification function without parameters, the one selected for the separation without regard to a target variable. Cases that are close to the separation hyperplane are the suspicious cases. Therefore, the result is dichotomous: 1=regular case, 0=outlier.


Principal component analysis (PCA) is a technique used to emphasize the majority of the variation and bring out strong patterns in a dataset. It is often used to make data easy to explore and visualize. It is closely connected to eigenvectors and eigenvalues. A short definition of the algorithm: PCA uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal component has the largest possible variance, and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to (i.e., uncorrelated with) the preceding components. The principal components are orthogonal because they are the eigenvectors of the covariance matrix, which is symmetric. Initially, variables used in the analysis form a multidimensional space, or matrix, of dimensionality m, if you use m variables. The following picture shows a twodimensional space. Values of the variables v1 and v2 define cases in this 2D space. Variability of the cases is spread across both source variables approximately equally. Finding principal components means finding new m axes, where m is exactly equal to the number of the source variables. However, these new axes are selected in such way that the most of the variability of the cases is spread over a single new variable, or over a principal component, like shown in the following picture. We can deconstruct the data points matrix into eigenvectors and eigenvalues. Every eigenvector has a corresponding eigenvalue. A eigenvector is a direction of the line and a eigenvalue is a number, telling how much variance there is in the data in that direction, or how spread out the data is on the line. he eigenvector with the highest eigenvalue is therefore the principal component. Here is an example of calculation of eigenvectors and eigenvalues for a simple twodimensional matrix. The interpretation of the principal components is up to you and might be pretty complex. This fact might limit PCA usability for businessoriented problems. PCA is used more in machine learning and statistics than in data mining, which is more end user oriented, and the results thus should be easy understandable. You use the PCA to:  Explore the data to explain the variability;
 Reduce the dimensionality – replace the m variables with n principal components, where n < m, in such a way that preserves the most of the variability;
 Use the residual variability not explained by the PCs for anomaly detection.


With the KMeans algorithm, each object is assigned to exactly one cluster. It is assigned to this cluster with a probability equal to 1.0. It is assigned to all other clusters with a probability equal to 0.0. This is hard clustering. Instead of distance, you can use a probabilistic measure to determine cluster membership. For example, you can cover the objects with bell curves for each dimension with a specific mean and standard deviation. A case is assigned to every cluster with a certain probability. Because clusters can overlap, this is called soft clustering. The ExpectationMaximization (EM) method changes the parameters of the bell curve to improve covering in each iteration. The Expectation  Maximization (EM) Clustering algorithm extends the KMeans paradigm in a different way. Instead of assigning each object to a dedicated cluster, it assigns each object to a cluster according to a weight representing the probability of the membership. In other words, there are no strict boundaries between clusters. Therefore, new means are computed based on weighted measures. The EM algorithm iterates between two steps. In the first step—the "expectation" step—the algorithm calculates the cluster membership of each case (i.e., the probability that a case belongs to a given cluster from the initially defined k number of clusters). In the second step—the "maximization" step—the algorithm uses these cluster memberships to reestimate the models' parameters, such as the location and scale parameters of Gaussian distribution. The algorithm assumes that the data is drawn from a mixture of Gaussian distributions (bell curves). Take a look at the graphics. In the first row, the algorithm initializes the mixture distribution, which is the mixture of several bell curves here. In the second and third rows, the algorithm modifies the mixture distribution based on the data. The iteration stops when it meets the specified stopping criteria—for example, when it reaches a certain likelihoodofimprovement rate between iterations. Step 1: Initializing the mixture distribution Step 2: Modifying the mixture distribution Step 3: Final modification You use the EM Clustering for the same purposes as the KMeans Clustering. In addition, you can search for outliers based on combinations of values of all input variables with the EM algorithm. You check the highest probability of cases over all clusters. The cases where the highest probability is still low do not fit well into any cluster. Said differently, they are not like other cases, and therefore you can assume that they are outliers. See the last figure in this blog post bellow. The green case belongs to the cluster D with probability 0.95, to the cluster C with probability 0.002, to the cluster E with probability 0.0003, and so on. The red case belongs to the cluster C with probability 0.03, to the cluster B with probability 0.02, to the cluster D with probability 0.003, and so on. The highest probability for the red case is still a low value; therefore, this case does not fit well to any of the clusters and thus might represent an outlier. Outliers can also represent potentially fraudulent transactions. EM Clustering is therefore useful also for fraud detection. Finally, you can use EM Clustering for advanced data profiling to find the rows with suspicious combinations of column values.


Hierarchical clustering could be very useful because it is easy to see the optimal number of clusters in a dendrogram and because the dendrogram visualizes the clusters and the process of building of that clusters. However, hierarchical methods don’t scale well. Just imagine how cluttered a dendrogram would be if 10,000 cases would be shown on it. KMeans is a distancebased partitioning algorithm that divides data set in predetermined (“k”) number of clusters around the average location (“mean”). In your mind, you intuitively know how to group people or any other cases. Groups do not need to have an equal number of members. You can do grouping according to one or more attributes. The algorithm comes from geometry. Imagine record space with attributes as dimensions. Each record (case) is uniquely located in space with values of the attributes (variables). The algorithm initially creates k fictitious members and defines them at the means of the clusters. These fictitious cases are also called centroids. The values of the input variables for these centroids could be selected randomly. Some algorithms use also a bit of heuristics and use the marginal distributions of the attributes as a starting point and randomly perturb from there. The algorithm then assigns each record to nearest centroid. This way, you get the initial clusters. When the clusters are defined, the algorithm can calculate the actual centroids of clusters and get new centroids. After the new centroids are calculated, the algorithm reassigns each record to the nearest centroid. Some records jump from cluster to cluster. Now the algorithm can calculate new centroids and then new clusters. The algorithm iterates last two steps until cluster boundaries stop changing. You can stop the iterations when there is less than the minimum number of cases defined as a parameter that can jump from cluster to cluster. Here is a graphical representation of the process. You can see the cases in a twodimensional space. The dark brown cases are the fictitious centroids. The green case is the one that will jump between clusters. After the centroids were selected, the algorithm assigns each case to the nearest centroid. Now we have our three initial clusters. The algorithm can calculate the real centroids of those three clusters. This means that the centroids move. The algorithm has to recalculate the cluster membership. The green case jumps from the middle cluster to the bottom left cluster. In the next iteration, no case jumps from a cluster to a cluster. Therefore, the algorithm can stop. Kmeans clustering scales much better than hierarchical methods. However, it has drawbacks as well. First of all, what is the optimal number of clusters? You can’t know in advance. Therefore, you need to create different models with different number of clusters and than select the one that fits your data the best. The next problem is the meaning of the clusters. There are no labels for the clusters that would be known in advance. Once the model is built, you need to check the distributions of the input variables in each cluster to understand what kind of cases constitute each cluster. Only after this step you can label the clusters.


Clustering is the process of grouping the data into classes or clusters so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters. Dissimilarities are assessed based on the attribute values describing the objects. There are a large number of clustering algorithms. The major methods can be classified into the following categories.  Partitioning methods. A partitioning method constructs K partitions of the data, which satisfy the following requirements: (1) each group must contain at least one object and (2) each object must belong to exactly one group. Given the initial K number of partitions to construct, the method creates initial partitions. It then uses an iterative relocation technique that attempts to improve the partitioning by moving objects from one group to another. There are various kinds of criteria for judging the quality of the partitions. Some most popular include the kmeans algorithm, where each cluster is represented by the mean value of the objects in the cluster, and the kmedoids algorithm, where each cluster is represented by one of the objects located near the center of the cluster.
 Hierarchical methods. A hierarchical method creates a hierarchical decomposition of the given set of data objects. These methods are agglomerative or divisive. The agglomerative (bottomup) approach starts with each object forming a separate group. It successively merges the objects or groups close to one another, until all groups are merged into one. The divisive (topdown) approach starts with all the objects in the same cluster. In each successive iteration, a cluster is split up into smaller clusters, until eventually each object is in one cluster or until a termination condition holds.
 Densitybased methods. Methods based on the distance between objects can find only sphericalshaped clusters and encounter difficulty in discovering clusters of arbitrary shapes. So other methods have been developed based on the notion of density. The general idea is to continue growing the given cluster as long as the density (number of objects or data points) in the “neighborhood” exceeds some threshold; that is, for each data point within a given cluster, the neighborhood of a given radius has to contain at least a minimum number of points.
 Modelbased methods. Modelbased methods hypothesize a model for each of the clusters and find the best fit of the data to the given model. A modelbased technique might locate clusters by constructing a density function that reflects the spatial distribution of the data points. Unlike conventional clustering, which primarily identifies groups of like objects, this conceptual clustering goes one step further by also finding characteristic descriptions for each group, where each group represents a concept or a class.
A hierarchical clustering model training typically starts by calculating a distance matrix – a matrix with distances between data points in a multidimensional hyperspace, where each input variable defines one dimension of that hyperspace. Distance measure can be a geometrical distance or some other, more complex measure. A dendrogram is a tree diagram frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering. Dendrograms are also often used in computational biology to illustrate the clustering of genes or samples. The following set of pictures shows the process of building an agglomerative hierarchical clustering dendrogram. Cluster analysis segments a heterogeneous population into a number of more homogenous subgroups or clusters. Typical usage scenarios include:  Discovering distinct groups of customers
 Identifying groups of houses in a city
 In biology, deriving animal and plant taxonomies
 Can even make predictions once the clusters are built and distribution of a target variable in the clusters is calculated.


The Association Rules algorithm is specifically designed for use in market basket analyses. This knowledge can additionally help in identifying crossselling opportunities and in arranging attractive packages of products. This is the most popular algorithm used in web sales. You can even include additional discrete input variables and predict purchases over classes of input variables. Association Rules Basics The algorithm considers each attribute/value pair (such as product/bicycle) as an item. An itemset is a combination of items in a single transaction. The algorithm scans through the dataset trying to find itemsets that tend to appear in many transactions. Then it expresses the combinations of the items as rules (such as “if customers purchase potato chips, they will purchase cola as well”). Often association models work against datasets containing nested tables, such as a customer list followed by a nested purchases table. If a nested table exists in the dataset, each nested key (such as a product in the purchases table) is considered an item. Understanding Measures Besides the itemsets and the rules, the algorithm also return some measures for the itemsets and the rules. Imagine the following transactions:  Transaction 1: Frozen pizza, cola, milk
 Transaction 2: Milk, potato chips
 Transaction 3: Cola, frozen pizza
 Transaction 4: Milk, pretzels
 Transaction 5: Cola, pretzels
The Association Rules measures include:  Support, or frequency, means the number of cases that contain the targeted item or combination of items. Therefore, support is a measure for the itemsets.
 Probability, also known as confidence, is a measure for the rules. The probability of an association rule is the support for the combination divided by the support for the condition. For example, the rule "If a customer purchases cola, then they will purchase potato chips" has a probability of 33%. The support for the combination (potato chips + cola) is 20%, occurring in one of each five transactions. However, the support for the condition (cola) is 60%, occurring in three out of each five transactions. This gives a confidence of 0.2 / 0.6 = 0.33 or 33%.
 Importance is a measure for both, itemsets and rules. When importance is calculated for an itemset, then when importance equals one, the items in the itemset are independent. If importance is greater than one, then the items are positively correlated. If importance is lower than one, then the items are negatively correlated. When importance is calculated for a rule “If {A} then {B}, then the value zero means there is no association between the items. Positive importance means that the probability for the item {B} goes up when the item {A} is in the basket, and negative importance of the rule means that the probability for the item {B} goes down when the tem {A} is in the basket.
Common Business Use Cases You use the Association Rules algorithm for market basket analyses. You can identify crossselling opportunities or arrange attractive packages. This is the most popular algorithm used in web sales. You can even include additional input variables and predict purchases over classes of input variables.


In two days, I am starting my first conference trip for this year. Therefore, it seems to me it is high time to write down my plan for the first semester of this year. Of course, I m adding my food plan for each event:)  SQL Saturday #374 Vienna. On Friday, February 27th, I am having a fullday seminar “Advanced Data Modeling Topics” in Vienna. On Saturday, I am also giving a presentation “Identity Mapping and DeDuplicating”. I am looking forward to the Käsekrainer. When I met it for the first time, I thought it was a bad joke. Our Kranjska sausage, or how Austrians say, the Krainer sausage, is probably the most controlled sausage, the sausage with the best ingredients possible. I was wondering what kind of barbarians would put cheese in it. In addition, Käsekrainer is much cheaper. It consists of pork, veal, cheese, and a fistful of unidentifiable ingredients. The Käsekrainer is probably the food that not acceptable for the highest number of religions and personal beliefs. However, over time, I started to love it:) Käsekrainer mit semmel, senf und kren with my name on it is already waiting for me in Vienna!
 Of course, I don’t want to miss the SQLBits. On Saturday, March 7th, I am giving the presentation “Analysing text with SQL Server 2014”. British cuisine might not be the most famous in the world. Nevertheless, there is a dish I never had an opportunity to taste yet. This is one of three conferences in England I am doing this semester. At least once I want to get the spotted dick.
 In the week of March 23rd, I am returning to London for the DevWeek conference. I am having a seminar “BI with Microsoft Tools: from Enterprise to a Personal Level” and five presentations there (Data Extraction and Transformation with Power Query and M; Data Mining Algorithms Part 1; Data Mining Algorithms Part 2; Introducing R and Azure ML; Visualising Geographic and Temporal Data with Power Map). If I don’t get the spotted dick during SQLBits, I should have enough time during the DevWeek.
 SQL Saturday #376 Budapest. The schedule is not public yet, but I am giving a presentation there. I would not like to miss the halászlé, the Hungarian fish soup.
 And of course, England again – SQL Saturday #372 Exeter. My presentation there is “Analysing Text with SQL Server 2014 and R”. And my last chance for the spotted dick this semester.
 SQL Saturday #369 Lisbon. I am having a seminar “Data Mining Algorithms in SQL Server, Excer, R, and Azure ML” on Thursday, May 14th, and two presentations on Saturday (again, the schedule is not public yet, so I am not revealing the titles of the sessions). Of course, visiting Portugal and not having the pastel de nata is not acceptable.
 May 18th – 20th – NTK, Portorož, Slovenia. One presentation, the title is still a secret, but not my copresenter Milica Medić. Feel free to envy me:) And yes, I am taking her for the horse steak, would not miss this at home.
 SQL Saturday #384 Varna. Again, the schedule is not public yet, but I am speaking there. Love to return to Bulgaria. In addition, this will by my first time in Varna and on the Black Sea coast. Oh, there are so many Bulgarian dishes I want to have again! Definitely баница (greasy pastry deliciousness) and шкембе чорба (tripe soup)!
 SQL Saturday #409 Rheinland. The schedule is still a secret. However, the Rheinischer Sauerbraten is in my plan. With a lot of beer.
 SQL Saturday 419 Bratislava. Finally, the Northern Slovenia, aka Slovakia, is getting its own SQL Saturday. Of course, I cannot miss it. There are not many dishes in Slovakia that would not be available in Southern Slovakia, aka Slovenia. However, the bryndzove halusky (small dumplings made of potato dough with sheep cheese and topped with scrambled bacon) dish is not well known in Ljubljana, so I am having one in Bratislava. But I am definitely refusing to eat any dish with word “kuraci” in its name. Search for the translation of this word from Slovak and from Slovenian, and you will understand why.
Which events are you visiting? Hope we meet at some.


Data mining is the most advanced part of business intelligence. With statistical and other mathematical algorithms, you can automatically discover patterns and rules in your data that are hard to notice with online analytical processing and reporting. However, you need to thoroughly understand how the data mining algorithms work in order to interpret the results correctly. In this blog I am introducing the data mining, and in the following blogs I am unveiling the black box of data mining and explaining how the most popular algorithms work. Data Mining Definition Data mining is a process of exploration and analysis, by automatic or semiautomatic means, of historical data in order to discover patterns and rules, which can be used later on new data for predictions and forecasting. With data mining, you deduce some hidden knowledge by examining, or training, the data. The unit of examination is called a case, which can be interpreted as one appearance of an entity, or a row, in a table. The knowledge is patterns and rules. In the process, you use attributes of a case, which are called variables in data mining terminology. For better understanding, you can compare data mining to OnLine Analytical Processing (OLAP), which is a modeldriven analysis where you build the model in advance. Data mining is a datadriven analysis, where you search for the model. You examine the data with data mining algorithms. There are many alternative names for data mining, such as knowledge discovery in databases (KDD) and predictive analytics. Originally, data mining was not the same as machine learning in that it gives business users insights for actionable decisions; machine learning determines which algorithm performs the best for a specific task. However, nowadays data mining and machine learning are in many cases used as synonyms. The Two Types of Data Mining Data mining techniques are divided into two main classes:  The directed, or supervised approach: You use known examples and apply gleaned information to unknown examples to predict selected target variable(s).
 The undirected, or unsupervised approach: You discover new patterns inside the dataset as a whole.
Some of the most important directed techniques include classification, estimation, and forecasting. Classification means to examine a new case and assign it to a predefined discrete class. Examples are assigning keywords to articles and assigning customers to known segments. Very similar is estimation, where you are trying to estimate a value of a variable of a new case in a continuously defined pool of values. You can, for example, estimate the number of children or the family income. Forecasting is somewhat similar to classification and estimation. The main difference is that you can’t check the forecasted value at the time of the forecast. Of course, you can evaluate it if you just wait long enough. Examples include forecasting which customers will leave in the future, which customers will order additional services, and the sales amount in a specific region at a specific time in the future. The most common undirected techniques are clustering and affinity grouping. An example of clustering is looking through a large number of initially undifferentiated customers and trying to see if they fall into natural groupings. This is a pure example of "undirected data mining" where the user has no preordained agenda and hopes that the data mining tool will reveal some meaningful structure. Affinity grouping is a special kind of clustering that identifies events or transactions that occur simultaneously. A wellknown example of affinity grouping is market basket analysis. Market basket analysis attempts to understand what items are sold together at the same time. Common Business Use Cases Some of the most common business questions that you can answer with data mining include:  What’s the credit risk of this customer?
 Are there any groups of my customers?
 What products do customers tend to buy together?
 How much of a specific product can I sell in the next time period?
 What is the potential number of customers shopping in this store?
 What are the major groups of my webclick customers?
 Is this a spam email?
However, the actual questions you might want to answer with data mining could be by far broader and depend on your imagination only. For an unconventional example, you might use data mining to try to lower the mortality rate in a hospital. Data mining is already widely used in many different applications. Some of the typical usages, along with the most commonly used algorithms for a specific task, include the following:  Crossselling: Widely used for web sales with the Association Rules and Decision Trees algorithms.
 Fraud detection: An important task for banks and credit card issuers, who want to limit the damage that fraud creates, including that experienced by customers and companies. The Clustering and Decision Trees algorithms are commonly used for fraud detection.
 Churn detection: Service providers, including telecommunications, banking, and insurance companies, perform this to detect which of their subscribers are about to leave them in an attempt to prevent it. Any of the directed methods, including the Naive Bayes, Decision Trees, or Neural Network algorithm, is suitable for this task.
 Customer Relationship Management (CRM) applications: Based on knowledge about customers, which you can extract with segmentation using, for example, the Clustering or Decision Trees algorithm.
 Website optimization: To do this, you should know how your website is used. Microsoft developed a special algorithm, the Sequence Clustering algorithm, for this task.
 Forecasting: Nearly any business would like to have some forecasting, in order to prepare better plans and budgets. The Time Series algorithm is specially designed for this task.
A Quick Introduction to the Most Popular Algorithms In order to raise the expectations for the upcoming blogs, I am adding a brief introduction to the most popular data mining algorithms in a condensed way, in a table. Algorithm  Usage  Association Rules  The algorithm used for market basket analysis, this defines an itemset as a combination of items in a single transaction. It then scans the data and counts the number of times the itemsets appear together in transactions. Market basket analysis is useful to detect crossselling opportunities.  Clustering  This groups cases from a dataset into clusters containing similar characteristics. You can use the Clustering method to group your customers for your CRM application to find distinguishable groups of your customers. In addition, you can use it for finding anomalies in your data. If a case does not fit well to any cluster, it is kind of an exception. For example, this might be a fraudulent transaction.  Naïve Bayes  This calculates probabilities for each possible state of the input attribute for every single state of predictable variable. Those probabilities predict the target attribute based on the known input attributes of new cases. The Naïve Bayes algorithm is quite simple; it builds the models quickly. Therefore, it is very suitable as a starting point in your predictive analytics project.  Decision Trees  The most popular DM algorithm, it predicts discrete and continuous variables. It uses the discrete input variables to split the tree into nodes in such a way that each node is more pure in terms of target variable, i.e. each split leads to nodes where a single state of a target variable is represented better than other states.  Regression Trees  For continuous predictable variables, you get a piecemeal multiple linear regression formula with a separate formula in each node of a tree. Discrete input variables are used to split the tree into nodes. A tree that predicts continuous variables is a Regression Tree. Use Regression Trees for estimation of a continuous variable; for example, a bank might use this technique to estimate the family income for a loan applicant.  Linear Regression  Predicts continuous variables, using a single multiple linear regression formula. The input variables must be continuous as well. Linear Regression is a simple case of a Regression Tree, a tree with no splits. Use it for the same purpose as Regression Trees.  Neural Network  This algorithm is from artificial intelligence, but you can use it for predictions as well. Neural networks search for nonlinear functional dependencies by performing nonlinear transformations on the data in layers, from the input layer through hidden layers to the output layer. Because of the multiple nonlinear transformations, neural networks are harder to interpret compared to Decision Trees.  Logistic Regression  As Linear Regression is a simple Regression Tree, a Logistic Regression is a Neural Network without any hidden layers.  Support Vector Machines  Support Vector Machines are supervised learning models with associated learning algorithms that analyse data and recognize patterns, used for classification. A support vector machine constructs a hyper plane or set of hyper planes in a highdimensional space where the input variables define the dimensions. The hyper planes split the data points into discrete groups of the target variable. Support Vector Machines are powerful for some specific classifications, like text and images classifications and handwritten characters recognition.  Sequence Clustering  This searches for clusters based on a model, and not on similarity of cases as Clustering does. The models are defined on sequences of events by using Markov Chains. Typical usage of the Sequence Clustering would be an analysis of your company’s Web site usage, although you can use this algorithm on any sequential data.  Time Series  You can use this algorithm to forecast continuous variables. Time Series many times denotes two different internal algorithms. For shortterm forecasting, AutoRegression Trees (ART) algorithm is used. For longterm prediction, AutoRegressive Integrated Moving Average (ARIMA) algorithm is used.  Conclusion This brief introduction to data mining should give you the idea what you could use it for and an overview which algorithms are appropriate for the business problem you are trying to solve. I guess you also noticed I am not talking about any specific technology here. These most popular data mining algorithms are available in many different products. For example, you can find them in SQL Server Analysis Services, Excel with Data Mining Addins, R, Azure ML, and more. Please learn how to use them with your specific product using the documentation of the product, by reading books that deal with your product, or by visiting a course about the product. I hope you got excited enough to read the upcoming blogs and visit some of my presentations on various conferences.




