A hands-on guide to making valuable decisions from data using advanced data mining methods and techniques
This second installment in the Making Sense of Data series continues to explore a diverse range of commonly used approaches to making and communicating decisions from data. Delving into more technical topics, this book equips readers with advanced data mining methods that are needed to successfully translate raw data into smart decisions across various fields of research including business, engineering, finance, and the social sciences.
Following a comprehensive introduction that details how to define a problem, perform an analysis, and deploy the results, Making Sense of Data II addresses the following key techniques for advanced data analysis:
Data Visualization reviews principles and methods for understanding and communicating data through the use of visualization including single variables, the relationship between two or more variables, groupings in data, and dynamic approaches to interacting with data through graphical user interfaces.
Clustering outlines common approaches to clustering data sets and provides detailed explanations of methods for determining the distance between observations and procedures for clustering observations. Agglomerative hierarchical clustering, partitioned-based clustering, and fuzzy clustering are also discussed.
Predictive Analytics presents a discussion on how to build and assess models, along with a series of predictive analytics that can be used in a variety of situations including principal component analysis, multiple linear regression, discriminate analysis, logistic regression, and Naïve Bayes.
Applications demonstrates the current uses of data mining across a wide range of industries and features case studies that illustrate the related applications in real-world scenarios.
Each method is discussed within the context of a data mining process including defining the problem and deploying the results, and readers are provided with guidance on when and how each method should be used. The related Web site for the series (www.makingsenseofdata.com) provides a hands-on data analysis and data mining experience. Readers wishing to gain more practical experience will benefit from the tutorial section of the book in conjunction with the TraceisTM software, which is freely available online.
With its comprehensive collection of advanced data mining methods coupled with tutorials for applications in a range of fields, Making Sense of Data II is an indispensable book for courses on data analysis and data mining at the upper-undergraduate and graduate levels. It also serves as a valuable reference for researchers and professionals who are interested in learning how to accomplish effective decision making from data and understanding if data analysis and data mining methods could help their organization.
"synopsis" may belong to another edition of this title.
Glenn J. Myatt, PhD, is cofounder of Leadscope, Inc. and a Partner of Myatt & Johnson, Inc., a consulting company that focuses on business intelligence application development delivered through the Internet. Dr. Myatt is the author of Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining, also published by Wiley. WAYNE P. JOHNSON, MSc., is cofounder of Leadscope, Inc. and a Partner of Myatt & Johnson, Inc. Mr. Johnson has over two decades of experience in the design and development of large software systems, and his current professional interests include human–computer interaction, information visualization, and methodologies for contextual inquiry.
A hands-on guide to making valuable decisions from data using advanced data mining methods and techniques
This second installment in the Making Sense of Data series continues to explore a diverse range of commonly used approaches to making and communicating decisions from data. Delving into more technical topics, this book equips readers with advanced data mining methods that are needed to successfully translate raw data into smart decisions across various fields of research including business, engineering, finance, and the social sciences.
Following a comprehensive introduction that details how to define a problem, perform an analysis, and deploy the results, Making Sense of Data II addresses the following key techniques for advanced data analysis:
Data Visualization reviews principles and methods for understanding and communicating data through the use of visualization including single variables, the relationship between two or more variables, groupings in data, and dynamic approaches to interacting with data through graphical user interfaces.
Clustering outlines common approaches to clustering data sets and provides detailed explanations of methods for determining the distance between observations and procedures for clustering observations. Agglomerative hierarchical clustering, partitioned-based clustering, and fuzzy clustering are also discussed.
Predictive Analytics presents a discussion on how to build and assess models, along with a series of predictive analytics that can be used in a variety of situations including principal component analysis, multiple linear regression, discriminate analysis, logistic regression, and Naïve Bayes.
Applications demonstrates the current uses of data mining across a wide range of industries and features case studies that illustrate the related applications in real-world scenarios.
Each method is discussed within the context of a data mining process including defining the problem and deploying the results, and readers are provided with guidance on when and how each method should be used. The related Web site for the series (www.makingsenseofdata.com) provides a hands-on data analysis and data mining experience. Readers wishing to gain more practical experience will benefit from the tutorial section of the book in conjunction with the TraceisTM software, which is freely available online.
With its comprehensive collection of advanced data mining methods coupled with tutorials for applications in a range of fields, Making Sense of Data II is an indispensable book for courses on data analysis and data mining at the upper-undergraduate and graduate levels. It also serves as a valuable reference for researchers and professionals who are interested in learning how to accomplish effective decision making from data and understanding if data analysis and data mining methods could help their organization.
A hands-on guide to making valuable decisions from data using advanced data mining methods and techniques
This second installment in the Making Sense of Data series continues to explore a diverse range of commonly used approaches to making and communicating decisions from data. Delving into more technical topics, this book equips readers with advanced data mining methods that are needed to successfully translate raw data into smart decisions across various fields of research including business, engineering, finance, and the social sciences.
Following a comprehensive introduction that details how to define a problem, perform an analysis, and deploy the results, Making Sense of Data II addresses the following key techniques for advanced data analysis:
Data Visualization reviews principles and methods for understanding and communicating data through the use of visualization including single variables, the relationship between two or more variables, groupings in data, and dynamic approaches to interacting with data through graphical user interfaces.
Clustering outlines common approaches to clustering data sets and provides detailed explanations of methods for determining the distance between observations and procedures for clustering observations. Agglomerative hierarchical clustering, partitioned-based clustering, and fuzzy clustering are also discussed.
Predictive Analytics presents a discussion on how to build and assess models, along with a series of predictive analytics that can be used in a variety of situations including principal component analysis, multiple linear regression, discriminate analysis, logistic regression, and Naïve Bayes.
Applications demonstrates the current uses of data mining across a wide range of industries and features case studies that illustrate the related applications in real-world scenarios.
Each method is discussed within the context of a data mining process including defining the problem and deploying the results, and readers are provided with guidance on when and how each method should be used. The related Web site for the series (www.makingsenseofdata.com) provides a hands-on data analysis and data mining experience. Readers wishing to gain more practical experience will benefit from the tutorial section of the book in conjunction with the TraceisTM software, which is freely available online.
With its comprehensive collection of advanced data mining methods coupled with tutorials for applications in a range of fields, Making Sense of Data II is an indispensable book for courses on data analysis and data mining at the upper-undergraduate and graduate levels. It also serves as a valuable reference for researchers and professionals who are interested in learning how to accomplish effective decision making from data and understanding if data analysis and data mining methods could help their organization.
1.1 OVERVIEW
A growing number of fields, in particular the fields of business and science, are turning to data mining to make sense of large volumes of data. Financial institutions, manufacturing companies, and government agencies are just a few of the types of organizations using data mining. Data mining is also being used to address a wide range of problems, such as managing financial portfolios, optimizing marketing campaigns, and identifying insurance fraud. The adoption of data mining techniques is driven by a combination of competitive pressure, the availability of large amounts of data, and ever increasing computing power. Organizations that apply it to critical operations achieve significant returns. The use of a process helps ensure that the results from data mining projects translate into actionable and profitable business decisions. The following chapter summarizes four steps necessary to complete a data mining project: (1) definition, (2) preparation, (3) analysis, and (4) deployment. The methods discussed in this book are reviewed within this context. This chapter concludes with an outline of the book's content and suggestions for further reading.
1.2 DEFINITION
The first step in any data mining process is to define and plan the project. The following summarizes issues to consider when defining a project:
Objectives: Articulating the overriding business or scientific objective of the data mining project is an important first step. Based on this objective, it is also important to specify the success criteria to be measured upon delivery. The project should be divided into a series of goals that can be achieved using available data or data acquired from other sources. These objectives and goals should be understood by everyone working on the project or having an interest in the project's results.
Deliverables: Specifying exactly what is going to be delivered sets the correct expectation for the project. Examples of deliverables include a report outlining the results of the analysis or a predictive model (a mathematical model that estimates critical data) integrated within an operational system. Deliverables also identify who will use the results of the analysis and how they will be delivered. Consider criteria such as the accuracy of the predictive model, the time required to compute, or whether the predictions must be explained.
Roles and Responsibilities: Most data mining projects involve a cross-disciplinary team that includes (1) experts in data analysis and data mining, (2) experts in the subject matter, (3) information technology professionals, and (4) representatives from the community who will make use of the analysis. Including interested parties will help overcome any potential difficulties associated with user acceptance or deployment.
Project Plan: An assessment should be made of the current situation, including the source and quality of the data, any other assumptions relating to the data (such as licensing restrictions or a need to protect the confidentiality of the data), any constraints connected to the project (such as software, hardware, or budget limitations), or any other issues that may be important to the final deliverables. A timetable of events should be implemented, including the different stages of the project, along with deliverables at each stage. The plan should allot time for cross-team education and progress reviews. Contingencies should be built into the plan in case unexpected events arise. The timetable can be used to generate a budget for the project. This budget, in conjunction with any anticipated financial benefits, can form the basis for a cost-benefit analysis.
1.3 PREPARATION
1.3.1 Overview
Preparing the data for a data mining exercise can be one of the most time-consuming activities; however, it is critical to the project's success. The quality of the data accumulated and prepared will be the single most influential factor in determining the quality of the analysis results. In addition, understanding the contents of the data set in detail will be invaluable when it comes to mining the data. The following section outlines issues to consider when accessing and preparing a data set. The format of different sources is reviewed and includes data tables and nontabular information (such as text documents). Methods to categorize and describe any variables are outlined, including a discussion regarding the scale the data is measured on. A variety of descriptive statistics are discussed for use in understanding the data. Approaches to handling inconsistent or problematic data values are reviewed. As part of the preparation of the data, methods to reduce the number of variables in the data set should be considered, along with methods for transforming the data that match the problem more closely or to use with the analysis methods. These methods are reviewed. Finally, only a sample of the data set may be required for the analysis, and techniques for segmenting the data are outlined.
1.3.2 Accessing Tabular Data
Tabular information is often used directly in the data mining project. This data can be taken directly from an operational database system, such as an ERP (enterprise resource planning) system, a CRM (customer relationship management) system, SCM (supply chain management) system, or databases containing various transactions. Other common sources of data include surveys, results from experiments, or data collected directly from devices. Where internal data is not sufficient for the objective of the data mining exercise, data from other sources may need to be acquired and carefully integrated with existing data. In all of these situations, the data would be formatted as a table of observations with information on different variables of interest. If not, the data should be processed into a tabular format.
Preparing the data may include joining separate relational tables, or concatenating data sources; for example, combining tables that cover different periods in time. In addition, each row in the table should relate to the entity of the project, such as a customer. Where multiple rows relate to this entity of interest, generating a summary table may help in the data mining exercise. Generating this table may involve calculating summarized data from the original data, using computations such as sum, mode (most common value), average, or counts (number of observations). For example, a table may comprise individual customer transactions, yet the focus of the data mining exercise is the customer, as opposed to the individual transactions. Each row in the table should refer to a customer, and additional columns should be generated by summarizing the rows from the original table, such as total sales per product. This summary table will now replace the original table in the data mining exercise.
Many organizations have invested heavily in creating a high-quality, consolidated repository of information necessary for supporting decision-making. These repositories make use of data from operational systems or other sources. Data warehouses are an example of an integrated and central corporate-wide repository of decision-support information that is regularly updated. Data marts are generally smaller in scope than data warehouses and usually contain information related to a single business unit. An important accompanying component is a metadata repository, which contains information about the data. Examples of metadata include where the data came from and what units of measurements were used.
1.3.3 Accessing Unstructured Data
In many situations, the data to be used in the data mining project may not be represented as a table. For example, the data to analyze may be a collection of documents or a sequence of page clicks on a particular web site. Converting this type of data into a tabular format will be necessary in order to utilize many of the data mining approaches described later in this book. Chapter 5 describes the use of nontabular data in more detail.
1.3.4 Understanding the Variables and Observations
Once the project has been defined and the data acquired, the first step is usually to understand the content in more detail. Consulting with experts who have knowledge about how the data was collected as well as the meaning of the data is invaluable. Certain assumptions may have been built into the data, for example specific values may have particular meanings. Or certain variables may have been derived from others, and it will be important to understand how they were derived. Having a thorough understanding of the subject matter pertaining to the data set helps to explain why specific relationships are present and what these relationships mean.
(As an aside, throughout this book variables are presented in italics.)
An important initial categorization of the variables is the scale on which they are measured. Nominal and ordinal scales refer to variables that are categorical, that is, they have a limited number of possible values. The difference is that ordinal variables are ordered. The variable color which could take values black, white, red, and so on, would be an example of a nominal variable. The variable sales, whose values are low, medium, and high, would be an example of an ordinal scale, since there is an order to the values. Interval and ratio scales refer to variables that can take any continuous numeric value; however, ratio scales have a natural zero value, allowing for a calculation of a ratio. Temperature measured in Fahrenheit or Celsius is an example of an interval scale, as it can take any continuous value within a range. Since a zero value does not represent the absence of temperature, it is classified as an interval scale. However, temperatures measured in degrees Kelvin would be an example of a ratio scale, since zero is the lowest temperature. In addition, a bank balance would be an example of a ratio scale, since zero means no value.
In addition to describing the scale on which the individual variables were measured, it is also important to understand the frequency distribution of the variable (in the case of interval or ratio scaled variables) or the various categories that a nominal or ordinal scaled variable may take. Variables are usually examined to understand the following:
Central Tendency: A number of measures for the central tendency of a variable can be calculated, including the mean or average value, the median or the middle number based on an ordering of all values, and the mode or the most common value. Since the mean is sensitive to outliers, the trimmed mean may be considered which refers to a mean calculated after excluding extreme values. In addition, median values are often used to best represent a central value in situations involving outliers or skewed data.
Variation: Different numbers show the variation of the data set's distribution. The minimum and maximum values describe the entire range of the variable. Calculating the values for the different quartiles is helpful, and the calculation determines the points at which 25% (Q1), 50% (Q2), and 75% (Q3) are found in the ordered values. The variance and standard deviation are usually calculated to quantify the data distribution. Assuming a normal distribution, in the case of standard deviation, approximately 68% of all observations fall within one standard deviation of the mean, and approximately 95% of all observations fall within two standard deviations of the mean.
Shape: There are a number of metrics that define the shape and symmetry of the frequency distribution, including skewness, a measure of whether a variable is skewed to the left or right, and kurtosis, a measure of whether a variable has a flat or pointed central peak.
Graphs help to visualize the central tendency, the distribution, and the shape of the frequency distribution, as well as to identify any outliers. A number of graphs that are useful in summarizing variables include: frequency histograms, bar charts, frequency polygrams, and box plots. These visualizations are covered in detail in the section on univariate visualizations in Chapter 2.
Figure 1.1 illustrates a series of statistics calculated for a particular variable (percentage body fat). In this example, the variable contains 251 observations, and the most commonly occurring value is 20.4 (mode), the median is 19.2, and the average or mean value is 19.1. The variable ranges from 0 to 47.5, with the point at which 25% of the ordered values occurring at 12.4, 50% at 19.2 (or median), and 75% at 25.3. The variance is calculated to be 69.6, and the standard deviation at 8.34, that is, approximately 68% of observations occur 8.34 from the mean (10.76-28.44), and approximately 95% of observations occur 16.68 from the mean (2.42-35.78).
At this point it is worthwhile taking a digression to explain terms used for the different roles variables play in building a prediction model. The response variable, also referred to as the dependent variable, the outcome, or y-variable, is the variable any model will attempt to predict. Independent variables, also referred to as descriptors, predictors, or x-variables, are the fields that will be used in building the model. Labels, also referred to as record identification, or primary key, is a unique value corresponding to each individual row in the table. Other variables may be present in the table that will not be used in any model, but which can still be used in explanations.
During this stage it is also helpful to begin exploring the data to better understand its features. Summary tables, matrices of different graphs, along with interactive techniques such as brushing, are critical data exploration tools. These tools are described in Chapter 2 on data visualization. Grouping the data is also helpful to understand the general categories of observations present in the set. The visualization of groups is presented in Chapter 2, and an in-depth discussion of clustering and grouping methods is provided in Chapter 3.
1.3.5 Data Cleaning
Having extracted a table containing observations (represented as rows) and variables (represented as columns), the next step is to clean the data table, which often takes a considerable amount of time. Some common cleaning operations include identifying (1) errors, (2) entries with no data, and (3) entries with missing data. Errors and missing values may be attributable to the original collection, the transmission of the information, or the result of the preparation process.
Values are often missing from the data table, but a data mining approach cannot proceed until this issue is resolved. There are five options: (1) remove the entire observation from the data table; (2) remove the variable (containing the missing values) from the data table; (3) replace the missing value manually; (4) replace the value with a computed value, for example, the variable's mean or mode value; and (5) replace the entry with a predicted value based on a generated model using other fields in the data table. Different approaches for generating predictions are described in Chapter 4 on Predictive Analytics. The choice depends on the data set and the problem being addressed. For example, if most of the missing values are found in a single variable, then removing this variable may be a better option than removing the individual observations.
A similar situation to missing values occurs when a variable that is intended to be treated as a numeric variable contains text values, or specific numbers that have specific meanings. Again, the five choices previously outlined above may be used; however, the text or the specific number value may suggest numeric values to replace them with. Another example is a numeric variable where values below a threshold value are assigned a text string such as "<10**-9." A solution for this case might be to replace the string with the number 0.000000001.
Another problem occurs when values within the data tables are incorrect. The value may be problematic as a result of an equipment malfunction or a data entry error. There are a number of ways to help identify errors in the data. Outliers in the data may be errors and can be found using a variety of methods based on the variable, for example, calculating a z-score for each value that represents the number of standard deviations the value is away from the mean. Values greater than plus or minus three may be considered outliers. In addition, plotting the data using a box plot or a frequency histogram can often identify data values that significantly deviate from the mean. For variables that are particularly noisy, that is they contain some degree of errors, replacing the variable with a binned version that more accurately represents the variation of the data may be necessary. This process is called data smoothing. Other methods, such as data visualization, clustering, and regression models (described in Chapters 2-4) can also be useful to identify anomalous observations that do not look similar to other observations or that do not fit a trend observed for the majority of the variable's observations.
(Continues...)
Excerpted from Making Sense of Data IIby Glenn J. Myatt Wayne P. Johnson Copyright © 2009 by John Wiley & Sons, Inc.. Excerpted by permission.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.
"About this title" may belong to another edition of this title.
FREE shipping within U.S.A.
Destination, rates & speedsSeller: BooksRun, Philadelphia, PA, U.S.A.
Paperback. Condition: Good. 1. Ship within 24hrs. Satisfaction 100% guaranteed. APO/FPO addresses supported. Seller Inventory # 0470222808-11-1
Quantity: 1 available
Seller: ZBK Books, Carlstadt, NJ, U.S.A.
Condition: good. Used book in good and clean conditions. Pages and cover are intact. Limited notes marks and highlighting may be present. May show signs of normal shelf wear and bends on edges. Item may be missing CDs or access codes. May include library marks. Fast Shipping. Seller Inventory # ZWM.IVSP
Quantity: 1 available
Seller: clickgoodwillbooks, Indianapolis, IN, U.S.A.
Condition: Acceptable. This is a paper back book: Used - Acceptable: All pages and the cover are intact, but shrink wrap, dust covers, or boxed set case may be missing. Pages may include limited notes, highlighting, or minor water damage but the text is readable. Item may be missing bundled media. Seller Inventory # 3O6QYV002KZR_ns
Quantity: 1 available
Seller: Zubal-Books, Since 1961, Cleveland, OH, U.S.A.
Condition: Very Good. 320 pp., paperback, light wear to edges, else very good. - If you are reading this, this item is actually (physically) in our stock and ready for shipment once ordered. We are not bookjackers. Buyer is responsible for any additional duties, taxes, or fees required by recipient's country. Seller Inventory # ZB1317987
Quantity: 1 available
Seller: Sunshine State Books, Lithia, FL, U.S.A.
paperback. Condition: Good. Paperback--exlibrary--otherwise, excellent condition. Seller Inventory # VE231224008G151
Quantity: 1 available
Seller: Antiquariat Bookfarm, Löbnitz, Germany
Softcover. 291 S. Ehem. Bibliotheksexemplar mit Signatur und Stempel. GUTER Zustand, ein paar Gebrauchsspuren. Ex-library with stamp and library-signature. GOOD condition, some traces of use. w15855 9780470222805 Sprache: Englisch Gewicht in Gramm: 350. Seller Inventory # 2433997
Quantity: 1 available
Seller: Textbooks_Source, Columbia, MO, U.S.A.
paperback. Condition: Good. 1st Edition. Ships in a BOX from Central Missouri! May not include working access code. Will not include dust jacket. Has used sticker(s) and some writing or highlighting. UPS shipping for most packages, (Priority Mail for AK/HI/APO/PO Boxes). Seller Inventory # 001003148U
Quantity: 4 available
Seller: GreatBookPrices, Columbia, MD, U.S.A.
Condition: New. Seller Inventory # 5197932-n
Quantity: Over 20 available
Seller: Textbooks_Source, Columbia, MO, U.S.A.
paperback. Condition: New. 1st Edition. Ships in a BOX from Central Missouri! UPS shipping for most packages, (Priority Mail for AK/HI/APO/PO Boxes). Seller Inventory # 001003148N
Quantity: 3 available
Seller: Lucky's Textbooks, Dallas, TX, U.S.A.
Condition: New. Seller Inventory # ABLIING23Feb2215580219304
Quantity: Over 20 available