Related Topics: data
data mining, also called knowledge discovery in databases, in computer science, the process of discovering interesting and useful patterns and relationships in large volumes of data. The field combines tools from statistics and artificial intelligence (such as neural networks and machine learning) with database management to analyze large digital collections, known as data sets. Data mining is widely used in business (insurance, banking, retail), science research (astronomy, medicine), and government security (detection of criminals and terrorists).
The proliferation of numerous large, and sometimes connected, government and private databases has led to regulations to ensure that individual records are accurate and secure from unauthorized viewing or tampering. Most types of data mining are targeted toward ascertaining general knowledge about a group rather than knowledge about specific individuals—a supermarket is less concerned about selling one more item to one person than about selling many items to many people—though pattern analysis also may be used to discern anomalous individual behaviour such as fraud or other criminal activity.
As computer storage capacities increased during the 1980s, many companies began to store more transactional data. The resulting record collections, often called data warehouses, were too large to be analyzed with traditional statistical approaches. Several computer science conferences and workshops were held to consider how recent advances in the field of artificial intelligence (AI)—such as discoveries from expert systems, genetic algorithms, machine learning, and neural networks—could be adapted for knowledge discovery (the preferred term in the computer science community). The process led in 1995 to the First International Conference on Knowledge Discovery and Data Mining, held in Montreal, and the launch in 1997 of the journal Data Mining and Knowledge Discovery. This was also the period when many early data-mining companies were formed and products were introduced.
One of the earliest successful applications of data mining, perhaps second only to marketing research, was credit-card-fraud detection. By studying a consumer’s purchasing behaviour, a typical pattern usually becomes apparent; purchases made outside this pattern can then be flagged for later investigation or to deny a transaction. However, the wide variety of normal behaviours makes this challenging; no single distinction between normal and fraudulent behaviour works for everyone or all the time. Every individual is likely to make some purchases that differ from the types he has made before, so relying on what is normal for a single individual is likely to give too many false alarms. One approach to improving reliability is first to group individuals that have similar purchasing patterns, since group models are less sensitive to minor anomalies. For example, a “frequent business travelers” group will likely have a pattern that includes unprecedented purchases in diverse locations, but members of this group might be flagged for other transactions, such as catalog purchases, that do not fit that group’s profile.
Computers and Technology Quiz
Computers host websites composed of HTML and send text messages as simple as...LOL. Hack into this quiz and let some technology tally your score and reveal the contents to you.
The complete data-mining process involves multiple steps, from understanding the goals of a project and what data are available to implementing process changes based on the final analysis. The three key computational steps are the model-learning process, model evaluation, and use of the model. This division is clearest with classification of data. Model learning occurs when one algorithm is applied to data about which the group (or class) attribute is known in order to produce a classifier, or an algorithm learned from the data. The classifier is then tested with an independent evaluation set that contains data with known attributes. The extent to which the model’s classifications agree with the known class for the target attribute can then be used to determine the expected accuracy of the model. If the model is sufficiently accurate, it can be used to classify data for which the target attribute is unknown.
There are many types of data mining, typically divided by the kind of information (attributes) known and the type of knowledge sought from the data-mining model.
Get a Britannica Premium subscription and gain access to exclusive content. Subscribe Now
Predictive modeling is used when the goal is to estimate the value of a particular target attribute and there exist sample training data for which values of that attribute are known. An example is classification, which takes a set of data already divided into predefined groups and searches for patterns in the data that differentiate those groups. These discovered patterns then can be used to classify other data where the right group designation for the target attribute is unknown (though other attributes may be known). For instance, a manufacturer could develop a predictive model that distinguishes parts that fail under extreme heat, extreme cold, or other conditions based on their manufacturing environment, and this model may then be used to determine appropriate applications for each part. Another technique employed in predictive modeling is regression analysis, which can be used when the target attribute is a numeric value and the goal is to predict that value for new data.
Descriptive modeling, or clustering, also divides data into groups. With clustering, however, the proper groups are not known in advance; the patterns discovered by analyzing the data are used to determine the groups. For example, an advertiser could analyze a general population in order to classify potential customers into different clusters and then develop separate advertising campaigns targeted to each group. Fraud detection also makes use of clustering to identify groups of individuals with similar purchasing patterns.
Data mining is the process of understanding data through cleaning raw data, finding patterns, creating models, and testing those models. It includes statistics, machine learning, and database systems. Data mining often includes multiple data projects, so it’s easy to confuse it with analytics, data governance, and other data processes. This guide will define data mining, share its benefits and challenges, and review how data mining works. Data mining has a long history. It emerged with computing in the 1960s through the 1980s. Historically, data mining was an intensive manual coding process — and it still involves coding ability and knowledgeable specialists to clean, process, and interpret data mining results today. Data specialists need statistical knowledge and some programming language knowledge to complete data mining techniques accurately. For instance, here are some examples of how companies have used R to answer their data questions. However, some of the manual processes are now able to be automated with repeatable flows, machine learning (ML), and artificial intelligence (AI) systems.
Data mining isn’t precisely data analytics
As discussed, data mining may be confused with other data projects. The data mining process includes projects such as data cleaning and exploratory analysis, but it is not just those practices. Data mining specialists clean and prepare the data, create models, test those models against hypotheses, and publish those models for analytics or business intelligence projects. In other words, analytics and data cleaning are parts of data mining, but they are only parts of the whole.
Benefits of data mining
Data mining is most effective when deployed strategically to serve a business goal, answer business or research questions, or be a part of a solution to a problem. Data mining assists with making accurate predictions, recognizing patterns and outliers, and often informs forecasting. Further, data mining helps organizations identify gaps and errors in processes, like bottlenecks in supply chains or improper data entry.
How data mining works
The first step in data mining is almost always data collection. Today’s organizations can collect records, logs, website visitors’ data, application data, sales data, and more every day. Collecting and mapping data is a good first step in understanding the limits of what can be done with and asked of the data in question. The Cross-Industry Standard Process for Data Mining (CRISP-DM) is an excellent guideline for starting the data mining process. This standard was created decades ago and is still a popular paradigm for organizations that are just starting.
The 6 CRISP-DM phases
The CRISP-DM comprises a six-phase workflow. It was designed to be flexible; data teams are allowed and encouraged to move back to a previous stage if needed. The model also provides opportunities for software platforms that help perform or augment some of these tasks.
1. Business understanding
Comprehensive data mining projects start by first identifying project objectives and scope. The business stakeholders will ask a question or state a problem that data mining can answer or solve.
2. Data understanding
Once the business problem is understood, it is time to collect the data relevant to the question and get a feel for the data set. This data often comes from multiple sources, including structured data and unstructured data. This stage may include some exploratory analysis to uncover some preliminary patterns. At the end of this phase, the data mining team has selected the subset of data for analysis and modeling.
3. Data preparation
This phase begins with more intensive work. Data preparation involves preparing the final data set, which includes all the relevant data needed to answer the business question. Stakeholders will identify the dimensions and variables to explore and prepare the final data set for model creation.
In this phase, you’ll select the appropriate modeling techniques for the given data. These techniques can include clustering, predictive models, classification, estimation, or a combination. Front Health used statistical modeling and predictive analytics to decide whether to expand healthcare programs to other populations. You may have to return to the data preparation phase if you select a modeling technique that requires selecting other variables or preparing some different sources.
After creating the models, you need to test them and measure their success at answering the question identified in the first phase. The model may answer facets of things not accounted for, and you may need to edit the model or edit the question. This phase is designed to allow you to look at the progress so far and ensure it’s on the right track for meeting the business goals. If it’s not, there might be a need to move backwards to previous steps before a project is ready for the deployment phase.
Finally, once the model is accurate and reliable, it is time to deploy it in the real world. The deployment can take place within the organization, be shared with customers, or be used to generate a report for stakeholders to prove its reliability. The work doesn’t end when the last line of code is complete; deployment requires careful thought, a roll-out plan, and a way to make sure the right people are appropriately informed. The data mining team is responsible for the audience’s understanding of the project.
Types of data mining techniques
Data mining includes multiple techniques for answering the business question or helping solve a problem. This section is just an introduction to two data mining techniques and is not currently comprehensive.
The most common technique is classification. To do this, identify a target variable and then divide that variable into appropriate level of detail categories. For example, the variable ‘occupation level’ might be split into ‘entry-level’, ‘associate’, and ‘senior’. With other fields such as age and education level, you can train your data model to predict what occupation level a person is more likely to have. You may add an entry for a recent 22-year-old graduate, and the data model could automatically classify that person in an ‘entry-level’ position. Insurance or financial institutions such as PEMCO Insurance used classification to train their algorithms to flag fraud and to monitor claims.
Clustering is another common technique, grouping records, observations, or cases by similarity. There won’t be a target variable like in classification. Instead, clustering just means separating the data set into subgroups. This method can include grouping records of users by geographic area or age group. Typically, clustering the data into subgroups is preparation for analysis. The subgroups become inputs for a different technique.