Data Mining

Previous Parent Next Page


Index - Major Sections
Home

**InHCc HMIS**

Site Map
Health Economic and Reform

Benefits

Discussion

Data and Data Analysis

Health Management

Product and Services
References
Team

_______________

Index - Same Level Subject

Data and Data Analyiss
Information Management
Computerized Patient Record
Health  Operations Research
Data Mining
Data Dissemination
Management
 

Index - Child Subjects
Data Mining
Computer Algorithms
Knowledge Base
Examples

 

 “Imagination is more important than knowledge”….Albert Einstein

Introduction

Data Mining makes present day research methods obsolete! Well almost...read on to get my message. 

Present day analysis is generally limited to measuring the impact that the interventions have in effecting the desired outcomes of the program or project.  Although impact analysis produces the evidence that the interventions have caused the change that has occurred, it is static...and in an environment of complexities and change, if not used correctly, it may be of little use to management.

We will define data mining as:

the non-trivial extraction of implicit, previously unknown and potentially useful knowledge from large databases

It is a form of knowledge discovery in which we rely on the computer to exhaustively search through the data looking for patterns and trends.

Data mining activities can be described as one of two ways

  • Directed Learning: To prove or disprove a theory or hypothesis
  • Knowledge Discovery: discovering relationships and trends.

Sometimes hunches are made about the affect of a particular variable on a pattern, other times there is no obvious link. Using directed learning, an analyst specifies which variables to watch. Undirected learning is used when there are no preconceptions of what the results may be.

Data mining reports the non-trivial, and previously unknown relationships between variables. No prior assumptions need be made. If the databases were not large, then no new techniques would be needed.  (see note on use of terms Data Mining vs Knowledge discovery).

The ability to collect data easily and inexpensively has made it more difficult to gather "knowledge" from this vast amount of collected data. Where does one start? Looking for solutions or trying to find the correct hypothesis can be perceived as a search problem. "An important element in search problems is the establishment of the complexity of the search space, how many hypotheses there are and how they are related (Adriaans and Zantinge, 1997). " Techniques in the past that were used to analyze a limited amount of data, with a limited number of variables, and a limited number of relationships will no longer be suitable. It is fairly easy to develop an hypothesis when there are only a limited number of possibilities. Today, due to the complexity of the search space, it is more difficult. 

The ability to read and understand the results of projects with a "few key indicators" will not be enough to survive as a researcher or as a professional manager. To survive, the researcher or manager must be able to take large quantities of integrated data, and from that data, be able to extract knowledge that helps explain and predict. It will not be enough to report "how many did this and how many did that", "averages" and "standard deviations." These values have little meaning and little use to management. It is not enough.

The ability to collect all data easily and inexpensively changes the emphasis of research from the "collecting of data" through surveys, questionnaires, or other time consuming and limited methods to how to filter, select, and interpret this massive  amount of data already collected through some mechanical process (such as a clinical computer system). This process of filtering, selecting and interpreting, we call Data Mining. 

Although, SQL and Data Mining are complementary, they are different. The difference between using SQL to query relational databases and Data Mining is that Data mining attempts to discover new or hidden information that cannot be easily traced. It is that 20% that makes the difference between success and failure. SQL is only a query language that must look for data using relationships that are already known, i.e. the primary key and the secondary keys. Data mining attempts to discover what these keys are. Normalization of SQL tables require that relationships be known. Data mining attempts to discover these relationships. SQL answers the question "what has been going on" and Data Mining answers the question "what is going to happen." It can help answer the following types of questions"

  • Discovering unknown associations. Such associations can be found when one event can be correlated to another event that seems completely unrelated.

  • Sequence, where one event leads to another later event. What risk factors will this client have? Based on where this client is in her life, what will be the next product or service she will want?

  • Recognizing patterns that lead to classification, or new organization of data. What products should be offered to which clients?

  • Finding groups of facts not previously know. This process is known as event clustering.

  • Forecasting, or simply discovering patterns in the data that can lead to predictions about the future. What is the trend in usage of contraceptive. Where should the next clinic be located?

Instead of just counting the number of products offered and the number of products distributed, it is far more effective to be able to offer a service to a client that wants that type of service. It is better to be able to predict inventory and resource needs instead of counting the number of times that a clinic is out of stock. Better care is given to the client if clinics are located where they are needed. It is less expense to target those groups that really need the services instead of those that do not. It is easier to offer the client something she wants instead of offering them something they do not want.

The ability to move data over a network and store this data in extremely large databases, has made it possible to consolidate data that was in the past impossible to integrate. Thus, connecting Family Planning data with Mother and Child Health and with EPI data may lead to unexpected views on the behavioral patterns of certain population groups. This information is impossible to obtain by looking at any individually health care unit. People move from service to service and this data must be analyzed. It is impossible to determine the cost-benefit of combining Health Care departments unless behavioral patterns are known.   

The task of a researcher is to explain and predict. In order to explain and make these predictions he must follow a certain procedure.

  • ...develops a hypothesis from his preconceived 
    ideas or hunches concerning relationships 
    in the data. 

  • ...develops a test to either prove or disprove this hypothesis

  • ...collects data

  • ...analysis the data to find patterns that relate 
    the variables in this data

  • ...either proves or disproves his hypothesis

  • If he proves his hypothesis with this data, then 
    he makes a prediction about future data 
    results (if he actually gets this far).

If the prediction is correct this time, then we assume the hypothesis is correct, but if the prediction is incorrect, the researcher must start over again with the cycle. He must gather more data, and make a new hypothesis. 

Even if the prediction is correct this time, it does not mean that the hypothesis will be true the next time that it is used to make a predication. People change and in reality, the process of discovery must go on and on for ever. 

The philosophy of Science states that we can formulate hypotheses, but we can never prove that a hypothesis is true. Everything that science discovers is only temporary (Adriaans and Zanting, 1996)...and is ever changing.

How, we have held off this long in telling you how Data Mining works its magic...and the answer is...it learns as it goes about its business of analyzing data. First the application develops a hypothesis, an algorithm, based on the historical information that you have accumulated in the database. Next it begins by looking at both positive and negative examples of a concept. It then adjusts its algorithm based on any new data that it reviews. It is a "learning algorithm." It performs the tasks of the researcher automatically and continuously. 

The difference between a researcher and the Data Mining application is that the Data Mining application can go through the entire database and find even the smallest pattern or relationship between variables. 

The difference between Data Mining and statistics is generally one of degree. Data Mining uses several statistical methods in order for it to derive its result. Whereas both can be used to create predictive models, Data Mining is generally faster and its reports are more "practical." Data Mining is not consider "scientific." The idea that you just go into a large database and run statistical programs to just see what you get, is not excepted well by the academic community. It is argued that in many cases, Data Mining techniques such as neural networks will increase the predictive accuracy beyond what can be attained via standard statistical techniques (Berson and Smith, 1997).

An Example:
In order to calculate the risks of complications in pregnancy, a "learning algorithm" first takes all the attributes of a women who have had a complication of pregnancy and puts these in the "risk group." It then takes all the attributes of women who have not had a complication of pregnancy and puts them in a "non-risk" group. Based on these grouping it develops a hypothesis. Each time that the algorithm receives new data, it automatically adjusts its hypothesis in order to have a better predictive value. It "learns." 

An Example:
Suppose we have a large file containing all the data from the "Health Department" for the last five years. There is a wealth of potentially useful knowledge in such a file and it can be queried to determine such questions as "who visited to which clinic and used what services in what time period." However, there is knowledge that is hidden in the database that cannot be obtained from straight SQL queries. An example would be the answers to questions such as "what is the profile of those clients with the highest risk of complications of pregnancy," or "what are the most important trends in the use of contraceptives." Now of course these questions can be answered by present day research, but you would "have to guess" the correct hypothesis and then "test" the hypothesis to either prove or disprove your guess. Of course, in this process of trail and error eventually you may arrive at the right combination of attitudes. This process of guessing the correct hypothesis is time consuming and expensive, and by the time you have the data, it would be "old." Data mining is set up to find that information automatically, in a few minutes, and in real time.

If a project is limited in scope, limited in concept space, and only a few variables are measured, then it is fairly easy to enumerate the possible results. An example would be if only nutrition and the history of diabetics was use to measure the risks of complications of pregnancy. In this case, it is fairly easy to find the correct hypothesis. The "experiment" only needs to look at the data that was obtained in the area of the project (limited population) and the Indicators that are measured (limited variables) and a hypothesis can be easily determined. By defining indicators in advance, a researcher has limited his ability to improve the process. 

What this implies is that the more data that is available the better the results. The more data that is collected, the more variables that are introduced, and the wider the area of data collection, the better the theories become in explaining the data. 

The question of how large is the area of data collection, part of its concept space, is also important in finding the optimal solution. If data is collected from only one small area (and maybe another to be used as a control group), only one method is analyzed. If data is collected from many different points in the environment, the chances of finding the best solution is enhanced. Not only will additional data be available to give positive examples but equally important it will give negative examples.  If you have many managers working to discover the best way to manage, by analyzing all of their data, the optimal solution will be found. This certainly applies to the recommendations of InHCc. 

Although Data Mining sounds like the solution to all problems, it is still in its infancy and involves many problems in its implementation. The good news is that 80% of these problems occur because of a poorly designed database. The rule "garbage in, garbage out" applies but since we are implementing the best of the databases, we will have no problems [just kidding]!

Date Mining uses

Data mining is used for both

  • Descriptive Information

  • Predictions

Descriptive information draws conclusions about past events. It is to help understand the cause and effect relationships between elements of data. There are

Data Mining Types of Models

Data Mining Models may be classified into the following types:

  • Classification Models

  • Clustering

  • Descriptive Models

  • Predictive Models

Data Mining Model Algorithms

Algorithms may fall into more that one category which may include:

  • Artificial Neural Networks

  • Associations

  • Clustering and Nearest Neighbor

  • Classification

  • Decision Trees

  • Factoring

  • Fuzzy Logic and Genetic Algorithms

  • Rule Induction

(See Model Algorithms)

Rewards

If data becomes freely available the pay-offs are enormous: the ability to decrease inventory, increase the clients overall health, increase client's compliance, and so on by as little as 1 percent represents a truly staggering amount.

(See Examples)

Back to Top

Previous Parent Next Page