The Challenges faced During Data Mining
Data mining is an act that entails releasing fascinating facts that are usually gathered from huge databases logs, data warehouses, anomalies, important structures, associations, changes and patterns among others (Galloway and Sminoff 3). Although to a researcher data mining might be a tool that entails database awareness ventures, to others it is a major step in awareness discovery. The focus of this paper is on the evaluation of challenges that data miners face.
In practice, the main data mining goals are usually prediction and description (Gianella, Bhargawa and Kargupta 4). Description entails analyzing human-interpretable which elaborates data. Prediction on the other hand entails the use of certain variables or fields in the database in prognosticating the interested variables’ future utility. These high levels are achieved through the application of major errands in data mining. The first errand is classification. This is the process via which the outcome that categorizes data of any article into different presumed classes is examined. The second errand is regression. This is the outcome’s leaning process during which a data article is classified to a presumption variable that has real value.
The third errand is clustering. This is a common task of depiction during which a person aims at differentiating a restricted clusters’ set in order to elaborate the gathered data. The fourth errand is summarization. This entails the techniques via which solid data subset elaboration is acquired. The fifth errand is dependency modeling. This entails discovering the model that indicates the vital dependencies that link variables. This errand has a structural level in which variables rely on one another locally. It also has a quantitative level which indicates the dependencies’ weight through the use of a specific numerical scale. Deviation detection and change are the last errands in data mining. These focus on providing vital data changes from the values measured initially (Gianella, Bhargawa and Kargupta 4).
Theories that explain Unified Data Mining formulation are either shallow or detailed. However, in terms of science, a good theory elaborates details in the scientific disciplines. Physicists contend that there are several forces that influence substances globally (Gianella, Bhargawa and Kargupta 45). They include strong force, electromagnetic force, weak force and gravitational force. A unified theory comprises of comprehensives, preciseness, essentials, consistency and correctness. For the unification parts of a theory to be understood fully, there are various intentions that should be considered as well as their inventors. Sir Isaac Newtown discovered the mechanism of a unified theory in 1687. This mechanism explains earthly and heavenly movements. John Dalton revived the atomic theory that explains substances’ physical nature (Yang 46).
Theodor Schwann and Matthias developed the cell theory with an aim of explaining the basic structure of the living things (Yang 48). Charles Darwin invented the unified evolution theory in 1859. This theory explains the living organisms’ volatility (Yang 50). James Clark Maxwell presented the Periodic system in 1800. This explains electric and magnetic fields’ relation. Hertz demonstrated the evidence that radio and light waves are electromagnetic waves in 1884 just the way this was prognosticated by the previous theory by Maxwell. A general theory of relativity was explained by Albert Einstein at the beginning of the 20s. This theory deals with gravity and it was another unified theory. Einstein invested the word unified theory while trying to prove that electromagnetism and gravity are related. However, he did not achieve his ultimate goal. Later, he attempted to invent the unified theory (UT) in his efforts to join gravity and electromagnetism.
On joining strong forces, weak forces and electromagnetic, he came up with a theory known as the Grand Unified Theory (GUT). Under the rainbow edges, there are four forces. The final theory is called theory of everything (TOE). Allen Newell proposed a cognition unified theory which explains and speculates the problem solving art in every human being. Johnson argues that unified theories have the scientific history written in them because they are powerful parameters of notion in scientific thought. Each force comprises of a specific theory that explains the way it functions. Theory of everything has a close link with the unified theory field though it differs while trying to elaborate every constant aspect of the physical nature. It also does not acquire the basic fields’ nature.
The data streams calamity has become important over the past few years due to hardware technology. This has made recording it easy as well as storing different transactions through automated ways in the daily life. The main problem in the domain of data stream however is clustering. This is because of its data summarization and detection. Clustering problem entails specific data’s partitioning into several or single data form categories with similar goals as well as establishing distant function through a similarity notion. Within the domain stream of data, clustering problem requires a procedure that examines the clusters of domain data constantly without the domination of the previous stream history.
In the high dimensional instances, unique clustering algorithms feature is portrayed. This affects the traditional statistical setups of the domain because of data that is sporadic in the high dimensional setting. All set points have a tendency of remaining at an equidistant position within the high dimensional set up. This results in an unclear distance-based clusters definition (Yang 64). Resent high dimensional works use certain methods that include projecting clusters while examining clusters in certain subsets. In these methods, descriptive cluster specifies every cluster in specific dimensional groups in deducing scarcity calamity within the high dimensional set up. Although a cluster might not be defined fully in terms of high dimensions because of data scarcity, some dimensional subsets are in some cases deprived in some subset points from high and significant quality clusters (Galloway and Sminoff 20). Experts call the dimensional subsets projected clusters and they can vary among clusters.
A database of time series entails utility series or matter that is acquired after a continued period of equal intervals’ measurements. For example, the intervals can be on weekly, daily or hourly basis. The database of time series is important in different applications that include the analysis of the stock market, utility studies, economic forecasting, sales forecasting, budget analysis, workload projections, natural phenomena observation, yields projection and quality control among others. Time series refers to a database that is continuous while sequence database comprises of an ordered database sequence that has or does not have solid time ideas. An example of sequence database is the transaction sequence of customer shipping and a sequence of a traversal website page. This is a sequence of data but it lacks classification like the one of data in time series (Galloway and Sminoff 27).
A complex knowledge form can occur while mining data due to various relations in which interest objectives depend on one another and comprise of different domains (Galloway and Sminoff 27). For example;
- Hyperlinks comprise of the graph structure (these are linked pages)
- Text comprises of a structured list (this has considered word sequence)
- HTML comprise of structural graphs (linked pages)
Data mining is rich in the relations of structural mining, interlinked pages, social networks as well as cells metabolic network. These are required although the main problem results from non-relational data mining. While learning the approximations concept, a person has to select the descriptive language to use because this selection can frontier a domain in which the selected algorithm may be used. Basically, two objectives exist. These are unstructured and structured. Attribute values’ pairs elaborate unstructured objects. For the objects of internal structure, the logic language of the first order is used.
Attribute-value languages have expressive power in logic preposition. Sometimes, these languages do not provide complete presentation of the detailed items’ structure and components’ relations. While the relations of the other database are restricted in the discovery process, background knowledge, examples and induced patterns are applicable since they can be depicted as a method of the language of the first order.
Mining network data is a relationships’ exploration process that has patterns of linked data. An example of this is the inter-relations that exist between the weak point of the element and data substance. Mining network data combines discovery and the capacity to spot human understanding patterns using computers to gain solid comprehension of the discovery that surrounds it and to uphold the interest of both arenas. Though a human-centered technique, it gives a good solution. Nevertheless, for its potential to be realized fully, its exploration phase must be continuous with constant intervals which establish a room for irregularities as well as the identification of the variations of the old pattern which must be fed into the phase of exception detection. The process of data mining uses an approach that requires the techniques that can lead to the unexpected outcomes and these can change data analysis making it an emerging path. There are several network slices of network data operations which enables it to show different phenomenal concepts.
Multi Agent Data (MAD) in most cases comprises of detailed applications which require a technique of supply solving. The observed data from the distributing sources is largely dependent on the behavior of the agent as well as specific applications. Normally, non-trivial calamity is the recording that is distributed within a normal environment because there are many restrictions that include sensitive data privacy and limited bandwidth that include wireless networks among others. The Distributed Data Mining (DDM) sector covers the distributed data survey problem. It also provides solutions to numerous algorithms. This is important when delegating surveillance for various data as well as mining operations that have basic supply distribution considering that there are sources’ constraints. Multi-agent systems also act as the distribution channels where MAD and DDM can be conjoined in appealing applications that have intensive data.
Several researchers concur that mining biological data is a threat to biomedical sciences as well as mining research. This research is illustrated by the relationship of HIV vaccine with data mining. Several mining data is available in the field of molecular biology. This cannot be controlled by the mining standard algorithms. These are the challenges that surround chemical properties, functional properties, 3D structures and DNA. Consequently, researchers in data mining should consider ecology and environmental informatics.
Serious discussions on how data mining tools should be improved have been presented. A major solution entails mining methodology composition automation in the mining system (Galloway and Sminoff 28). This can help in preventing errors in data mining. With automation of different operations in data mining, human labor can be reduced significantly. Cleaning the documentation of structured data should also be aligned. Combining the interactive visuals as well as the process of mining data automatically is another issue because visualization enhances the exploration of more details of the data as well as refining the tasks of mining it. The growth theory behind the interaction of solid database is also very important (Scott 30).
In mining data, privacy protection is a vital topic according to different researchers (Scott 40). When privacy is protected, privacy of the users is also safe during data mining. Some respondents argue that unless the privacy issue is solved, the process of mining data will remain a public problem. Evaluation of integrity knowledge in data collection as well as individual pattern integrity evaluation is among the challenges that researchers face in their developmental measures. Additionally, assessing knowledge integrity faces two key challenges.
- Enhancing efficiency to distinguish the contents of knowledge between data versions. This issue requires efficient algorithm growth and integrity knowledge examination by data structure during data collection.
- Enhancing algorithms in order to approximate the effect that modified data has. This issue calls for recording of data modification’s impact by the algorithm growth.
Data is not static because it keeps changing in various domains. Consequently, it is highly important to incorporate the learned models with time while avoiding bias (Scott 53). Data on cost sensitivity that is not balanced is also a related issue in research. Studies reveal that the datasets of UCI are poorly balanced and tiny. Database in the real world has 10 features and 10 examples without a class target for the forecast (Scott 56).
In a nut shell, it should be noted that from the beginning of the 1980s, there has been a continuous success in data mining. Several problems have come up and mining experts have solved them. Nevertheless, exchanging important discussions in time has not been realized fully. The main problems that face data mining have been discussed in this paper and they can be summarized as they appear this way:
- Data mining goals
- Development of a data mining unifying theory
- Scaling up high speed stream of data and data of high dimension
- Data mining sequence as well as time series of data
- Complex knowledge in mining complex data
- Network setting data mining
- Mining distributed data and multi-agent data
- Environmental and biological problems data mining
- The process of mining data and related problems
- Privacy, security as well as data integrity
- Handling unbalanced, cost-sensitive and non-static data
It ought to be noted that data is not static and it changes in various domains constantly. Consequently, incorporating the learnt models over time as well as avoiding bias is very important. A major solution to various data mining problems includes composition automation in the methodology of the system of data mining.
Galloway, John and Sminoff Simon. (2007). Network data mining: methods and techniques for discovering deep linkage between attributes. Sydney: University of Technology Sydney, 2007. Print.
Gianella, Chris, Bhargawa, Ruchita & Kargupta, Hillol. Multi-agent systems and distributed data mining. Lecture notes in computer science, 3191 (2004): 1-15.Web.
Scott, Jim. Social Network analysis: a handbook. London: Sage, 2000. Print.
Yang, Qiang. 10 Challenging problems in data mining research. International journal of information technology & decision making, 5.4 (2006): 587-597. Web.