Understanding Knowledge Discovery in Data Process

Explain Knowledge Discovery in Data Process

It is an interactive and iterative sequence comprising of 9 phases. Teams commonly learn new things in a phase that cause them to go back and refine the work done in prior phases based on new insights and information that have been uncovered. The below description depicts the iterative movement between phases until the team members have sufficient information to move to the next phase. The process begins with finding the KDD goals and ends with the successful implementation of the discovered knowledge.

1. Domain Understanding

In this preliminary step the team needs to understand and define the goals of the end-user and the environment in which the KDD process will take place.

2. Selection & Addition

In this phase it is important to determine the dataset which will be utilized for the KDD process. So, the team needs to first find the relevant data which is accessible. Data from multiple sources can be integrated in this phase. Note that this is the data which is going to lead us to Knowledge. So, if some attributes from the data are missing then it will lead to half-cooked Knowledge. Therefore, the objective of this phase is determining the suitable and complete dataset on which the discovery will be performed.

3. Pre-processing & Cleansing

The data received from the earlier phase is like a rough diamond. Now in this phase you need to polish the diamond so that everyone can know its beauty. So now the main task in this phase is to sanitize and prepare the data for use. Data cleansing is a subprocess that focuses on removing errors in your data so your data becomes true and consistent. Sanity checks are performed to check that the data does not contain physically or theoretically impossible values such as people taller than 3 meters or someone with an age of 299 years.

4. Data Transformation

Once your team has cleansed and integrated the data, now you may have to transform your data so it becomes suitable for the next phase of data mining. In this phase, the data is transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations.

5. Data Mining

In this phase methods like Association, Classification, Clustering and/or Regression are applied in order to extract patterns. We may need to use the Data Mining Algorithm several times until the desired output is obtained.

6. Evaluation

In this phase we evaluate and understand the mined patterns, rules and reliability to the goal set in the first phase. Here we assess the pre-processing steps for their impact on the Data Mining Algorithm outcomes.

7. Discovered Knowledge Presentation

The last phase is all about the use and overall feedback and discovery results acquired by Data Mining. The interesting discovered patterns are presented to the end-user and may be stored as new knowledge in the knowledge base. The success of this phase decides the effectiveness of the entire KDD process.

Vlj1+O54g8fDw8PDw8LjmePLOw8PDw8PD45rjCRIPDw8PDw+Pa44nSDw8PDw8PDyuOZ4g8fDw8PDw8LjmeILEw8PDw8PD45rjCRIPDw8PDw+Pa44nSDw8PDw8PDyuOf8fjGS4s5Whm8sAAAAASUVORK5CYII=

Data Warehouse Architecture

As organizations grow, they usually have multiple data sources that store different kinds of information. However, for reporting purposes, the organization needs to have a single view of the data from these different sources. This is where the role of a Data Warehouse comes in. A Data Warehouse helps to connect and analyze data that is stored in various heterogeneous sources. The process by which this data is collected, processed, loaded, and analyzed to derive business insights is called Data Warehousing. The data warehouse architecture defines the way in which information is processed, transformed, loaded and then presented to the end users for the purpose of generating business insights.

KV+38+c+qsEAAAAASUVORK5CYII=

K-Means Clustering

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters. Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on. It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each dataset belongs to only one group that has similar Properties.

Parallel Query Evaluation and Query Optimization

Parallel query evaluation and query optimization are two important concepts in database management systems that are used to improve the performance of queries. Query optimization is the process of selecting the best possible execution plan for a given query. The goal of query optimization is to minimize the time required to execute a query by choosing the most efficient way to access and process the data.

Data Cleaning and Data Transformation

Data cleaning and data transformation are two important steps in the process of preparing data for analysis in database management systems. Data cleaning, also known as data cleansing or data scrubbing, is the process of identifying and correcting or removing errors, inconsistencies, and inaccuracies in the data. This can include removing duplicates, correcting misspellings or formatting errors, filling in missing values, and identifying outliers. The goal of data cleaning is to ensure that the data is accurate and complete, and that any errors or inconsistencies do not affect the results of the analysis.

Apriori Algorithm

The Apriori algorithm is a classic algorithm used for frequent itemset mining and association rule learning in database management systems. It is designed to discover frequent itemsets and association rules from a large dataset.