Data mining: concepts and techniques / Jiawei Han, Micheline Kamber, Jian Pei. – 3rd ed. .. Six years ago, Jiawei Han's and Micheline Kamber's seminal textbook organized and presented .. Contents of the book in PDF format. Errata on. Six years ago, Jiawei Han's and Micheline Kamber's seminal textbook organized and presented Data .. Table of contents of the book in PDF. Errata on the. Data mining: concepts and techniques by Jiawei Han and Micheline Kamber. Article (PDF Available) in ACM SIGMOD Record 31(2) · June with.
|Language:||English, Spanish, French|
|Genre:||Children & Youth|
|Distribution:||Free* [*Register to download]|
Data Mining: Concepts and Techniques By Jiawei Han and Micheline Kamber Academic Press, Morgan Kaufmann Publishers, pages, list price $ Selected Works of Abbas Madraky. Follow Contact. Book. Data Mining. Concepts and Techniques, 3rd ruthenpress.info (). Jiawei Han; Micheline Kamber. -in-Data-Management-Systems-Jiawei-Han-Micheline-Kamber-Jian-Pei-Data- Mining.-Concepts-and-Techniques-3rd-Edition-Morgan-Kaufmannpdf.
Differences in representation, scaling, or encod- ing may cause the same real-world entity attribute values to differ in the data sources being integrated. Are these two variables positively or negatively correlated? For the variable age the mean is See Figure 2. The correlation coefficient is 0. The variables are positively correlated. What are the value ranges of the following normalization methods? Use the two methods below to normalize the following group of data: For readability, let A be the attribute age.
Using Equation 2. Given the data, one may prefer decimal scaling for normalization because such a transformation would maintain the data distribution and be intuitive to interpret, while still allowing mining on specific age groups. As such values may be present in future data, this method is less appropriate.
This type of transformation may not be as intuitive to the user in comparison with decimal scaling. Use a flow chart to summarize the following procedures for attribute subset selection: Suppose a group of 12 sales price records has been sorted as follows: Partition them into three bins by each of the following methods.
Stepwise forward selection. Stepwise backward elimination.
Propose several methods for median approximation. Analyze their respective complexity under different parameter settings and decide to what extent the real value can be approximated. Moreover, suggest a heuristic strategy to balance between accuracy and complexity and then apply it to all methods you have given. This question can be dealt with either theoretically or empirically, but doing some experiments to get the result is perhaps more interesting. Given are some data sets sampled from different distributions, e.
The former two distributions are symmetric, whereas the latter two are skewed. For example, if using Equation 2. Obviously, the error incurred will be decreased as k becomes larger; however, the time used in the whole procedure will also increase. The product of error made and time used are good optimality measures. A combination of forward selection and backward elimination. In practice, this parameter value can be chosen to improve system performance.
There are also other approaches for median approximation. The student may suggest a few, analyze the best trade-off point, and compare the results from the different approaches.
A possible such approach is as follows: Hierarchically divide the whole data set into intervals: This iterates until the width of the subregion reaches a predefined threshold, and then the median approximation formula as above stated is applied. In this way, we can confine the median to a smaller area without globally partitioning all of data into shorter intervals, which would be expensive.
The cost is proportional to the number of intervals. However, there is no commonly accepted subjective similarity measure. Using different similarity measures may deduce different results. Nonetheless, some apparently different similarity measures may be equivalent after some transformation.
Suppose we have the following two-dimensional data set: A1 A2 x1 1. Use Euclidean distance on the transformed data to rank the data points. An equiwidth histogram of width 10 for age. Using these definitions we obtain the distance from each point to the query point.
Based on the cosine similarity, the order is x1 , x3 , x4 , x2 , x5. After normalizing the data we have: Conceptually, it is the length of the vector. Examples of sampling: Based on the Euclidean distance of the normalized points, the order is x1 , x3 , x4 , x2 , x5 , which is the same as the cosine similarity order. ChiMerge [Ker92] is a supervised, bottom-up i. Perform data discretization for each of the four numerical attributes using the ChiMerge method.
Let the stopping criteria be: You need to write a small program to do this to avoid clumsy numerical computation. Submit your simple analysis and your test results: The basic algorithm of chiMerge is: The final intervals are: Sepal length: Sepal width: Petal length: Petal width: The split points are: Propose an algorithm, in pseudocode or in your favorite programming language, for the following: Also, an alternative binning method could be implemented, such as smoothing by bin modes.
The user can again specify more meaningful names for the concept hierarchy levels generated by reviewing the maximum and minimum values of the bins with respect to background knowledge about the data.
Robust data loading poses a challenge in database systems because the input data are often dirty. In many cases, an input record may have several missing values and some records could be contaminated i. Work out an automated data cleaning and loading algorithm so that the erroneous data will be marked and contaminated data will not be mistakenly inserted into the database during data loading. We can, for example, use the data in the database to construct a decision tree to induce missing values for a given attribute, and at the same time have human-entered rules on how to correct wrong data types.
An Overview 3. State why, for the integration of multiple heterogeneous information sources, many companies in industry prefer the update-driven approach which constructs and uses data warehouses , rather than the query-driven approach which applies wrappers and integrators.
Describe situations where the query-driven approach is preferable over the update-driven approach. For decision-making queries and frequently-asked queries, the update-driven approach is more preferable. This is because expensive data integration and aggregate computation are done before query processing time. For the data collected in multiple heterogeneous databases to be used in decision-making processes, any semantic heterogeneity problems among multiple databases must be analyzed and solved so that the data can be integrated and summarized.
If the query-driven approach is employed, these queries will be translated into multiple often complex queries for each individual database. The translated queries will compete for resources with the activities at the local sites, thus degrading their performance. In addition, these queries will generate a complex answer set, which will require further filtering and integration. Thus, the query-driven approach is, in general, inefficient and expensive.
The update-driven approach employed in data warehousing is faster and more efficient since most of the queries needed could be done off-line. This is also the case if the queries rely on the current data because data warehouses do not contain the most current information. Briefly compare the following concepts.
You may use an example to explain your point s. AN OVERVIEW The snowflake schema and fact constellation are both variants of the star schema model, which consists of a fact table and a set of dimension tables; the snowflake schema contains some normalized dimension tables, whereas the fact constellation contains a set of fact tables that share some common dimension tables.
A starnet query model is a query model not a schema model , which consists of a set of radial lines emanating from a central point. Each step away from the center represents the stepping down of a concept hierarchy of the dimension. The starnet query model, as suggested by its name, is used for querying and provides users with a global view of OLAP operations. Data transformation is the process of converting the data from heterogeneous sources to a unified data warehouse format or semantics.
Refresh is the function propagating the updates from the data sources to the warehouse. An enterprise warehouse provides corporate-wide data integration, usually from one or more operational systems or external information providers, and is cross-functional in scope, whereas the data mart is confined to specific selected subjects such as customer, item, and sales for a marketing data mart.
An enterprise warehouse typically contains detailed data as well as summarized data, whereas the data in a data mart tend to be summarized. The implementation cycle of an enterprise warehouse may take months or years, whereas that of a data mart is more likely to be measured in weeks. A virtual warehouse is a set of views over operational databases. For efficient query processing, only some of the possible summary views may be materialized.
A virtual warehouse is easy to build but requires excess capacity on operational database servers. Suppose that a data warehouse consists of the three dimensions time, doctor, and patient, and the two measures count and charge, where charge is the fee that a doctor charges a patient for a visit.
Three classes of schemas popularly used for modeling data warehouses are the star schema, the snowflake schema, and the fact constellations schema. A star schema is shown in Figure 3. The operations to be performed are: A star schema for data warehouse of Exercise 3.
Suppose that a data warehouse for Big University consists of the following four dimensions: When at the lowest conceptual level e. At higher conceptual levels, avg grade stores the average grade for the given combination. A snowflake schema is shown in Figure 3. A snowflake schema for data warehouse of Exercise 3. The specific OLAP operations to be performed are: Suppose that a data warehouse consists of the four dimensions, date, spectator, location, and game, and the two measures, count and charge, where charge is the fare that a spectator pays when watching a game on a given date.
Spectators may be students, adults, or seniors, with each category having its own charge rate. Taking this cube as an example, briefly discuss advan- tages and problems of using a bitmap index structure.
Bitmap indexing is advantageous for low-cardinality domains. For example, in this cube, if dimension location is bitmap indexed, then comparison, join, and aggregation operations over location are then reduced to bit arithmetic, which substantially reduces the processing time.
For dimensions with high cardinality, such as date in this example, the vector used to represent the bitmap index could be very long. For example, a year collection of data could result in date records, meaning that every tuple in the fact table would require bits or approximately bytes to hold the bitmap index.
Briefly describe the similarities and the differences of the two models, and then analyze their advantages and disadvantages with regard to one another. Give your opinion of which might be more empirically useful and state the reasons behind your answer. They are similar in the sense that they all have a fact table, as well as some dimensional tables. The major difference is that some dimension tables in the snowflake schema are normalized, thereby further splitting the data into additional tables.
The advantage of the star schema is its simplicity, which will enable efficiency, but it requires more space. For the snowflake schema, it reduces some redundancy by sharing common tables: However, it is less efficient and the saving of space is negligible in comparison with the typical magnitude of the fact table. Therefore, empirically, the star schema is better simply because efficiency typically has higher priority over space as long as the space requirement is not too huge.
Another option is to use a snowflake schema to maintain dimensions, and then present users with the same data collapsed into a star . References for the answer to this question include: Understand the difference between star and snowflake schemas in OLAP. Snowflake Schemas. Design a data warehouse for a regional weather bureau.
The weather bureau has about 1, probes, which are scattered throughout various land and ocean locations in the region to collect basic weather data, including air pressure, temperature, and precipitation at each hour. All data are sent to the central station, which has collected such data for over 10 years. Your design should facilitate efficient querying and on-line analytical processing, and derive general weather patterns in multidimensional space.
Since the weather bureau has about 1, probes scattered throughout various land and ocean locations, we need to construct a spatial data warehouse so that a user can view weather patterns on a map by month, by region, and by different combinations of temperature and precipitation, and can dynamically drill down or roll up along any dimension to explore desired patterns.
The star schema of this weather spatial data warehouse can be constructed as shown in Figure 3. A star schema for a weather spatial data warehouse of Exercise 3. To construct this spatial data warehouse, we may need to integrate spatial data from heterogeneous sources and systems.
Fast and flexible on-line analytical processing in spatial data warehouses is an important factor. There are three types of dimensions in a spatial data cube: We distinguish two types of measures in a spatial data cube: A nonspatial data cube contains only nonspatial dimensions and numerical measures.
If a spatial data cube contains spatial dimensions but no spatial measures, then its OLAP operations such as drilling or pivoting can be implemented in a manner similar to that of nonspatial data cubes.
If a user needs to use spatial measures in a spatial data cube, we can selectively precompute some spatial measures in the spatial data cube.
Which portion of the cube should be selected for materialization depends on the utility such as access frequency or access priority , sharability of merged regions, and the balanced overall cost of space and on-line computation. A popular data warehouse implementation is to construct a multidimensional database, known as a data cube. Unfortunately, this may often generate a huge, yet very sparse multidimensional matrix. Present an example illustrating such a huge and sparse data cube.
For the telephone company, it would be very expensive to keep detailed call records for every customer for longer than three months. Therefore, it would be beneficial to remove that information from the database, keeping only the total number of calls made, the total minutes billed, and the amount billed, for example. The resulting computed data cube for the billing database would have large amounts of missing or removed data, resulting in a huge and sparse data cube.
Regarding the computation of measures in a data cube: Describe how to compute it if the cube is partitioned into many chunks. PN Hint: The three categories of measures are distributive, algebraic, and holistic.
Pn Hint: The variance function is algebraic. If the cube is partitioned into many chunks, the variance can be computed as follows: Read in the chunks one by one, keeping track of the accumulated 1 number of tuples, 2 sum of xi 2 , and 3 sum of xi. Use the formula as shown in the hint to obtain the variance. For each cuboid, use 10 units to register the top 10 sales found so far. Read the data in each cubiod once. If the sales amount in a tuple is greater than an existing one in the top list, insert the new sales amount from the new tuple into the list, and discard the smallest one in the list.
The computation of a higher level cuboid can be performed similarly by propagation of the top cells of its corresponding lower level cuboids. Suppose that we need to record three measures in a data cube: Design an efficient computation and storage method for each measure given that the cube allows data to be deleted incrementally i.
For min, keep the hmin val, counti pair for each cuboid to register the smallest value and its count. For each deleted tuple, if its value is greater than min val, do nothing. Otherwise, decrement the count of the corresponding node. If a count goes down to zero, recalculate the structure.
For each deleted node N , decrement the count and subtract value N from the sum. For median, keep a small number, p, of centered values e. Each removal may change the count or remove a centered value. If the median no longer falls among these centered values, recalculate the set.
Otherwise, the median can easily be calculated from the above set. The generation of a data warehouse including aggregation ii.
Roll-up iii. Drill-down iv. Incremental updating Which implementation techniques do you prefer, and why? A ROLAP technique for implementing a multiple dimensional view consists of intermediate servers that stand in between a relational back-end server and client front-end tools, thereby using a relational or extended-relational DBMS to store and manage warehouse data, and OLAP middleware to support missing pieces. A MOLAP implementation technique consists of servers, which support multidimen- sional views of data through array-based multidimensional storage engines that map multidimensional views directly to data cube array structures.
The fact tables can store aggregated data and the data at the abstraction levels indicated by the join keys in the schema for the given data cube. In generating a data warehouse, the MOLAP technique uses multidimensional array structures to store data and multiway array aggregation to compute the data cubes.
To roll-up on a dimension using the summary fact table, we look for the record in the table that contains a generalization on the desired dimension. For example, to roll-up the date dimension from day to month, select the record for which the day field contains the special value all.
The value of the measure field, dollars sold, for example, given in this record will contain the subtotal for the desired roll-up. To perform a roll-up in a data cube, simply climb up the concept hierarchy for the desired dimension. For example, one could roll-up on the location dimension from city to country, which is more general.
To drill-down on a dimension using the summary fact table, we look for the record in the table that contains a generalization on the desired dimension. For example, to drill-down on the location dimension from country to province or state, select the record for which only the next lowest field in the concept hierarchy for location contains the special value all. In this case, the city field should contain the value all. The value of the measure field, dollars sold, for example, given in this record will contain the subtotal for the desired drill-down.
To perform a drill-down in a data cube, simply step down the concept hierarchy for the desired dimension. For example, one could drill-down on the date dimension from month to day in order to group the data by day rather than by month. Incremental updating OLAP: To perform incremental updating, check whether the corresponding tuple is in the summary fact table. If not, insert it into the summary table and propagate the result up. Otherwise, update the value and propagate the result up. If not, insert it into the cuboid and propagate the result up.
If the data are sparse and the dimensionality is high, there will be too many cells due to exponential growth and, in this case, it is often desirable to compute iceberg cubes instead of materializing the complete cubes. Suppose that a data warehouse contains 20 dimensions, each with about five levels of granularity. How would you design a data cube structure to efficiently support this preference?
How would you support this feature? An efficient data cube structure to support this preference would be to use partial materialization, or selected computation of cuboids.
By computing only the proper subset of the whole set of possible cuboids, the total amount of storage space required would be minimized while maintaining a fast response time and avoiding redundant computation. Since the user may want to drill through the cube for only one or two dimensions, this feature could be supported by computing the required cuboids on the fly. Since the user may only need this feature infrequently, the time required for computing aggregates on those one or two dimensions on the fly should be acceptable.
A data cube, C, has n dimensions, and each dimension has exactly p distinct values in the base cuboid. Assume that there are no concept hierarchies associated with the dimensions.
This is the maximum number of distinct tuples that you can form with p distinct values per dimensions. You need at least p tuples to contain p distinct values per dimension.
In this case no tuple shares any value on any dimension. The minimum number of cells is when each cuboid contains only p cells, except for the apex, which contains a single cell. What are the differences between the three main types of data warehouse usage: Information processing involves using queries to find and report useful information using crosstabs, tables, charts, or graphs.
Analytical processing uses basic OLAP operations such as slice-and-dice, drill-down, roll-up, and pivoting on historical data in order to provide multidimensional analysis of data warehouse data.
Data mining uses knowledge discovery to find hidden patterns and associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools. The motivations behind OLAP mining are the following: The high quality of data i. The available information processing infrastructure surrounding data warehouses means that comprehensive information processing and data analysis infrastructures will not need to be constructed from scratch.
On-line selection of data mining functions allows users who may not know what kinds of knowledge they would like to mine the flexibility to select desired data mining functions and dynamically swap data mining tasks. Assume a base cuboid of 10 dimensions contains only three base cells: The measure of the cube is count.
A closed cube is a data cube consisting of only closed cells. How many closed cells are in the full cube? Briefly describe these three methods i. Note that the textbook adopts the application worldview of a data cube as a lattice of cuboids, where a drill-down moves from the apex all cuboid, downward in the lattice. Star-Cubing works better than BUC for highly skewed data sets.
The closed-cube and shell-fragment approaches should be explored. Here, we have two cases, which represent two possible extremes, 1. The k tuples are organized like the following: However, this scheme is not effective if we keep dimension A and instead drop B, because obviously there would still be k tuples remaining, which is not desirable. It seems that case 2 is always better. A heuristic way to think this over is as follows: Obviously, this can generate the most number of cells: We assume that we can always do placement as proposed, disregarding the fact that dimensionality D and the cardinality ci of each dimension i may place some constraints.
The same assumption is kept throughout for this question. If we fail to do so e. The question does not mention how cardinalities of dimensions are set. To answer this question, we have a core observation: Minimum case: The distinct condition no longer holds here, since c tuples have to be in one identical base cell now. Thus, we can put all k tuples in one base cell, which results in 2D cells in all.
Maximum case: We will replace k with b kc c and follow the procedure in part b , since we can get at most that many base cells in all. From the analysis in part c , we will not consider the threshold, c, as long as k can be replaced by a new value. Considering the number of closed cells, 1 is the minimum if we put all k tuples together in one base cell. How can we reach this bound? We assume that this is the case. We also assume that cardinalities cannot be increased as in part b to satisfy the condition.
Suppose that a base cuboid has three dimensions A, B, C, with the following number of cells: Suppose that each dimension is evenly partitioned into 10 por- tions for chunking. The complete lattice is shown in Figure 4. A complete lattice for the cube of Exercise 4. The total size of the computed cube is as follows. The total amount of main memory space required for computing the 2-D planes is: Often, the aggregate measure value of many cells in a large data cuboid is zero, resulting in a huge, yet sparse, multidimensional matrix.
Note that you need to explain your data structures in detail and discuss the space needed, as well as how to retrieve data from your structures. Give the reasoning behind your new design. A way to overcome the sparse matrix problem is to use multiway array aggregation. The first step consists of partitioning the array-based cube into chunks or subcubes that are small enough to fit into the memory available for cube computation. Each of these chunks is first compressed to remove cells that do not contain any valid data, and is then stored as an object on disk.
The second step involves computing the aggregates by visiting cube cells in an order that minimizes the number of times that each cell must be revisited, thereby reducing memory access and storage costs. By first sorting and computing the planes of the data cube according to their size in ascending order, a smaller plane can be kept in main memory while fetching and computing only one chunk at a time for a larger plane.
In order to handle incremental data updates, the data cube is first computed as described in a. Subsequently, only the chunk that contains the cells with the new data is recomputed, without needing to recompute the entire cube.
This is because, with incremental updates, only one chunk at a time can be affected. The recomputed value needs to be propagated to its corresponding higher-level cuboids. Thus, incremental data updates can be performed efficiently. When computing a cube of high dimensionality, we encounter the inherent curse of dimensionality problem: Compute the number of nonempty aggregate cells.
Comment on the storage space and time required to compute these cells. If the minimum support count in the iceberg condition is two, how many aggregate cells will there be in the iceberg cube? Show the cells. However, even with iceberg cubes, we could still end up having to compute a large number of trivial uninteresting cells i. Suppose that a database has 20 tuples that map to or cover the two following base cells in a dimensional base cuboid, each with a cell count of Let the minimum support be How many distinct aggregate cells will there be like the following: What are the cells?
We subtract 1 because, for example, a1 , a2 , a3 ,. These four cells are: They are 4: There are only three distinct cells left: Propose an algorithm that computes closed iceberg cubes efficiently. We base our answer on the algorithm presented in the paper: Let the cover of a cell be the set of base tuples that are aggregated in the cell. Cells with the same cover can be grouped in the same class if they share the same measure.
Each class will have an upper bound, which consists of the most specific cells in the class, and a lower bound, which consists of the most general cells in the class. The set of closed cells correspond to the upper bounds of all of the distinct classes that compose the cube. We can compute the classes by following a depth-first search strategy: Let the cells making up this bound be u1 , u2 , Finding the upper bounds would depend on the measure. Incorporating iceberg conditions is not difficult.
Show the BUC processing tree which shows the order in which the BUC algorithm explores the lattice of a data cube, starting from all for the construction of the above iceberg cube. We know that dimensions should be processed in the order of decreasing cardinality, that is, use the most discriminating dimensions first in the hope that we can prune the search space as quickly as possible.
In this case we should then compute the cube in the order D, C, B, A. The order in which the lattice is traversed is presented in Figure 4. BUC processing order for Exercise 4. Discuss how you might extend the Star-Cubing algorithm to compute iceberg cubes where the iceberg condition tests for avg that is no bigger than some value, v. Instead of using average we can use the bottom-k average of each cell, which is antimonotonic.
To reduce the amount of space required to check the bottom-k average condition, we can store a few statistics such as count and sum for the base tuples that fall between a certain range of v e. This is analogous to the optimization presented in Section 4. A flight data warehouse for a travel agent consists of six dimensions: Starting with the base cuboid [traveller, departure, departure time, arrival, arrival time, f light], what specific OLAP operations e.
Outline an efficient cube computation method based on common sense about flight data distribution. The OLAP operations are: There are two constraints: Use an iceberg cubing algorithm, such as BUC.
Use binning plus min sup to prune the computation of the cube. Implementation project There are four typical data cube computation methods: Find another student who has implemented a different algorithm on the same platform e.
An iceberg condition: Output i. The set of computed cuboids that satisfy the iceberg condition, in the order of your output gener- ation; ii. This is used to quickly check the correctness of your results. What challenging computation problems are encountered as the number of dimensions grows large? How can iceberg cubing solve the problems of part a for some data sets and characterize such data sets? Give one simple example to show that sometimes iceberg cubes cannot provide a good solution.
For example, for a dimensional data cube, we may only compute the 5-dimensional cuboids for every possible 5-dimensional combination. The resulting cuboids form a shell cube.
Discuss how easy or hard it is to modify your cube computation algorithm to facilitate such computation. This is to be evaluated on an individual basis. The number of cuboids for a cube grows exponentially with the number of dimensions. If the number of dimension grows large, then huge amounts of memory and time are required to compute all of the cuboids.
Iceberg cubes, by eliminating statistically insignificant aggregated cells, can substantially reduce the number of aggregate cells and therefore greatly reduce the computation. Benefits from iceberg cubing can be maximized if the data sets are sparse but not skewed. This is because in these data sets, there is a relatively low chance that cells will collapse into the same aggregated cell, except for cuboids consisting of a small number of dimensions. Thus, many cells may have values that are less than the threshold and therefore will be pruned.
Consider, for example, an OLAP database consisting of dimensions. Let ai,j be the jth value of dimension i. Assume that there are 10 cells in the base cuboid, all of which aggregate to the cell a1,1 , a2,1 , Let the support threshold be Then, all descendent cells of this cell satisfy the threshold.
In this case, iceberg cubing cannot benefit from the pruning effect. It is easy to modify the algorithms if they adopt a top-down approach. Consider BUC as an example. We can modify the algorithm to generate a shell cube of a specific number of dimension combinations because it proceeds from the apex all cuboid, downward. The process can be stopped when it reaches the maximum number of dimensions. H-cubing and Star-Cubing can be modified in a similar manner.
Consider the following multifeature cube query: Why or why not? R1 such that R1. For class characterization, what are the major differences between a data cube-based implementation and a relational implementation such as attribute-oriented induction? Discuss which method is most efficient and under what conditions this is so. For class characterization, the major differences between a data cube-based implementation and a relational based implementation such as attribute-oriented induction include the following: Under a data cube-based approach, the process is user-controlled at every step.
This includes the selection of the relevant dimensions to be used as well as the application of OLAP operations such as roll-up, roll-down, slicing and dicing. A relational approach does not require user interaction at every step, however, as attribute relevance and ranking is performed automatically. The relational approach supports complex data types and measures, which restrictions in current OLAP technology do not allow.
Thus, OLAP implementations are limited to a more simplified model for data analysis. An OLAP-based implementation allows for the precomputation of measures at different levels of aggregation materialization of subdata cubes , which is not supported under a relational approach.
EXERCISES 51 Based upon these differences, it is clear that a relational approach is more efficient when there are complex data types and measures being used, as well as when there are a very large number of attributes to be considered. This is due to the advantage that automation provides over the efforts that would be required by a user to perform the same tasks. However, when the data set being mined consists of regular data types and measures that are well supported by OLAP technology, then the OLAP-based implementation provides an advantage in efficiency.
This results from the time saved by using precomputed measures, as well as the flexibility in investigating mining results provided by OLAP functions. Suppose that the following table is derived by attribute-oriented induction. See Table 4.
A crosstab for birth place of Programmers and DBAs. Discuss why relevance analysis is beneficial and how it can be performed and integrated into the character- ization process. Compare the result of two induction methods: Incremental class comparison.
Data-cube-based incremental algorithm for mining class comparisons with dimen- sion relevance analysis. P , a Prime generalized relation used to build the data cube. The method is outlined as follows. To build the initial data cube for mining: The incremental part of the data is identified to produce a target class and contrasting class es from the set of task relevant data to generate the initial working relations.
This is performed on the initial working relation for the target class in order to determine which attributes should be retained attribute relevance. An attribute will have to be added that indicates the class of the data entry. The desired level of generalization is determined to form prime target class and prime contrasting class cuboid s. This generalization will be synchronous between all of the classes, as the contrasting class relation s will be generalized to the same level.
To process revisions to the relevant data set and thus make the algorithm incremental, perform the following. Instead, only the changes to the relevant data will be processed and added to the prime relation as held in the data cube.
Figure 4. A data-cube-based algorithm for incremental class comparison. Outline an incremental updating procedure for applying the necessary deletions to R. Outline a data cube-based incremental algorithm for mining class comparisons. A data-cube-based algorithm for incremental class comparison is given in Figure 4. The Apriori algorithm uses prior knowledge of subset support properties. Prove that any itemset that is frequent in D must be frequent in at least one partition of D.
Let s be a frequent itemset. Let min sup be the minimum support. Let D be the task-relevant data, a set of database transactions.
Let D be the number of transactions in D. Let s0 be any nonempty subset of s. Any transaction containing itemset s will also contain itemset s0. Thus, s0 is also a frequent itemset. This proves that the support of any nonempty subset s0 of itemset s must be as great as the support of s. Any itemset that is frequent in D must be frequent in at least one partition of D.
Proof by Contradiction: Assume that the itemset is not frequent in any of the partitions of D. Let F be any frequent itemset.
Let C be the total number of transactions in D. Let A be the total number of transactions in D containing the itemset F. Let us partition D into n nonoverlapping partitions, d1 , d2 , d3 ,. Because of the assumption at the start of the proof, we know that F is not frequent in any of the partitions d1 , d2 , d3 ,. But this is a contradiction since F was defined as a frequent itemset at the beginning of the proof. This proves that any itemset that is frequent in D must be frequent in at least one partition of D.
Section 5. Propose a more efficient method. Explain why it is more efficient than the one proposed in Section 5. Consider incorporating the properties of Exercise 5. An algorithm for generating strong rules from frequent itemsets is given in Figure 5. It is more efficient than the method proposed in Section 5. If a subset x of length k does not meet the minimum confidence, then there is no point in generating any of its nonempty subsets because their respective confidences will never be greater than the confidence of x see Exercise 5.
The method in Section 5. This is inefficient because it may generate and test many unnecessary subsets i. Consider the following worst-case scenario: The method of Section 5. A database has five transactions. Rule Generator. Given a set of frequent itemsets, output all of its strong rules.
Strong rules of itemsets in l. An algorithm for generating strong rules from frequent itemsets. Compare the efficiency of the two mining processes.
See Figure 5. Root K: FP-tree for Exercise 5. Apriori has to do multiple scans of the database while FP-growth builds the FP-Tree with a single scan.
Candidate generation in Apriori is expensive owing to the self-join , while FP-growth does not generate any candidates. Implementation project Implement three frequent itemset mining algorithms introduced in this chapter: Compare the performance of each algorithm with various kinds of large data sets.
Write a report to analyze the situations such as data size, data distribution, minimal support threshold setting, and pattern density where one algorithm may perform better than the others, and state why. This is to be evaluated on an individual basis as there is no standard answer. A database has four transactions. Suppose that a large store has a transaction database that is distributed among four locations.
Transactions in each component database have the same format, namely Tj: Propose an efficient algorithm to mine global association rules without considering multilevel associations. You may present your algorithm in the form of an outline. Your algorithm should not require shipping all of the data to one site and should not cause excessive network communication overhead.
An algorithm to mine global association rules is as follows: Let CF be the union of all of the local frequent itemsets in the four stores. This can be done by summing up, for each itemset, the local support of that itemset in the four stores.
Doing this for each itemset in CF will give us their global supports. Itemsets whose global supports pass the support threshold are global frequent itemsets.
Suppose that frequent itemsets are saved for a large transaction database, DB. However, multiple occurrences of an item in the same shopping basket, such as four cakes and three jugs of milk, can be important in transaction data analysis.
How can one mine frequent itemsets efficiently considering multiple occurrences of items? Propose modifications to the well-known algorithms, such as Apriori and FP-growth, to adapt to such a situation. Consider an item and its occurrence count as a combined item in a transaction.
For example, we can consider jug, 3 as one item. For instance, jug, 3 may be a frequent item. For i, max count , try to find k-itemsets for count from 1 to max count. This can be done either by Apriori or FP-growth.
In FP-growth, one can create a node for each i, count combination, however, for efficient implementation, such nodes can be combined into one using combined counters i.
Compare their performance with various kinds of large data sets.
Write a report to answer the following questions: Mining closed frequent itemsets leads to a much more compact answer set but has the same expressive power as mining the complete set of frequent itemsets. Moreover, it leads to more efficient mining algorithms if one explores optimization methods as discussed in the above research papers.
EXERCISES 59 b Analyze in which situations such as data size, data distribution, minimal support threshold setting, and pattern density and why that one algorithm performs better than the others. Please check IlliMine http: The discussion of these algorithms are in the three papers listed above. Suppose that a data relation describing students at Big University has been generalized to the generalized relation R in Table 5. Table 5. Generalized relation for Exercise 5.
A over 30 Canada 2. S over 30 Canada 3. D over 30 Canada 2. S over 30 Latin America 3. That is, if a node is frequent, its children will be examined; otherwise, its descendants are pruned from the search. Students can easily sketch the corresponding concept hierarchies.
The following set of rules is mined in addition to those mined above. Propose and outline a level-shared mining approach to mining multilevel association rules in which each item is encoded by its level position, and an initial scan of the database collects the count for each item at each concept level, identifying frequent and subfrequent items. Comment on the processing cost of mining multilevel associations with this method in comparison to mining single-level associations.
A level-shared mining approach is presented here, using the taxonomy of Figure 5. A taxonomy of data items. Scan the original transactional database. During the scan, create a hierarchy-information- encoded transaction table T of the database by encoding each item by its level position. During the scan, accumulate the support counts of each item at each concept level by examining the encoded representation of each item.
By doing so, we will discover the frequent items 1-itemsets at all levels. Note that each level has its own minimum support threshold.
The initial database scan in Step 1 finds the frequent 1-itemsets, L1 , at all levels. Join these frequent 1-itemsets to generate the candidate 2-itemsets at all levels. Scan T once to determine which of the candidate 2-itemsets are frequent.
Join the frequent 2-itemsets to generate the candidate 3-itemsets at all levels. Scan T once to determine which of the candidate 3-itemsets are frequent. Continue in this manner until none of the levels generate any new candidate itemsets.
As we can see, this is basically the Apriori algorithm. Once we have found all of the frequent itemsets of all the levels, generate the corresponding strong multilevel association rules.
A similarity is that the cost of database scans for this method is equal to the cost of database scans for single-level associations. Therefore, to determine the largest frequent k-itemsets at all levels, we only need to scan the encoded transaction table k times.
Mining single-level associations, on the other hand, only involves generating candidate sets for one level. Therefore, the cost of generating candidates for this method is much greater than the cost of generating candidates for single-level associations.
Implementation project Many techniques have been proposed to further improve the performance of frequent-itemset mining algorithms. Taking FP-tree-based frequent pattern-growth algorithms, such as FP-growth, as an example, implement one of the following optimization techniques, and compare the performance of your new implementation with the one that does not incorporate such optimization. However, one can develop a top-down projection technique i.
Design and implement such a top-down FP-tree mining method and compare your performance with the bottom-up projection method. However, such a structure may consume a lot of space when the data are sparse. One possible alternative design is to explore an array- and pointer-based hybrid implementation, where a node may store multiple items when it contains no splitting point to multiple subbranches. Develop such an implementation and compare it with the original one. One interesting alternative is to push right the branches that have been mined for a particular item p, that is, to push them to the remaining branch es of the FP-tree.
This is done so that fewer conditional pattern bases have to be generated and additional sharing can be explored when mining the remaining branches of the FP-tree.
Design and implement such a method and conduct a performance study on it. There is no standard answer for an implementation project. However, several papers discuss such imple- mentations in depth and can serve as good references. Searching for the best strategies for mining frequent closed itemsets, Proc.
Give a short example to show that items in a strong association rule may actually be negatively correlated. Consider the following table: The following contingency table summarizes supermarket transaction data, where hot dogs refers to the transactions containing hot dogs, hotdogs refers to the transactions that do not contain hot dogs, hamburgers refers to the transactions containing hamburgers, and hamburgers refers to the transactions that do not contain hamburgers.
If not, what kind of correlation relationship exists between the two? Therefore, the association rule is strong. So, the download of hotdogs is not independent of the download of hamburgers. There exists a positive correlation between the two. In multidimensional data analysis, it is interesting to extract pairs of similar cell characteristics associated with substantial changes in measure in a data cube, where cells are considered similar if they are related by roll-up i.
Such an analysis is called cube gradient analysis. Suppose the cube measure is average. A user poses a set of probe cells and would like to find their corresponding sets of gradient cells, each of which satisfies a certain gradient threshold.
Han et al. Additional before applying data mining algorithms. Data extensions to the basic association rule cleaning, data integration, data framework are explored, e. All these techniques are artificially categorized into quantitative and explained in the book without focusing too distance-based association rules when both of much on implementation details so that the them work with quantitative attributes. According to their unsupervised learning.
Several classification final goal, data mining techniques can be and regression techniques are introduced considered to be descriptive or predictive: The authors also discuss some summarize data by applying attribute- classification methods based on concepts oriented induction using characteristic rules from association rule mining.
Furthermore, and generalized relations. Analytical the chapter on classification mentions characterization is used to perform attribute alternative models based on instance-based relevance measurements to identify irrelevant learning e. We believe number of attributes, the more efficient the that this book section would deserve a more mining process. Generalization techniques detailed treatment even a whole volume on can also be extended to discriminate among its own , which should obviously include an different classes.
The discussion of descriptive techniques is Regression called prediction by the completed with a brief study of statistical authors appears as an extension of the measures i. The former dispersion measures and their insightful deals with continuous values while the latter graphical display.
Association rules are midway Linear regression is clearly explained; between descriptive and predictive data multiple, nonlinear, generalized linear, and mining maybe closer to descriptive log-linear regression models are only techniques.
They find interesting referenced in the text. Some ratio-scaled. A taxonomy of clustering buzzwordism about the role of data mining methods is proposed including examples for and its social impact can be found in this each category: This categorization of clustering Why to Read This Book.
The youth of this field are as appealing as the previous ones. Unfortunately, This book constitutes a superb these interesting techniques are only briefly example of how to write a technical textbook described in this book. It is Space constraints also limit the written in a direct style with questions and discussion of data mining in complex types of answers scattered throughout the text that data, such as object-oriented databases, keep the reader involved and explain the spatial, multimedia, and text databases.
Web reasons behind every decision.
The presence mining, for instance, is only overviewed in its of examples make concepts easy to three flavors: The chapters are mostly self- contained, so they can be separately used to Practical Issues. In fact, describes some interesting examples of the you may even use the book artwork which is use of data mining in the real world i.
Moreover, the biomedical research, financial data analysis, bibliographical discussions presented at the retail industry, and telecommunication end of every chapter describe related work utilities. This chapter also offers some and may prove invaluable for those interested practical tips on how to choose a particular in further reading. A must-have for data data mining system, advocating for multi- miners!