-
Notifications
You must be signed in to change notification settings - Fork 17
MarketBasketAnalysis
Harish Butani edited this page May 28, 2012
·
3 revisions
The goal is to provide a FrequentItemSets function that can be invoked as a table function. Its signature is something like this:
Signature: FrequentItemSets(int support, String txnColumn, String itemColumn)
Returns rows that contain 2 columns: (Array<String>, Integer).The first column represents an ItemSet and the second column represents ist support. The input arguments are:
- Support: a number representing the support threshold.
- TxnColumn: the column in the input dataset that represents the BasketId
- ItemColumn: the column in the input dataset that represents the ItemId.
Additionally we may support other arguments that aide in Item/ItemSet selection, like in Oracle’ implementation:
- minCount: minimum number of items in the candidate itemsets output
- maxCount: maximum number of items in the candidate itemsets output
- includeList: an expression used to check inclusion criteria. At least one item from the list must appear in candidate itemsets that will be returned.
- excludeList: an expression used to check exclusion criteria. No item from the list must appear in candidate itemsets that will be returned.
The use of this function should be a simple table function invocation:
from frequentItemsets( <select l_orderKey, p_name from lineitem join part on lineitem.p_partkey = part.p_partkey order by l_orderkey > partition by l_orderkey) into path='/tmp/wout'
- the e.g. is based on the tpch schema: each order is a Basket
- the driving query is shown as being ordered on the OrderKey, more on this below.
- the Output will be a 2 column table: (Itemset, count)
We plan to use the SON algorithm as described in the Mining Massive Datasets book. To summarize the 2 pass algorithm works in the following way:
-
Phase 1
- assume input data is already stored bucketed by basket. In our case it assumes that the rows are ordered on the basketId and itemId.
-
Map phase
- read the entire partition into memory
- apply the A-priori (or one of the variation) on the partition; the support is proportionally scaled down based on the ratio of the number of baskets in this parition to the total number of baskets.
- The Output of the Map phase is (ItemSet, NullWritable)
- Reduce phase output a row for each candidate ItemSet. Since the number of candidate ItemSets will be small, the number of reducers can be small, even 1.
-
Phase 2
- scans the input Basket dataset again.
-
Map phase
- each mapper reads the entire output of Phase 1 into a data struct. For each Basket outputs the list of candidate ItemSets that it contains.
- The Output is (Itemset, 1)
-
Reduce phase
- sum up all rows for an ItemSet
- Output the ItemSet if the count > support argument
- Output from Phae is a table of the form (Itemset)
- Define FrequentItemSet function as having Map-side processing.
- Introduce the notion of Iterative Functions
- an Iterative table function operates on the input multiple times until a condition is met. The function is responsible for providing a callabck that is invoked for termination.
- The Output of the function in Phase i is available to Phase i + 1 through the Distributed Cache mechanism.