MarketBasketAnalysis

Market Basket Analysis

The goal is to provide a FrequentItemSets function that can be invoked as a table function. Its signature is something like this:

Signature: FrequentItemSets(int support, String txnColumn, String itemColumn)

Returns rows that contain 2 columns: (Array<String>, Integer).The first column represents an ItemSet and the second column represents ist support. The input arguments are:

Support: a number representing the support threshold.
TxnColumn: the column in the input dataset that represents the BasketId
ItemColumn: the column in the input dataset that represents the ItemId.

Additionally we may support other arguments that aide in Item/ItemSet selection, like in Oracle’ implementation:

minCount: minimum number of items in the candidate itemsets output
maxCount: maximum number of items in the candidate itemsets output
includeList: an expression used to check inclusion criteria. At least one item from the list must appear in candidate itemsets that will be returned.
excludeList: an expression used to check exclusion criteria. No item from the list must appear in candidate itemsets that will be returned.

The use of this function should be a simple table function invocation:

	  from frequentItemsets(
             <select l_orderKey, p_name
              from lineitem join part 
               on lineitem.p_partkey = part.p_partkey
              order by l_orderkey
             >
             partition by l_orderkey)
	  into path='/tmp/wout'

the e.g. is based on the tpch schema: each order is a Basket
the driving query is shown as being ordered on the OrderKey, more on this below.
the Output will be a 2 column table: (Itemset, count)

Implementation

We plan to use the SON algorithm as described in the Mining Massive Datasets book. To summarize the 2 pass algorithm works in the following way:

Phase 1
- assume input data is already stored bucketed by basket. In our case it assumes that the rows are ordered on the basketId and itemId.
- Map phase
  - read the entire partition into memory
  - apply the A-priori (or one of the variation) on the partition; the support is proportionally scaled down based on the ratio of the number of baskets in this parition to the total number of baskets.
- The Output of the Map phase is (ItemSet, NullWritable)
- Reduce phase output a row for each candidate ItemSet. Since the number of candidate ItemSets will be small, the number of reducers can be small, even 1.
Phase 2
- scans the input Basket dataset again.
- Map phase
  - each mapper reads the entire output of Phase 1 into a data struct. For each Basket outputs the list of candidate ItemSets that it contains.
  - The Output is (Itemset, 1)
- Reduce phase
  - sum up all rows for an ItemSet
  - Output the ItemSet if the count > support argument
  - Output from Phae is a table of the form (Itemset)

Fitting algorithm into Table Function framework

Define FrequentItemSet function as having Map-side processing.
Introduce the notion of Iterative Functions
- an Iterative table function operates on the input multiple times until a condition is met. The function is responsible for providing a callabck that is invoked for termination.
- The Output of the function in Phase i is available to Phase i + 1 through the Distributed Cache mechanism.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MarketBasketAnalysis

Market Basket Analysis

Implementation

Fitting algorithm into Table Function framework

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally