Skip to content

MarketBasketAnalysis

Harish Butani edited this page May 28, 2012 · 3 revisions

Market Basket Analysis

The goal is to provide a FrequentItemSets function that can be invoked as a table function. Its signature is something like this:

Signature: FrequentItemSets(int support, String txnColumn, String itemColumn)

Returns rows that contain 2 columns: (Array<String>, Integer).The first column represents an ItemSet and the second column represents ist support. The input arguments are:

  • Support: a number representing the support threshold.
  • TxnColumn: the column in the input dataset that represents the BasketId
  • ItemColumn: the column in the input dataset that represents the ItemId.

Additionally we may support other arguments that aide in Item/ItemSet selection, like in Oracle’ implementation:

  • minCount: minimum number of items in the candidate itemsets output
  • maxCount: maximum number of items in the candidate itemsets output
  • includeList: an expression used to check inclusion criteria. At least one item from the list must appear in candidate itemsets that will be returned.
  • excludeList: an expression used to check exclusion criteria. No item from the list must appear in candidate itemsets that will be returned.

The use of this function should be a simple table function invocation:

	  from frequentItemsets(
             <select l_orderKey, p_name
              from lineitem join part 
               on lineitem.p_partkey = part.p_partkey
              order by l_orderkey
             >
             partition by l_orderkey)
	  into path='/tmp/wout'
  • the e.g. is based on the tpch schema: each order is a Basket
  • the driving query is shown as being ordered on the OrderKey, more on this below.
  • the Output will be a 2 column table: (Itemset, count)

Implementation

We plan to use the SON algorithm as described in the Mining Massive Datasets book. To summarize the 2 pass algorithm works in the following way:

  • Phase 1
    • assume input data is already stored bucketed by basket. In our case it assumes that the rows are ordered on the basketId and itemId.
    • Map phase
      • read the entire partition into memory
      • apply the A-priori (or one of the variation) on the partition; the support is proportionally scaled down based on the ratio of the number of baskets in this parition to the total number of baskets.
    • The Output of the Map phase is (ItemSet, NullWritable)
    • Reduce phase output a row for each candidate ItemSet. Since the number of candidate ItemSets will be small, the number of reducers can be small, even 1.
  • Phase 2
    • scans the input Basket dataset again.
    • Map phase
      • each mapper reads the entire output of Phase 1 into a data struct. For each Basket outputs the list of candidate ItemSets that it contains.
      • The Output is (Itemset, 1)
    • Reduce phase
      • sum up all rows for an ItemSet
      • Output the ItemSet if the count > support argument
      • Output from Phae is a table of the form (Itemset)

Fitting algorithm into Table Function framework

  • Define FrequentItemSet function as having Map-side processing.
  • Introduce the notion of Iterative Functions
    • an Iterative table function operates on the input multiple times until a condition is met. The function is responsible for providing a callabck that is invoked for termination.
    • The Output of the function in Phase i is available to Phase i + 1 through the Distributed Cache mechanism.
Clone this wiki locally