Deep in the trenches: Scripting a recommendation engine pipeline using mahout

Mahout comes up with large number of machine learning algorithms implemented to be run on top of Hadoop. For generating item similarity based recommendations, Mahout has built in algorithms to generate item similarity using multiple similarity measures, like COSINE, PEARSON correlation, etc.

There are two pieces to the item similarity based recommendation.

Compute item similarity metrics
Generate recommendations based on the latest user-item interaction data.

Generating item similarity metrics is a complex and lengthy computation. Imagine computing the pairwise similarity for 100k or so products. But, once the similarities are computed then generating recommendations based on the user-item interaction data is relatively less demanding. Luckily, item to item similarity does not change fast. You need to recompute only once or twice a month (depending on how often you introduce new products).

Mahout has a command to generate recommendations, which is 'recommenditembased'. The mahout built in code first generate item similarity metrics and then proceeds to generate recommendations.

The mahout recommenditembased command will look like below.

mahout recommenditembased --startPhase 0 --endPhase 10 -i /input -o /output -s SIMILARITY_COSINE -mp 15 -m 300 --numRecommendations 1000 --tempDir /temDir

You can provide the startPhase and endPhase to control to what extend mahout should proceed in its processing.

See the link below for more on mahout recommendation phases.

http://www.slideshare.net/vangjee/a-quick-tutorial-on-mahouts-recommendation-engine-v-04

Generating similarities - (frequency - run every week or so)

The first step is to compute the preference metrics to gather all the user-item interactions. This is the 0th phase.
step1 - delete tempDir if already existing

hadoop fs -rm -r <tempDir>
step 2 - generate preference metrics
mahout recommenditembased --startPhase 0 --endPhase 0 -i <inputDir> -o <outputDir> -s <similarityClass> --tempDir <tempDir>.

For similarity metrics computation the endPhase is 1. Next step is to compute the similarities

step 3 - compute similarities
mahout recommenditembased --startPhase 1 --endPhase 1 -i <inputDir> -o <outputDir> -s <similarityClass> --tempDir <tempDir>

For the job which computes similarities you will need step1 and step2.

Generating recommendations( frequency - run every 5-6 hours or daily)

Start with deleting the preference metrics folder in hadoop if it is already existing. Note that we need to compute the near realtime user-item interactions

step 1 - delete <tempDir>/preparePreferenceMatrix
hadoop fs -rm -r <tempDir>/preparePreferenceMatrix

Note that you should not delete the entire tempDir as it contains the item similarity metrics and other things which are required for recommendation generation.

step 2 - generate preference metrics
mahout recommenditembased --startPhase 0 --endPhase 0 -i <inputDir> -o <outputDir> -s <similarityClass> --tempDir <tempDir>

step 3 - delete partial multiply directory if already existing.
hadoop fs -rm -r <tempDir>/partialMultiply

step 4 - generate recommendations

mahout recommenditembased --startPhase 2 --endPhase 30 -i <inputDir> -o <outputDir> " -s <similarityClass> -mp 15 -m 300 --numRecommendations 1000 "+similarityOut+"/current --tempDir <tempDir>

Deep in the trenches

Friday, July 11, 2014

Scripting a recommendation engine pipeline using mahout

Generating similarities - (frequency - run every week or so)

Generating recommendations( frequency - run every 5-6 hours or daily)

No comments:

About Me