Seminar on Recommender systems using Hadoop

Hiral Patel

Satish Bhat, senior software engineer for machine learning at Adknowledge lectured Oct. 24 on “Large Scaled Recommender Systems using Hadoop and Machine Learning.” The seminar was conducted by Dr. Yugyung Lee, associate professor from the School of Computing and Engineering.

This worked as an introduction that aimed to make use of Hadoop core infrastructure. It also introduced technologies that are an integral part of Data Platform infrastructure namely Apache Flume, Apache Pig, Apache scoop, Apache hive and Apache Mahout.

“Recommender systems helps you decide what you need to buy in a particular scenario,” Bhat said.

Recommender systems transform the way people find products, information and people. Recommender systems studies patterns of behavior, so a person can know what he or she will enjoy without experiencing the situation or object. This technology has evolved over the last 20 years into a rich collection of tools that enable the practitioner or researcher to develop effective recommenders.

Recommender systems are either collaborative systems, content based recommender systems or Hybrid systems. Collaborative system is the basic approach to recommender system and it is a user to user recommendation system. Netflix uses this this type of recommender system.

In the Content based recommender systems, items that are similar to ones that the user liked in the past are considered as the criteria in recommending a particular item. This is known as item to item recommendation system. Pandora and Rotten tomatoes work on this content based recommender system.

Hybrid Recommender system is a combination of collaborative and content based recommender systems. Amazon uses a dynamic recommendation system that changes the recommendations based on each item a visitor views. This is also known as Real Time targeting.

“We work for desktops and mobile devices but, now we are using mobile devices to take statistics to gauge the users traffic and put in more concentration,” Bhat said. “In Adknowledge, we show our users ads on his/her demographic and behavioral information.”

Clustering is based on the patterns observed in the clicks of many users and the similar ones are grouped into a cluster. The Likes and Clicks are the major basis of this clustering.

“Presently, we have 4 – 5 clustering algorithms running at a particular point of time in Adknowledge and last three clicks are used as clustering metric. We have 440 000 million users and 500 000 clusters and sampling usually starts at two percent of the total users which is around 3 million users,” Bhat said.

A/B Testing is used to check the impact on each cluster. A/B clustering is commonly used in web development, marketing and advertising. Usually, Backtesting is done before A/B testing.

Hadoop is designed for Batch processing so it can be used for Real time processing. But, when Apache Storm is used on top of Hadoop, it works the best for Big Data in real time processing. With SQL or Apache HIVE, Hadoop can be used without using Java. Apache Flume is designed for efficiently transferring bulk data between Hadoop and structured data stores such as relational databases or dumping log files from the web servers.

“The good thing about Hadoop is it has such a big community of programmers, who are giving inputs and one day it may be transformed to become Real time. But, the bad thing is that is depends on how well we code the program otherwise it may crash or not do as we expect it to do,” Bhat said. “Machine learning is one of the best fields to be in and also the hardest fields because we need to learn, read stuff and keep updating. But, it is rewarding in terms of career and money.”

He explained how students can become a data scientist by concentrating on issues likes data mining and semantic web. He also asked students to participate in coding competitions like Kaggle, open source competition that help to learn new skills and improve coding.

“You can build your own recommender system and get data from the government websites on census or there are free data providers as well,” Bhat said.

Bhat made it to the final round of Google interview and shared some of his experiences during that time with students and advised them to learn programming as depth as they can.

“If you know programming you are the best and as a software engineer, any company looks for programming. Basic understanding of Algorithms, memory occupied by algorithms and time complexities are the topics that are covered in interviews,” Bhat said.