CS224w:
Social and Information Network Analysis
Assignment number:
Submission time:
Final Project Report
11:00 PM
and date:
10th Dec 2013
Fill in and include this cover sheet with each of your assignments. It is an honor code
violation to write down the wrong time. Assignments are due at 9:30 am, either handed in
at the beginning of class or left in the submission box on the 1st floor of the Gates building,
near the east entrance.
Each student will have a total of two free late periods. One late period expires at the start
of each class. (Homeworks are usually due on Thursdays, which means the first late periods
expires on the following Tuesday at 9:30am.) Once these late periods are exhausted, any
assignments turned in late will be penalized 50% per late period. However, no assignment
will be accepted more than one late period after its due date.
Your name:
Email:
Kapil Jaisinghani, Ravi Todi , Zhengyi Liu (Group No 31)
SUID:
Collaborators:
I acknowledge and accept the Honor Code.
(Signed)
KJ ,RT, ZL
(For CS224w staff only)
Late periods:
Section
1
2
3
4
5
6
Total
Comments:
1
2
Score
CS224W Project Report -‐ Group 31 Modeling Growth and Decline of Businesses in Yelp Network Kapil Jaisinghani (kjaising), Ravi Todi (rtodi), Zhengyi Liu (zhengyil) December 10, 2013 1. Introduction Yelp kicked off in 2005 as a way for users to rate and review local businesses. Businesses organize their own listings while users rate the business from 1-‐5 stars and write their own text reviews. Within this system, there is a meta review system where users can vote on helpful or funny reviews. Yelp has become the most well-‐known local search and user review site, has grown so much in size and influence that it now plays a significant role in the success or failure of local businesses. Yelp has amassed an enormous amount of raw data on businesses. While businesses are able to see their current ratings and the raw text of their reviews. There is no information related to projections about popularity trends of their businesses. For any business, we would like to predict its growth or decline in Yelp performance, which we assume has high correlation with real world performance. For this project we wish to predict the volume of future reviews and change in star rating. We plan to analyze which Yelp Graph Features serve as good predictors in the model, evaluate different approaches to mine temporally evolving Yelp graph and provide empirical results comparing different approaches. 2. Prior Work The topic related to our project is time series prediction. We have reviewed below the papers relevant to our work primarily in the area of link prediction in graphs and time series analysis using machine learning techniques. In some of previous papers such as [1] and [3], authors manipulated multiple features to construct popularity score prediction formulas on temporal bipartite network based on preferential attachment. The limitation of this paper is that they basically focused on number of reviews and had very few features. So in our research we extracted multiple features and ran machine learning algorithms on these features to get a comprehensive result. In [2] authors have formulated the link prediction problem in bipartite and unimodel graph as a supervised learning problem with the goal to discriminate between positive and negative examples that is linked and non-‐linked class. Paper uncovers new metrics that can be calculated based on direct topological attributes from unimodel graphs and its influence on link prediction efficiency. We also employed this methodology on our dataset and we also extracted some topological features as authors did, such as maximal number of shared neighbors. To find the optimal set of features, [4] Describes a series of techniques used to infer future attention of businesses using numerous kinds of features generated from business statistics and raw review text. It walks through feature selection process to select most influential features to implement Support Vector Regression (SVR) for prediction. We are generating data features similar to discuss in this paper for the growth prediction and also generating similar graph features over the bipartite network. 1 3. Modeling Modeling Bipartite Graph The yelp network we are using consists of two kinds of objects: businesses and users. The interactions (reviews) between businesses and users are modeled as edges. To have a complete representation of this structure, we are modeling it as a bipartite graph with two sets of nodes, and a set of edges connecting two sides as shown in Figure 1. The disadvantages of doing analysis on compressed unimodel graph (of this bipartite graph) are explained in [1]. Therefore we modeled the yelp data in the structure below. Modeling Time Evolution Our Yelp data timestamps (reviews) are from 2005-‐03 to 2013-‐01. The data graph is evolving with new businesses and users joining and new reviews added. With this consideration, we modeled our data into temporal graphs. For each business/user, we made the first time it showed up in reviews as the time it joins. Then for each month (we decided to use 1 month as the window size) we aggregated all data in that period and make the graph evolved (updated). Modeling for Prediction A. Star Rating and No of Reviews Prediction: The general formulation of time series can be represented in below form, where the missing information is lumped into a noise term w(t) The above formulation suggest that, the problem of one-‐step forecasting in which previous values of the series are available can be cast in the form of a generic supervised learning regression problem. In the forecasting setting, the training set is derived by the historical series by creating following data matrix. 2 We have generalized the approach used in [5] by including all the additional data features in the same model. In the above data matrix Yt represents value of the target variable at time t, in our case we have modeled this matrix for two separate target variables, No of reviews and Star rating of a given Yelp business. X1 –Xi represents the features provided and generated by data manipulation & graph analysis which is covered in detail in section 4.3. Here ‘n’ is the time span to get previous observations, in our model the time granularity is in months and we have used 6 rolling months as the time span for our analysis. ‘N’ here signifies how many such observations will be created from the data for a given business. In the data, the time period starts from ‘2005-‐
03’ to ‘2013-‐01’ resulting 95 observations for every business, though for most of the business the reviews are sparse and leading to missing values. B. Link Prediction with Supervised ML: We formulate the link prediction as a supervised learning problem. The goal is to discriminate between examples of the linked class (positive examples) against examples of the not-‐linked class (negative examples). Learning such a supervised classification model requires building a training data that describes examples of both classes. We explain hereafter how to generate training data for the link prediction problem in the case of a time evolving bipartite graph Let Gobs be a graph that summarizes the temporal sequence of networks, G = < G1 , G2 , … … Gt > and let us refer to Gt!1 , as the labeling graph. In this model Gobs is computed for a time span ‘n’ similar to model above for time series prediction. An example is generated for each couple of nodes (u, v) such that: ▪ u and v belong to both Gobs and Gt!1 ▪ < u, v > is not an edge of Gobs 4. Implementation 4.1 Dataset We have used the sample dataset released by Yelp for greater Phoenix, AZ area for academic research. The company released a set of data on the following page: yelp.com/dataset_challenge. This dataset includes information of 11537 businesses, 43873 users, 229907 reviews and 8282 check-‐ins. Some of the data features which are provided as part of the dataset are ▪ For a business: a) review count b) average stars c) votes ▪ For a user: a) review count b) average stars ▪ Number of check-‐in’s for a business per hour per day of the week 3 4.2 Initial Data Analysis and Preparation For some business categories there is not enough data to build and validate algorithm as can be seen in Figure 2. Majority of the reviews are for restaurants businesses. We have restricted our analysis to restaurants businesses with at least 80 reviews. Figure 3a is the heat map depicting number of reviews in Phoenix area. Also most of the reviews are for the year 2010-‐2012 (Figure 3b), for the other years data is sparse. 4.3 Feature Generation Data Features: We have modelled 13 features for the time series modelling, but except for 4 max/min features all other vary with time. We have taken rolling 6 months as the time span which results in total 45 additional features. Below are the descriptions of the core 13 data features which we are leveraging: 4 ▪
▪
▪
▪
▪
▪
▪
▪
▪
▪
▪
Number of Reviews: The total no of reviews for a given business is given in data, but to use it for time series modelling review count in each of month is required, to derive it we take count review id in a given period from reviews data set. Average number of Stars: Similar to number of reviews it is provided in dataset for each business, but we derive it from reviews dataset for each time period. Maximum/ Minimum number of stars: Max/Min features does not vary with time, they are derived from reviews dataset. Maximum/ Minimum date of review: Max/Min date features does not vary with time, they are derived from reviews dataset. Number of reviews voted as ’cool’: This is provided for each review, we derive it for each time period for given business by aggregating the data. Number of reviews voted as ’funny’: Similar to above this is provided for each review, we derive it by aggregating the data. Number of reviews voted as ’useful’: Similar to above this is provided for each review, we derive it by aggregating the data. Number of unique users logging these reviews: This is derived by counting user ids in review data for a given business for each time period. Number of days since the ﬁrst review: This is derived using first review date for a given business and last date in each time period. Number of days since the last review: This is derived using last review date for a given business but before last date of each time period. Number of days between ﬁrst and last review: This is derived by using first review date and last review date but before last date of each time period for a given business. Graph Properties and Node Features: The fact we could conclude from the following plots in Figure 5 is that both the businesses and users follow the power law distribution, which helps the prediction of future reviews using preferential attachment. With the temporal modeling and other graph feature such as graph average/node/projected graph coefficients, we could get more useful and insightful results to help find more influencing point to our prediction. To predict links on temporal graphs, we need to extract graph topological features and train the classifier using these features below: ▪
▪
Product of Degrees: For a business and user, we use the product of their degrees. This represents the preferential attachment natural of the temporal graphs. Common neighbors: For this bipartite graph, considering there are no common neighbors between a business and a user, we are using the following strategy: For a business M and an user N, we get M’s neighbors as NB(M). For all users in NB(M), we are searching for the 5 ▪
▪
▪
one shares the most common neighbors with user N, and get the size of the common neighbors. Then we switch M and N and do this again. Jaccard Coefficients: Having business M and user N, using the same logic in (b), we are getting the maximum Jaccard coefficients between M and N’s neighbors then N and M’s neighbors. Category interest: Each business has some categories. For example, Chipotle might have category “Restaurant, Mexican grill”. For a business M and a user N, we are getting all neighbors of user N, and count the common categories between N’s neighbors and M. If we get a large count, that means the user N has a higher interest in M’s categories. So N has a higher possibility to review M in the future. Location interest: Each business has three location related data entry: zip code, longitude, latitude. For a business M and a user N, we are getting all neighbors of N, and get the average distance between N’ neighbors and M. If we have a smaller value, this means the user’s activity area is more closed to business M. So the user is more possible to review M. Data Preparation To make this data fully adaptable to our project purpose and scale, we need to prune and clean the extra noise and flaws out the dataset. Here are some steps we have applied: ▪
▪
▪
▪
There are intersection between business IDs and user IDs. When constructing bipartite graph, this will break the bipartite topology of the graph. To avoid this, we re-‐assigned id to businesses and users. All the business IDs are less than 100000, user IDs are larger than 100000. As we could see from the Figure 3b that the data is too sparse for our analysis in early years, so we decided to choose data from 2010 to 2012 as the target data set. In order to avoid doing prediction for inactive businesses, we have removed all businesses with a review count less than 80 (in the time range from 2005 to 2012) from the data set. The time span in data is from 2005-‐03 to 2013-‐01 and for lot of the business there are no reviews for many months in between creating missing data. We have handled missing data imputation for time varying features like review count by filling zeros and then taking cumulative sums, for non-‐time varying features like max/min stars we have chosen to fill forward. 4.4 Feature Selection Feature selection identifies key important features and removes irrelevant, redundant or noise features to reduce the dimensionality of feature space. It improves efficiency, accuracy and comprehensibility of the models. We have used these techniques only for time series ML modelling to reduce the dimensionality from 58 features to 5-‐10 representative features, for link prediction as there were only 5 features we have not used any of the methods. Following are the methodologies we have used. Principal Components Analysis: PCA is a tool that reduces dimensionality of a feature space. It is particularly useful when the feature space is too large for a regression or classification to be computationally feasible or when data contains large amounts of noise. It builds a set of features by selecting those axes, which maximize data variance. We have used PCA to reduce the dimensionality to 10 components. Using PCA components for prediction improved the regression metrics of SVR – the explained variance & R2 score went up from 0.8 to 0.94, MAE got reduced from 79.7 to 41.7. For Random Forests and Gaussian processes the improvements were negligible. 6 Univariate Feature Analysis: It’s simple and efficient way of selecting features based on its prediction quality. We used select 10 best features with f_regression as scoring function. Refer figure 6(a) for the scores and top features. Using top 10 features from it, the results were similar to PCA were SVR scores improved by the same margin and it had negligible impact on RF & GP. Recursive Feature Removal: Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and weights are assigned to each one of them. Then, features whose absolute weights are the smallest are pruned from the current set features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached. We used SVR as the estimator function with linear kernel, the resulting feature ranks can be seen in figure 6(b). Using top 10 features from it, improved the regression metrics of SVR – the explained variance & R2 score went up from 0.8 to 0.92, MAE got reduced from 79.7 to 42.3. For Random Forests and Gaussian processes the improvements were negligible. Figure 6 (a) Scores from Select K Best Method Feature Name Feature Rank review_t-‐1 1 review_t-‐2 1 cool_votes_t-‐4 1 cool_votes_t-‐5 1 useful_votes_t-‐5 1 useful_votes_t-‐4 2 review_t-‐6 3 funny_votes_t-‐5 4 funny_votes_t-‐6 5 Figure 6 (b) Feature Rank from RFE 4.5 Prediction 4.5.1 Time Series Prediction with Supervised ML: We have used following ML algorithms for predicting number of reviews and star ratings by modeling it as supervised learning regression problem as covered in section (3) above. A. Gaussian Processes (GP) is a generic supervised learning method primarily designed to solve regression problems. One of the advantage of using GP is the prediction is probabilistic (Gaussian) so that one can compute empirical confidence intervals and exceedance probabilities that might be used to refit (online fitting, adaptive fitting). It is a nonparametric method based on modeling the observed responses of the different training data points (function values) as a multivariate normal random variables. In our results GP was the best performing model for predicting number of reviews in our results, it scores better than RF and SVR on all the regression metrics, though difference is very small. The parameters used for the model were (theta 0=1e-‐2, theta L=1e-‐4, theta U=1e-‐1). B. Support Vector Regression (SVR): One of the reason to choose SVR as one of our regression model is that it provides very fast prediction performance, this will likely hold up on running numerous predictions once the data sets become bigger. It aims to find a function f (x) that does not deviate more than some ϵ away from the training data. By having the margin, the training process can determine a subset of training points as having the most influence on the model parameters. The points with the most influence end up defining the margin and are called 7 support vectors. Thus, the final model that is produced depends only on a subset of the training data. Predictions then run only with these smaller number of support vector. In our results scored least compared to RF and GP, though with a small margin. SVR also showed maximum improvement in scores with feature reduction methods. The default parameters values had resulted in very bad scores, we have used grid search method to find the optimal values for the model (C = 1000, gamma = 0.0001, kernel = ‘RBF’). C. Random Forests (RF): The reason to use ensemble method as one of our evaluation algorithm is because it combines the predictions of several models built with a given learning algorithm in order to improve generalizability/robustness over a single model. A RF model is a meta estimator that fits a number of decision trees on various sub-‐samples of the dataset and use averaging to improve the predictive accuracy and control over-‐fitting. In our results RF came up best results for predicting star rating, it had minimal change in its scores with feature selection methods as it has built in feature pruning as per their importance. 4.5.2 Link Prediction with Supervised ML: We have used classification version of the ML algorithms stated above namely Support Vector Classifier (SVC) and Random Forrest (RF) for predicting links by modeling it as supervised learning classification problem as covered in section (3) above. In our results RF scores better on classification metrics compared to SVC. For SVC also we have used grid search for parameter tuning similar to SVR as mentioned above. 5. Evaluation Model Performance Comparison We have compared results from different algorithms within link prediction & time series prediction for “number of reviews” using the metrics explained above. The comparison for “change in star rating” is evaluated only in time series prediction. Link prediction is more constrained model for the given problem as we have modeled to predict not just how many reviews (new edges) a given business (biz nodes) gets but also from whom (user nodes), whereas time series regression model does not have this constraint due to which directly comparing the results across them may not give us fair idea of which one performs better. Model Validation Learning the parameters of a prediction function and testing it on the same data over fits the model repeating the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-‐unseen data. To avoid this, it is common to hold out part of the available data as a “test set”. Similarly to avoid over fitting values of hyper parameters on the test data, yet another part of the dataset “validation set” needs to be held out. We have used 10-‐fold cross validation to create validation & test/training sets from data. As Link prediction is modeled as binary classification problem and time series prediction is modeled as regression problem. We have evaluated the results of both the methods using the metrics defined in the following sections. Evaluation & Results Link Prediction As it has been modeled as a supervised classification problem, we have used the standard classification evaluation metric to assess our results: Accuracy, Precision, Recall, F1 Score. Model RF SVC Accuracy Precision Recall F1 0.9985 0.9976 0.9994 0.9985 0.9775 0.9694 0.9859 0.9776 8 The classification results came out to be good on all metrics, the Random Forests algorithm performed better than Support Vector Classification in the model. Evaluation & Results with Time Series Prediction As it has been modeled as a supervised regression problem, we used the following standard regression evaluation metric to assess our results Explained Variance (Exp Var), Mean Absolute Error (MAE), Mean Squared Error (MSE), R-‐Square (R2) Results Statistics across all the businesses for Prediction of No of reviews for the coming month using past month’s reviews count and other data features as discussed before. SVR mean std dev min max Exp Var 0.9794 0.0499 0.5558 0.9995 MSE R2 351.382 0.9747 11.4612 2545.9293 0.0684 MAE 4.8684 0.6263 RF 0.7937 0.2439 136.6689 42081.555 0.9994 MAE MSE R2 0.9865 3.4015 31.4047 0.9835 0.0252 2.2174 59.5594 0.0350 0.6903 0.6721 0.9331 0.5691 0.9989 18.4654 732.7712 0.9989 Exp Var mean std dev min max Exp Var GP mean std dev min max MAE MSE R2 0.9896 2.8342 19.0326 0.9861 0.0368 1.4446 37.2644 0.0647 0.2810 0.9963 1.6094 -‐0.2149 0.9997 14.6420 458.4087 0.9997 Results Statistics across all the businesses for Prediction of Star Rating for the coming month using past month’s ratings and other data features as discussed before. SVR mean std dev min max Exp Var MAE R2 MSE RF 0.6257 0.0749 0.0125 0.4557 0.5233 0.0467 0.0253 0.8485 -‐8.6964 0.0000 0.0000 -‐12.2335 1.0000 0.4077 0.2344 1.0000 Exp Var mean std dev min max MAE MSE R2 0.7722 0.0350 0.0070 0.7526 0.5394 0.0290 0.0194 0.5867 -‐9.2431 0.0000 0.0000 -‐10.1487 1.0000 0.2889 0.2178 1.0000 GP Exp Var mean std dev min max MAE MSE R2 0.5169 0.0397 0.0121 0.4709 4.0113 0.0470 0.0810 4.4823 -‐81.1059 0.0000 0.0000 -‐90.7601 1.0000 0.5682 1.5806 1.0000 The results of the regression metrics shows good performance for both the prediction tasks. GP comes up on top for Number of Reviews prediction and RF for Star Rating prediction. 6. Conclusion and Future Work Conclusion: In this paper, we have described two different techniques Time Series modelling using machine learning and Link Prediction in Bipartite Graph to infer future businesses growth using review volume and change in star rating. We have used numerous features from business statistics and relationships between businesses and users in modelled graph. In our experiments, we have attempted to infer the number of additional reviews and star ratings businesses will receive in the next month by formulating it as one step forecasting. We found that of all features, 9 the temporal features of previous reviews and ratings, along with votes for reviews provided the best prediction results. Also, evaluation metrics results shows GP achieved top performance compared to RF & SVR in predicting Number of Reviews and RF was on top spot for Star Rating prediction. Future work: Some of the additional data which we didn’t considered for the modelling can be further looked at especially review text and check in times. Apart from number of reviews and star ratings, number of check-‐ins can also be used to model future growth of a business. Also, one can look at modelling the data as weighted or signed graph by using rating or sentiment in review text. Another aspect one can consider is comparison with additional algorithms like statistical time series (ARIMA/ GARCH) and link prediction using supervised random walks. 7. References [1] Zeng, A., Gualdi, S., Medo, M., & Zhang, Y. C. (2013). Trend prediction in temporal bipartite networks: the case of Movielens, Netflix, and Digg. arXiv preprint arXiv:1302.3101. [2] Benchettara, N., Kanawati, R., & Rouveirol, C. (2010, August). Supervised machine learning applied to link prediction in bipartite social networks. In Advances in Social Networks Analysis and Mining (ASONAM), 2010 International Conference on (pp. 326-‐330). IEEE. [3] Beguerisse Díaz, M., Porter, M. A., & Onnela, J. P. (2010). Competition for popularity in bipartite networks. Chaos: An Interdisciplinary Journal of Nonlinear Science, 20(4), 043101-‐043101. [4] Hood, B., Hwang, V., & King, J. Inferring Future Business Attention. Retrieved from http://www.yelp.com/html/pdf/YelpDatasetChallengeWinner_InferringFuture.pdf. [5] Gianluca Bontempi, Souhaib Ben Taieb, and Yann-‐A¨el Le Borgne (2013) Machine Learning Strategies for Time Series Forecasting. In Lecture Notes in Business Information Processing, 2013 – Springer. 10