1. Presence of duplicate hotels
While parsing the data, we found there were a large number of duplicate hotels in the dataset that were assigned to unique hotel ids. The presence of such duplicate hotels made it difficult to analyse the exact sentiment of the reviews towards the hotels. We were required to clean such duplicate data in the pre-processing stage.
2. Large number of hotel reviews
The TripAdvisor dataset we used had many reviews for each hotel in a single file. It was necessary to parse the data of each hotel in order to retrieve the content for each review belong to a hotel.
3. Reviews conflicting with the rating
A customer in a hotel may give a rating of 4 stars out of 5 stars but the customer’s review might mention that they were disappointed with the restaurant or some other feature of the hotel. This makes it difficult to gauge the true sentiment of the customer.
4. Coming up with the idea..
When we were coming up with an idea for the project, we decided to implement something that doesn’t already exist as a tool. This proved to be a huge challenge as most of the domains use big data these to perform some sort of analysis. We had to brainstorm quite a bit to come up with something new.
We thought to go through certain publicly available data sets and came up with an idea for an analysis tool for hotel inspection. The task of inspecting hotels may seem routine and you’d probably wonder what kind of analysis can be done to help this profession. Surprisingly enough we found that, as it impacted the travel and tourism industry of a place indirectly, analysis was fairly necessary to improve the travel experience of people.
5. Getting the data
Having said the above, it was an equally difficult task to get the data we required that matched our idea which was also of considerable size so that better analysis can be performed on it. Since our domain was new, we had to improvise with the data we had – hotel reviews.
6. Data Preprocessing
With any big data project, the most time consuming and challenging part is pre-processing it to match the requirements of the system. The data we were using consisted of a few hundred reviews for hotels across the world separated across different files and written in a json format. (Further description of the data is given below). The challenge was to parse this data into separate elements and store it into a database for easy extraction.
Our reviews was a large text written by the users which were not always perfect and simple English statements. We had to find a way of classifying the reviews as positive or negative and also find out what exactly about the hotel the users consider good/bad. So, accurately analysing the sentiments of people was a challenge. Also the data we had was unlabelled and hence we had to build a semi-structured learning model for sentiment analysis.
Now that we have the data, we had to think from the perspective of the hotel inspector who’s using this tool. So in order to fulfill the requirements of a hotel inspector, we had to come up with our own ideas on what he/she might expect from such a tool which resulted in us creating an overall as well as a break down analysis of the hotel in question.
Since we knew very little of this domain to begin with,we resorted to a very reliable hotel inspection checklist provided by the AAA. It gave us a detailed list of what the hotel inspector checks for in a hotel.
The following is a link to the checklist we followed: http://www.ncdsv.org/images/hotelinspectionchecklist.pdf
From this, we extracted a set of features that the users have rated in the reviews and based on this checklist, we assigned scores to each feature to predict what the users look most for in a hotel they plan to stay.
The top 5 criteria are as follows:
4. Internet and other facilities
We used an existing approach (based on a research paper) for opinion mining and visualization of the reviews.
The main reasons for choosing this approach are as follows:
1. This approach made better utilization of the data as compared to other approaches.
2. Sentiment analysis results from Naive Bayesian classification generated better results as per the paper.
3. We managed to generate better ideas for visualization of our results.
The following steps are involved in our approach:
1. Data cleaning and loading (MapReduce)
The first step in our approach involves cleaning the data and loading it into a database. Each JSON file was parsed using MapReduce. It consists of two parts: data cleaning and data extraction.
Data preprocessing involves cleaning and transforming the obtained data as per our requirements. The preprocessing techniques we used is as follows:
1. Removal of HTML tags: HTML tags (e.g. <a></a>) need to be removed as it does not contribute to classification.
2. Replacement of ‘,’ and ‘…’ with white spaces
3. Removal of punctuation: Punctuation should be removed from thedata. (e.g. ?, !)
4. Removal of additional white spaces: Any trailing white spaces that are present in the data need to be removed.
5. Conversion to lowercase: Conversion of the data to lowercase willmake the reviews uniform.
6. Removal of numbers: e.g XYZ27 i.e. the number 27 has nocontribution during sentiment analysis and can be removed.
7. Removal of stopwords: The commonly occurring words (stopwords) should be removed. Pronouns, conjunctions and prepositions areexamples of commonly occurring words. (e.g. a, is, the, etc)
8. Removal of duplicates: The duplicate hotels in the data should beremoved as they conflict with the results.
b. Data Extraction
For each file i.e. hotel we extract the data such as reviews data and the hotel information. The reviews contain information about the individual ratings given by a customer, the overall rating, the date it was published and the actual content of the review. The hotel information data contains the hotel name, the URL to the website of the hotel, the address which contains street and city data and the hotel id of the hotel.
2. Sentiment classification of the reviews (Mahout)
The data we obtained did not have labels for classification. Along with manually labelling the reviews, we also used the ratings in order to label the reviews. Thus, we used a semi-supervised approach for learning. The model we used for classification is the Naive-Bayesian model. The classifier was built using these labelled reviews in Mahout.
3. Analyze the overall sentiment towards the hotel
Based on the data obtained by the manual labelling of the reviews using the overall ratings given, we trained the model by converting the training data to sequence files and then to sparse vectors and using TF-IDF vectors. Thus the model, when tested gave an accuracy of 80% which was good enough for our classifier.
Using this classifier, we calculated the sentiment of the reviews for all the hotels and found the total number of positive and negative reviews for each hotel which we stored in a csv file to use for visualization. The results were not used in mongoDB as visualization with d3.js was done better with csv files rather than values from mongoDB.
This overall sentiment analysis of the hotels helped in building the priority list for hotel inspection. We did face a few challenges here to build the priority list as the hotels differed from each other in the total number of reviews. For example, there were hotels which had around 500-800 reviews of which 50% were negative while there were a few which had fewer than 10 reviews and all were negative. This second case gave 100% negativity while we should prioritize the first case.
To help this case, we assigned weights for the sentiment of the hotels: assign a higher score for the hotels which had more than the average number of reviews and more number of negative reviews.
Weighted Percentage of Negativity = (negative reviews/total no.of.reviews) * (total no.of reviews/avg no.of reviews) *100
= (no.of negative reviews/mean no.of reviews) *100
4. Analyze the sentiment towards individual features of the hotel
We have extracted the sentiment towards individual features using the ratings given by the user for the hotel features. The users have rated a lot of features of the hotels. We calculated the sentiment towards the features of the hotel by calculating the average rating for that particular feature. We plan to use the reviews for feature extraction in the future work.
The last step of our approach was building a simple and effective visualization on the front-end where a hotel inspector can immediately obtain the information they need at first glance. The front-end visualizations includes the priority list of hotels to be inspected and a world map that depicts the count of hotels on the priority list at a particular location. You can further drill down on the hotels in the priority list in order to visualize the overall ratings, overall sentiment, changing trends, and the reviews of the hotel that is selected in the priority list.
1. Apache Mahout:
Apache Mahout is a scalable machine learning library that supports big data sets. We used the Naive Bayes algorithm as a classification model for sentiment analysis from this library. This algorithm in Mahout uses a medium sized dataset which contains a lot of text for classification.
JSON in an industry standard data interchange format. In order to parse the data, we had to feed the JSON files as input to the mappers. We then transform this data to extract the values we require for sentiment analysis. This transformed data is then sorted, merged and presented to the reducer. As there is no common key, there is no reducer step and the output is written directly to the file.
The system we built is an analysis tool that is based on customer reviews and ratings. We focussed more on negative review classification in order to generate which hotels require inspection. It provides a hotel inspector with a priority list of hotels for immediate hotel inspection. Using the world map, it is easy to identify the locations in the world that have highest number of hotels that require inspection immediately. The system also provides an analysis of the notable features of a hotel and the customers’ sentiment towards these features. The inspector also has access to all the reviews on the dashboard. The system also consists of a trend chart which can be viewed by the inspector in order to see the fluctuations between the positive and negative ratings over a span of one year.