The early adoption rate for chatbots is way higher when compared to the early growth stage of the mobile app. By the end of April 2018, Facebook Messenger developers alone had created 300,000 bots in their platform. This signifies the growth rate and adoption rate of chatbots for business applications. “80% of businesses want to deploy chatbots by 2020” — Source: Business Insider.Having said the importance of chatbots and its growth rate, you might be wondering, if there are numerous bots available in the market,“Why only a few bots are so popular? What is the unique selling point of these chatbots? What happened to the rest of the chatbots?The answer to all the question is, the way the bots are trained. There is a saying nowadays Chatbots are only as good as the training they are given. However, most of the chatbots are hand-crafted and the bots aren’t smart enough to respond to the customer queries. Any user would be unsatisfied, if bot fails to answer his query after two to three interactions. This is because bots are trained with either a limited set of questions or with no questions for that intent. In real time, these bots fail to make a user impact and loses its true purpose. With rising high expectation of chatbots from the audience, there are numerous reasons that prevent the chatbot from delivering its purpose.
Training a bot sufficiently may address all the issues related to user experiences. To illustrate with an example, let's consider the hotel domain for this model.
Here, “check-in” is the intent & entire statement is referred as user expression. Intent and user expression are the basic components for a chatbot.
The scope of the project is to identify the intent based on the user expression and generate similar user expressions so that the bot can train well in several ways by itself without any manual intervention. After training process, the bot is supposed to respond to the queries without any hassle.
The objective is to build a Variant Generator model to train the bot by itself with a large set of domain-based question. A data pipeline is designed to collect, clean and pre-process the input data. Using this pipeline, domain-based qualified data can be generated to train the bot efficiently. Model is then designed and served with an API endpoint to retrieve variants for any intent.
A sufficient amount of qualified domain data is required to build the model. A total of around 1,500 data sources were queried and around 20,000 questions were collected and considered for modeling. As the collected data is unprocessed, the data is converted to a normalized form to capture the actual intent of the question.
Pre-Processing: Entities are extracted for Organisation, Location and Person information the noun chunks. Equivalent word vectors are substituted for these entities. This will be helpful for understanding contextual meaning of the sentence and to eliminate the duplicate questions. Numerous pre-processing steps performed in order to clean, normalize the data and to make the text easy for intent classification.
One of the most important challenges in text classification is labeling the training dataset. There is no direct way to annotate the training data rather than using an unsupervised learning approach.
A novel approach was used to annotate the training data using an unsupervised algorithm and considering the similarity of word vectors within a cluster (to remove the outliers). Using this algorithm, proxy labels are given to the clusters. Later, actual labels are provided based on domain-based ontology.
To implement the domain-based labeling, the schema from Schema.org for hotel domain(shown below) is taken as a reference and customized some of the labels of ontology for this project. Text documents are further labeled according to the new schema.
The high-level model design is shown below and each of the blocks will be explained in the later sections. Throughout the sections, question and user expression refer to the same. Likewise, intent and category refer to the same.
The model is designed to retrieve set of similar questions based on the given question. It consists of the following components:
1. Intent Classification Model:
Suppose if the customer is asking a bot“ What is the tariff details including the breakfast ?” The intent classifier would identify the intents as “price” and “food” from the question.
The intent classifier identifies the category class for the question based on lexical and semantic similarity. Maximum two categories are identified for the given question. For the above example, questions related to price and food are considered for similarity calculation.
2. Similarity Model:
Internally, the system tries to match the similarity between the asked question and the question set of the predicted category. It will retrieve the top five to ten similar questions based on the predicted probability of the asked question.
Suppose, if the identified categories are price and food, ‘questions set’ of those categories are retrieved using statistical similarity measures. Finally, the top five to ten similar questions are retrieved.
This full model is then served using Flask Micro-service API to consume the data from the model. Based on this model, if we wish to train for check-in as Intent, the following similar questions are generated based on the given question using our model.
Right now, the model is effective for training the bot smarter and faster. It can understand the context of the user expression as well.
Last but not the least, the bots are now trained at least 100 times faster than expected. With this model design, we can scale up with a number of clients in all verticals and domains.