Conversational AI Chatbot using Deep Learning: How Bi-directional LSTM, Machine Reading…

By Kunal Bhashkar

keywords: NLU, NLG, Word Embedding, Tensorflow, RNN, Bi-directional LSTM, Generative Adversarial Network, Machine Reading Comprehension, Transfer Learning, Sequence to Sequence Model with multi-headed attention mechanism, Deep Reinforcement Learning, Self-learning based on Sentiment Analysis, Knowledge base, Recurrent Embedding Dialogue policy, Dual Encoder LSTM, Encoder-Decoder

In this article, I will explain how we can create Deep Learning based Conversational AI. The basic definition of chatbot is, it is a computer software program designed to simulate human conversation via text or audio messages. Today’s AI systems can interact with users, understand their needs, map their preferences and recommend an appropriate line of action with minimal or no human intervention. There are lot of popular conversational agents are available today like Apple’s Siri, Microsoft’s Cortana, Google Assistant, and Amazon’s Alexa.

The basic foundation of chatbots is providing the best response of any query that it receives. The best response like answering the sender questions, providing sender relevant information, ask follow-up questions and do the conversation in realistic way.

The below picture illustrate the conceptual map of Chatbot using Deep learning,

source

The chatbot needs to be able to understand the intentions of the sender’s message, determine what type of response message (a follow-up question, direct response, etc.) is required, and follow correct grammatical and lexical rules while forming the response. Some models may use additional meta information from data, such as speaker id, gender, emotion. Sometimes, sentiment analysis is used to allows the chatbot to ‘understand’ the mood of the user by analysing verbal and sentence structuring clues.

The following picture shows that how Deep learning based chatbot work internally,

source

2. Role of NLU, NLG and Dialogue Management in Conversational AI

Natural Language Understanding

The NLU unit is responsible for transforming the user utterance to a predefined semantic frame according to the system’s conventions, i.e. to a format understandable for the system. This includes a task of slot filling and intent detection. For example, the intent, could be a greeting, like Hello, Hi, Hey, or it could have an inform nature, for example I like Indian food, where the user is giving some additional information. Depending on the interests, the slots could be very diverse, like the actor name, price, start time, destination city etc. As we can see, the intents and the slots are defining the closed-domain nature of the Chatbot. The task of slot filling and intent detection is seen as a sequence tagging problem. For this reason, the NLU component is usually implemented as an LSTM-based recurrent neural network with a Conditional Random Field (CRF) layer on top of it. The model presented is a sequence-to-sequence model using bidirectional LSTM network, which fills the slots and predicts the intent in the same time. On the other hand, the model is doing the same using an attention-based RNN. To achieve such a task, the dataset labels consist of: concatenated B–I–O (Begin, Inside, Outside) slot tags, the intent tag and an additional end-of-string (EOS) tag. As an example, in a restaurant reservation scenario, given the sentence Are there any French restaurants in Toronto downtown?, the task is to correctly output, or fill, the following slots: {cuisine: French} and {location: Toronto downtown}.

The following picture shows the classification process for intent classification using Neural Network as,

source

Natural Language Generator (NLG)

Natural Language Generation (NLG) is the process of generating text from a meaning representation. It can be taken as the reverse of the natural language understanding. NLG systems provide a critical role for text summarization, machine translation, and dialog systems. In the NLG, The system response as a semantic frame, it maps back to a natural language sentence, understandable for the end user. The NLG component can be rule-based or model-based. In some scenarios it can be a hybrid model, i.e. combination of both. The rule-based NLG outputs some predefined template sentences for a given semantic frame, thus they are very limited without any generalisation power. While several general-purpose rule-based generation systems have been developed, they are often quite difficult to adapt to small, task-oriented applications because of their generality. Machine learning based (trainable) NLG systems are more common in today’s dialog systems. Such NLG systems use several sources as input such as: content plan, representing meaning representation of what to communicate with the user, knowledge base, structured database to return domain-specific entities, user model, a model that imposes constraints on output utterance, dialog history, the information from previous turns to avoid repetitions, referring expressions, etc.

Trainable NLG systems can produce various candidate utterances (e.g., scholastically or rule base) and use a statistical model to rank them. The statistical model assigns scores to each utterance and is learnt based on textual data. Most of these systems use bigram and trigram language models to generate utterances.

On the other hand, In NLG based on a semantically controlled Long Short-term Memory (LSTM) recurrent network, It can learn from unaligned data by jointly optimising its sentence planning and surface realisation components using a simple cross entropy training criterion without any heuristics, and good quality language variation is obtained simply by randomly sampling the network outputs.

The following figure shows that the working of Semantic Controlled LSTM cell,

source

Dialogue Management (DM)

The DM could be connected to some external Knowledge Base (KB) or Data Base (DB), such that it can produce more meaningful answers. The Dialogue Manager consists the following two components: the Dialogue State Tracker (DST) and the Policy Learning which is the Reinforcement Learning (RL) agent. The Dialogue State Tracker (DST) is a complex and essential component that should correctly infer the belief about the state of the dialogue, given all the history up to that turn. The Policy Learning is responsible for selecting the best action, i.e. the system response to the user utterance, that should lead the user towards achieving the goal in a minimal number of dialogue turns.

The following figure is shows that how dialogue state Tracker and RL agent are working together,

source

Types of dialog management

I will discuss the different types of dialog management and how they handle these principles.

Finite state machine

The powers of a Finite State Machine are quite extensive. Most conversations can be implemented by a FSM. They are especially good when the number of things a user can say are limited. Most tools for building a conversational bot will also provide a tool to make a decision diagram. So most bots will have a FSM underneath their hood.

A network with distributed terminals sometime can be modelled as a finite state machine with several ports. We define in the following the concept of multi-port finite state machines, which is a generalisation of finite state machines with two ports shows in following figure,

source

Switch statement

The most basic type of dialog management is a large switch statement. Every intent triggers a different response. E.g. “Hallo” → “Hi!”, “What’s your name?” → “My name is chatbot”, “What does NLU mean?” → “Natural Language Understanding”, “How are you?” → “I’m doing great!”, etc….

Goal based

In a complex conversation you cannot think about dialogs as a set of states because the number of states can quickly become unmanageable. So you need to approach conversations differently. A popular way of thinking about them is thinking about them in terms of goals.

Say that your user ask for the location of a restaurant without giving it’s name.

i. Your system will receive a “looking_for_restaurant”-intent and start a new goal “finding_restaurant”.

ii. It will notice that to finish this goal it needs to know the name of the restaurant. It therefore will ask the user for the name.

iii. When the user answers it will first analyze this response to see if it contains the name of the restaurant. If it does, it will save the name in its context.

iv. Finally the system will see if it now can finish the “finding_restaurant”-goal. Since the name of the restaurant is now known, it can lookup the restaurant’s location and tell it to the user.

This type of dialog management works based on behaviours instead of states. It’s easier to manage different ways of asking the same question, context switching or making decisions based on what you know about the user.

Belief based

Most NLU will classify intents and entities with a certain degree of uncertainty. This means that dialog manager can only assume what the user said and actually can’t work with discrete rules but needs to work with beliefs.

3. Types of Conversational AI

Rule Based Chatbot

In a rule-based approach, a bot answers questions based on some rules on which it is trained on. The rules defined can be very simple to very complex. The creation of these bots are relatively straightforward using some rule-based approach, but the bot is not efficient in answering questions, whose pattern does not match with the rules on which the bot is trained. However, these systems aren’t able to respond to input patterns or keywords that don’t match existing rules. One of such languages is AIML (Artificial Intelligence Markup Language): The AIML language´s purpose is to make the task of dialog modeling easy, according to the stimulus-response approach. Moreover, it is a XML-based markup language and it is a tag based. Tags are identifiers that are responsible to make code snippets and insert commands in the chatterbot. AIML defines a data object class called AIML objects, which is responsible for modelling patterns of conversation.

Example of AIML Code,

Basic Tags:

  1. <aiml>: Defines the beginning and end of an AIML document
  2. <category>: Defines the knowledge in a knowledge base.
  3. <pattern>: Defines the pattern to match what a user may input.
  4. <template>: Defines the response of an Alicebot to user’s input.
<aiml version=”1.0.1" encoding=”UTF-8"?><category> <pattern> HELLO BOT </pattern> <template> Hello my new friend! </template></category>

</aiml>

The following figure is the Decision tree of rule based conversational AI,

source

Retrieval Based Conversational AI

When given user input, the system uses heuristics to locate the best response from its database of pre-defined responses. Dialogue selection is essentially a prediction problem, and using heuristics to identify the most appropriate response template may involve simple algorithms like keywords matching or it may require more complex processing with machine learning or deep learning. Regardless of the heuristic used, these systems only regurgitate pre-defined responses and do not generate new output.

With massive data available, it is intuitive to build a retrieval based conversational system as information retrieval techniques are developing fast. Given a user input utterance as the query, the system searches for candidate responses by matching metrics. The core of retrieval based conversational systems is formulated as a matching problem between the query utterance and the candidate responses. A typical way for matching is to measure the inner-product of two representing feature vectors for queries and candidate responses in a transformed Hilbert space. The modelling effort boils down to finding the mapping from the original inputs to the feature vectors , which is known as representation learning.There is two-step retrieval technique to find appropriate responses from the massive data repository. The retrieval process consists of a fast ranking by standard TF-IDF measurement and the re-ranking process using conversation-oriented features designed with human expertise. The systems to select the most suitable response to the query from the question-answer pairs using a statistical language model as cross-lingual information retrieval. These methods are based on shallow representations, which basically utilises one-hot representation of words. Most strong retrieval systems learn representations with deep neural networks (DNNs). DNNs are highly automated learning machines; they can extract underlying abstract features of data automatically by exploring multiple layers of non-linear transformation. Prevailing DNNs for sentence level modelling include convolution neural networks (C-NNs) and recurrent neural networks (RNNs). A series of matching methods can be applied to short-text conversations for retrieval-based systems. Basically, these methods model sentences using convolutional or recurrent networks to construct abstractive representations. Although not all of these methods are originally designed for conversation, they are effective for short-text matching tasks and are included as strong baselines for retrieval-based conversational studies.

source

Response Selection with Topic Clues for Retrieval-based

If we have incorporating topic information into message response matching to boost responses with rich content in retrieval-based chatbots.

Topic Word Generation

There is LDA model , which is the state-of-the-art topic model for short texts, to generate topic words for messages and responses. LDA assumes that each piece of text (a message or a response) corresponds to one topic, and each word in the text is either a background word or a topic word under the topic of the text.

Topic-aware Convolutional Neural Tensor Network

There is a topic-aware convolutional neural tensor network (TACNTN) to leverage the topic words obtained from LDA in message-response matching.

source

Generative Based

A generative model chatbot doesn’t use any predefined repository. This kind of chatbot is more advanced, because it learns from scratch using a process called “Deep Learning.” Generative models are typically based on Machine Translation techniques, but instead of translating from one language to another, we “translate” from an input to an output (response).

Another way to build a conversational system is to use language generation techniques.We can combine language template generation with the search-based methods. With deep learning techniques applied, generation-based systems are greatly advanced.

We have a sequence-to-sequence (seq2seq) framework that emerged in the neural machine translation field and was successfully adapted to dialogue problems. The architecture consists of two RNNs with different sets of parameters.The approach involves two recurrent neural networks, one to encode the source sequence, called the encoder, and a second to decode the encoded source sequence into the target sequence, called the decoder.It was originally developed for machine translation problems, although it has proven successful at related sequence-to-sequence prediction problems such as text summarization and question answering.

Encoder:The encoder simply takes the input data, and train on it then it passes the last state of its recurrent layer as an initial state to the first recurrent layer of the decoder part.

Working of Encoder

The encoder RNN conceives a sequence of context tokens one at a time and updates its hidden state. After processing the whole context sequence, it produces a final hidden state, which incorporates the sense of context and is used for generating the answer.

Decoder: The decoder takes the last state of encoder’s last recurrent layer and uses it as an initial state to its first recurrent layer , the input of the decoder is the sequences that we want to get ( in our case French sentences).

How Does the Decoder Work?

The goal of the decoder is to take context representation from the encoder and generate an answer. For this purpose, a softmax layer over vocabulary is maintained in the decoder RNN. At each time step, this layer takes the decoder hidden state and outputs a probability distribution over all words in its vocabulary.

source

Ensemble of Retrieval- and Generation-Based Dialog Systems

Typically, a recurrent neural network (RNN) captures the query’s semantics with one or a few distributed, real-valued vectors (also known as embedding); another RNN decodes the query embedding to a reply. Deep neural networks allow complicated interaction by multiple non-linear transformations; RNNs are further suitable for modelling time-series data (e.g., a sequence of words) especially when enhanced with long short term memory (LSTM) or gated recurrent units (GRUs). Despite these, RNN also has its own weakness when applied to dialog systems: the generated sentence tends to be short, universal, and meaningless, for example, “I don’t know” or “something” . This is probably because chatbot-like dialogs are highly diversified and a query may not convey sufficient information for the reply. Even though such universal utterances may be suited in certain dialog context, they make users feel boring and lose interest, and thus are not desirable in real applications.

In ensemble of retrieval and generative dialog systems. Given a user issued query, we first obtain a candidate reply by information retrieval from a large database. The query, along with the candidate reply, is then fed to an utterance generator based on the “bi-sequence to sequence” (biseq2seq) model. Such sequence generator takes into consideration the information contained in not only the query but also the retrieved reply; hence, it alleviates the low-substance problem and can synthesize replies that are more meaningful. After that we use the scorer in the retrieval system again for post-reranking. This step can filter out less relevant retrieved replies or meaningless generated ones. The higher ranked candidate (either retrieved or generated) is returned to the user as the reply. Basically, the retrieval and generative systems are integrated by two mechanisms:

(1) The retrieved candidate is fed to the sequence generator to mitigate the “low-substance” problem; (2) The post-reranker can make better use of both the retrieved candidate and the generated utterance.

The following Figure depicts the overall framework of ensemble of retrieval and generative dialog systems.

source

AIML Knowledge base (KB) Conversational AI

A KB in this form is often called a Knowledge Graph (KG) due to its graphical representation, i.e., the entities are nodes and the relations the directed edges that link the nodes.

The basic concept of Knowledge base is shown as following figure,

source

Most state-of-the-art symbolic approaches to KB-QA are based on semantic parsing, where a question is mapped to its formal meaning representation (e.g., logical form) and then translated to a KB query. The answers to the question can then be obtained by finding a set of paths in the KB that match the query and retrieving the end nodes of these paths. Knowledge based systems have been helping humans to solve problems which are intellectually difficult, but easy for machines. These problems typically are easily represented with a set of formal rules.

The following figure we have KB representation graph centred on the question Q1. In the graph nodes are patterns (P) and templates (T), and edges are P-T associations and T-P semantic recursions.

source

Knowledge bases (KB) are powerful tools that can be used to augment conversational models. Since knowledge bases usually entail some kind of domain specific information, these techniques are mainly used for task-oriented dialog systems. In a KB, information related to the task at hand can be stored, for example information about nearby restaurants or about public transportation routes. Simple dictionaries or look-up-tables can be used to match an entity with information about it. Since KBs store information discretely, their integration with neural network based encoder-decoder models is not trivial.

The following figure shown that how the KB searching happens,

source

In Restaurant finding Knowledge base mechanism example, the encoder-decoder model produces a response that also uses general tokens for locations and times, and a special placeholder token for the KB result. Finally, the general tokens are transformed back to actual words using the stored table, a KB is employed which uses these general tokens to search for a route between the two places and its output is incorporated in the response. One more similar KB augmented encoder-decoder model is used for the task of recommending restaurants. Here, besides a standard encoder RNN the source utterance is also processed with a belief tracker, implemented as a convolutional neural network (CNN). Convolutional neural networks applied to encoder-decoder models. Belief tracking is an important part of task oriented spoken dialog systems. The belief tracker network produces a query for a database containing information about restaurants. The final input to the decoder RNN is the weighted sum consisting of the last state of the decoder RNN and a categorical probability vector from the belief tracker. Then the decoder outputs a response in the same way as in the previous example, with lexicalised general tokens. These tokens are then replaced with the actual information that they point to in the KB.

Self Learning: Recurrent Embedding Dialogue policy (REDP)

Natural Language Processing with a Training Model to enable the bot to ‘learn’ to understand a sentence, Context to be able to perform a conversation and History to learn from previous conversations. A grand challenge in this field is to create software which is capable of holding extended conversations, carrying out tasks, keeping track of conversation history, and coherently responding to new information. The aim is to learn vector embeddings for dialogue states and system actions in a supervised setting.

The following figure shown that how the chatbot response machine are associated with all components,

source

When we ask a user “what price range are you looking for?”, they might respond with:

  • “Why do you need to know that?” (narrow context)
  • “Can you show me some restaurants yet?” (broad context)
  • “Actually no I want Chinese food” (correction)
  • “I should probably cook for myself more” (chitchat)

We call all of this uncooperative behaviour. There are many other ways a user might respond. Here’s an example conversation:

source

At inference time, the current state of the dialogue is compared to all possible system actions, and the one with the highest cosine similarity is selected.

REDP, new dialogue policy, has two benefits: (1) it’s much better at learning how to deal with uncooperative behaviour, and (2) it can re-use this information when learning a new task.

It uses the same idea to deal with uncooperative users. After responding correctly to a user’s uncooperative message, the assistant should return to the original task and be able to continue as though the deviation never happened. REDP achieves this by adding an attention mechanism to the neural network, allowing it to ignore the irrelevant parts of the dialogue history. The image below is an illustration of the REDP architecture (a full description is in the paper). The attention mechanism is based on a modified version of the Neural Turing Machine, and instead of a classifier we use an embed-and-rank approach.

The following figure shown the working of REDP,

source

Attention has been used in dialogue research before, but the embedding policy is the first model which uses attention specifically for dealing with uncooperative behaviour, and also to reuse that knowledge in a different task. One advantage of this approach is that target labels can be represented as a bag of multiple features, allowing us to represent system actions as a composition of features. In general, the features describing a particular action can come from a number of sources, including the class hierarchy, the name of the action, and even features derived from the code itself (such as which functions are called). Whatever the source of the features, similar actions should have more features in common than dissimilar actions, and ideally reflect the structure of the domains. In our experiments we only derive features from the action name, either taking the whole name as a single feature, or splitting the name into tokens and representing it as a bag of words.

4. Intent Identification and Information Extraction

The machine algorithm for Intent Identification can be either supervised or unsupervised. If we implement the supervised approach, we need to manually give labels to hundreds of data for training purpose which going to be tiring and boring, but if we implement the unsupervised one, there were several critical knowledge gaps that we can’t cover in just 3 weeks especially regarding the design of training process. Therefore, even though we need to manually give labels to our data, we chose to go with the supervised one. This picture below illustrate how text classifier using supervised ML works:

source

At this point, we have several machine learning algorithms that we could choose to implement supervised learning which is Naive Bayesian, LDA, SVM and Neural Network. But before we choose the algorithm, we need to find a method to translate words into an array (word embedding) since all algorithms that I mention previously need input in form of array or at least numbers. There are 2 options that we had to do that, by using one hot encoded bag of words (bow) or word2vec (CBOW). If we had more times, we definitely would choose word2vec to embed the input since the size of array would be significantly smaller compared to BOW, but we had limited time and to implement word2vec we need to use Java (deeplearn4j) or Python(gensim) which no one between us had any experience making an API using these languages. Actually, it is possible for us to create the classifier by using Python but the problem will occur in the process of making an API out of it, especially in the deployment process. To deploy Python in the live server, there are several configurations that need to be done and we don’t have the courage to play around with our company server since everyone else is also using it for other projects. So for the sake of familiarity, we decide to use BOW which we manage to find a node package to implement it called mimir.

In Deep learning, Intent Identification works as a three layers of processing: encoder network, intention network, and decoder network. The encoder network has inputs from the current source side input. Because the source side in the current turn is also dependent on the previous turn, the source side encoder network is linked with the output from the previous target side. The encoder network creates a representation of the source side in the current turn. The intention network is dependent on its past state, so that it memories the history of intentions. It therefore is a recurrent network, taking a representation of the source side in the current turn and updating its hidden state. The decoder is a recurrent network for language modelling that outputs symbol at each time. This output is dependent on the current intention from the intention network. It also pays attention to particular words in the source side. In NLU, the functions / dialogue acts are often domain specific. In other words, instead of asking whether the function of the user’s utterance is a question or answer, we ask whether the function is to, for example, find flights or cancel a reservation in a flight reservation program. Domain-specific dialogue acts are called intents. Intent identifying has been most prominently used by call centre bots, which ask the user “how can I help you?” and subsequently use intent identification to re-direct the user to one of N pre-defined re-direction options. Many of the same machine learning algorithms used for DA classification are used for intent identification.

Regarding Information Extraction, The primary responsibility of the NLU is not just to understand phrase function, but to understand the meaning of the text itself. To extract meaning from text, we convert unstructured text — text written into a text-only chatbot — into structured grammatical data objects, which will be further processed by the Dialogue Manager. The first step in this process is breaking down a sentence into tokens that represent each of its component parts: words, punctuation marks, numbers, etc. Tokenization is difficult because of the frequency of ambiguous or malformed inputs including: (i) phrases , (ii) contractions , abbreviations , and periods. These tokens can be analyzed using a number of techniques, described below, to create a number of different data structures that be processed by the dialogue manager.

There are few approach which can be use for Information retrieval as below,

Bag of Words: We ignore sentence structure, order, and syntax, and count the number of occurrences of each word. We use this to form a vector space model, in which stop words are removed, and morphological variants go through a process call lemmatization and are stored as instances of the basic lemma . In the dialogue manager phase, assuming a rule-based bot, these resulting words will be matched against documents stored in the bot’s knowledge database to find the documents with inputs containing similar keywords. The bag of words approach is simple because it does not require knowledge of syntax, but, for this same reason, is not precise enough to solve more complex problems.

Latent Semantic Analysis : This approach is similar to the bag of words. Meanings / concepts, however, not words, are the basic unit of comparison parsed from a given sentence or utterance. Second, groups of words that co-occur frequently are grouped together. In LSA, we create a matrix where each row represents a unique word, each column represents a document, and the value of each cell is the frequency of the word in the document. We compute the distance between the vector representing each utterance and document, using singular value decomposition to reduce the dimensionality of the matrix, and determine the closest document.

Regular Expressions: Sentences / utterances can be treated as regular expressions, and can be pattern matched against the documents in the bot’s knowledge database. For example, imagine that one of the documents in the bot’s knowledge database handles the case where the user inputs the phrase: “my name is *”. “*” is the wildcard character, and indicates that this regular expression should be triggered whenever the bot hears the phrase “my name is” followed by anything. If the user says “my name is Jack”, this phrase will be parsed into a number of regular expressions, including “my name is *” and will trigger the retrieval of that document.

Part of Speech (POS) Tagging: POS tagging labels each word in the input string with its part of speech (e.g. noun, verb, adjective, etc.). These labels can be rule-based (a manually-created set of rules is created to specify part of speech for ambiguous words given their context). They can also be created using stochastic models which train on sentences labeled with correct POS. In the dialogue manager, POS can be used to store relevant information in the dialogue history. POS is also used in response generation to indicate the POS object type of the desired response.

Named/Relation Entity Recognition: In named entity recognition (NER), the names of people, places, groups, and locations are extracted and labeled accordingly. NER-name pairs can be stored by the dialogue manager in the dialogue history to keep track of the context of the bot’s conversation. Relation extraction goes one step further to identity relations (e.g. “who did what to whom”) and label each word in these phrases.

Semantic Role Labelling: The arguments of a verb are labelled based on their semantic role (e.g. subject, theme, etc.). In this process, the predicate is labelled first followed by its arguments. Prominent classifiers for semantic role labelling have been trained on FrameNet and PropBank, databases with sentences already labelled with their semantic roles. These semantic role-word pairs can be stored by the dialogue manager in the dialogue history to keep track of context.

Creation of Grammatical Data Structures: Sentences and utterances can be stored in a structured way in grammar formalism such as context-free grammars (CFGs) and dependency grammars (DGs). Context-free grammars are tree-like data structures that represent sentences as containing noun phrases and verb phrases, each of which contain nouns, verbs, subjects, and other grammatical constructs. Dependency grammars, by contrast, focus on the relationships between words.

Statistical Methods for Information Extraction

Hidden Vector State (HVS) Model: The goal of the statistical hidden vector state models is to automatically produce some accurate structured meaning. Consider an example as “I want to return to Dallas on Thursday.” The parse tree below represents one way of representing the structured meaning of the sentence. SS represents the initial node send_start, and SE represents the end node send_end. We view each leaf node as a vector state, described by its parent nodes: the vector state of Dallas is [CITY, TOLOC, RETURN, SS].

source

The whole parse-tree can then be thought of a sequence of vector states, represented by the sequence of squares above. If each vector state is thought of as a hidden variable, then the sequence of vector states (e.g. squares above) can be thought of as a Hidden Markov Model: we start at SS, and have certain probabilities of reaching a number of possible hidden states as the next state. Each vector state can be thought of as a “push-down automaton” or stack. Support Vector Machine (SVM) Model: Support Vector Machines are a supervised machine learning tool. Given a set of labeled training data, the algorithm generates the optimal hyperplane that divides the sample into their proper labels. Traditionally, SVMs are thought of as solving binary classification problems, however multiple hyperplanes can be used to divide the data into more than two label categories. The optimal hyperplane is defined as the hyperplane that creates the maximum margin, or distance, between different-labeled data point sets.

Conditional Random Field Models: CRFs are log-linear statistical models often applied for structured prediction. Unlike the average classifier, which predicts a label for a single object and ignores context, CRF’s take into account previous features of the input sequence through the use of conditional probabilities. A number of different features can be used to train the model, including lexical information, prefixes and suffixes, capitalization and other features.

Deep Learning: The most recent advancement in the use of statistical models for concept structure prediction is deep learning for natural language processing, or deep NLP. Deep learning neural network architectures differ from traditional neural networks in that they use more hidden layers, with each layer handling increasingly complex features. As a result, the networks can learn from patterns and unlabelled data, and deep learning can be used for unsupervised learning. Deep learning methods have been used to generate POS tags of sentences (chunk text into noun phrases, verb phrases, etc.) and for named-entity recognition and semantic role labelling.

The below picture illustrate the working of Deep learning based Statistical Model,

source

5. Self-learning Based on Sentiment Analysis

Initially, the development of a bot was based on two fundamental components :

Natural Language Understanding module, used by the Dialogue Manager, that processes the user input to search for keywords through which to understand the action to be taken.

Natural Language Generation module that generates answers from the information gathered by the Dialogue Manager.

Over time, we have faced a real evolution in the development of task-oriented conversational agents because of the availability of deep learning techniques.

This picture below illustrate the process of sentiment analysis in user generated content,

source

The training process for sentiment analysis it will provide for automatic labeling of new instances. In the sentiment analysis method each sentence is analyzed against two classification sub-systems: one for identifying the class of the answers, one for assessing the sentiment of the sentence. At the end of the processing of each sentence the learning model is updated according to the detected sentiment. This is based on a data structure formed by intents (An intent is a semantic label representing an intention of the end-user) . For each intent, there is a set of sentences that represent it. Each sentence that describes an intent contains entities (Entities are the parameters of the intent that help in defining the specific user request) that are attributes specific to the given intent.

An example taken from the dataset is shown below.

Take an example as, Request: ”I need days off from tomorrow to the day after tomorrow”.

Intent: LEAV E REQU EST

Entities:

– start date: tomorrow.

– end date: day after tomorrow.

In this example, the scenario of an employee requesting holidays is repre- sented by the LEAV E REQU EST intent and by the start date and end date entities.

An example of how the method works in case of positive sentiment is shown below.

example: Time off request

Request: ”Hi” {intent: [HELLO] detected: [Hi] sentiment: [neutral]}

Bot: ”Hi, Dave”

Request: ”I’m stuck in traffic, I’ll be there soon” {intent: [TIMEOFF REQUEST]

detected: [there,soon] new words: [stuck,traffic] sentiment: [neutral]}

Bot: ”Ok, do you want to create a time off request?”

Request: ”Yes, thank you!” {intent: [CONFIRM] detected: [Yes, thank, you] sentiment: [positive]}

In this example the agent detects the correct intent by the words there and soon and enriches the dictionary with stuck and traffic. If these words are often used for a time off request, they will become characteristic for this intent. (i.g. Dave in the future he will can write ”I’m stuck in traffic” or ”There is traffic” to request a time off).

An example of how the proposed method works in case of negative sentiment is as below.

example: Time off request

Request: ”Hi” {intent: [HELLO] detected: [Hi] sentiment: [neutral]}

Bot: ”Hi, Dave”

Request: ”Tomorrow I’ll be busy” {intent: [LEAVE REQUEST] detected: [to-

morrow,busy] sentiment: [neutral]}

Bot: ”Ok, do you want to create a leave request?”

Request: ”No, that’s not what I want!” {intent: [NOT CONFIRM] detected: [No,that’s not, what, I, want] sentiment: [negative]}

In this example the agent detects the incorrect intent by the words tomorrow and busy. In the future, if the bot will always receive a negative response to the request that he proposes then the words found will no longer be characteristics of the intent found and can be totally eliminated.

We can have also defined a dictionary that the bot uses to translate the type of some words. For example, the terms ”tomorrow” and ”day after tomorrow” are assigned to the type date.

Classification of intents

The classification problem considered in our method is determining the intent and the entities associated to a given user sentence. User sentences are represented with bag of words, without considering the order of the words. To improve classification accuracy, we also use a vocabulary of n-words with an N-gram model . The classification algorithm is based on Naive Bayes Text Classifier , a statistical technique able to estimate the probability of an element belonging to a certain class. The Naive Bayes technique estimates the conditional probabilities of each word given the classification category by associating every word that convey the same meaning in the intents, a numerical value that we will consider as a weight. The words that characterize an intent will have greater weight because they will only be found within that intent, so their occurrence is limited compared to non-characterizing words that we find in numerous intents.

Take an example as, Given an intent Leave representing requests from a user regarding leaves, we would like sentences such as ”I want go to holidays”, ”I’m tired, I need to rests”, ”I want holidays for this month”, etc. to be classified as Leave. The main idea developed is to provide the agent with the ability to automatically collect feedback about its answers in order to improve its knowledge base. To this end, we experimented the use of sentiment analysis. To detect the sentiment from user sentences, we have defined another classification problem from user input to three classes: Positive, Negative and Neutral and use again a Naive Bayes approach to train this classifier on a specific dataset. For any user’s sentence, we keep track of local and global sentiment score, local score is about the last sentence, global score is an average value across the dialogue. Furthermore, to improve the idea, we can define some particular intents that act as modifiers.

Take an example, when the user corrects the bot with phrases like ”I’m sorry I did not mean this”, this is considered as a negative feedback, while phrases containing specific thanks, such as ”Thank you! I was trying to do exactly this!” provide for positive feedback.

6. Sequence2Sequence Model with Multihead Attention Mechanism

Recurrent Neural Network

Recurrent Neural Networks (RNNs) are popular models that have shown great promise in many NLP tasks. The idea behind RNNs is to make use of sequential information. In a traditional neural network we assume that all inputs (and outputs) are independent of each other. But for many tasks that’s a very bad idea. If you want to predict the next word in a sentence you better know which words came before it. RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being depended on the previous computations. Another way to think about RNNs is that they have a “memory” which captures information about what has been calculated so far. In theory RNNs can make use of information in arbitrarily long sequences, but in practice they are limited to looking back only a few steps. Here is what a typical RNN looks like:

source

Let’s consider the following sequence — Bangalore is the largest city of ______. It is easy to fill the blank with India. This means that there is information about the last word encoded in the previous elements of the sequence.

The idea behind this architecture is to exploit this sequential structure of the data. The name of this neural networks comes from the fact that they operate in a recurrent way. This means that the same operation is performed for every element of a sequence, with its output depending on the current input, and the previous operations.

The following picture shows the working of RNN for language modeling,

source

Recurrent Neural Networks can be used in a variety of scenarios depending in how the inputs are fed and the outputs are interpreted. These scenarios can be divided into three main different classes:

Sequential input to sequential output

Machine translation / part-of-speech tagging and language modeling tasks lie within this class.

Sequential input to single output

One task with this property is sentiment analysis, in which we fed a sentence and we want to classify it as positive, neutral or negative.

Single input to sequential output

This is, for example, the case of image captioning: where we fed a picture to the RNN and want to generate a description of it.

Deep RNN with Multilayer Perceptron

Deep architectures of neural networks can represent a function exponentially more efficient than shallow architectures. While recurrent networks are inherently deep in time given each hidden state is a function of all previous hidden states , it has been shown that the internal computation is in fact quite shallow. It is argued that adding one or more nonlinear layers in the transition stages of a RNN can improve overall performance by better disentangling the underlying variations the original input. The deep structures in RNNs with perceptron layers can fall under three categories: input to hidden, hidden to hidden, and hidden to output.

source

Bi-Directional Recurrent Neural Network

The structure of BRNN is an to split the state neurons of a regular RNN in a part that is responsible for the positive time direction (forward states) and a part for the negative time direction (backward states). Outputs from forward states are not connected to inputs of backward states, and vice versa. If you might have to learn representations from future time steps to better understand the context and eliminate ambiguity. Take the following examples, “He said, Teddy bears are on sale” and “He said, Teddy Roosevelt was a great President”. In the above two sentences, when we are looking at the word “Teddy” and the previous two words “He said”, we might not be able to understand if the sentence refers to the President or Teddy bears. Therefore, to resolve this ambiguity, we need to look ahead. This is what Bidirectional RNNs accomplish.

The following picture illustrate the General structure of the bidirectional recurrent neural network,

source

Multidimentional Recurrent Neural Network

The basic idea of multidimensional recurrent neural networks (MDRNNs) is to replace the single recurrent connection found in standard recurrent networks with as many connections as there are spatio-temporal dimensions in the data. These connections allow the network to create a flexible internal representation of surrounding context, which is robust to localised distortions. An MDRNN hidden layer scans through the input in 1D strips, storing its activations in a buffer. The strips are ordered in such a way that at every point the layer has already visited the points one step back along every dimension. The hidden activations at these previous points are fed to the current point through recurrent connections, along with the input.

RNN architectures used so far have been explicitly one dimensional, meaning that in order to use them for multi-dimensional tasks, the data must be preprocessed to one dimension, for example, by presenting one vertical line of an image at a time to the network. The most successful use of neural networks for multi-dimensional data has been the application of convolution networks to image processing tasks such as digit recognition . One disadvantage of convolution nets is that because they are not recurrent, they rely on hand specified kernel sizes to introduce context. Another disadvantage is that they don’t scale well to large images. For example, sequences of handwritten digits must be pre-segment.

source

Long Short-Term Memory or LSTM Network

An LSTM network is a recurrent neural network that has LSTM cell blocks in place of our standard neural network layers. These cells have various components called the input gate, the forget gate and the output gate. RNNs are good in handling sequential data but they run into problem when the context is far away.

Example: I live France and I know ____. The answer must be ‘French’ here but if the there are some more words in between ‘I live in France’ & ‘I know ____’. It’ll be difficult for RNNs to predict ‘French’. This is the problem of Long-Term Dependencies. Hence we come to LSTMs.

source

LSTMs are explicitly designed to avoid the long-term dependency problem. LSTMs also provide solution to Vanishing/Exploding Gradient problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn! All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.

The picture below illustrate that how input gate, forget get and output gate are working together,

source

Bidirectional LSTM

The basic idea of bidirectional recurrent neural nets (BRNNs) is to present each training sequence forwards and backwards to two separate recurrent nets, both of which are connected to the same output layer. (In some cases a third network is used in place of the output layer, but here we have used the simpler model). This means that for every point in a given sequence, the BRNN has complete, sequential information about all points before and after it. Also, because the net is free to use as much or as little of this context as necessary, there is no need to find a (task-dependent) time-window or target delay size. BRNNs have given improved results in sequence learning tasks, It is possible to increase capacity of BRNNs by stacking hidden layers of LSTM cells in space, called deep bidirectional LSTM (BLSTM) .

BLSTM networks are more powerful than unidirectional LSTM networks. These networks theoretically involve all information of input sequences during computation. The distributed representation feature of BLSTM is crucial for different applications such as language understanding . In BLSTM, The forward and backward passes over the unfolded network over time are carried out in a similar way to regular network forward and backward passes, except that we need to unfold the hidden states for all time steps. We also need a special treatment at the beginning and the end of the data points.

The picture below illustrate BLSTM for tagging named entities. Multiple tables look up word-level feature vectors,

source

The long-short term memory (LSTM) unit with the forget gate allows highly non-trivial long-distance dependencies to be easily learned . For sequential labelling tasks such as NER and speech recognition, a bi-directional LSTM model can take into account an effectively infinite amount of context on both sides of a word and eliminates the problem of limited context that applies to any feed-forward model. While LSTMs have been studied in the past for the NER task by Hammerton, the lack of computational power (which led to the use of very small models) and quality word embeddings limited their effectiveness.

The picture below illustrate fully connected LSTM works,

source

LSTM-CRF networks

In the CRF networks there are two different ways to make use of neighbor tag information in predicting current tags. The first is to predict a distribution of tags for each time step and then use beam-like decoding to find optimal tag sequences. The work of maximum entropy classifier and Maximum entropy Markov models fall in this category. The second one is to focus on sentence level instead of individual positions, thus leading to Conditional Random Fields (CRF) models. Note that the inputs and outputs are directly connected, as opposed to LSTM and bidirectional LSTM networks where memory cells/recurrent components are employed. CRFs can produce higher tagging accuracy in general. It is interesting that the relation between these two ways of using tag information bears resemblance to two ways of using input features , and the results in this paper confirms the superiority of BI-LSTM compared to LSTM.

The picture below illustrate working of CRF model,

source

In LSTM-CRF networks, It can efficiently use past input features via a LSTM layer and sentence level tag information via a CRF layer. A CRF layer is repre- sented by lines which connect consecutive output layers. A CRF layer has a state transition matrix as parameters. With such a layer, we can efficiently use past and future tags to predict the current tag,

The picture below illustrate working of LSTM-CRF model,

source

The sequence of word representation is regarded as inputs to a bi-directional LSTM, and its output results from the right and left context for each word in a sentence. The output representation from bi-directional LSTM fed onto a CRF layer, the size of representation and its labels are equivalent. In order to consider the neighboring labels, instead of the softmax, we chose CRF as a decision function to yield final label sequence.

The picture below illustrate working of character level vector concatenated with word embedding as word representation with BLSTM with CRF model,

source

Gated Recurrent Unit

A GRU has two gates, a reset gate , and an update gate . Intuitively, the reset gate determines how to combine the new input with the previous memory, and the update gate defines how much of the previous memory to keep around. If we set the reset to all 1’s and update gate to all 0’s we again arrive at our plain RNN model. The basic idea of using a gating mechanism to learn long-term dependencies is the same as in a LSTM.

In GRU the RNN cell as a computation in which we update the memory vector deciding, at each timestep, which information we want to keep, which information is not relevant anymore and we would like to forget and which information to add from the new input. The RNN cell also creates an output vector which is tightly related to the current hidden state (or memory vector).

The picture below illustrate the working of GRU,

source

Comparison between LSTM and GRU,

source

In Emotion classification example from noisy speech, we simulate noisy speech upon superimposing various environmental noises on clean speech. Features are extracted from the noisy speech and feed to the GRU for emotion classification.

The picture below illustrate an example of emotion classification from noisy speech using LSTM-GRU,

source

Character based convolutional gated recurrent encoder with word based gated recurrent decoder with attention (CCEAD)

This model has the similar underlying architecture of the sequence-to sequence models . In this model a character based sequence-to-sequence architecture with a convolutional neural network(CNN)-gated recurrent unit (GRU) encoder that captures error representations in noisy text. The decoder of this model is a word based gated recurrent unit (GRU) that gets its initial state from the character encoder and implicitly behaves like a language model.

The following is the Architectural diagram of our character based convolutional gated recurrent encoder with word based gated recurrent
decoder with attention (CCED),

source

The following picture Illustrate the CNN module comprising the encoder of our CCEAD model used for capturing hidden representations in data as,

source

Word Embedding

Word Embedding is a technique for learning dense representation of words in a low dimensional vector space. Each word can be seen as a point in this space, represented by a fixed length vector. Semantic relations between words are captured by this technique. Word Embedding is typically done in the first layer of the network : Embedding layer, that maps a word (index to word in vocabulary) from vocabulary to a dense vector of given size.

Word2Vec

Word2Vec is a method to construct word embedding. It can be obtained using two methods (both involving Neural Networks): Skip Gram and Common Bag Of Words (CBOW).

We can perform some amazing tasks from word embeddings of Word2Vec.

  1. Finding the degree of similarity between two words.
    model.similarity('woman','man')
    0.73723527
  2. Finding odd one out.
    model.doesnt_match('breakfast cereal dinner lunch';.split())
    'cereal'
  3. Amazing things like woman+king-man =queen
    model.most_similar(positive=['woman','king'],negative=['man'],topn=1)
    queen: 0.508
  4. Probability of a text under the model
    model.score(['The fox jumped over the lazy dog'.split()])
    0.21

GloVe

GloVe is a new model for word representation , for Global Vectors, because the global corpus statistics are captured directly by the model. It is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. It is Semantic vector space models of language represent each word with a real-valued vector. These vectors can be used as features in a variety of applications, such as information retrieval , document classification , question answering , named entity recognition. For example, the analogy “king is to queen as man is to woman” should be encoded in the vector space by the vector equation king − queen = man − woman. This evaluation scheme favours models that produce dimensions of meaning, thereby capturing the multi-clustering idea of distributed representations.

FastText

FastText is an extension to Word2Vec proposed by Facebook in 2016. Instead of feeding individual words into the Neural Network, FastText breaks words into several n-grams (sub-words). For instance, the tri-grams for the word apple is app, ppl, and ple (ignoring the starting and ending of boundaries of words). The word embedding vector for apple will be the sum of all these n grams. After training the Neural Network, we will have word embeddings for all the n-grams given the training dataset. Rare words can now be properly represented since it is highly likely that some of their n-grams also appears in other words.

There are many different types of word embeddings:

i. Frequency based embedding

ii. Prediction based embedding

Frequency based embedding

Count vector

count vector model learns a vocabulary from all of the documents, then models each document by counting the number of times each word appears.

TF-IDF vectorization

Tf means term-frequency while tf–idf means term-frequency times inverse document-frequency.

In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little meaningful information about the actual contents of the document. If we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms.

In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform. This method takes into account not just the occurrence of a word in a single document but in the entire corpus. lets take a business article this article will contain more business related terms like Stock-market, Prices, shares etc in comparison to any other article. but terms like “a, an, the” will come in each article with high frequency. so this method will penalize these type of high frequency words.

Co-Occurrence Matrix with a fixed context window

Words co-occurrence matrix describes how words occur together that in turn captures the relationships between words. Words co-occurrence matrix is computed simply by counting how two or more words occur together in a given corpus.

Prediction based embedding

Continuous Bag of Words(CBOW)

CBOW is learning to predict the word by the context. A context may be single word or multiple word for a given target words.

lets see this by an example “The cat jumped over the puddle.”

So one approach is to treat {“The”, “cat”, ’over”, “the’, “puddle”} as a context and from these words, be able to predict or generate the centre word “jumped”. This type of model we call a Continuous Bag of Words (CBOW) Model.

The below picture illustrate the representation of CBOW,

source

Skip-gram

For skip-gram, the input is the target word, while the outputs are the words surrounding the target words. For instance, in the sentence “I have a cute dog”, the input would be “a”, whereas the output is “I”, “have”, “cute”, and “dog”, assuming the window size is 5. All the input and output data are of the same dimension and one-hot encoded. The network contains 1 hidden layer whose dimension is equal to the embedding size, which is smaller than the input/ output vector size. At the end of the output layer, a softmax activation function is applied so that each element of the output vector describes how likely a specific word will appear in the context. The graph below visualizes the network structure.

given the sentence above (“The fluffy dog barked as it chased a cat”) as input a run of the model would look like this:

Here’s the architecture of our neural network of Skip-gram model,

source

Tensorflow implementation of word Embedding

You can create word embeddings in TensorFlow, we first split the text into words and then assign an integer to every word in the vocabulary. For example, the sentence “I have a cat.” could be split into [“I”, “have”, “a”, “cat”, “.”] and then the corresponding word_ids tensor would have shape [5] and consist of 5 integers. To map these word ids to vectors, we need to create the embedding variable and use thetf.nn.embedding_lookup function as follows:

word_embeddings = tf.get_variable(“word_embeddings”, [vocabulary_size, embedding_size])

embedded_word_ids = tf.nn.embedding_lookup(word_embeddings, word_ids)

After this, the tensor embedded_word_ids will have shape [5,embedding_size] in our example and contain the embeddings (dense vectors) for each of the 5 words. At the end of training, word_embeddings will contain the embeddings for all words in the vocabulary.

For vector representation of word on Tensorboard, we use following code,

# Merge all the summaries and write them out to /tmp/logs (by default)merged = tf.summary.merge_all()train_writer = tf.summary.FileWriter(FLAGS.summaries_dir + '/train', sess.graph)test_writer = tf.summary.FileWriter(FLAGS.summaries_dir + '/test')

tf.global_variables_initializer().run()

Launching TensorBoard

To run TensorBoard, use the following command (alternatively python -m tensorboard.main)

tensorboard --logdir=path/to/log-directory

The below picture shows that vector representation of word on Tensorboard,

screenshot

Word representation is central to natural language processing. The default approach of representing words as discrete and distinct symbols is insufficient for many tasks, and suffers from poor generalization. For example, the symbolic representation of the words “pizza” and “hamburger” are completely unrelated: even if we know that the word “pizza” is a good argument for the verb “eat”, we cannot infer that “hamburger” is also a good argument. We thus seek a representation that captures semantic and syntactic similarities between words.

The below picture illustrate the vector representation of word,

source

7. Topic aware Sequence to Sequence Model with Multihead Attention Mechanism

The Sequence to Sequence model (seq2seq) consists of two RNNs — an encoder and a decoder. The encoder reads the input sequence, word by word and emits a context (a function of final hidden state of encoder), which would ideally capture the essence (semantic summary) of the input sequence. Based on this context, the decoder generates the output sequence, one word at a time while looking at the context and the previous word during each timestep. This is a ridiculous over simplification, but it gives you an idea of what happens in seq2seq.

Sequence To Sequence model introduced in Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation has since then, become the Go-To model for Dialogue Systems and Machine Translation. It consists of two RNNs (Recurrent Neural Network) : An Encoder and a Decoder. The encoder takes a sequence(sentence) as input and processes one symbol(word) at each timestep. Its objective is to convert a sequence of symbols into a fixed size feature vector that encodes only the important information in the sequence while losing the unnecessary information. You can visualise data flow in the encoder along the time axis, as the flow of local information from one end of the sequence to another.

Topic aware sequence-to-sequence (TA-Seq2Seq) model in order to leverage topic information as prior knowledge in response generation. TA-Seq2Seq is built on the sequence-to-sequence framework. In encoding, the model represents an input message as hidden vectors by a message encoder, and acquires embeddings of the topic words of the message from a pre-trained LDA model. The topic words are used as a simulation of topical concepts in people’s minds, and obtained from a LDA model which is pre-trained using large scale social media data outside the conversation data. In decoding, each word is generated according to both the message and the topics through a joint attention mechanism. In joint attention, hidden vectors of the message are summarized as context vectors by message attention which follows the existing attention techniques, and embeddings of topic words are synthesized as topic vectors by topic attention. Different from existing attention, in topic attention, the weights of the topic words are calculated by taking the final state of the message as an extra input in order to strengthen the effect of the topic words relevant to the message. The joint attention lets the context vectors and the topic vectors jointly affect response generation, and makes words in responses not only relevant to the input message, but also relevant to the correlated topic information of the message. To model the behavior of people using topical concepts as “building blocks” of their responses, we modify the generation probability of a topic word by adding another probability item which biases the overall distribution and further increases the possibility of the topic word appearing in the response. The results on both automatic evaluation metrics and human annotations show that TA-Seq2Seq can generate more informative, diverse, and topic relevant responses and significantly outperforms state-of-the-art methods for response generation.

Seq2Seq Attention mechanism

The traditional Seq2Seq model assumes that every word is generated from the same context vector. In practice, however, different words in Y could be semantically related to different parts of X. To tackle this issue, attention mechanism is introduced into Seq2Seq.

The below picture illustrate the seq2seq attention mechanism,

source

Sequence-to-sequence model (Seq2Seq) was first proposed in machine translation. The idea was to translate one sequence to another sequence through an encoder-decoder neural architecture. Recently, dialog generation has been treated as sequence translation from a query to a reply.

Multi-head Attention Mechanism

The context vector obtained by traditional attention mechanism focuses on a specific representation subspace of the input sequence. Such context vector is expected to reflect one aspect of the semantics in the input. However, a sentence usually involves multiple semantics spaces, especially for a long sentence. In multi-head attention mechanism for Seq2Seq model to allow the decoder RNN to jointly attend to information from different representation subspaces of the encoder hidden states at the decoding process. The idea of multi-head has been applied to learn the sentence representation in self-attention.

The below picture illustrate the working of Multihead encoder-decoder attention mechanism,

source

Dual Encoder LSTM (DE)

The DE model consists of two RNNs which respectively compute the vector representation of an input context and response.

source

Dual Encoder LSTM network is just one of many we could apply to this problem and it’s not necessarily the best one. You can come up with all kinds of Deep Learning architectures that haven’t been tried yet — it’s an active research area. For example, the seq2seq model often used in Machine Translation would probably do well on this task. The reason we are going for the Dual Encoder is because it has been reported to give decent performance on this data set. This means we know what to expect and can be sure that our implementation is correct.

The following are the working of Dual Encoder,

i. Both the context and the response text are split by words, and each word is embedded into a vector. The word embeddings are initialized with Stanford’s GloVe vectors and are fine-tuned during training.

ii. Both the embedded context and response are fed into the same Recurrent Neural Network word-by-word. The RNN generates a vector representation that, loosely speaking, captures the “meaning” of the context and response.

iii. We measure the similarity of the predicted response r' and the actual response r by taking the dot product of these two vectors. A large dot product means the vectors are similar and that the response should receive a high score. We then apply a sigmoid function to convert that score into a probability.

8. Neural Response Generation via Generative Adversarial Network

Generative Adversarial Nets (GANs) offers an effective architecture of jointly training a generative model and a discriminative classifier to generate sharp and realistic images. This architecture could also potentially be applied to conversational response generation to relieve the safe response problem, where the generative part can be an Seq2Seq-based model that generates response utterances for given queries, and the discriminative part can evaluate the quality of the generated utterances from diverse dimensions according to human-produced responses. However, unlike the image generation problems, training such a GAN for text generation here is not straightforward. The decoding phase of the Seq2Seq model usually involves sampling discrete words from the predicted distributions, which will be fed into the training of the discriminator. The sampling procedure is non-differentiable, and will therefore break the back-propagation. Inspired by recent advances in Neural Machine Translation (NMT). Earlier works focused on paired word sequences only, now we have the mechanism that the comprehensibility of the generated responses can benefit from multiview training with respect to words, coarse tokens and utterances.

source

Generative model

The generative model G defines the policy that generates a response y given dialogue history x. It takes a form similar to seq2seq models, which first map the source input to a vector representation using a recurrent net and then compute the probability of generating each token in the target using a softmax function.

Discriminative model

The discriminative model D is a binary classifier that takes as input a sequence of dialogue utterances {x, y} and outputs a label indicating whether the input is generated by humans or machines. The input dialogue is encoded into a vector representation using a hierarchical encoder 2 which is then fed to a 2-class softmax function, returning the probability of the input dialogue episode being a machine-generated dialogue (denoted Q − ({x, y})) or a human-generated dialogue (denoted Q + ({x, y})).

Policy Gradient Training

The key idea of the system is to encourage the generator to generate utterances that are indistinguishable from human generated dialogues. We use policy gradient methods to achieve such a goal, in which the score of current utterances being human-generated ones assigned by the discriminator is used as a reward for the generator, which is trained to maximize the expected reward of generated utterance(s) using the REINFORCE algorithm.

Reward for Every Generation Step

Suppose, for example, the input history is what’s your name, the human-generated response is I am John, and the machine-generated response is I don’t know. The vanilla REINFORCE model assigns the same negative reward to all tokens within the human-generated response (i.e., I, don’t, know), whereas proper credit assignment in training would give separate rewards, most likely a neutral reward for the token I, and negative rewards to don’t and know. We call this reward for every generation step, abbreviated REGS. Rewards for intermediate steps or partially decoded sequences are thus necessary. Unfortunately, the discriminator is trained to assign scores to fully generated sequences, but not partially decoded ones. We propose two strategies for computing intermediate step rewards by (1) using Monte Carlo (MC) search and (2) training a discriminator that is able to assign rewards to partially decoded sequences.

9. Machine Reading for Question Answering

Machine Reading Comprehension (MRC) is a challenging task: the goal is to have machines read a (set of) text passage(s) and then answer any question about the passage(s). The MRC model is the core component of text-QA agents.

Consider an example as given the question “will I qualify for OSAP if I’m new in Canada”, one might first locate the relevant passage that include: “you must be a 1 Canadian citizen; 2 permanent resident; or 3 protected person…” and reason that being new to the country is usually the opposite of citizen, permanent resident etc., thus determine the correct answer: “no, you won’t qualify”.

source

Neural MRC Models

In spite of the variety of model structures and attention types , a typical neural MRC model performs reading comprehension in three steps, as (1) encoding the symbolic representation of the questions and passages into a set of vectors in a neural space; (2) reasoning in the neural space to identify the answer vector (e.g., in SQuAD, this is equivalent to ranking and re-ranking the embedded vectors of all possible text spans in P ). and (3) decoding the answer vector into a natural language output in the symbolic space (e.g., this is equivalent to mapping the answer vector to its text span in P ).

The below picture illustrate the working of Machine reading comprehension,

source

Encoding in MRC

Most MRC models encode questions and passages through three layers: lexicon embedding layer,contextual embedding layer and attention layer.

Lexicon Embedding Layer

It extracts information from Q and P at the word level and normalizes for lexical variants. It typically maps each word to a vector space using a pre-trained word embedding model, such as word2vec or GloVe. such that semantically similar words are mapped to the vectors that are close to each other in the neural space. Word embedding can be enhanced by concatenating each word embedding vector with other linguistic embeddings such as those derived from characters, Part-Of-Speech (POS) tags, and named entities etc.

Contextual Embedding Layer

It utilizes contextual cues from surrounding words to refine the embedding of the words. As a result, the same word might map to different vectors in a neural space depending on its context, such as “bank of a river” vs. “ bank of America”. This is typically achieved by using a Bi-directional Long Short-Term Memory (BiLSTM) network.

Attention Layer

It couples the question and passage vectors and produces a set of query-aware feature vectors for each word in the passage, and generates the working memory M over which reasoning is performed.

Reasoning

MRC models can be grouped into different categories based on how they perform reasoning to generate the answer: single-step and multi-step models.

Single-Step Reasoning

A single-step reasoning model matches the question and document only once and produce the final answers.

Multi-Step Reasoning.

Multi-step reasoning models are the dynamic multi-step reasoning models have to be trained using RL methods, e.g., policy gradient, which are tricky to implement due to the instability issue. SAN combines the strengths of both types of multi-step reasoning models.

source

10. Goal-Oriented Dialog Management for Conversational AI with Transfer Learning

Transfer Learning

The main goal of this work is to study the impact of a widely used technique Transfer Learning on goal oriented bots. As the name suggests, transfer learning transfers knowledge from one neural network to another. The former is known as the source, while the latter is the target. The goal of the transfer is to achieve better performance on the target domain with limited amount of training data, while benefiting from additional information from the source domain. In the case of dialogue systems, the input space for both source and target nets are their respective dialogue spaces.

The klGoal-oriented bots contain an initial natural understanding (NLU) component, that is tasked with determining the user’s intent and its parameters, also known as slots . The usual practice in the RL-based Goal-Oriented Chatbots is to define the user-bot interactions as semantic frames. The entire dialogue can be reduced to a set of slot-value pairs, called semantic frames. Consequently, the conversation can be executed on two distinct levels: Semantic level: In this level the user sends and receives only a semantic frames as messages.

Natural language level: In this level the user sends and receives natural language sentences, which are reduced to, or derived from a semantic frame by using Natural Language Understanding (NLU) and Natural Language Generation (NLG) units respectively. It consists of two independent units which are the User Simulator on the left side and the Dialogue Manager (DM) on the right side. We operate on the semantic level, removing the noise introduced by the NLU and NLG units.

There are some mechanism which is used in Goal-Oriented Dialog Management for Conversational AI with Transfer Learning which explain as below

User Simulator

The User Simulator creates a user — bot conversation, given the semantic frames. Because the model is based on Reinforcement Learning, a dialogue simulation is necessary to successfully train the model. The user goal consists of two different sets of slots as inform slots and request slots. Inform slots are the slots for which the user knows the value, i.e. they represent the user constraints (e.g. {movie name: “avengers”, number of people: “3”, date: “tomorrow”}) and Request slots are ones for which the user is looking for an answer (e.g. { city, theater, start time } }).

Dialogue Manager

The Dialogue Manager (DM), as its name suggests, manages the dialogue flow in order to conduct a proper dialogue with the user. The DM is composed by two trainable sub components: the Dialogue State Tracker (DST) and the Policy Learning Module, i.e. the agent. Additionally, the Dialogue Manager exploits an external Knowledge Base (KB), to find and suggest values for the user requests. Therefore, it plays a central role in the entire Dialogue System.

Dialogue State Tracker

The responsibility of the Dialogue State Tracker (DST) is to build a reliable and robust representation of the current state of the dialogue. All system actions are based on the current dialogue state. It keeps track of the history of the user utterances, system actions and the querying results from the Knowledge Base. It extracts features and creates a vector embedding of the current dialogue state, which is exposed and used by the Policy Learning module later on. In order to produce the embeddings, the Dialogue State Tracker must know the type of all slots and intents that might occur during the dialogue. Since we operate on a semantic level (i.e. not introducing any additional noise), we employ a rule-based state tracker.

Policy Learning

The Policy Learning module selects the next system actions to drive the user towards the goal in the smallest number of steps. It does that by using the deep reinforcement neural networks, called Deep Q-Networks (DQN) .

The picture illustrate below shows the difference between with and without Transfer Learning Technique,

source

11. Deep Reinforcement Learning Chatbot Model

The system consists of an ensemble of natural language generation and retrieval models, including template-based models, bag-of-words models, sequence-to-sequence neural network and latent variable neural network models. By applying reinforcement learning to crowd sourced data and real-world user interactions, the system has been trained to select an appropriate response from the models in its ensemble. The system has been evaluated through A/B testing with real-world users, where it performed significantly better than many competing systems. Due to its machine learning architecture, the system is likely to improve with additional data. Our system consists of an ensemble of response models. The response models take as input a dialogue and output a response in natural language text. In addition, the response models may also output one or several scalar values, indicating their internal confidence. As will be explained later, the response models have been engineered to generate responses on a diverse set of topics using a variety of strategies.

The below picture illustrate the work flow of response generation and evaluation of in Deep reinforcement learning algorithm,

source

The dialogue manager is responsible for combining the response models together. As input, the dialogue manager expects to be given a dialogue history (i.e. all utterances recorded in the dialogue so far, including the current user utterance) and confidence values of the automatic speech recognition system (ASR confidences) or text based generated response. To generate a response, the dialogue manager follows a three-step procedure. First, it uses all response models to generate a set of candidate responses. Second, if there exists a priority response in the set of candidate responses (i.e. a response which takes precedence over other responses), this response will be returned by the system. For example, for the question “What is your name?”, the response “I am an Alexa Prize socialbot” is a priority response. Third, if there are no priority responses, the response is selected by the model selection policy. For example, the model selection policy may select a response by scoring all candidate responses and picking the highest-scored response.

Response Models

There are 22 response models in the system, including retrieval-based neural networks, generation-based neural networks, knowledge base question answering systems and template-based systems.

Template-based Models

Templates to produce a response given the dialogue history and user utterance By default all templates generate non-priority responses, so we configure templates related to the socialbot’s name, age and location to output priority responses. We modify a few templates further to make them consistent with the challenge (e.g. to avoid obscene language and to encourage the user to discuss certain topics, such as news, politics and movies). The majority of templates remain unchanged.

Knowledge Base-based Question Answering

They use a policy-based agent with continuous states based on KB embeddings to traverse the knowledge graph to identify the answer node (entity) for an input query. The RL-based methods are as robust as the neural methods due to the use of continuous vectors for state representation, and are as interpretable as symbolic methods because the agents explicitly traverse the paths in the graph.

source

Retrieval-based Neural Networks

VHRED models: The system contains several VHRED models (Latent Variable Hierarchical Recurrent Encoder-Decoder) , sequence-to-sequence models with Gaussian latent variables trained as variational auto-encoders . The trained VHRED models generate candidate responses as follows. First, a set of K model responses are retrieved from a dataset using cosine similarity between the current dialogue history and the dialogue history in the dataset based on bag-of-words TF-IDF Glove word embeddings. An approximation of the log-likelihood for each of the 20 responses is computed by VHRED, and the response with the highest log-likelihood is returned.

Bag-of-words Retrieval Models: The system contains three bag-of-words retrieval models based on TF-IDF Glove word embeddings and Word2Vec embeddings. Similar to the VHRED models, these models retrieve the response with the highest cosine similarity.

Retrieval-based Logistic Regression

The system contains a response model, called BoWEscapePlan, which returns a response from a set of 35 topic-independent, generic pre-defined responses, such as “Could you repeat that again”, “I don’t know” and “Was that a question?”. Its main purpose is to maintain user engagement and keep the conversation going, when other models are unable to provide meaningful responses. This model uses a logistic regression classifier to select its response based on a set of higher-level features.

Search Engine-based Neural Networks

The system contains a deep classifier model, called LSTMClassifierMSMarco, which chooses its response from a set of search engine results. The system searches the web with the last user utterance. as query, and retrieves the first 10 search snippets. The retrieved snippets are preprocessed by stripping trailing words, removing unnecessary punctuation and truncating to the last full sentence. The model uses a bidirectional LSTM to separately map the last dialogue utterance and the snippet to their own embedding vectors. The resulting two representations are concatenated and passed through an MLP to predict a scalar-value between 0 − 1 indicating how appropriate the snippet is as a response to the utterance.

Generation-based Neural Networks

The system contains a generative recurrent neural network language model, called GRUQuestion-Generator, which can generate follow-up questions word-by-word, conditioned on the dialogue history. The input to the model consists of three components: a one-hot vector of the current word, a binary question label and a binary speaker label. The model contains two GRU layers and softmax output layer.

Model Selection Policy and Architecture

After generating the candidate response set, the dialogue manager uses a model selection policy to select the response it returns to the user. The dialogue manager must select a response which increases the satisfaction of the user for the entire dialogue. It must make a trade-off between immediate and long-term user satisfaction. For example, suppose the user asks to talk about politics. If the dialogue manager chooses to respond with a political joke, the user may be pleased for one turn. Afterwards, however, the user may be disappointed with the system’s inability to debate political topics. Instead, if the dialogue manager chooses to respond with a short news story, the user may be less pleased for one turn. However, the news story may influence the user to follow up with factual questions, which the system may be better adept at handling. To make the trade-off between immediate and long-term user satisfaction, we consider selecting the appropriate response as a sequential decision making problem. This section describes five approaches to learn the model selection policy. These approaches are all evaluated with real-world users in the next section.

Action-value Parametrization

The use of an action-value function for selecting dialogue responses is closely related to where a model is learned to predict the quality of a dialogue system response.

Model Architecture

The below diagram represents the flow of a single round, one complete loop, in training with Reinforcement Learning algorithm,

source

Chatbot Query and Response

Example 1 (A & B are both a trained chatbot)

A: Can you do me a favor?

B: Anything.

A: I want you to bend a father.

B: Of course.

A: I’m so sorry I didn’t mean to be rude but.

B: Don’t get it at least it’s about as much as I can.

Example 2 (A & B are both a trained chatbot)

A: I didn’t do it! I haven’t got a gun!

B: You liar. You still want to get your money back.

A: Something wrong with the truth.

B: I don’t want any money just a little I can’t take just out of this.

A: What’s wrong.

B: Listen to me. I’m scared now.

A: Ah you’ll need all the money I need your help.

12. Coding and Implementation

Dataset

In this post I’ll work with the Cornell_Movie-Dialogs_Corpus (link).

DESCRIPTION:

This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts:

- 220,579 conversational exchanges between 10,292 pairs of movie characters

- involves 9,035 characters from 617 movies

- in total 304,713 utterances

- movie metadata included:

- genres

- release year

- IMDB rating

- number of IMDB votes

- IMDB rating

- character metadata included:

- gender (for 3,774 characters)

- position on movie credits (3,321 characters)

Preprocessing

The Cornell_Movie-Dialogs_Corpus dataset is a natural language dataset and can’t be used in its exact form. It needs to be converted in a suitable data structure in order to use it for further computation and processing.

Tokenize- First step in the pre-processing is to tokenize the sentences into different words. For example, ‘Bob dropped the apple. Where is the apple?’ is tokenized to [‘Bob’, ‘dropped’, ‘the’, ‘apple’, ‘.’, ‘Where’, ‘is’, ‘the’, ‘apple’, ‘ ?’]

Splitting into Story, Questions, and answers: Next, the sentences were split into stories, questions and answers so that they can be fed to the proposed models.

Combining all the stories- All the stories were then combined up to the point that the question was asked. This finally becomes the story for that particular question.

Indexing the stories, questions, and answers- Finally, the questions and stories are indexed according to their time of occurrence and are eventually processed via word2vec model. The answers are transformed to one hot encoded vector.

Creating the model

Now that we have inputs, parsing, evaluation and training it’s time to write code for our Dual LSTM neural network. Because we have different formats of training and evaluation data I’ve written a chatbot_model.py wrapper that takes care of bringing the data into the right format for us. It takes a model_impl argument, which is a function that actually makes predictions.

Defining Evaluation Metrics

Tensorflow already comes with many standard evaluation metrics that we can use. To use these metrics we need to create a dictionary that maps from a metric name to a function that takes the predictions and label.

tf.metrics.accuracy( labels, predictions, weights=None, metrics_collections=None, updates_collections=None, name=None

)

The code is available on my github Profile: github link

The below picture are the some Screenshots of the output,

screenshot
screenshot
screenshots
screenshots

REFERENCES

  • Deep Learning for Chatbots (paper)
  • Dialogue Intent Classification with Long Short-Term Memory Networks (paper)
  • Sequence to Sequence Learning with Neural Networks(paper)
  • A Neural Conversational Model (paper)
  • Neural Machine Translation by Jointly Learning to Align and Translate (paper)
  • Effective Approaches to Attention-based Neural Machine Translation (paper)
  • Neural Approaches to Conversational AI (paper)
  • Neural Response Generation via GAN with an Approximate Embedding Layer (paper)
  • Task-oriented Conversational Agent Self-learning Based on Sentiment Analysis (paper)
  • Deep Reinforcement Learning for Dialogue Generation (paper)
  • Topic Aware Neural Response Generation (paper)
  • Response Selection with Topic Clues for Retrieval-based Chatbots (paper)
  • Bidirectional Recurrent Neural Networks as Generative Models (paper)
  • Adversarial-Learning-for-Generative-Conversational-Agents (link)
  • Few-Shot Generalization Across Dialogue Tasks (link)
  • Neural Networks for Text Correction and Completion in Keyboard Decoding (paper)