Chatbot | Retrieval-based Dialog System on the Ubuntu Dialog Corpus | LSTM | PyTorch

1) Idea

One of the top goals in artificial intelligence research is to enable a computer to hold natural, coherent conversations with humans - ideally in such a way that the computer tricks the human into thinking that it is talking to another human being. As of now, none of these so-called chatbots can pass the Turing test yet and research is still very far from developing a generative open-domain dialog system, meaning one that can freely compose sentences to hold human-style conversations about all kinds of topics. Despite that, more and more companies make use of chatbots to create value e.g. in their technical customer support or as personal assistants. Often, these are closed-domain retrieval-based chatbots: using some kind of heuristic, they pick an appropriate response from a fixed repository of responses predefined by humans within a narrowly specified range of topics. Taking inspiration from such retrieval-based chatbots, the goal in this project was to build a model that, at best, is able to retrieve the most appropriate response to a conversational input from a whole pool of candidate responses.

2) Corpus

In the paper “The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems”, the hypothesis is made that the lack of progress in recent years in building sophisticated dialogue systems is due to the lack of sufficiently large datasets. The authors of the paper thus aimed at compiling a new large corpus to exploit the opportunities of deep learning for dialog systems and came up with the so-called Ubuntu Dialog Corpus (UDC). It is a collection of chat logs from Ubuntu-internal chat rooms, in which users of the operating system seek and give Ubuntu-related technical advice. Containing around 930.000 dialogs, it is the largest freely available corpus with the following favourable characteristics: it is targeting a task-specific domain, namely technical support, with conversations being two-way oriented and going back and forth between the two users at least three times. As such, it is much more representative of natural human dialogs than, for example, common microblogging datasets like from Twitter or Weibo, in which the exchanges lack a natural conversational structure and length.

3) Data & task

The data comes fully tokenized, stemmed and lemmatized and entities like names, locations, organizations, URLs, and system paths were replaced with special tokens.

Mostly importantly, the training set is separated into 3 columns: the context of the conversation (all previous utterances in the conversation up to the point at which the next response is to be found), the candidate response, and a label denoting whether that candidate response is a true response to the context (label = 1), or a randomly drawn response from elsewhere in the dataset (label = 0), with a 50%/50% distribution of positive and negative examples. Given this nature of the data, the task is to classify whether an utterance is a true response to a given context utterance or not. It is thus a binary classification problem of context-response pairs. However, I did not strictly stick with the original approach of the train/validation/test setup: I split the train.csv into three parts (80%/10%/10%) to create my own training/validation/test files that all share the same column structure as the original train.csv. In addition, I took a subsample of the original test.csv as a second test file with a different structure, as will be explained in greater detail in the testing paragraph.

4) Model architecture

Python 3.5 and PyTorch 0.3.0 was used to implement a neural network that can be described as a dual encoder model. The main component is an encoder model composed of three layers:

Embedding layer: initialized with pre-trained GloVe vectors for those words available, otherwise randomly initialized (from standard normal distribution), fine-tuned during training. Dimension: vocabulary size x embedding dimension.
LSTM layer: unidirectional, single-layer LSTM with input-to-hidden weights initialized from a uniform distribution and hidden-to-hidden weights with orthogonal initialization, following the original paper’s recommendations. At each time step, one word vector of the input utterance is fed into the LSTM and the hidden state is updated. For this classification task, we are only interested in the last hidden state of each input sequence, which can be interpreted as a numerical summary of the input utterance.
Dropout layer: According to PyTorch documentation and forum, the in-built dropout in the LSTM layer will not apply effectively if num_layers = 1 (since it does by definition not apply to the last layer)., Therefore, the in-built dropout was set to 0.0 and an additional dropout layer was added to the model such that is applies directly to the last hidden state of each input sequence.

To obtain the dual encoder model, one instance of the encoder model is applied to the context utterance and subsequently to the response utterance for each training example. The two outputs (the last hidden state of the context input sequence, denoted as c, and the last hidden state of the response input sequence, denoted as r) are then used to calculate the probability that this is a valid pair: a weight matrix M is initialized as another learnable parameter (in addition to the embedding weights and the LSTM weights) and used to map between c and r. This can be interpreted as a predictive approach: given some context c, by multiplying c with M we predict what a possible response r’ could look like. We then measure the similarity of r’ to the actual response r using the dot product and convert it to a probability using the sigmoid function.

A high similarity will result in a high dot product and a sigmoid value that goes towards 1. The model was trained by minimizing the binary cross entropy loss.

5) Training & validation results

These results are based on 10,000 training examples & 1,000 validation examples. While training loss and accuracy behaved as desired, the course of the validation loss and accuracy show clearly that the model tends to heavy overfitting. The validation loss never decreases; even worse, it increases unstoppably. Parallelly, the validation accuracy increases by only a small amount, is relatively unstable and then falls rapidly. Attempts to regularize this overfitting situation by optimizing for high values for dropout and L2 weight decay showed limited effectiveness. Other attempts to prevent overfitting, such as drastically reducing the model’s hidden and embedding size and freezing the word embedding parameters (so that the LSTM weights and M remain the only learnable parameters) did not improve the validation results neither. The best validation accuracy was achieved with the hyperparameter configuration shown in the table.

6) Test data & test results

Testing was done with two different approaches.

I used a subsample of my split_testing.csv which has the same data structure as the data used for training and validation (context - response - label). The appropriate testing metric used was accuracy - it simply measures what is the chance that the label 1 or 0 is classified correctly.
I used a subsample of the original test.csv which has a different structure than the training and validation data:

It contains 11 columns: 1 context utterance to which we want to find an appropriate response, 1 ground truth response utterance which is the true appropriate response, 9 distractor utterances which, together with the 1 ground truth utterance, represent in total 10 candidate responses, for each one of which the model computes the sigmoid scores (the probability for each candidate response being the ground truth response) and is challenged to rank them accordingly. For this testing approach, the appropriate metric is recall, specifically recall at k for k = 5, 2, 1. The system’s ranking is considered correct if the ground truth utterance is among the k highest scored response candidates. Given the above validation results, the test results are expectedly weak. The model basically performs no better than taking random guesses:

7) Key findings

As explained before, the most distinct characteristic of the UDC is its size, which makes it so well-suited to accelerate progress in dialog system research by making use of deep learning. In their paper “Improved Deep Learning Baselines for Ubuntu Corpus Dialogs”, the authors set the benchmark performance for the dual encoder model architecture on the UDC dataset. To achieve good recall rates, they trained the model on at least 100,000 up to the full 1,000,000 training examples available.

This sheds some light on why training on 10,000 examples leads the model into an overfitting situation that cannot be regularized with the methods applied in this project. It implies that a maximum amount of training data is crucial for good performance in dialog systems and that only increasing the amount of training data might finally show a promising regularizing effect on the LSTM model developed in this project.

Taking a closer look at the benchmark test results, it is interesting that even with the maximum of training data used, the recall for k = 1 is at 63,8% and does not reach its peak yet. In other words: even with 1,000,000 training examples, in around one third of all cases the user of a chatbot, be it a customer seeking technical advice, will not receive the most appropriate response to his question.This illustrates one of the biggest shortcomings of chatbots as of now - their need for vast amounts of data. In practice, only few companies can provide those amounts of data when they first start putting chatbots in dialog with their customers. That’s why many chatbots are first released with mediocre performance, gathering data to learn over time, yet frustrating and driving away many users. The authors of the benchmark paper therefore propose three ways of enhancing performance of dialog systems:

improved pre-processing approaches
sophisticated model ensembles (combinations of multiple model architectures)
extended models with memory (e.g. providing external sources of information from user manuals)

Name	Name	Last commit message	Last commit date
Latest commit ? History 65 Commits
src	src	?	?
PyTorch_Dialog_System_Notebook.ipynb	PyTorch_Dialog_System_Notebook.ipynb	?	?
README.md	README.md	?	?

乳腺囊实性结节是什么意思	铁是什么颜色	山东人为什么那么高	四个又读什么	pisen是什么牌子
什么是小数	用什么泡脚可以去湿气	肩膀疼挂什么科室最好	孕妇吃什么最有营养	狐臭和汗臭有什么区别
翊字五行属什么	左肺钙化灶是什么意思	喝什么排肝毒最快	阴茎进入阴道是什么感觉	五海瘿瘤丸主要治什么病
五音指什么	幽门螺杆菌抗体阳性什么意思	11月28是什么星座	吃什么补肾最好	什么是情感

上海以前叫什么hcv7jop4ns5r.cn	卯是什么生肖hcv9jop2ns4r.cn	北豆腐是什么hcv7jop6ns4r.cn	福建岩茶属于什么茶hcv8jop5ns3r.cn	阿赖耶识是什么意思hcv9jop1ns0r.cn
润是什么生肖hkuteam.com	上午十点是什么时辰naasee.com	女人为什么会阳虚hcv9jop7ns0r.cn	博大精深什么意思hcv9jop7ns3r.cn	翡翠和玉有什么区别hcv7jop9ns7r.cn
月经每次都推迟是什么原因xinmaowt.com	泳帽什么材质的好hanqikai.com	为什么不能空腹喝牛奶hcv8jop5ns9r.cn	山合念什么hcv9jop5ns1r.cn	副县长是什么级别干部hcv7jop6ns2r.cn
什么时间是排卵期hcv7jop6ns8r.cn	赤是什么颜色hcv9jop1ns5r.cn	戊日是什么意思hcv9jop2ns9r.cn	祈祷是什么意思hcv9jop8ns1r.cn	五行中金代表什么travellingsim.com

李克强主持召开国务院常务会议部署推进医联体建设

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Chatbot | Retrieval-based Dialog System on the Ubuntu Dialog Corpus | LSTM | PyTorch

About

Uh oh!

Releases

Packages

Languages

Janinanu/Retrieval-based_Chatbot

Folders and files

Latest commit

History

Repository files navigation

Chatbot | Retrieval-based Dialog System on the Ubuntu Dialog Corpus | LSTM | PyTorch

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages