音序是什么| 纳差什么意思| 男人精液少是什么原因| 什么叫多动症| 人的反义词是什么| 脸色发青是什么原因引起的| 孤寡老人是什么意思| 榧子是什么| 帕金森病是什么原因引起的| 什么时候测量血压最准确| bull是什么意思| 血栓吃什么药| 尿发红什么原因| 三位一体是什么意思| ntl是什么意思| 瓜尔佳氏现在姓什么| 健胃消食片什么时候吃最好| 就寝是什么意思| 女人梦见猪是什么预兆| 当医生需要什么条件| 不显山不露水是什么意思| 甲状腺钙化是什么意思| 为什么大姨妈迟迟不来| 梦见抓鸟是什么征兆| 办护照需要准备什么材料| 什么是白肺| 鹞子是什么鸟| 婚姻宫是什么意思| 朝野是什么意思| 心肌缺血用什么药效果最好| 五浊恶世是什么意思| 肾结石吃什么药止疼| 脚凉吃什么药| 不想吃饭吃什么药| 坐骨神经痛挂什么科| 肾综合征是什么病严重吗| 重阳节又称什么节| 骨龄大于年龄意味着什么| 吃什么药马上硬起来| 冰希黎香水什么档次| 铜锣湾有什么好玩的| 摔纹皮是什么皮| 雌性激素是什么| 痛风反复发作什么原因| 呃逆什么意思| 早餐吃什么不升血糖| 终止是什么意思| 珊瑚绒是什么面料| 鼠五行属什么| 人乳头瘤病毒18型阳性是什么意思| 泪河高度说明什么| 什么是备孕| 血糖高的人早餐吃什么| 吃了布洛芬不能吃什么| 日照有什么特产| 07年属什么生肖| 甲亢看什么科| 公开课是什么意思| 逆熵是什么意思| 总掉头发是什么原因女| 吃什么食物能养肝护肝| 胆碱酯酶高是什么原因| 鱼字五行属什么| 为什么会得扁平疣| 尿毒症有什么症状| 24属什么生肖| 狗咬人后狗为什么会死| 眉毛上长痘是什么原因| sids是什么意思| 屁股出血什么原因| 部首和偏旁有什么区别| 马齿笕有什么功效| 下半年有什么节日| 甲状腺结节不能吃什么| 农历六月十二是什么日子| 婴儿头发长得慢是什么原因| 性功能减退吃什么药好| 怡五行属性是什么| 垂直同步有什么用| 巨蟹女跟什么星座最配| 飞机杯是什么感觉| 看不上是什么意思| 鱼露是什么| 畅字五行属什么| 胃烧心吃什么药| 亮油什么时候涂| 西游记什么朝代写的| 正畸和矫正有什么区别| 李商隐被称为什么| 喝柠檬水有什么好处和坏处| 青提是什么| 肝阴虚吃什么药| 内裤发黄是什么妇科病| 舌下含服是什么意思| 姨妈老是推迟是为什么| 为什么有的人晒不黑| 喉咙疼痛吃什么药效果最好| 脚麻木是什么原因引起的| 三伏天晒背有什么好处| 吃了山竹不能吃什么| 每天尿都是黄的是什么原因| 尿粘液丝高是什么原因| 慢性非萎缩性胃炎吃什么药效果好| 老鼠爱吃什么食物| 胃炎应该吃什么药| 持之以恒是什么意思| 蓝天白云是什么生肖| 正常白带是什么样子| 丝瓜炒什么| hbcag是什么意思| 俄罗斯乌克兰为什么打仗| 烫伤涂什么药膏| 私处变黑是什么原因| 鼻头出汗是什么原因| 怀孕前三个月不能吃什么| 藏红花泡水喝有什么功效| 猪宝是什么东西| 心脏房颤是什么症状| 推介会是什么意思| 上曼月乐环后要注意什么| 记过处分有什么影响| 胎儿打嗝是什么原因| 儿时是什么意思| 暮春是什么意思| 叔叔老婆叫什么| 抵抗力差吃什么可以增强抵抗力| 房间隔缺损是什么意思| 下巴下面长痣代表什么| 一对什么填空| 瞳孔放大意味着什么| 9月份什么星座| 嘴巴发苦是什么原因造成的| 尿结石什么症状| 喉咙肿瘤有什么症状| 脸肿是什么原因| 鱼油什么时候吃最好| 政五行属什么| 芦笋炒什么好吃| 吴用属什么生肖| 香港脚是什么意思| 敕是什么意思| 傀儡什么意思| 胃溃疡能吃什么| 恐惧症吃什么药最好| 梦见生小孩是什么征兆| 梦见骨灰盒是什么征兆| 小儿外科主要看什么病| 贫血应该吃什么| 郭富城什么星座| 肝在什么位置| 乙酰氨基葡萄糖苷酶阳性什么意思| 免疫力和抵抗力有什么区别| 脾胃是什么意思| 梦见老公出轨什么意思| 黄鳝吃什么| 前什么后什么| 酩酊是什么意思| 晔字为什么不能取名| 小孩摇头是什么原因| 麦麸是什么意思| 梦到自己掉头发是什么预兆| 梦见自己的头发长长了是什么意思| 吃什么能让子宫瘤变小| 什么叫自私的人| 梦见狗是什么意思| 痹病是什么意思| 阿司匹林什么时间吃最好| leonardo是什么牌子| 为什么第一次进不去| 白月光是什么意思| 蒙脱石散适合什么腹泻| 什么是修行| 蛋糕用什么面粉| 贫血什么意思| 夏天结婚新郎穿什么衣服图片| 梦到别人怀孕是什么意思| 魅惑是什么意思| 河南人喜欢吃什么菜| 折耳猫什么颜色最贵| 锁骨窝疼可能是什么病| 妈妈的姐姐叫什么| 大腿后侧肌肉叫什么| 右侧肋骨下方是什么器官| 牙龈上火肿痛吃什么药| 鸡同鸭讲是什么意思| 脸上长痘痘是什么原因| 头不由自主的轻微晃动是什么病| 鲱鱼在中国叫什么鱼| 金牛座和什么星座最不配| 广西属于什么方向| 硬下疳是什么样子| 心肌病是什么症状| 黑壳虾吃什么食物| 教育是什么意思| 碳酸是什么| bgb是什么意思| 胆囊息肉是什么原因造成的| 骨关节炎吃什么药| 腰椎退行性变是什么病| 白塞氏病是什么病| 媱字五行属什么| 4月25日是什么星座| 男生吃菠萝有什么好处| 缺钾是什么原因引起| 十月十号是什么星座| 长息肉是什么原因| 什么样的耳朵| 下午六点是什么时辰| 路痴是什么意思| 乾卦代表什么| 胃上面是什么器官| 奇行种什么意思| veromoda是什么牌子| 睡觉老做梦是什么原因| 9月出生的是什么星座| 慢阻肺是什么意思| 蚂蚁长什么样子| 梦见自己爬山是什么意思| 教师节送老师什么好| 杜鹃花是什么颜色| 阑尾炎吃什么药见效快| 什么的后羿| 上午8点是什么时辰| chanel是什么牌子| 结婚10周年是什么婚| pt是什么时间| 美国为什么不敢打朝鲜| 雍正为什么不杀十阿哥| 营卫不和吃什么中成药| 合肥有什么特产| 噗噗是什么意思| 虾跟什么不能一起吃| 宫颈cin1级是什么意思| 百日咳吃什么药| 五行海中金是什么意思| 水瓶女和什么星座最配| 蕞是什么意思| 冲羊煞东是什么意思| 黄瓜什么时候种植| 射的快吃什么药| 动物园里有什么动物| 三七粉主治什么病| 国家为什么要扫黄| 虾和什么食物相克| 静脉穿刺是什么意思| 89年蛇是什么命| 自得其乐是什么意思| 时髦是什么意思| 内裤上有黄色分泌物是什么原因| 经常吃南瓜有什么好处和坏处| 为什么手会麻| 压到蛇了是有什么预兆| 额头长痘是什么原因引起的| 6月是什么月| ms是什么病| 巨细胞病毒抗体阳性是什么意思| 为什么会孕酮低| 什么是黑色素肿瘤| 庸人自扰什么意思| 阻生智齿是什么意思| 指甲盖凹陷是什么原因| 牙疼吃什么消炎药最好| 身体颤抖是什么病| 有甲状腺结节不能吃什么| 百度
Skip to content

Janinanu/Retrieval-based_Chatbot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

?

History

65 Commits
?
?
?
?
?
?

Repository files navigation

Chatbot | Retrieval-based Dialog System on the Ubuntu Dialog Corpus | LSTM | PyTorch

1) Idea

One of the top goals in artificial intelligence research is to enable a computer to hold natural, coherent conversations with humans - ideally in such a way that the computer tricks the human into thinking that it is talking to another human being. As of now, none of these so-called chatbots can pass the Turing test yet and research is still very far from developing a generative open-domain dialog system, meaning one that can freely compose sentences to hold human-style conversations about all kinds of topics. Despite that, more and more companies make use of chatbots to create value e.g. in their technical customer support or as personal assistants. Often, these are closed-domain retrieval-based chatbots: using some kind of heuristic, they pick an appropriate response from a fixed repository of responses predefined by humans within a narrowly specified range of topics. Taking inspiration from such retrieval-based chatbots, the goal in this project was to build a model that, at best, is able to retrieve the most appropriate response to a conversational input from a whole pool of candidate responses.

2) Corpus

In the paper “The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems”, the hypothesis is made that the lack of progress in recent years in building sophisticated dialogue systems is due to the lack of sufficiently large datasets. The authors of the paper thus aimed at compiling a new large corpus to exploit the opportunities of deep learning for dialog systems and came up with the so-called Ubuntu Dialog Corpus (UDC). It is a collection of chat logs from Ubuntu-internal chat rooms, in which users of the operating system seek and give Ubuntu-related technical advice. Containing around 930.000 dialogs, it is the largest freely available corpus with the following favourable characteristics: it is targeting a task-specific domain, namely technical support, with conversations being two-way oriented and going back and forth between the two users at least three times. As such, it is much more representative of natural human dialogs than, for example, common microblogging datasets like from Twitter or Weibo, in which the exchanges lack a natural conversational structure and length.

3) Data & task

The data comes fully tokenized, stemmed and lemmatized and entities like names, locations, organizations, URLs, and system paths were replaced with special tokens.

alt text

Mostly importantly, the training set is separated into 3 columns: the context of the conversation (all previous utterances in the conversation up to the point at which the next response is to be found), the candidate response, and a label denoting whether that candidate response is a true response to the context (label = 1), or a randomly drawn response from elsewhere in the dataset (label = 0), with a 50%/50% distribution of positive and negative examples. Given this nature of the data, the task is to classify whether an utterance is a true response to a given context utterance or not. It is thus a binary classification problem of context-response pairs. However, I did not strictly stick with the original approach of the train/validation/test setup: I split the train.csv into three parts (80%/10%/10%) to create my own training/validation/test files that all share the same column structure as the original train.csv. In addition, I took a subsample of the original test.csv as a second test file with a different structure, as will be explained in greater detail in the testing paragraph.

alt text

4) Model architecture

Python 3.5 and PyTorch 0.3.0 was used to implement a neural network that can be described as a dual encoder model. The main component is an encoder model composed of three layers:

  • Embedding layer: initialized with pre-trained GloVe vectors for those words available, otherwise randomly initialized (from standard normal distribution), fine-tuned during training. Dimension: vocabulary size x embedding dimension.
  • LSTM layer: unidirectional, single-layer LSTM with input-to-hidden weights initialized from a uniform distribution and hidden-to-hidden weights with orthogonal initialization, following the original paper’s recommendations. At each time step, one word vector of the input utterance is fed into the LSTM and the hidden state is updated. For this classification task, we are only interested in the last hidden state of each input sequence, which can be interpreted as a numerical summary of the input utterance.
  • Dropout layer: According to PyTorch documentation and forum, the in-built dropout in the LSTM layer will not apply effectively if num_layers = 1 (since it does by definition not apply to the last layer)., Therefore, the in-built dropout was set to 0.0 and an additional dropout layer was added to the model such that is applies directly to the last hidden state of each input sequence.

alt text

To obtain the dual encoder model, one instance of the encoder model is applied to the context utterance and subsequently to the response utterance for each training example. The two outputs (the last hidden state of the context input sequence, denoted as c, and the last hidden state of the response input sequence, denoted as r) are then used to calculate the probability that this is a valid pair: a weight matrix M is initialized as another learnable parameter (in addition to the embedding weights and the LSTM weights) and used to map between c and r. This can be interpreted as a predictive approach: given some context c, by multiplying c with M we predict what a possible response r’ could look like. We then measure the similarity of r’ to the actual response r using the dot product and convert it to a probability using the sigmoid function.

alt text

A high similarity will result in a high dot product and a sigmoid value that goes towards 1. The model was trained by minimizing the binary cross entropy loss.

5) Training & validation results

alt text

alt text

These results are based on 10,000 training examples & 1,000 validation examples. While training loss and accuracy behaved as desired, the course of the validation loss and accuracy show clearly that the model tends to heavy overfitting. The validation loss never decreases; even worse, it increases unstoppably. Parallelly, the validation accuracy increases by only a small amount, is relatively unstable and then falls rapidly. Attempts to regularize this overfitting situation by optimizing for high values for dropout and L2 weight decay showed limited effectiveness. Other attempts to prevent overfitting, such as drastically reducing the model’s hidden and embedding size and freezing the word embedding parameters (so that the LSTM weights and M remain the only learnable parameters) did not improve the validation results neither. The best validation accuracy was achieved with the hyperparameter configuration shown in the table.

alt text

6) Test data & test results

Testing was done with two different approaches.

  • I used a subsample of my split_testing.csv which has the same data structure as the data used for training and validation (context - response - label). The appropriate testing metric used was accuracy - it simply measures what is the chance that the label 1 or 0 is classified correctly.
  • I used a subsample of the original test.csv which has a different structure than the training and validation data:

alt text

It contains 11 columns: 1 context utterance to which we want to find an appropriate response, 1 ground truth response utterance which is the true appropriate response, 9 distractor utterances which, together with the 1 ground truth utterance, represent in total 10 candidate responses, for each one of which the model computes the sigmoid scores (the probability for each candidate response being the ground truth response) and is challenged to rank them accordingly. For this testing approach, the appropriate metric is recall, specifically recall at k for k = 5, 2, 1. The system’s ranking is considered correct if the ground truth utterance is among the k highest scored response candidates. Given the above validation results, the test results are expectedly weak. The model basically performs no better than taking random guesses:

alt text

7) Key findings

As explained before, the most distinct characteristic of the UDC is its size, which makes it so well-suited to accelerate progress in dialog system research by making use of deep learning. In their paper “Improved Deep Learning Baselines for Ubuntu Corpus Dialogs”, the authors set the benchmark performance for the dual encoder model architecture on the UDC dataset. To achieve good recall rates, they trained the model on at least 100,000 up to the full 1,000,000 training examples available.

alt text

This sheds some light on why training on 10,000 examples leads the model into an overfitting situation that cannot be regularized with the methods applied in this project. It implies that a maximum amount of training data is crucial for good performance in dialog systems and that only increasing the amount of training data might finally show a promising regularizing effect on the LSTM model developed in this project.

Taking a closer look at the benchmark test results, it is interesting that even with the maximum of training data used, the recall for k = 1 is at 63,8% and does not reach its peak yet. In other words: even with 1,000,000 training examples, in around one third of all cases the user of a chatbot, be it a customer seeking technical advice, will not receive the most appropriate response to his question.This illustrates one of the biggest shortcomings of chatbots as of now - their need for vast amounts of data. In practice, only few companies can provide those amounts of data when they first start putting chatbots in dialog with their customers. That’s why many chatbots are first released with mediocre performance, gathering data to learn over time, yet frustrating and driving away many users. The authors of the benchmark paper therefore propose three ways of enhancing performance of dialog systems:

  • improved pre-processing approaches
  • sophisticated model ensembles (combinations of multiple model architectures)
  • extended models with memory (e.g. providing external sources of information from user manuals)

Releases

No releases published

Packages

No packages published
乳腺囊实性结节是什么意思 铁是什么颜色 山东人为什么那么高 四个又读什么 pisen是什么牌子
什么是小数 用什么泡脚可以去湿气 肩膀疼挂什么科室最好 孕妇吃什么最有营养 狐臭和汗臭有什么区别
翊字五行属什么 左肺钙化灶是什么意思 喝什么排肝毒最快 阴茎进入阴道是什么感觉 五海瘿瘤丸主要治什么病
五音指什么 幽门螺杆菌抗体阳性什么意思 11月28是什么星座 吃什么补肾最好 什么是情感
上海以前叫什么hcv7jop4ns5r.cn 卯是什么生肖hcv9jop2ns4r.cn 北豆腐是什么hcv7jop6ns4r.cn 福建岩茶属于什么茶hcv8jop5ns3r.cn 阿赖耶识是什么意思hcv9jop1ns0r.cn
润是什么生肖hkuteam.com 上午十点是什么时辰naasee.com 女人为什么会阳虚hcv9jop7ns0r.cn 博大精深什么意思hcv9jop7ns3r.cn 翡翠和玉有什么区别hcv7jop9ns7r.cn
月经每次都推迟是什么原因xinmaowt.com 泳帽什么材质的好hanqikai.com 为什么不能空腹喝牛奶hcv8jop5ns9r.cn 山合念什么hcv9jop5ns1r.cn 副县长是什么级别干部hcv7jop6ns2r.cn
什么时间是排卵期hcv7jop6ns8r.cn 赤是什么颜色hcv9jop1ns5r.cn 戊日是什么意思hcv9jop2ns9r.cn 祈祷是什么意思hcv9jop8ns1r.cn 五行中金代表什么travellingsim.com
百度