The seven methods mentioned here are all summarized after reading the literature, not all of them are mature, stable and commercialized, and the purpose is only to put forward ideas for reference.
Rule-based methods are usually used in the absence of training data. Because it is very different from the following statistics-based method, it is recorded as a zero method.
Rule-based parsing system usually consists of two parts: one is "rule base", and parsing rules are usually CFG context-free grammar; The other is a synonym database, which records some common synonyms in standard word.
The whole syntactic analysis is a context-free grammatical reduction process. Firstly, automatic word segmentation is carried out, then the words in the user question are reduced to standard word according to the thesaurus, and then the reduced question is compared with the analysis rules in the rule base. Once the comparison is successful, the user question is successfully restored to the standard question corresponding to the analysis rule.
For example, there are two records in thesaurus: Failed: No Access, No Entry, Unsuccessful, Error and Login: Login, Login, and there is a rule in the rule base: Account Login Failed: [Account] [Login] [Failed].
A user asked, "Why can't I log in to my account?" . First of all, suppose the word segmentation is correct, and the result of word segmentation is "How can I | account | login | get on |"; After subtracting words, the result of subtracting words is "How did my account login fail?" ; Then compare the rule "account login failed: [account] [login] [failure]" and find that the comparison is successful. This user question is successfully simplified to a standard question "account login failure", and we provide the standard answer corresponding to "account login failure" in the system to the user to complete the interaction process.
This can solve the problem to a certain extent, but the shortcomings are also particularly serious. First of all, the "rule base" and "synonym base" need to be built manually, which requires huge and long-term human resources investment. Because the expression of language is theoretically infinite, the rules and synonyms that can be thought of are always limited; Moreover, with the development of language or the change of business, the maintenance of the whole rule base and synonym base also needs continuous human resources investment.
Secondly, writing rule base requires rich experience and extremely high requirements for personnel quality. Because the analysis rules are quite abstract, at such a high level of abstraction, even if the writers have rich experience (even worse if they have no experience), the conflict between different analysis rules is inevitable, that is, the same user problem will be successfully compared by the analysis rules of multiple standard problems. In this case, the problem of standard question selection/grading needs another system to solve.
In other words, we can regard the process of finding standard questions according to users' questions as a search process of inputting queries to obtain documents.
We can try to use the retrieval model used in traditional search engines to analyze user problems. "On the Basis of Search Engine (I)" mentioned that BM25 is the best retrieval model at present, so we will take BM25 model as an example for analysis.
The calculation formula of BM25 model is as follows:
The calculation formula of BM25 model integrates four factors: IDF factor, document word frequency, document length factor and query word frequency, and uses three freely adjustable factors (k 1, k2, b) to adjust the weights of the combined factors.
Where n represents the total number of documents, n represents the number of documents with corresponding words, f represents the word frequency of corresponding words in documents, qf represents the word frequency of corresponding words in query statements, and dl represents the document length.
There are three ways to use BM25 model. Taking standard questions, standard questions and standard answers, and user question sets that have been correctly matched with standard questions in history as documents, the similarity between them and user questions is calculated through formulas, and then the standard questions with the highest scores are sorted according to the similarity, and the analysis results are taken out.
I haven't experimented with this idea, but I guess that although this method saves a lot of manpower, its performance in this closed field QA system should be worse than that of the previous rule-based method, and the method based on retrieval model will perform better in the open field.
In addition, the method based on the traditional retrieval model will have an inherent defect, that is, the retrieval model can only deal with the overlapping words of queries and documents, while the traditional retrieval model can not deal with the semantic relevance of words. The former method solves the semantic related problems to some extent through artificial thesaurus.
As mentioned above, the method based entirely on the retrieval model cannot deal with the semantic relevance of words.
In order to solve this problem to some extent, we can use LDA/SMT and other methods to mine the synonymous relations between words through the corpus, and automatically construct a synonym table with a synonym degree higher than the threshold and an appropriate size for each word. If synonyms of the searched keywords are found in the document when they are substituted into the formula of the retrieval model, they can be multiplied by a certain weight according to the degree of synonyms, and then included in the word frequency calculation of keywords.
There is an introduction to LDA/SMT in Talking about Intelligent Search and Conversational OS.
Simply put, LDA can reasonably classify words into different implied topics; And by calculating the KL divergence (relative entropy) of the topic vector θ of the two articles, the similarity of the two articles can be obtained. SMT model comes from Microsoft, and its purpose is to introduce translation model into traditional retrieval model, so as to improve the ability of retrieval model to deal with semantically related word pairs. Baidu also uses this model to improve the quality of the results returned by search engines.
Word embedding represents words as distributed representation, that is, word vectors in low-dimensional vector space. Words in distributed representation can use cosine distance to calculate the semantic relevance between words. Corresponding to a hot representation, the dimension of the word vector under a hot representation is the same as that of the vocabulary, and the word vectors of different words are orthogonal. Traditional Word Set Model (SOW) and Bag of Words Model (BOW) adopt a thermal representation.
We can use the method of deep learning to get the word vector represented by word distribution. For example, if you train an ordinary neural probabilistic language model, you can get the word vector of the word, or refer to the way in word2vec to train CBOW or Skip-gram model. The introduction of neural probabilistic language model, CBOW and Skip-gram are both mentioned when talking about intelligent search and conversational OS.
With the help of Baidu, the idea of modeling with DNN is as follows:
We need to use a set of positive and negative examples of user question-standard question pairs as training corpus. With the above method, both positive and negative examples are embedded with words and sent to DNN, and the semantic difference between positive and negative examples is modeled by pairwise sorting loss.
The last method based on DNN can solve the problem of semantic association of words to some extent, but it does not properly deal with the short-distance dependence in sentences, such as the inability to distinguish "A to B" and "B to A".
According to Baidu's evaluation results, CNN has a better performance in dealing with short-distance dependence.
The picture comes from ARC- 1 in Dr. Li Hang's Convective Neural Network Architecture for Matching Natural Language Sentences:
The basic idea of this method is to embed every word in the question and get a fixed-length word vector corresponding to each word. We represent the problem as a two-dimensional matrix, and each row represents the word vector corresponding to the corresponding word in the problem. This two-dimensional matrix is convoluted many times (the width of convolution kernel is the same as the dimension of word vector, and the height is mostly 2-5), and finally a one-dimensional feature vector is obtained. We use CNN to deal with user questions and standard questions at the same time, and get the feature vectors corresponding to user questions and standard questions in the library. After that, these two vectors are spliced and sent to a multilayer perceptron, which calculates the matching degree of the two problems.
In addition, it is pointed out that if two feature vectors are directly spliced and sent to MLP, the boundary information will be lost, so we send feature vector A, feature vector B and aTb to MLP at the same time to calculate the similarity.
The structure of ARC-2 is also derived from the above paper by Dr. Li Hang:
The improvement of ARC-2 compared with ARC- 1 is that ARC-2 tries to make two sentences interact with each other before obtaining the high-level abstract representation of the result similar to ARC- 1, instead of obtaining their respective high-level abstract representations through CNN structure.
In ARC- 1 model, a feature map is just a column vector or a one-dimensional matrix, and several column vectors are combined to form the pattern (two-dimensional) in ARC- 1 schematic diagram, while in ARC-2, a feature map becomes a two-dimensional matrix, and several two-dimensional matrices are superimposed to form the pattern (three-dimensional) in ARC-2 schematic diagram.
The subsequent convolution and pooling process is similar to CNN in CV. Similar to the previous method, 1D convolution involves the connection of two word vectors, and the previous method can also be used to avoid the loss of boundary information.
It has also been suggested that in ARC-2 structure, it is not the best scheme to directly use the word vectors obtained by the traditional word embedding method to form sentences as input, and the best scheme is to use the hidden state that has passed LSTM.
We can use the LSTM structure to train an RNN language model, as shown below (taking ordinary RNN as an example):
It can be found from the figure that when the output is "E", the third component in the hidden layer vector is the largest, while when the output is "L", the first component is the largest, and when the output is "O", the second component is the largest. We can use the hidden state of RNN as the word vector of distributed representation and as the input of CNN(ARC-2), and we can get better results after testing.
The word segmentation result with high reliability is the basic premise of the following syntactic analysis steps.
In Fundamentals of Natural Language Processing (II), I introduced some classic word segmentation methods, but all of them were earlier research results. CRF method is currently recognized as the most effective word segmentation algorithm.
The idea of CRF method is very direct, that is, the word segmentation problem is regarded as a sequence labeling problem, and the position of each word in the sentence is labeled:
The process of CRF word segmentation is to mark the position of words, and then form word segmentation between B and E and S words. There are many public CRF-based word segmentation tools on the Internet.
At least four aspects can further improve the analysis quality on the basis of the existing model, including: problem standardization, user status, reinforcement learning and multiple rounds of dialogue.
The purpose of problem standardization is to have better fault tolerance for user input.
Simple, such as: simplified and traditional standardization, full-width and half-width standardization, punctuation processing, and case standardization. More complicated things, such as the correction of Chinese typos. Automatic error correction technology is widely used, which can play a great role in improving the user experience of the system and can be said to be cost-effective.
The common practice of typo correction is to train the noise channel model.
We can extract features from user states and use them as additional information as the input of neural network during training and analysis.
User states that can be considered at least include:
Secondly, we can adopt reinforcement learning and design a reasonable reward mechanism, so that the analysis system can update its strategy independently in the process of interacting with the environment.
Compared with ordinary supervised learning methods, reinforcement learning has two obvious advantages: first, the data needed to update reinforcement learning strategies mainly comes from interaction/sampling with the environment, rather than expensive manual marking data; The other is that the strategies generated by reinforcement learning are iteratively updated independently according to the reward mechanism, and there will be some innovative practices, not just imitating the "standard" practices provided by human beings.
Although QA problem analysis does not have the concept of "strategy \ innovative play" like games, it can still help save a lot of manual marking costs in analysis optimization.
One of the core problems in the application of reinforcement learning method is the design of reward mechanism. In the design of reward mechanism in QA scenario, at least the following perspectives can be considered:
Multi-round dialogue technology can further improve the coherence of dialogue with users.
I tend to divide the multi-round dialogue into two scenarios: closed domain and open domain, and the realization ideas of different scenarios should be different.
The characteristics of multi-round dialogue in closed domain scene are: the problems that the system can solve are limited sets, and the purpose of multi-round dialogue is to guide users to the problems that we can solve.
The characteristic of multi-round dialogue in the open domain scene is that the problems that the system needs to solve are an infinite set, and the purpose of multi-round dialogue is to understand the needs of users more accurately according to the context.
Under this guiding ideology, the core idea of multi-round dialogue in closed domain should be "filling the slot", while the core idea of multi-round dialogue in open domain is "context replacement" and "subject completion".
"Talking about Intelligent Search and Conversational OS" introduces that Baidu uses slot filling technology to do NLU, and uses "context replacement" and "topic completion" to improve its DuerOS dialogue ability.
Furthermore, the technical basis of slot filling, context replacement and topic completion is "sequence labeling". Here are two PPT of Baidu:
According to Baidu's PPT, it is commercially feasible to use bidirectional LSTM+CRF for sequence marking.
Choosing the right time of manual access is also one of the methods to improve the overall performance of QA system, and its core problem lies in balancing user experience and input cost. The earlier the manual access, the better the user experience, but the higher the cost.
The following is a simple way to provide Ant Financial Service: If the system provides the same answer to the user three times in a row, the manual entry button will be displayed; If the user asks customer service questions twice in a row (such as "I want to be manual" and "What's your customer service phone number"), the manual entry button will be displayed.
Another important part of the question answering system is the answer base.
The optimization of answer entry can be considered from at least three angles:
The diversity of answer forms is very understandable. For example, Ma Xiao answers questions in a variety of forms, including text, links, pictures and videos.
Personalization has been involved in the above analysis and optimization (considering the analysis and optimization of user status), and the above analysis ideas can also be applied to answer entry. We can provide different personalized answers for users with different registration time, different payment amount and different access paths.
The answer seems abstract to users, but it is also easy to understand. Generally speaking, my answers to QA system are graded according to "map level", "navigation level" and "special car level":
According to the original scene classification of man-machine dialogue system, QA system that provides "automobile-level" answers can be called VPA.
For the optimization of the answer database, there are at least two optimization points under the premise of complete answer entry (the answer forms are rich enough to provide personalized answers for different users):
The design idea of reinforcement learning method reward mechanism in analytic optimization can also be used to find problems in the answer base, because it is often difficult to clearly distinguish whether the negative feedback of users is directed at the analytic system or the answer itself.
In addition to finding problems from users' negative feedback, we should also have some preventive mechanisms to avoid these problems in advance for the above two optimization points.
For example, the first point, "There is something wrong with the standard answer in the answer library", if it is not the quality problem of the input personnel, the biggest possibility comes from the timeliness of the answer, that is, we have provided users with expired answers. In order to solve this problem, we can add a "temporary" label when entering the answer, indicating that the answer is very time-sensitive and needs to be updated in time.
As for the second point, "the answers to some questions are missing in the answer base", the greatest possibility comes from unexpected events and business changes. For example, the system service is down, a new version of the system is installed or some operational activities are organized. For these changes that may cause users doubts, we should prepare some FAQs in advance and enter them into the answer base.
In addition, when we input new questions and their standard answers, we need to pay attention to the adaptability of the new input questions with the original analysis system to avoid the situation that the new input questions are difficult to be analyzed by the analysis system. The methods that can be adopted include, for example, actively inputting some different questions as the initial training corpus while inputting new questions (the practice of Netease Qiyuyun customer service).
Chronic cough is a common thing in life. It lasts for several months. Do you know what to eat quickly when you cough? Let's tal