Question Answering Using Neural Networks
Google performs question answering of queries and returns URLs in response to queries.
Google shows answers when a searcher intends to answer a question and provides a list of links to URLs when a query may be best answered by a page listed in its index.
I wrote about Google meeting a searcher’s intent that way in Entity Seeking Queries and Semantic Dependency Trees
Google has been working on providing question-answering in queries.
I have written about Google using answer passages and how Google may provide direct answers to questions seeking a specific answer in response to a question, rather than a string of links to pages that may provide answers.
One recent question-answering post I wrote about was Does Google Use Schema to Write Answer Passages for Featured Snippets?
I wrote many posts about related question answering patents, and they are at:
We don’t know which is used by the search engine. Still, I have seen more patent applications published and granted at Google that involve machine learning approaches using neural networks.
This patent appears different from those because it uses a word vectors approach to understand and answer questions.
You may remember that I wrote about those in the post Citations behind the Google Brain Word Vectors Approach. This tells us about the algorithm behind Rankbrain and how Google may identify words that are missing in queries based on the meanings of words that appear in those queries.
This new patent application describes a system that selects a text span from an input electronic document to answer an input question.
The patent provides a very brief summary of how it works:
By employing lightweight, i.e., computationally-efficient, models combined in a cascade to find the answer to an input question, the described systems can locate text in an input document that answers the input question.
In particular, the described systems can outperform more complex, less computationally efficient architectures. Thus, the described systems can answer received questions while consuming fewer computing resources, e.g., less memory and less processing power, than conventional approaches, which may be particularly helpful when the systems are in resource-constrained environments, e.g., on mobile devices.
In particular, the systems can meet state-of-the-art results on many question-answering tasks despite consuming many fewer computational resources than before state-of-the-art systems, e.g., systems that use computationally-intensive recurrent neural networks to process document tokens, questions tokens, or both.
This patent can be found at:
Selecting Answer Spans From Electronic Documents Using Neural Networks
Inventors: Thomas Mieczyslaw Kwiatkowski, Ankur P. Parikh, Swabha Swayamdipta
Filed Date: October 29, 2018
Publication Number US20200265327
Publication Date: August 20, 2020
Applicants Google LLC
We see references to a word vectors approach in the abstract for this patent.
Methods, systems, and apparatus, including computer programs encoded on computer storage media, select a text span from an input electronic document that answers an input question.
One of the methods includes obtaining a respective first numeric representation of text spans in the input document for each of the text spans:
Determining, for a segment that contains the text span, a question-aware segment vector
Determining, for the question, a segment-aware question vector
Processing the first numeric representation of the text span, the question-aware segment vector, and the segment-aware question vector using a second feedforward neural network to generate a second numeric representation of the text span
for each unique text span in the plurality of text spans:
Determining an aggregate representation for the unique text span
Determining, from the aggregate representation, a final score for the unique text span
Selecting a unique text span.
Two of the inventors listed on this patent are co-authors on two papers on question answering. The first is from 2017: Learning Recurrent Span Representations For Extractive Question Anaswering
The reading comprehension task that asks questions about a given evidence document is a central problem in natural language understanding.
Recent formulations of this task have typically focused on answer selection from a set of candidates pre-defined manually or through the use of an external NLP pipeline.
However, Rajpurkar et al. (2016) recently released the SQUAD dataset in which the answers can be arbitrary strings from the supplied text.
This paper focuses on this answer extraction task, presenting a novel model architecture that efficiently builds fixed-length representations of all spans in the evidence document with a recurrent network.
We show that scoring explicit span representations significantly improves performance over other approaches that factor the prediction into separate predictions about words or start and end markers.
Our approach improves upon the best-published results of Wang & Jiang (2016) by 5% and decreases the error of Rajpurkar et al.’s baseline by > 50%.
The second is from 2019: Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index
The abstract tells us that it is about:
Existing open-domain question answering (QA) models are not suitable for real-time usage because they need to process several long documents on-demand for every input query.
This paper introduces the query agnostic indexable representation of document phrases that can drastically speed up open-domain QA and allow us to reach longtail targets.
In particular, our dense-sparse phrase encoding effectively captures syntactic, semantic, and lexical information of the phrases and eliminates the pipeline filtering of context documents.
Leveraging optimization strategies, our model can be trained in a single 4-GPU server and serve the entire Wikipedia (up to 60 billion phrases) under 2TB with CPUs only.
Our experiments on SQuADOpen show that our model is more accurate than DrQA (Chen et al., 2017) with 6000x reduced computational cost, which translates into at least 58x faster end-to-end inference benchmark on CPUs.1
The abstract for that paper tells us what it is about.
This patent describes a system that chooses a text span from an electronic document answering a received question.
Once a text span is found to answer the question, the selected text span can be chosen as part of a response to the question.
The input question may have been a voice query, and the system can then provide a spoken utterance in response to the query.
A mobile device, like a smart speaker or another computing device interacting with the user with voice inputs, can receive a voice query spoken by the user and send the received query to the system (e.g., over a data communication network.)
The system can then identify a candidate electronic page that may contain the answer to the received query, select a text span from the page using the techniques described in this specification, and then send the text span to the computing device as part of a response to the voice query (i.e., as data representing a verbal utterance of the text span or as a text for conversion to speech at the computing device.)
In some cases, the user can identify the candidate page.
If the user submitted the voice query while viewing a given document using the computing device, the system could identify the given document as the candidate electronic document.
In some other cases, an external system (e.g., an Internet search engine) identifies the candidate’s electronic document according to the query. Then, it provides the candidate’s electronic document to the system.
The system can receive the question as a text query and provide the text span for presentation on a user device as part of the response to the text query.
An Internet search engine can receive the text query, and the text span identified by the system can be used by the Internet search engine as part of the response to the search query (e.g., as a formatted presentation of content along with search results identified by the Internet search engine as being responsive to the query.)
Identifying Answering Text Spans to Help with Question Answering
The system may receive an input question and input electronic document and identify a text span from the electronic document that answers the question.
Both are tokenized (i.e., so that the text of both the input question and the electronic document is represented as a respective set of tokens.)
A token can be (e.g., a word, a phrase, or other n-gram selected from a vocabulary of possible tokens.)
When an electronic document is received, the system identifies candidate text spans on the page.
The system can identify as a candidate text span each possible consecutive sequence of tokens in the page, including fewer than a threshold number of tokens.
Because the same candidate text span can occur many times throughout the page, the system also identifies, from the candidate text spans in the document, a set of unique text spans (i.e., so that no text span in the set of unique text spans corresponds to any other text span in the set of unique text spans.)
The system may consider one text span to correspond to another if the two text spans are within a threshold edit distance.
The system may consider two text spans to correspond to the same entity with a named entity recognition system.
A Cascaded Machine Learning Question Answering Model
This system uses a cascaded machine learning system (i.e., a machine learning system having a cascaded model architecture to select a text span from the set of unique text spans as the text span that answers the input question.)
The cascaded model architecture uses three machine learning models: level 1, level 2, and level 3.
This is a “cascade” because the models in each layer of the cascade receive input from the outputs of models in previous layers of the cascade.
The models in the final layer of the cascade, i.e., layer 3, generate the final prediction of the machine learning system from the output of the model in the previous layer, i.e., layer 2.
Level 1 of the cascade operates on simple features of the question and the candidate text spans to generate a respective first numeric representation of each text span.
A numeric representation is an ordered collection of numeric values (e.g., a vector, a matrix, or higher-order tensor of floating-point values or quantized floating-point values.)
The models in level 1 operate only on embeddings from a dictionary of pre-trained token embeddings and, optionally, a binary question-word feature that indicates whether a given span contains a token from the question.
An embedding is a vector of numeric values in a fixed dimensional space.
Because the embeddings have been pre-trained, the embeddings in the fixed dimensional space show similarities (e.g., semantic similarities between the tokens that they represent.)
The embedding for the word “king” may be closer in the fixed dimensional space to the embedding for the word “queen” than the embedding for the word “pawn.”
The model in layer 2 of the cascade uses the first numeric representations generated by level 1 along with an attention mechanism that, for each candidate span, aligns question tokens with tokens in the document segment that contains the candidate span, e.g., the sentence, paragraph or other groups of tokens in the electronic document that contains the candidate span, to generate a respective second numeric representation for each candidate answer span.
The model in layer 3 receives the second numeric representations of the candidate text spans and aggregates information from all the candidate answer spans mentioned many times in the document, i.e., that occur many times throughout the document, to determine a respective final score for each unique text span. Thus, the final score for a given unique text span measures how well the unique text span answers the question.
The system then selects a text span from the unique text spans based on the final scores. For example, the system can select the unique text span with the highest final score to answer the question.
To allow the cascaded machine learning system to score answer spans (i.e., so that the final scores generated by level 3 of the cascade can be used to identify answers to input questions, the system trains the machine learning models in the cascade on training data that includes labeled training examples.)
That is, each labeled training example includes a question—electronic document pair that is labeled with data identifying the correct text span (i.e., the text span from the electronic document that best answers the question.)
Level 1 of the cascade can include two models: a span+short context model and a question+span model that collectively generate the first numeric representation.
For any given text span, model operates on
- An initial representation of the left context of the text span in the input document
- An initial representation of the text span
- An initial representation of the right context of the text span in the input document to generate as output a span-context representation of the text span
The initial representation of the text span is based on the pre-trained embeddings of the tokens in the text span.
The initial representation of the text span could be a bag of words embedding the tokens in the text span (i.e., an average of the embeddings of the tokens in the text span.)
Or the initial representation of the text span could also be a concatenation of a bag of words embedding the tokens in the text span and a question-word feature that indicates whether the text span includes any of the questions tokens.
The question-word feature can be a binary feature (e.g., that has the value 1 when the text span includes one or more question tokens and has the value 0 when the text span does not include any of the question tokens.)
The initial representation of the left context is a bag of words embedding the tokens in the left context of the text span, i.e., an average of the embeddings of the K tokens to the immediate left of the text span in the input document.
The initial representation of the right context is a bag of words embedding the tokens in the right context of the text span (i.e., an average of the embeddings of the K tokens to the immediate right of the text span in the input document.)
To generate the span-context representation of the text span, the model processes a concatenation of
- The initial representation of the left context of the text span in the input document
- The initial representation of the text span
- The initial representation of the right context of the text span in the input document using a feed-forward neural network
In some implementations, the neural network is a two-layer feed-forward neural network with rectified linear unit (ReLU) activations.
In particular, operations from the feed-forward neural network to generate a representation h from an input x and can be expressed as:
(NB) where U and V parameter matrices and a and b are parameter biases of the feed-forward network.
The model is also configured to generate a score for the text span that, like the final score, measures how well the unique text span answers the question (shown as the input to a loss term l 2).
In particular, the model can generate the score by processing the span-context representation of the text span through a linear prediction layer that maps a vector to a single value.
In particular, the operations performed by the linear prediction layer to generate a value ? From an input representation, h can be expressed as:
(NB) where w and z are parameters of the linear prediction layer.
Using the score generated by the model for training is described in more detail below.
For any given text span, the model operates on.
(i) the initial representation of the text span and
(ii) an initial representation of the question to generate a question-span representation of the text span.
The model first generates a weight for each question token based on the embedding of each question token.
The model can generate a weight for the question token by first applying another feed-forward neural network, i.e., applying the ffnn operations, to the embedding of the question token to generate an initial representation of the question token and then applying another linear prediction layer to the initial representation of the question token.
The model can then generate the initial representation of the question token by computing a weighted average of the embeddings for the question token, with the embedding for each question token being weighted by a normalized version of the computed weight.
Once the initial representation of the question has been generated, the model generates the question-span representation of the text span by applying another feed-forward neural network (i.e., applying the ffnn operations described above, to a concatenation of the initial representation of the text span and an initial representation of the question.)
While not used during inference, during training, the model is also configured to generate a score for the text span that, like the final score, measures how well the unique text span answers the question.
In particular, the model can generate the score by processing the question-span representation of the text span through another linear prediction layer.
The first numeric representation of the text span is a concatenation of the question-span representation and the span-context representation. It is provided as input to level 2 of the cascade.
Level 2 of the cascade includes a context attention model that, for a given question span, operates on the first numeric representation to generate the second numeric representation of the question span for a given question span.
For a given text span, the question-answering model generates,
(i) for a segment with the text span in the input document, a question-aware segment vector based on a similarity between the question tokens in the question and segment tokens in the segment that contains the text span and
(ii) for the question, a segment-aware question vector question is based on the similarity between the question tokens in the question and segment tokens in the segment that contains the text span.
To generate these two vectors, the model measures the similarity between every pair of question and segment embedding (i.e., generates a respective similarity score between each question embedding and each segment embedding.)
To generate the similarity score ? ij for a given question embedding q i—segment embedding d j pair, the model performs the following operations:
? ij =ffnn( q i) T ffnn( d i).
To generate the question-aware segment vector for the segment in the input document that contains the text span, the model then determines a respective attended vector for each segment token that accounts for the similarity of the segment token to the question tokens as reflected by the similarity scores and determines the question-aware segment vector from the attended vectors for the segment tokens.
To determine the segment-aware question vector for the question, the model determines a respective attended vector for each question token that measures the similarity of the question token to the segment tokens as reflected by the similarity scores and determines the segment-aware question vector from the attended vectors for the question tokens.
In particular, to generate the question-aware segment vector, each original segment embedding vector and its corresponding attended vector are concatenated and passed through another feed-forward net. The representations generated by which are summed to get the question-aware segment vector.
Each original question embedding vector and its corresponding attended vector are concatenated and passed through this feed-forward net. The representations generated by which are summed to get the segment-aware question vector.
The model then processes a concatenation of the first numeric representation of the text span, the question-aware segment vector, the segment-aware question vector, and the question-span feature using another feedforward neural network second numeric representation of the text span.
While not used during inference, during training, the model is also configured to generate a score for the text span that, like the final score, measures how well the unique text span answers the question.
In particular, the model can generate the score by processing the second numeric representation of the text span through another linear prediction layer.
Level 3 includes an aggregating many mentions model that receives the second numeric representations of the candidate answer spans and aggregates information from all the candidate answer spans which occur many times throughout the document based on the second numeric representations.
In particular, for each unique question span, the model processes the second numeric representations for each of the text spans that correspond to the unique text span using another feedforward neural network to generate a respective transformed numeric representation for each of the text spans.
The model then determines the total representation for the unique text span by summing the transformed numeric representations of the candidate text spans corresponding to the unique text span.
The model then generates the final score for the unique text span by processing the total representation of the unique text span through another linear prediction layer.
While the architectures of the various feed-forward neural networks and the various linear projection layers employed by the models are generally the same, each feed-forward neural network and linear projection layer generally have different parameter values from each other neural network or projection layer.
To determine these parameter values, the system trains the cascaded machine learning system on training data.
Training the Cascaded Neural Network System for Question Answering
For convenience, the process is performed with a system of computers located in one or more locations.
A question answering system performs the process.
This system can perform the process on multiple training examples to update the values of the parameters of a cascaded neural network system.
The system obtains a training example. This training example starts with a training question and a training document and identifies a correct word span from the training document that best answers the question.
The system processes the training question and the training document using the cascaded neural network system to generate:
- (i) a final score for the unique word span corresponding to the correct word span
- (ii) scores for each of the models for each mention of the correct word span in the training document
In particular, as described above, while after training, only the final score is used to select the best answer for the input question; during training, each model is configured to generate a respective score for each candidate word span training document.
The system determines an update to the parameters of the cascaded machine learning system by determining a gradient of a loss function on the parameters.
The loss function l includes terms that depend on the scores generated by a corresponding model.
In particular, the loss function includes, for each of the models, a respective loss term that depends on the scores assigned to the mentions of the correct word span in the training document and, for the model, a loss term that depends on the final score assigned to the unique word span corresponding to the correct word span by the model.
In particular, the loss function can be the total negative log-likelihood of the correct answer spans under all the submodels.
For example, the loss function is expressed as:
(NB) where each ? Is a hyperparameter such that the ?s add to 1, S* is the set of all mentions of the correct answer span in the training document, p (k)(s|q,d) is the score assigned to a mention s in the set S* by the k-th model of the models 160- 180, and p (4)(u|q,d) is the final score assigned to a unique answer span by the model.
The system can determine the gradient about each of the parameters using a machine learning training technique (e.g., backpropagation). It can then determine the update from the gradient by applying an update rule (e.g., an ADAM update rule, an rmsprop update rule, or a stochastic gradient descent learning rate, to the gradients.)