|
| 1 | +.. --------------------------------------------------------------------------- |
| 2 | +.. Copyright 2016-2018 Intel Corporation |
| 3 | +.. |
| 4 | +.. Licensed under the Apache License, Version 2.0 (the "License"); |
| 5 | +.. you may not use this file except in compliance with the License. |
| 6 | +.. You may obtain a copy of the License at |
| 7 | +.. |
| 8 | +.. http://www.apache.org/licenses/LICENSE-2.0 |
| 9 | +.. |
| 10 | +.. Unless required by applicable law or agreed to in writing, software |
| 11 | +.. distributed under the License is distributed on an "AS IS" BASIS, |
| 12 | +.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 13 | +.. See the License for the specific language governing permissions and |
| 14 | +.. limitations under the License. |
| 15 | +.. --------------------------------------------------------------------------- |
| 16 | +
|
| 17 | +Aspect Based Sentiment Analysis (ABSA) |
| 18 | +###################################### |
| 19 | + |
| 20 | +Overview |
| 21 | +======== |
| 22 | +Aspect Based Sentiment Analysis is the task of co-extracting opinion terms and aspect terms |
| 23 | +(opinion targets) and the relations between them in a given corpus. |
| 24 | + |
| 25 | +Algorithm Overview |
| 26 | +================== |
| 27 | +Training: the training phase inputs training data and outputs an opinion lexicon and an aspect lexicon. |
| 28 | +the training flow consists the following three main steps: |
| 29 | + |
| 30 | +1. The first training step is text pre-processing that is performed by Spacy_. This step includes |
| 31 | +tokenization, part-of-speech tagging and sentence breaking. |
| 32 | + |
| 33 | +2. The second training step is to apply a dependency parser to the training |
| 34 | +data. for this purpose we used the parser described in [1]_. |
| 35 | +For more details regarding steps 1 & 2 see :doc:`BIST <spacy_bist>` dependency parser. |
| 36 | + |
| 37 | +3. The third step is based on applying a bootstrap lexicon acquisition algorithm described in [2]_, |
| 38 | +the algorithm uses a generic lexicon introduced by [3]_ as initial step for the bootstrap process. |
| 39 | + |
| 40 | +4. The last step includes applying an MLP based opinion term re-ranking and polarity estimation |
| 41 | +algorithm. This step is based on using the word embbedding similarities between each acquired term |
| 42 | +and a set of generic opinion terms as features. A pre-trained model is re-ranking provided. |
| 43 | + |
| 44 | +Inference: the inference phase inputs an inference data along with the opinion lexicon and aspect |
| 45 | +lexicon generated by the training phase. The output of the inference phase is a list aspect-opinion |
| 46 | +pairs (along with their polarity and score) extracted from the inference data. |
| 47 | +The inference approach is based on detecting syntactically related aspect-opinion pairs. |
| 48 | + |
| 49 | + |
| 50 | +Flow |
| 51 | +==== |
| 52 | +.. image :: assets/absa_flow.png |
| 53 | +
|
| 54 | +Training |
| 55 | +======== |
| 56 | +Full code example is available at ``examples/absa/train.py``. |
| 57 | +There are two training modes: |
| 58 | + |
| 59 | +1. Providing training data in a raw text format. In this case the training flow will |
| 60 | +apply the dependency parser to the data: |
| 61 | + |
| 62 | +.. code:: bash |
| 63 | +
|
| 64 | + python3 examples/absa/train.py --data=TRAINING_DATASET |
| 65 | +
|
| 66 | +
|
| 67 | +Arguments: |
| 68 | + |
| 69 | +``--data=TRAINING_DATASET`` - path to the input training dataset. Should point to a single raw text file with documents |
| 70 | +separated by newlines or a single csv file containing one doc per line or a directory containing one raw |
| 71 | +text file per document. |
| 72 | + |
| 73 | +Optional arguments: |
| 74 | + |
| 75 | +``--rerank-model=RERANK_MODEL`` - path to re-rank model. By default when running the training |
| 76 | +for the first time this model will be downloaded to ``~/nlp-architect/cache/absa/train/reranking_model`` |
| 77 | + |
| 78 | +Notes: |
| 79 | + |
| 80 | +a. The generated opinion and aspect lexicons are written as csv files to: |
| 81 | +``~/nlp-architect/cache/absa/train/output/generated_opinion_lex_reranked.csv`` and to ``~/nlp-architect/cache/absa/train/output/generated_aspect_lex.csv`` |
| 82 | + |
| 83 | +b. In this mode the parsed data (jsons of ParsedDocument objects) is written to (``~/nlp-architect/cache/absa/train/parsed``) |
| 84 | + |
| 85 | +c. When running the training for the first time the system will download |
| 86 | +glove word embbedding model (the user will be prompt for authorization) to |
| 87 | +``~/nlp-architect/cache/absa/train/word_emb_unzipped`` (this may take a while) |
| 88 | + |
| 89 | +d. For demonstration purposes we provide a sample of tripadvisor.co.uk restaurants reviews under the |
| 90 | +`Creative Commons Attribution-Share-Alike 3.0 License <https://creativecommons.org/licenses/by-sa/3.0/>`__ (Copyright 2018 Wikimedia Foundation). |
| 91 | +The dataset can be found at ``datasets/absa/datasets/absa/tripadvisor_co_uk-travel_restaurant_reviews_sample_2000_train.csv``. |
| 92 | +``~/nlp-architect/cache/absa/train/reranking_model`` |
| 93 | + |
| 94 | +2. Providing parsed training data. In this case the training flow skips the parsing step: |
| 95 | + |
| 96 | +.. code:: bash |
| 97 | +
|
| 98 | + python3 examples/absa/train.py --parsed-data=PARSED_TRAINING_DATASET |
| 99 | +
|
| 100 | +Arguments: |
| 101 | + |
| 102 | +``--parsed-data=PARSED_TRAINING_DATASET`` - path to the parsed format (jsons of ParsedDocument objects) of the training dataset. |
| 103 | + |
| 104 | +Inference |
| 105 | +========= |
| 106 | +Full code example is available at ``examples/absa/inference/inference.py``. |
| 107 | +There are two inference modes: |
| 108 | + |
| 109 | +1. Providing inference data in a raw text format. |
| 110 | + |
| 111 | +.. code:: python |
| 112 | +
|
| 113 | + inference = SentimentInference(ASPECT_LEX, OPINION_LEX) |
| 114 | + sentiment_doc = inference.run(doc="The food was wonderful and fresh. Staff were friendly.") |
| 115 | +
|
| 116 | +Arguments: |
| 117 | + |
| 118 | +``ASPECT_LEX`` - path to aspect lexicon (csv file) that was produced by the training phase. |
| 119 | +aspect.csv may be manually edited for grouping alias aspect names (e.g. 'drinks' and 'beverages') |
| 120 | +together. Simply copy all alias names to the same line in the csv file. |
| 121 | + |
| 122 | +``OPINION_LEX`` - path to opinion lexicon (csv file) that was produced by the training phase. |
| 123 | + |
| 124 | +``doc`` - input sentence. |
| 125 | + |
| 126 | +2. Providing parsed inference data (ParsedDocument format). In this case the parsing step is skipped: |
| 127 | + |
| 128 | +.. code:: python |
| 129 | +
|
| 130 | + inference = SentimentInference(ASPECT_LEX, OPINION_LEX, parse=False) |
| 131 | + doc_parsed = json.load(open('/path/to/parsed_doc.json'), object_hook=CoreNLPDoc.decoder) |
| 132 | + sentiment_doc = inference.run(parsed_doc=doc_parsed) |
| 133 | +
|
| 134 | +
|
| 135 | +Inference - interactive mode |
| 136 | +============================ |
| 137 | + |
| 138 | +The provided file ``examples/absa/inference/interactive.py`` enables using generated lexicons in interactive mode: |
| 139 | + |
| 140 | + |
| 141 | +.. code:: bash |
| 142 | +
|
| 143 | + python3 interactive.py --aspects=ASPECT_LEX --opinions=OPINION_LEX |
| 144 | +
|
| 145 | +
|
| 146 | +Arguments: |
| 147 | + |
| 148 | +``--aspects=ASPECT_LEX`` - path to aspect lexicon (csv file format) |
| 149 | + |
| 150 | +``--opinions=OPINION_LEX`` - path to opinion lexicon (csv file format) |
| 151 | + |
| 152 | + |
| 153 | +References |
| 154 | +========== |
| 155 | + |
| 156 | +.. [1] `Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations <https://transacl.org/ojs/index.php/tacl/article/view/885/198>`__, Kiperwasser, E., & Goldberg, Y, Transactions Of The Association For Computational Linguistics (2106), 4, 313-327. |
| 157 | +.. [2] `Opinion word expansion and target extraction through double propagation <https://dl.acm.org/citation.cfm?id=1970422>`__, Guang Qiu, Bing Liu, Jiajun Bu, and Chun Chen, In Computational Linguistics, volume 37(1). |
| 158 | +.. [3] `Mining and Summarizing Customer Reviews <http://dx.doi.org/10.1145/1014052.1014073>`__, Minqing Hu and Bing Liu, Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2004), pp. 168-177, 2004. |
| 159 | +
|
| 160 | +.. _Spacy: https://spacy.io |
0 commit comments