TensorArray Not Used on line 865 of tokenization_utils.py


Hello!

I found an AI-Specific Code smell in your project.
The smell is called: TensorArray Not Used

You can find more information about it in this paper: https://dl.acm.org/doi/abs/10.1145/3522664.3528620.

According to the paper, the smell is described as follows:

| **Problem**  | If the developer initializes an array using tf.constant() and tries to assign a new value to it in the loop to keep it growing, the code will run into an error. The developer can fix this error by the low-level tf.while\_loop() API. However, it is inefficient coding in this way. A lot of intermediate tensors are built in this process. |
| ------------- | :------------- |
| **Solution**  | **Using tf.TensorArray() for growing array in the loop is a better solution for this kind of problem in TensorFlow 2.**  |
| **Impact** | **Efficiency, Error-proneness**  |

Example:
 ```diff
### TensorFlow
import tensorflow as tf
@tf.function
def fibonacci(n):
    a = tf.constant(1)
    b = tf.constant(1)
-    c = tf.constant([1, 1])
+    c = tf.TensorArray(tf.int32, n)
+    c = c.write(0, a)
+    c = c.write(1, b)

    for i in range(2, n):
        a, b = b, a + b
-       c = tf.concat([c, [b]], 0)
+		c = c.write(i, b)

-    return c
+	 return c.stack()

  ```
You can find the code related to this smell in this link: https://github.com/CLUEbenchmark/CLUE/blob/2ea90461e0a0321945f880330b629ce09e0e3fd2/baselines/models_pytorch/classifier_pytorch/transformers/tokenization_utils.py#L855-L875.

I also found instances of this smell in other files, such as:

File: https://github.com/CLUEbenchmark/CLUE/blob/master/baselines/models/bert/optimization_test.py#L26-L36 Line: 31
File: https://github.com/CLUEbenchmark/CLUE/blob/master/baselines/models/bert_wwm_ext/optimization_test.py#L26-L36 Line: 31
File: https://github.com/CLUEbenchmark/CLUE/blob/master/baselines/models/ernie/optimization_test.py#L26-L36 Line: 31
File: https://github.com/CLUEbenchmark/CLUE/blob/master/baselines/models/roberta_wwm_ext/optimization_test.py#L26-L36 Line: 31
File: https://github.com/CLUEbenchmark/CLUE/blob/master/baselines/models/roberta_wwm_large_ext/optimization_test.py#L26-L36 Line: 31
.

I hope this information is helpful!

Problem	If the developer initializes an array using tf.constant() and tries to assign a new value to it in the loop to keep it growing, the code will run into an error. The developer can fix this error by the low-level tf.while_loop() API. However, it is inefficient coding in this way. A lot of intermediate tensors are built in this process.
Solution	Using tf.TensorArray() for growing array in the loop is a better solution for this kind of problem in TensorFlow 2.
Impact	Efficiency, Error-proneness

	if add_special_tokens:
	sequence = self.build_inputs_with_special_tokens(ids, pair_ids)
	token_type_ids = self.create_token_type_ids_from_sequences(ids, pair_ids)
	encoded_inputs["special_tokens_mask"] = self.get_special_tokens_mask(ids, pair_ids)
	else:
	sequence = ids + pair_ids if pair else ids
	token_type_ids = [0] * len(ids) + ([1] * len(pair_ids) if pair else [])

	if return_tensors == 'tf' and is_tf_available():
	sequence = tf.constant([sequence])
	token_type_ids = tf.constant([token_type_ids])
	elif return_tensors == 'pt' and is_torch_available():
	sequence = torch.tensor([sequence])
	token_type_ids = torch.tensor([token_type_ids])
	elif return_tensors is not None:
	logger.warning("Unable to convert output to tensors format {}, PyTorch or TensorFlow is not available.".format(return_tensors))

	encoded_inputs["input_ids"] = sequence
	encoded_inputs["token_type_ids"] = token_type_ids

	if max_length and len(encoded_inputs["input_ids"]) > max_length:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorArray Not Used on line 865 of tokenization_utils.py #170

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TensorArray Not Used on line 865 of tokenization_utils.py #170

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions