Skip to content

How does equality comparison work for File and Directory values? Are equality relationships preserved outside and inside tasks? #720

@adamnovak

Description

@adamnovak

The table of comparison operators says you can compare File values with ==. It doesn't actually say you can compare Directory values, but nothing seems to explain why you wouldn't be able to, so that looks like an omission.

The rules about file localization say that two local File values that originate from the same "parent directory" need to be localized to a shared parent directory for the task, and that file basenames need to be preserved.

So say I have three input File values that a workflow takes. It sends them all to the same task. The task can use conditional if...then...else expressions to make the Bash code executed depend on the WDL-level equality of the File values as seen by the task command. The Bash code can look at the string substituted-in values of the files and do Bash-level string comparisons on them. And the workflow can also use conditional blocks or expressions to make what the workflow executes depend on the equality relationships between the File values as seen by the workflow.

Say I pass in these input paths to my execution engine:

{
    "wf_name.file_a": "/home/anovak/file1.txt",
    "wf_name.file_b": "/home/anovak/../anovak/file1.txt",
    "wf_name.file_c": "/home/anovak/file2.txt"
}

Or at workflow scope I say:

File file_a = "/home/anovak/file1.txt"
File file_b = "/home/anovak/../anovak/file1.txt"
File file_c = "/home/anovak/file2.txt"

Despite using different strings for the paths, file_a and file_b are the same underlying file. That file is in the same "parent directory" as file_c, as I interpret it, because multiple paths are being used to refer to one on-disk directory data structure, with one identifying device and inode.

When localized for a task, then, file_a and file_c must have the same parent directory, file_b and file_c must have the same parent directory, and both file_a and file_b need to have the same basename. So file_a and file_b must be presented to the task as the same file: the engine can't download two different copies and present them both.

(There's also a rule that "Two inputs with the same basename must be located separately, to avoid name collision.". But I read that really as referring to two distinct files being input, not two distinct input slots. Otherwise you could never pass the same file twice to a task if you were also passing a sibling of that file.)

So in the task, file_a == file_b should really be true, because these two variables refer to the same file. And at the Bash level, they must be substituted with the same string and be equal by a Bash string comparison, because "the absolute path to the localized file/directory is substituted" into the command, and for one file there is only one absolute path.

But, is file_a == file_b true at workflow scope? Outside of a task, the files have not been localized, so nothing is constraining them to have any particular on-disk relationship, and in many implementations there might not really be any on-disk files when a comparison is made at workflow scope.

I think it is least confusing to guarantee that equality relationships are always the same between File values before and after the localization transformation. But this means that, at workflow scope, file_a which was initialized from one string value needs to be the same as file_b which was initialized from a different string value. If the two File values compare equal, then at workflow scope do they coerce to equal String values? Or do they coerce to the nonequal String values used to initialize them?

Similar concerns apply for Directory, but with a Directory it's easier to get multiple paths to the same thing, and you can even do it while having the same string for the parent path, and the same basename:

{
    "wf_name.dir_a": "/home/anovak/dir1/",
    "wf_name.dir_b": "/home/anovak/dir1",
    "wf_name.dir_c": "/home/anovak/dir2/"
}

Metadata

Metadata

Assignees

Labels

K-clarification(Kind) Clarifications regarding the WDL specification.S03-pre-rfc-discussion(State) A discussion that happens before an RFC is proposed.T-types(Topic) Issues related to the WDL type system.

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions