Skip to content

Commit 11d5cdf

Browse files
Minor updates
1 parent 3bfcba6 commit 11d5cdf

File tree

3 files changed

+26
-16
lines changed

3 files changed

+26
-16
lines changed

datasail/routine.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ def tech2oneD(tech: str) -> tuple[str, str]:
3838
raise ValueError(f"Technique {tech} is not a two-dimensional technique.")
3939

4040

41-
def datasail_main(**kwargs) -> Optional[Tuple[Dict, Dict, Dict]]:
41+
def datasail_main(**kwargs) -> Optional[Tuple[Optional[Dict], Optional[Dict], Optional[Dict]]]:
4242
"""
4343
Main routine of DataSAIL. Here the parsed input is aggregated into structures and then split and saved.
4444
@@ -157,6 +157,10 @@ def datasail_main(**kwargs) -> Optional[Tuple[Dict, Dict, Dict]]:
157157
map_[technique].append({})
158158
map_[technique][run].update(pre_map[one_d_tech])
159159

160+
if all(len(e_run) == 0 for e_techs in e_name_split_map.values() for e_run in e_techs) and \
161+
all(len(f_run) == 0 for f_techs in f_name_split_map.values() for f_run in f_techs):
162+
LOGGER.error("No assignments could be made for any technique! Please check your input data and values for cluster-numbers, delta, and epsilon.")
163+
return None, None, None
160164
LOGGER.info("Store results")
161165

162166
# infer interaction assignment from entity assignment if necessary and possible

datasail/sail.py

Lines changed: 4 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -119,14 +119,11 @@ def validate_args(**kwargs) -> Dict[str, object]:
119119
error("The filepath to the weights of the E-data is invalid.", 8, kwargs[KW_CLI])
120120
if kwargs[KW_E_STRAT] is not None and isinstance(kwargs[KW_E_STRAT], Path) and not kwargs[KW_E_STRAT].is_file():
121121
error("The filepath to the stratification of the E-data is invalid.", 11, kwargs[KW_CLI])
122-
if kwargs[KW_E_SIM] is not None and isinstance(kwargs[KW_E_SIM], str) and kwargs[KW_E_SIM].lower() not in SIM_ALGOS:
123-
kwargs[KW_E_SIM] = Path(kwargs[KW_E_SIM])
122+
if kwargs[KW_E_SIM] is not None and (isinstance(kwargs[KW_E_SIM], Path) or kwargs[KW_E_SIM].lower() not in SIM_ALGOS):
124123
if not kwargs[KW_E_SIM].is_file():
125124
error(f"The similarity metric for the E-data seems to be a file-input but the filepath is invalid.",
126125
9, kwargs[KW_CLI])
127-
if kwargs[KW_E_DIST] is not None and isinstance(kwargs[KW_E_DIST], str) and \
128-
kwargs[KW_E_DIST].lower() not in DIST_ALGOS:
129-
kwargs[KW_E_DIST] = Path(kwargs[KW_E_DIST])
126+
if kwargs[KW_E_DIST] is not None and (isinstance(kwargs[KW_E_DIST], Path) or kwargs[KW_E_DIST].lower() not in DIST_ALGOS):
130127
if not kwargs[KW_E_DIST].is_file():
131128
error(f"The distance metric for the E-data seems to be a file-input but the filepath is invalid.",
132129
10, kwargs[KW_CLI])
@@ -142,13 +139,11 @@ def validate_args(**kwargs) -> Dict[str, object]:
142139
error("The filepath to the weights of the F-data is invalid.", 14, kwargs[KW_CLI])
143140
if kwargs[KW_E_STRAT] is not None and isinstance(kwargs[KW_E_STRAT], Path) and not kwargs[KW_E_STRAT].is_file():
144141
error("The filepath to the stratification of the E-data is invalid.", 20, kwargs[KW_CLI])
145-
if kwargs[KW_F_SIM] is not None and isinstance(kwargs[KW_F_SIM], str) and kwargs[KW_F_SIM].lower() not in SIM_ALGOS:
146-
kwargs[KW_F_SIM] = Path(kwargs[KW_F_SIM])
142+
if kwargs[KW_F_SIM] is not None and (isinstance(kwargs[KW_F_SIM], Path) or kwargs[KW_F_SIM].lower() not in SIM_ALGOS):
147143
if not kwargs[KW_F_SIM].is_file():
148144
error(f"The similarity metric for the F-data seems to be a file-input but the filepath is invalid.",
149145
15, kwargs[KW_CLI])
150-
if kwargs[KW_F_DIST] is not None and isinstance(kwargs[KW_F_DIST], str) and \
151-
kwargs[KW_F_DIST].lower() not in DIST_ALGOS:
146+
if kwargs[KW_F_DIST] is not None and (isinstance(kwargs[KW_F_DIST], Path) or kwargs[KW_F_DIST].lower() not in DIST_ALGOS):
152147
if not kwargs[KW_F_DIST].is_file():
153148
error(f"The distance metric for the F-data seems to be a file-input but the filepath is invalid.",
154149
16, kwargs[KW_CLI])

docs/faq.rst

Lines changed: 17 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -11,22 +11,22 @@ conference discussions, GitHub issues, or other occasions. If you don't find hel
1111
Theoretical and Conceptional Questions
1212
--------------------------------------
1313

14-
Does training on DataSAIL splits produce better generalizing models?
15-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14+
1. Does training on DataSAIL splits produce better generalizing models?
15+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1616
Yes, training on DataSAIL splits generally leads to better generalizing models. The DataSAIL splits are designed to reduce information leakage between splits.
1717
Therefore, when used for hyperparameter tuning, they help in selecting models (and their hyperparameter) that generalize better to unseen data.
1818

19-
What are the limitations of DataSAIL?
20-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
19+
2. What are the limitations of DataSAIL?
20+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2121
The most time and memory consuming step in DataSAIL is the clustering of the data. For most datatypes, this is done by third-party programms such as FoldSeek,
2222
DIAMOND, or MASH. In that case, DataSAIL has no influence on the runtime and memory consumption. The user may provide their own commandline arguments to these
2323
programs.
2424

2525
Practical Questions
2626
-------------------
2727

28-
How can I relax the split constraints if DataSAIL fails to find a solution?
29-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
28+
1. How can I relax the split constraints if DataSAIL fails to find a solution?
29+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3030
Sometimes, DataSAIL is unable to solve the split problem and might output a message like:
3131

3232
.. code-block:: shell
@@ -43,3 +43,14 @@ DataSAIL compiles your input into multiple variables and constraints that for a
4343

4444
- If you are already on :code:`v1.2.0` or newer, you can set the :code:`epsilon` value to higher numbers. Default is :code:`0.05` but anything up to :code:`0.2`
4545
or :code:`0.3` is totally reasonable. If you use stratification, you also need to set :code:`delta` to a higher value as both values are connected in that scenario.
46+
47+
2. DataSAIL shows a log message stating the found solution is :code:`optimal_inaccurate`. What does that mean?
48+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
49+
This message just means, that the solver in DataSAIL found a solution, but the optimization did not finish and was terminated because of the timeout.
50+
Therefore, the solution is not guaranteed to be optimal, but it is still a valid solution that satisfies all constraints and is in most cases close to optimal.
51+
Therefore, you can use that :code:`optimal_inaccurate` solution without problems.
52+
53+
3. I set :code:`runs>1` but DataSAIL outputs the same splits each time. Why is that?
54+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
55+
When you set the :code:`runs` variable to values greater than :code:`1`, DataSAIL will shuffle the dataset inbetween splitting rounds to run the optimization from different initializations.
56+
But since many datasets have a unique optimal solution, DataSAIL might find the same solution multiple times and output it mutliple times.

0 commit comments

Comments
 (0)