Feat: add hyperparameters tuning #361

LucasDedieu · 2025-01-20T12:45:04Z

Description

The goal of this PR is to introduce a hyperparameter tuning script to EDS-NLP. This new feature enables users to optimize their model's hyperparameters by specifying either the available GPU hours or the desired number of trials. By doing so, users can efficiently find the optimal hyperparameters for training their models, leading to improved performance and efficiency.

Changes

edsnlp/tune.py: Implemented the tuning functionality.
tests/tuning/: Added unit tests test_tuning.py and test_update_config.py for the tuning functionality.
docs/tutorials/tuning.md: Created a new tutorial for hyperparameter tuning.
docs/tutorials/index.md: Added a link to the new tuning tutorial.
mkdocs.yml: Updated the navigation to include the new tuning tutorial.
pyproject.toml: Updated dependencies to include Optuna.

Checklist

If this PR is a bug fix, the bug is documented in the test suite.
Changes were documented in the changelog (pending section).
If necessary, changes were made to the documentation (eg new pipeline).

percevalw

This is great, thank you @LucasDedieu ! I've left a few comments throughout your code

edsnlp/tune.py

edsnlp/training/trainer.py

changelog.md

pyproject.toml

percevalw · 2025-01-22T09:09:16Z

docs/tutorials/tuning.md

+          hidden_dropout_prob: 0.1
+          attention_probs_dropout_prob: 0.1
+          classifier_dropout: 0.1


Are these parameters passed to the underlying transformer object ? If so, can you add a comment to explain this, since these aren't documented in the eds.transformer object

Yes, they are passed to the underlying transformer.
Their meanings is specified at beginning of section 2.3. Do you think I should add additional details ?

edsnlp/tune.py

docs/tutorials/tuning.md

github-actions · 2025-01-22T09:32:06Z

Coverage Report

Name

Stmts

Miss

∆ Miss

Cover

edsnlp/tune.py

New missing coverage at lines 258-277 !

 def objective_with_param(config, tuned_parameters, trial):
-     kwargs, _ = update_config(config, tuned_parameters, trial=trial)
-     seed = random.randint(0, 2**32 - 1)
-     set_seed(seed)
- 
-     def on_validation_callback(all_metrics):
-         f1 = all_metrics["ner"]["micro"]["f"]
-         step = all_metrics["step"]
-         trial.report(f1, step)
-         if trial.should_prune():
-             raise optuna.TrialPruned()
- 
-     try:
-         nlp = train(**kwargs, on_validation_callback=on_validation_callback)
-     except optuna.TrialPruned:
-         logger.info("Trial pruned")
-         raise
-     scorer = GenericScorer(**kwargs["scorer"])
-     val_data = kwargs["val_data"]
-     return scorer(nlp, val_data)["ner"]["micro"]["f"]

New missing coverage at line 281 !

     def objective(trial):
-         return objective_with_param(config_path, tuned_parameters, trial)

194

18

20

90.72%

edsnlp/training/trainer.py

Was already missing at line 80

     if result is None:
-         result = {}
     if isinstance(x, dict):

New missing coverage at line 612 !

                         if on_validation_callback:
-                             on_validation_callback(all_metrics[-1])

241

2

1

99.17%

TOTAL

11139

231

21

97.93%

Files without new missing coverage

Name	Stmts	Miss	Cover
edsnlp/utils/torch.py Was already missing at line 102 def load_pruned_obj(obj, _): - return obj Was already missing at line 118 def save_align_devices_hook(pickler, obj): - pickler.save_reduce(load_align_devices_hook, (obj.__dict__,), obj=obj) Was already missing at lines 121-128 def load_align_devices_hook(state): - state["execution_device"] = MAP_LOCATION ... - AlignDevicesHook = None Was already missing at line 143 if torch.Tensor in copyreg.dispatch_table: - old_dispatch[torch.Tensor] = copyreg.dispatch_table[torch.Tensor] copyreg.pickle(torch.Tensor, reduce_empty)	83	9	89.16%
edsnlp/utils/span_getters.py Was already missing at lines 52-55 else: - for span in candidates: - if span.label_ in span_filter: - yield span Was already missing at lines 59-61 if span_getter is None: - yield doc[:], None - return if callable(span_getter): Was already missing at lines 62-64 if callable(span_getter): - yield from span_getter(doc) - return for key, span_filter in span_getter.items(): Was already missing at line 66 if key == "*": - candidates = ( (span, group) for group in doc.spans.values() for span in group Was already missing at lines 75-78 else: - for span, group in candidates: - if span.label_ in span_filter: - yield span, group Was already missing at line 82 if callable(span_setter): - span_setter(doc, matches) else: Was already missing at line 124 elif isinstance(v, str): - new_value[k] = [v] elif isinstance(v, list) and all(isinstance(i, str) for i in v): Was already missing at line 162 elif isinstance(v, str): - new_value[k] = [v] elif isinstance(v, list) and all(isinstance(i, str) for i in v):	149	14	90.60%
edsnlp/utils/resources.py Was already missing at line 33 if not verbs: - return conjugated_verbs	24	1	95.83%
edsnlp/utils/numbers.py Was already missing at line 34 else: - string = s string = string.lower().strip() Was already missing at lines 38-41 return int(string) - except ValueError: - parsed = DIGITS_MAPPINGS.get(string, None) - return parsed	16	4	75.00%
edsnlp/utils/filter.py Was already missing at line 206 if isinstance(label, int): - return [span for span in spans if span.label == label] else:	74	1	98.65%
edsnlp/utils/bindings.py Was already missing at line 23 return "." + path - return path	65	1	98.46%
edsnlp/processing/spark.py Was already missing at line 50 getActiveSession = SparkSession.getActiveSession - except AttributeError:	47	1	97.87%
edsnlp/processing/multiprocessing.py Was already missing at lines 386-391 self.on_stop() - except BaseException as e: ... - self.main_control_queue.put(e) finally: Was already missing at lines 395-397 pass - except StopSignal: - pass for name, queue in self.consumer_queues(stage): Was already missing at line 535 while schedule[task_idx] is None: - task_idx = (task_idx + 1) % len(schedule) Was already missing at lines 599-601 if isinstance(docs, StreamSentinel): - self.active_batches[stage].append([None, None, None, docs]) - continue batch_id = str(hash(tuple(id(x) for x in docs)))[-8:] + "-" + self.uid Was already missing at lines 1113-1119 if out[0].kind == requires_sentinel: - missing_sentinels -= 1 ... - missing_sentinels = len(self.cpu_worker_names) continue	624	14	97.76%
edsnlp/processing/deprecated_pipe.py Was already missing at lines 207-209 def converter(doc): - res = results_extractor(doc) - return ( [{"note_id": doc._.note_id, **row} for row in res]	57	2	96.49%
edsnlp/pipes/trainable/span_linker/span_linker.py Was already missing at lines 402-404 if self.reference_mode == "synonym": - embeds = embeds.to(new_lin.weight) - new_lin.weight.data = embeds else:	173	2	98.84%
edsnlp/pipes/trainable/span_classifier/span_classifier.py Was already missing at line 345 if not all(keep_bindings): - logger.warning( "Some attributes have no labels or values and have been removed:"	159	1	99.37%
edsnlp/pipes/trainable/ner_crf/ner_crf.py Was already missing at line 254 if self.labels is not None and not self.infer_span_setter: - return Was already missing at lines 262-264 if callable(self.target_span_getter): - for span in get_spans(doc, self.target_span_getter): - inferred_labels.add(span.label_) else:	158	3	98.10%
edsnlp/pipes/trainable/layers/crf.py Was already missing at line 21 # out: 2 * N * O - return (log_A.unsqueeze(-1) + log_B.unsqueeze(-3)).logsumexp(-2) Was already missing at line 29 # out: 2 * N * O - return (log_A.unsqueeze(-1) + log_B.unsqueeze(-3)).max(-2) Was already missing at line 98 if learnable_transitions: - self.transitions = torch.nn.Parameter( torch.zeros_like(forbidden_transitions, dtype=torch.float) Was already missing at line 108 if learnable_transitions and with_start_end_transitions: - self.start_transitions = torch.nn.Parameter( torch.zeros(num_tags, dtype=torch.float) Was already missing at line 117 if learnable_transitions and with_start_end_transitions: - self.end_transitions = torch.nn.Parameter( torch.zeros(num_tags, dtype=torch.float)	137	5	96.35%
edsnlp/pipes/trainable/embeddings/transformer/transformer.py Was already missing at line 165 if quantization is not None: - kwargs["quantization_config"] = quantization Was already missing at line 185 if self.cls_token_id is None: - [self.cls_token_id] = self.tokenizer.convert_tokens_to_ids( [self.tokenizer.special_tokens_map["bos_token"]] Was already missing at line 189 if self.sep_token_id is None: - [self.sep_token_id] = self.tokenizer.convert_tokens_to_ids( [self.tokenizer.special_tokens_map["eos_token"]]	166	3	98.19%
edsnlp/pipes/qualifiers/reported_speech/reported_speech.py Was already missing at lines 24-28 return "REPORTED" - elif token._.rspeech is False: - return "DIRECT" - else: - return None	99	3	96.97%
edsnlp/pipes/qualifiers/negation/negation.py Was already missing at line 28 else: - return None	99	1	98.99%
edsnlp/pipes/qualifiers/hypothesis/hypothesis.py Was already missing at line 27 else: - return None	96	1	98.96%
edsnlp/pipes/qualifiers/history/history.py Was already missing at lines 26-32 def history_getter(token: Union[Token, Span]) -> Optional[str]: - if token._.history is True: - return "ATCD" - elif token._.history is False: - return "CURRENT" - else: - return None Was already missing at lines 337-343 ) - except ValueError: ... - note_datetime = None Was already missing at lines 352-358 ) - except ValueError: ... - birth_datetime = None Was already missing at lines 424-427 ) - except ValueError as e: - absolute_date = None - logger.warning( "In doc {}, the following date {} raises this error: {}. "	177	14	92.09%
edsnlp/pipes/qualifiers/family/family.py Was already missing at line 27 else: - return None	81	1	98.77%
edsnlp/pipes/qualifiers/base.py Was already missing at line 178 def __call__(self, doc: Doc) -> Doc: - results = self.process(doc) raise NotImplementedError(f"{type(results)} should be used to tag the document")	50	1	98.00%
edsnlp/pipes/ner/tnm/model.py Was already missing at line 147 def __str__(self): - return self.norm() Was already missing at line 171 ) - exclude_unset = skip_defaults	112	2	98.21%
edsnlp/pipes/ner/scores/sofa/sofa.py Was already missing at line 32 if not assigned: - continue if assigned.get("method_max") is not None: Was already missing at line 40 else: - method = "Non précisée"	25	2	92.00%
edsnlp/pipes/ner/scores/elston_ellis/patterns.py Was already missing at line 26 if x <= 5: - return 1 Was already missing at lines 32-36 else: - return 3 - - except ValueError: - return None	21	4	80.95%
edsnlp/pipes/ner/scores/charlson/patterns.py Was already missing at lines 21-23 return int(extracted_score) - except ValueError: - return None	13	2	84.62%
edsnlp/pipes/ner/scores/base_score.py Was already missing at line 154 if value is None: - continue normalized_value = self.score_normalization(value)	47	1	97.87%
edsnlp/pipes/ner/disorders/solid_tumor/solid_tumor.py Was already missing at lines 130-136 for span in spans: - span.label_ = "solid_tumor" ... - yield span	37	6	83.78%
edsnlp/pipes/ner/disorders/peripheral_vascular_disease/peripheral_vascular_disease.py Was already missing at line 107 if "peripheral" not in span._.assigned.keys(): - continue	15	1	93.33%
edsnlp/pipes/ner/disorders/diabetes/diabetes.py Was already missing at line 131 # Mostly FP - continue Was already missing at line 134 elif self.has_far_complications(span): - span._.status = 2 Was already missing at line 146 if next(iter(self.complication_matcher(context)), None) is not None: - return True return False	31	3	90.32%
edsnlp/pipes/ner/disorders/connective_tissue_disease/connective_tissue_disease.py Was already missing at line 103 # Huge change of FP / Title section - continue	14	1	92.86%
edsnlp/pipes/ner/disorders/ckd/ckd.py Was already missing at lines 120-123 dfg_value = float(dfg_span.text.replace(",", ".").strip()) - except ValueError: - logger.trace(f"DFG value couldn't be extracted from {dfg_span.text}") - return False	29	3	89.66%
edsnlp/pipes/ner/disorders/cerebrovascular_accident/cerebrovascular_accident.py Was already missing at lines 111-113 if span._.source == "ischemia": - if "brain" not in span._.assigned.keys(): - continue	17	2	88.24%
edsnlp/pipes/ner/adicap/models.py Was already missing at line 15 def norm(self) -> str: - return self.code Was already missing at line 18 def __str__(self): - return self.norm()	16	2	87.50%
edsnlp/pipes/misc/split/split.py Was already missing at lines 175-177 if max_length <= 0 and self.regex is None: - yield doc - return	70	2	97.14%
edsnlp/pipes/misc/sections/sections.py Was already missing at line 126 if sections is None: - sections = patterns.sections sections = dict(sections)	45	1	97.78%
edsnlp/pipes/misc/quantities/quantities.py Was already missing at lines 147-149 def __getitem__(self, item: int): - assert isinstance(item, int) - return [self][item] Was already missing at lines 160-163 def __eq__(self, other: Any): - if isinstance(other, SimpleQuantity): - return self.convert_to(other.unit) == other.value - return False Was already missing at line 166 if other.unit == self.unit: - return SimpleQuantity(self.value + other.value, self.unit, self.registry) return SimpleQuantity( Was already missing at line 193 return self.convert_to(other_unit) - except KeyError: raise AttributeError(f"Unit {other_unit} not found") Was already missing at line 198 def verify(cls, ent): - return True Was already missing at line 264 def __lt__(self, other: Union[SimpleQuantity, "RangeQuantity"]): - return max(self.convert_to(other.unit)) < min((part.value for part in other)) Was already missing at line 275 return self.convert_to(other.unit) == other.value - return False Was already missing at line 289 def verify(cls, ent): - return True Was already missing at line 888 if snippet.end != last and doclike.doc[last: snippet.end].text.strip() == "": - pseudo.append("w") pseudo = "".join(pseudo) Was already missing at line 1069 if start_line is None: - continue Was already missing at lines 1100-1102 unit_norm = self.unit_followers[unit_before.label_] - except (KeyError, AttributeError, IndexError): - pass Was already missing at line 1145 ): - ent = doc[unit_text.start: number.end] else: Was already missing at lines 1152-1154 dims = self.unit_registry.parse_unit(unit_norm)[0] - except KeyError: - continue Was already missing at lines 1260-1262 last._.set(last.label_, new_value) - except (AttributeError, TypeError): - merged.append(ent) else:	440	20	95.45%
edsnlp/pipes/misc/dates/models.py Was already missing at line 156 else: - d["month"] = note_datetime.month if self.day is None: Was already missing at lines 160-166 else: - if self.year is None: ... - d["day"] = default_day Was already missing at lines 174-176 return dt - except ValueError: - return None Was already missing at line 192 else: - return None Was already missing at line 208 if self.second: - norm += f"{self.second:02}s"	199	11	94.47%
edsnlp/pipes/misc/dates/dates.py Was already missing at line 249 if isinstance(absolute, str): - absolute = [absolute] if isinstance(relative, str): Was already missing at line 251 if isinstance(relative, str): - relative = [relative] if isinstance(duration, str): Was already missing at line 253 if isinstance(duration, str): - relative = [duration] if isinstance(false_positive, str): Was already missing at lines 357-366 if self.merge_mode == "align": - alignments = align_spans(matches, spans, sort_by_overlap=True) ... - matches.append(span) Was already missing at lines 462-464 if v1.mode == Mode.DURATION: - m1 = Bound.FROM if v2.bound == Bound.UNTIL else Bound.UNTIL - m2 = v2.mode or Bound.FROM elif v2.mode == Mode.DURATION:	153	14	90.85%
edsnlp/pipes/misc/consultation_dates/consultation_dates.py Was already missing at line 131 else: - self.date_matcher = None Was already missing at line 134 if not consultation_mention: - consultation_mention = [] elif consultation_mention is True:	48	2	95.83%
edsnlp/pipes/core/normalizer/__init__.py Was already missing at line 7 def excluded_or_space_getter(t): - return t.is_space or t.tag_ == "EXCLUDED"	5	1	80.00%
edsnlp/pipes/core/endlines/endlines.py Was already missing at lines 156-160 if end_lines_model is None: - path = build_path(__file__, "base_model.pkl") - - with open(path, "rb") as inp: - self.model = pickle.load(inp) elif isinstance(end_lines_model, str): Was already missing at lines 163-165 self.model = pickle.load(inp) - elif isinstance(end_lines_model, EndLinesModel): - self.model = end_lines_model else: Was already missing at line 196 ): - return "ENUMERATION" Was already missing at line 283 if np.isnan(sigma): - sigma = 1	87	7	91.95%
edsnlp/pipes/core/contextual_matcher/models.py Was already missing at lines 28-32 if isinstance(v, list): - assert ( - len(v) == 2 - ), "`window` should be a tuple/list of two integer, or a single integer" - v = tuple(v) if isinstance(v, int):	138	2	98.55%
edsnlp/pipes/core/contextual_matcher/contextual_matcher.py Was already missing at line 94 ) - label = label_name if label is None: Was already missing at line 343 if assigned is None: - continue if replace_entity:	143	2	98.60%
edsnlp/patch_spacy.py Was already missing at lines 67-69 # if module is reloaded. - existing_func = registry.factories.get(internal_name) - if not util.is_same_func(factory_func, existing_func): raise ValueError(	31	2	93.55%
edsnlp/package.py Was already missing at lines 478-480 version = version or pyproject["project"]["version"] - except (KeyError, TypeError): - version = "0.1.0" name = name or pyproject["project"]["name"] Was already missing at line 484 else: - main_package = None model_package = snake_case(name.lower())	207	3	98.55%
edsnlp/metrics/span_attributes.py Was already missing at lines 56-58 ) - assert attributes is None - attributes = kwargs.pop("qualifiers") if attributes is None:	71	2	97.18%
edsnlp/matchers/simstring.py Was already missing at line 280 if custom: - attr = attr[1:].lower() Was already missing at line 295 if custom: - token_text = getattr(token._, attr) else:	146	2	98.63%
edsnlp/language.py Was already missing at line 103 if last != begin: - logger.warning( "Missed some characters during"	51	1	98.04%
edsnlp/data/standoff.py Was already missing at line 38 def __init__(self, ann_file, line): - super().__init__(f"File {ann_file}, unrecognized Brat line {line}") Was already missing at line 192 ) - except Exception: raise Exception(	186	2	98.92%
edsnlp/data/polars.py Was already missing at line 36 if hasattr(data, "collect"): - data = data.collect() assert isinstance(data, pl.DataFrame)	55	1	98.18%
edsnlp/data/json.py Was already missing at line 81 return records - except Exception as e: raise Exception(f"Cannot read {file}: {e}")	112	1	99.11%
edsnlp/data/converters.py Was already missing at line 142 if "tokenizer" in CONTEXT[0]: - return CONTEXT[0]["tokenizer"] if _DEFAULT_TOKENIZER is None: Was already missing at line 414 elif key == "XPOS": - word.tag_ = value elif key == "FEATS": Was already missing at line 718 if isinstance(converter, type): - return converter(**kwargs), {} return converter, validate_kwargs(converter, kwargs)	234	3	98.72%
edsnlp/data/conll.py Was already missing at lines 81-83 ) - except StopIteration: - cols = DEFAULT_COLUMNS warnings.warn( Was already missing at lines 92-96 if not line: - if doc["words"]: - yield doc - doc = {"words": []} - continue if line.startswith("#"):	76	6	92.11%
edsnlp/core/torch_component.py Was already missing at line 392 if hasattr(self, "compiled"): - res = self.compiled(batch) else: Was already missing at line 438 """ - return self.preprocess(doc) Was already missing at line 463 if object.__repr__(self) in exclude: - return exclude.add(object.__repr__(self))	187	3	98.40%
edsnlp/core/stream.py Was already missing at lines 190-192 if isinstance(batch, StreamSentinel): - yield batch - continue results = [] Was already missing at lines 999-1001 elif op.batch_fn is None: - batch_size = op.size - batch_fn = batchify else:	354	4	98.87%
edsnlp/core/pipeline.py Was already missing at line 605 if name in exclude: - continue if name not in components: Was already missing at lines 716-719 """ - res = Stream.ensure_stream(docs) - res = res.map(functools.partial(self.preprocess, supervision=supervision)) - return res	442	4	99.10%
edsnlp/connectors/omop.py Was already missing at line 69 if not isinstance(row.ents, list): - continue Was already missing at line 87 else: - doc.spans[span.label_].append(span) Was already missing at line 127 if df.note_id.isna().any(): - df["note_id"] = range(len(df)) Was already missing at line 171 if i > 0: - df.term_modifiers += ";" df.term_modifiers += ext + "=" + df[ext].astype(str)	84	4	95.24%

273 files skipped due to complete coverage.

Coverage failure: total of 97.93% is less than 98.06% ❌

… dummy trial to compute time, now using the first tuning trial

…ost. Now at the end of the study, we check if there is gpu time left and add trials if possible regarding EMA of past trials.

sonarqubecloud · 2025-01-28T16:15:48Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

LucasDedieu and others added 8 commits January 17, 2025 16:16

feat: added hyperparameters tuning

3244208

add unit tests and update docs and dependencies

882f4f9

update docs

5baa057

update changelog

8e8266c

update pyproject.toml

66c7084

add pytest skip when optuna not installed

423c4b0

feat: add trial pruning

8a623f0

Merge branch 'master' into tuning

09339ab

percevalw requested changes Jan 22, 2025

View reviewed changes

LucasDedieu added 14 commits January 22, 2025 17:20

check plotly import before tuning

5c2403c

changed tuning tutorial name

f5621a1

add HyperparameterConfig class to type hyperparameters arguments

c71b7fd

change HyperparameterConfig to forbid extra field

4d04fc0

move optuna and plotly dependencies to dev group

8c47b96

fix: change behavior to compute when not specified in config; no more…

bfc67b5

… dummy trial to compute time, now using the first tuning trial

feat: in case where many trials were pruned, some gpu time could be l…

7c2f624

…ost. Now at the end of the study, we check if there is gpu time left and add trials if possible regarding EMA of past trials.

Move pruning logic to tune.py via a validation callback

7f62aa9

Update changelog

7428138

Add new unit test

d9abb62

update unit test

70d807b

add unit tests

4a91ff3

update unit tests

fe99579

fix pytorch load in unit test

1a3b2df

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: add hyperparameters tuning #361

Feat: add hyperparameters tuning #361

LucasDedieu commented Jan 20, 2025 •

edited

Loading

percevalw left a comment

percevalw Jan 22, 2025

LucasDedieu Jan 22, 2025 •

edited

Loading

github-actions bot commented Jan 22, 2025 •

edited

Loading

sonarqubecloud bot commented Jan 28, 2025

Feat: add hyperparameters tuning #361

Are you sure you want to change the base?

Feat: add hyperparameters tuning #361

Conversation

LucasDedieu commented Jan 20, 2025 • edited Loading

Description

Changes

Checklist

percevalw left a comment

Choose a reason for hiding this comment

percevalw Jan 22, 2025

Choose a reason for hiding this comment

LucasDedieu Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Jan 22, 2025 • edited Loading

Coverage Report

sonarqubecloud bot commented Jan 28, 2025

Quality Gate passed

LucasDedieu commented Jan 20, 2025 •

edited

Loading

LucasDedieu Jan 22, 2025 •

edited

Loading

github-actions bot commented Jan 22, 2025 •

edited

Loading