Releases: Lightning-AI/torchmetrics
Minor patch release
[0.10.1] - 2022-10-21
Fixed
- Fixed broken clone method for classification metrics (#1250)
- Fixed unintentional downloading of nltk.punktwhenlsumnot inrouge_keys(#1258)
- Fixed type casting in MAPmetric betweenboolandfloat32(#1150)
Contributors
@dreaquil, @SkafteNicki, @stancld
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Large changes to classifications
TorchMetrics v0.10 is now out, significantly changing the whole classification package. This blog post will go over the reasons why the classification package needs to be refactored, what it means for our end users, and finally, what benefits it gives. A guide on how to upgrade your code to the recent changes can be found near the bottom.
Why the classification metrics need to change
We have for a long time known that there were some underlying problems with how we initially structured the classification package. Essentially, classification tasks can e divided into either binary, multiclass, or multilabel, and determining what task a user is trying to run a given metric on is hard just based on the input. The reason a package such as sklearn can do this is to only support input in very specific formats (no multi-dimensional arrays and no support for both integer and probability/logit formats).
This meant that some metrics, especially for binary tasks, could have been calculating something different than expected if the user were to provide another shape but the expected. This is against the core value of TorchMetrics, that our users, of course should trust that the metric they are evaluating is given the excepted result.
Additionally, classification metrics were missing consistency. For some, metrics num_classes=2 meant binary, and for others num_classes=1 meant binary. You can read more about the underlying reasons for this refactor in this and this issue.
The solution
The solution we went with was to split every classification metric into three separate metrics with the prefix binary_* , multiclass_* and multilabel_* . This solves a number of the above problems out of the box because it becomes easier for us to match our users' expectations for any given input shape. It additionally has some other benefits both for us as developers and ends users
- Maintainability: by splitting the code into three distinctive functions, we are (hopefully) lowering the code complexity, making the codebase easier to maintain in the long term.
- Speed: by completely removing the auto-detection of task at runtime, we can significantly increase computational speed (more on this later).
- Task-specific arguments: by splitting into three functions, we also make it more clear what input arguments affect the computed result. Take - Accuracy as an example: both num_classes , top_k , average are arguments that have an influence if you are doing multiclass classification but doing nothing for binary classification and vice versa with the thresholds argument. The task-specific versions only contain the arguments that influence the given task.
- There are many smaller quality-of-life improvements hidden throughout the refactor, however here are our top 3:
Standardized arguments
The input arguments for the classification package are now much more standardized. Here are a few examples:
- Each metric now only supports arguments that influence the final result. This means that num_classes is removed from all binary_*metrics are now required for allmulticlass_*metrics and renamed tonum_labelsfor allmultilabel_*metrics.
- The ignore_indexargument is now supported by ALL classification metrics and supports any value and not only values in the [0,num_classes] range (similar to torch loss functions). Below is shown an example:
- We added a new validate_argsto all classification metrics to allow users to skip validation of inputs making the computations completely faster. By default, we will still do input validation because it is the safest option for the user. Still, if you are confident that the input to the metric is correct, then you can now disable this, checking for a potential speed-up (more on this later).
Constant memory implementations
Some of the most useful metrics for evaluating classification problems are metrics such as ROC, AUROC, AveragePrecision, etc., because they not only evaluate your model for a single threshold but a whole range of thresholds, essentially giving you the ability to see the trade-off between Type I and Type II errors. However, a big problem with the standard formulation of these metrics (which we have been using) is that they require access to all data for their calculation. Our implementation has been extremely memory-intensive for these kinds of metrics.
In v0.10 of TorchMetrics, all these metrics now have an argument called thresholds. By default, it is None and the metric will still save all targets and predictions in memory as you are used to. However, if this argument is instead set to a tensor - torch.linspace(0,1,100) it will instead use a constant-memory approximation by evaluating the metric under those provided thresholds.
Setting thresholds=None has an approximate memory footprint of O(num_samples) whereas using thresholds=torch.linspace(0,1,100) has an approximate memory footprint of O(num_thresholds). In this particular case, users will save memory when the metric is computed on more than 100 samples. This feature can save memory by comparing this to modern machine learning, where evaluation is often done on thousands to millions of data points.
This also means that the Binned* metrics that currently exist in TorchMetrics are being deprecated as their functionality is now captured by this argument.
All metrics are faster (ish)
By splitting each metric into 3 separate metrics, we reduce the number of calculations needed. We, therefore, expected out-of-the-box that our new implementations would be faster. The table below shows the timings of different metrics with the old and new implementations (with and without input validation). Numbers in parentheses denote speed-up over old implementations.
The following observations can be made:
- Some metrics are a bit faster (1.3x), and others are much faster (4.6x) after the refactor!
- Disabling input validation can speed up things. For example, multiclass_confusion_matrixgoes from a speedup of 3.36x to 4.81 when input validation is disabled. A clear advantage for users that are familiar with the metrics and do not need validation of their input at every update.
- If we compare binary with multiclass, the biggest speedup can be seen for multiclass problems.
- Every metric is faster except for the precision-recall curve, even the new approximative binning method. This is a bit strange, as the non-approximation should be equally fast (it's the same code). We are actively looking into this.
[0.10.0] - 2022-10-04
Added
- Added a new NLP metric InfoLM(#915)
- Added Perplexitymetric (#922)
- Added ConcordanceCorrCoefmetric to regression package (#1201)
- Added argument normalizetoLPIPSmetric (#1216)
- Added support for multiprocessing of batches in PESQmetric (#1227)
- Added support for multioutput in PearsonCorrCoefandSpearmanCorrCoef(#1200)
Changed
- Classification refactor (#1054, #1143, #1145, #1151, #1159, #1163, #1167, #1175, #1189, #1197, #1215, #1195)
- Changed update in FIDmetric to be done in an online fashion to save memory (#1199)
- Improved performance of retrieval metrics (#1242)
- Changed SSIMandMSSSIMupdate to be online to reduce memory usage (#1231)
Fixed
- Fixed a bug in ssimwhenreturn_full_image=Truewhere the score was still reduced (#1204)
- Fixed MPS support for:
- Fixed bug in ClasswiseWrappersuch thatcomputegave wrong result (#1225)
- Fixed synchronization of empty list states (#1219)
Contributors
@Borda, @bryant1410, @geoffrey-g-delhomme, @justusschock, @lucadiliello, @nicolas-dufour, @Queuecumber, @SkafteNicki, @stancld
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Minor patch release
[0.9.3] - 2022-08-22
Added
- Added global option sync_on_computeto disable automatic synchronization whencomputeis called (#1107)
Fixed
- Fixed missing reset in ClasswiseWrapper(#1129)
- Fixed JaccardIndexmulti-label compute (#1125)
- Fix SSIM propagate device if gaussian_kernelis False, add test (#1149)
Contributors
@KeVoyer1, @krshrimali, @SkafteNicki
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Minor patch release
[0.9.2] - 2022-06-29
Fixed
- Fixed mAP calculation for areas with 0 predictions (#1080)
- Fixed bug where avg precision state and auroc state was not merge when using MetricCollections (#1086)
- Skip box conversion if no boxes are present in MeanAveragePrecision(#1097)
- Fixed inconsistency in docs and code when setting average="none"inAvaragePrecisionmetric (#1116)
Contributors
@23pointsNorth, @kouyk, @SkafteNicki
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Minor PL compatibility patch
[0.9.1] - 2022-06-08
Added
- Added specific RuntimeErrorwhen metric object is on the wrong device (#1056)
- Added an option to specify own n-gram weights for BLEUScoreandSacreBLEUScoreinstead of using uniform weights only. (#1075)
Fixed
- Fixed aggregation metrics when input only contains zero (#1070)
- Fixed TypeErrorwhen providing superclass arguments askwargs(#1069)
- Fixed bug related to state reference in metric collection when using compute groups (#1076)
Contributors
@jlcsilva, @SkafteNicki, @stancld
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Faster forward
Highligths
TorchMetrics v0.9 is now out, and it brings significant changes to how the forward method works. This blog post goes over these improvements and how they affect both users of TorchMetrics and users that implement custom metrics. TorchMetrics v0.9 also includes several new metrics and bug fixes.
Blog: TorchMetrics v0.9 — Faster forward
The Story of the Forward Method
Since the beginning of TorchMetrics, Forward has served the dual purpose of calculating the metric on the current batch and accumulating in a global state. Internally, this was achieved by calling update twice: one for each purpose, which meant repeating the same computation. However, for many metrics, calling update twice is unnecessary to achieve both the local batch statistics and accumulating globally because the global statistics are simple reductions of the local batch states.
In v0.9, we have finally implemented a logic that can take advantage of this and will only call update once before making a simple reduction. As you can see in the figure below, this can lead to a single call of forward being 2x faster in v0.9 compared to v0.8 of the same metric.
With the improvements to forward, many metrics have become significantly faster (up to 2x)
It should be noted that this change mainly benefits metrics (for example, confusionmatrix) where calling update is expensive.
We went through all existing metrics in TorchMetrics and enabled this feature for all appropriate metrics, which was almost 95% of all metrics. We want to stress that if you are using metrics from TorchMetrics, nothing has changed to the API, and no code changes are necessary.
[0.9.0] - 2022-05-31
Added
- Added RetrievalPrecisionRecallCurveandRetrievalRecallAtFixedPrecisionto retrieval package (#951)
- Added class property full_state_updatethat determinesforwardshould callupdateonce or twice (#984,#1033)
- Added support for nested metric collections (#1003)
- Added Diceto classification package (#1021)
- Added support to segmentation type segmas IOU for mean average precision (#822)
Changed
- Renamed reductionargument toaveragein Jaccard score and added additional options (#874)
Removed
- Removed deprecated compute_on_stepargument (#962, #967, #979 ,#990, #991, #993, #1005, #1004, #1007)
Fixed
- Fixed non-empty state dictfor a few metrics (#1012)
- Fixed bug when comparing states while finding compute groups (#1022)
- Fixed torch.doublesupport in stat score metrics (#1023)
- Fixed FIDcalculation for non-equal size real and fake input (#1028)
- Fixed case where KLDivergencecould outputNan(#1030)
- Fixed deterministic for PyTorch<1.8 (#1035)
- Fixed default value for mdmc_averageinAccuracy(#1036)
- Fixed missing copy of property when using compute groups in MetricCollection(#1052)
Contributors
@Borda, @burglarhobbit, @charlielito, @gianscarpe, @MrShevan, @phaseolud, @razmikmelikbekyan, @SkafteNicki, @tanmoyio, @vumichien
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Minor patch release
[0.8.2] - 2022-05-06
Fixed
- Fixed multi-device aggregation in PearsonCorrCoef(#998)
- Fixed MAP metric when using a custom list of thresholds (#995)
- Fixed compatibility between compute groups in MetricCollectionand prefix/postfix arg (#1007)
- Fixed compatibility with future Pytorch 1.12 in safe_matmul(#1011, #1014)
Contributors
@ben-davidson-6, @Borda, @SkafteNicki, @tanmoyio
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Minor patch release
[0.8.1] - 2022-04-27
Changed
- Reimplemented the signal_distortion_ratiometric, which removed the absolute requirement offast-bss-eval(#964)
Fixed
- Fixed "Sort currently does not support bool dtype on CUDA" error in MAP for empty preds (#983)
- Fixed BinnedPrecisionRecallCurvewhenthresholdsargument is not provided (#968)
- Fixed CalibrationErrorto work on logit input (#985)
Contributors
@DuYicong515, @krshrimali, @quancs, @SkafteNicki
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Faster collection and more metrics!
We are excited to announce that TorchMetrics v0.8 is now available. The release includes several new metrics in the classification and image domains and some performance improvements for those working with metrics collections.
Metric collections just got faster
Common wisdom dictates that you should never evaluate the performance of your models using only a single metric but instead a collection of metrics. For example, it is common to simultaneously evaluate the accuracy, precision, recall, and f1 score in classification. In TorchMetrics, we have for a long time provided the MetricCollection object for chaining such metrics together for an easy interface to calculate them all at once. However, in many cases, such a collection of metrics shares some of the underlying computations that have been repeated for every metric in the collection. In Torchmetrics v0.8 we have introduced the concept of compute_groups to MetricCollection that will, as default, be auto-detected and group metrics that share some of the same computations.
Thus, if you are using MetricCollections in your code, upgrading to TorchMetrics v0.8 should automatically make your code run faster without any code changes.
Many exciting new metrics
TorchMetrics v0.8 includes several new metrics within the classification and image domain, both for the functional and modular API. We refer to the documentation for the full description of all metrics if you want to learn more about them.
- SpectralAngleMapperor SAM was added to the image package. This metric can calculate the spectral similarity between given reference spectra and estimated spectra.
- CoverageErrorwas added to the classification package. This metric can be used when you are working with multi-label data. The metric works similar to the- sklearncounterpart and computes how far you need to go through ranked scores such that all true labels are covered.
- LabelRankingAveragePrecisionand- LabelRankingLosswere added to the classification package. Both metrics are used in multi-label ranking problems, where the goal is to give a better rank to the labels associated with each sample. Each metric gives a measure of how well your model is doing this.
- ErrorRelativeGlobalDimensionlessSynthesisor ERGAS was added to the image package. This metric can be used to calculate the accuracy of Pan sharpened images considering the normalized average error of each band of the resulting image.
- UniversalImageQualityIndexwas added to the image package. This metric can assess the difference between two images, which considers three different factors when computed: loss of correlation, luminance distortion, and contrast distortion.
- ClasswiseWrapperwas added to the wrapper package. This wrapper can be used in combinations with metrics that return multiple values (such as classification metrics with the average=None argument). The wrapper will unwrap the result into a- dictwith a label for each value.
[0.8.0] - 2022-04-14
Added
- Added WeightedMeanAbsolutePercentageErrorto regression package (#948)
- Added new classification metrics:
- Added new image metric:
- Added support for MetricCollectioninMetricTracker(#718)
- Added support for 3D image and uniform kernel in StructuralSimilarityIndexMeasure(#818)
- Added smart update of MetricCollection(#709)
- Added ClasswiseWrapperfor better logging of classification metrics with multiple output values (#832)
- Added **kwargsargument for passing additional arguments to base class (#833)
- Added negative ignore_indexfor the Accuracy metric (#362)
- Added adaptive_kfor theRetrievalPrecisionmetric (#910)
- Added reset_real_featuresargument image quality assessment metrics (#722)
- Added new keyword argument compute_on_cputo all metrics (#867)
Changed
- Made num_classesinjaccard_indexa required argument (#853, #914)
- Added normalizer, tokenizer to ROUGE metric (#838)
- Improved shape checking of permutation_invariant_training(#864)
- Allowed reduction None(#891)
- MetricTracker.best_metricwill now give a warning when computing on metric that do not have a best (#913)
Deprecated
- Deprecated argument compute_on_step(#792)
- Deprecated passing in dist_sync_on_step,process_group,dist_sync_fndirect argument (#833)
Removed
- Removed support for versions of Lightning lower than v1.5 (#788)
- Removed deprecated functions, and warnings in Text (#773)
- WERand- functional.wer
 
- Removed deprecated functions and warnings in Image (#796)
- SSIMand- functional.ssim
- PSNRand- functional.psnr
 
- Removed deprecated functions, and warnings in classification and regression (#806)
- FBetaand- functional.fbeta
- F1and- functional.f1
- Hingeand- functional.hinge
- IoUand- functional.iou
- MatthewsCorrcoef
- PearsonCorrcoef
- SpearmanCorrcoef
 
- Removed deprecated functions, and warnings in detection and pairwise (#804)
- MAPand- functional.pairwise.manhatten
 
- Removed deprecated functions, and warnings in Audio (#805)
- PESQand- functional.audio.pesq
- PITand- functional.audio.pit
- SDRand- functional.audio.sdrand- functional.audio.si_sdr
- SNRand- functional.audio.snrand- functional.audio.si_snr
- STOIand- functional.audio.stoi
 
Fixed
- Fixed device mismatch for MAPmetric in specific cases (#950)
- Improved testing speed (#820)
- Fixed compatibility of ClasswiseWrapperwith theprefixargument ofMetricCollection(#843)
- Fixed BestScoreon GPU (#912)
- Fixed Lsum computation for ROUGEScore(#944)
Contributors
@ankitaS11, @ashutoshml, @Borda, @hookSSi, @justusschock, @lucadiliello, @quancs, @rusty1s, @SkafteNicki, @stancld, @vumichien, @weningerleon, @yassersouri
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Minor patch release
[0.7.3] - 2022-03-22
Fixed
- Fixed unsafe log operation in TweedieDeviacefor power=1 (#847)
- Fixed bug in MAP metric related to either no ground truth or no predictions (#884)
- Fixed ConfusionMatrix,AUROCandAveragePrecisionon GPU when running in deterministic mode (#900)
- Fixed NaN or Inf results returned by signal_distortion_ratio(#899)
- Fixed memory leak when using updatemethod with tensor whererequires_grad=True(#902)
Contributors
@mtailanian, @quancs, @SkafteNicki
If we forgot someone due to not matching commit email with GitHub account, let us know :]