Releases: Lightning-AI/torchmetrics
JOSS paper
[0.7.2] - 2022-02-10
Fixed
- Minor patches in JOSS paper.
Improve mAP performance
[0.7.1] - 2022-02-03
Changed
- Used
torch.bucketizein calibration error whentorch>1.8for faster computations (#769) - Improve mAP performance (#742)
Fixed
- Fixed check for available modules (#772)
- Fixed Matthews correlation coefficient when the denominator is 0 (#781)
Contributors
@Borda, @ramonemiliani93, @SkafteNicki, @twsl
If we forgot someone due to not matching commit email with GitHub account, let us know :]
New NLP metrics and improved API
We are excited to announce that TorchMetrics v0.7 is now publicly available. This release is pretty significant. It includes several new metrics (mainly for NLP), naming and import changes, general improvements to the API, and some other great features. TorchMetrics thus now has over 60+ metrics, and the package is more user-friendly than ever.
NLP metrics - Text package
Text package is a part of TorchMetrics as of v0.5. With the growing capability of language generation models, there is also a real need to have reliable evaluation metrics. With several added metrics and unified API, TorchMetrics makes the usage of various metrics even easier! TorchMetrics v0.7 newly includes a couple of machine translation metrics such as chrF, chrF++, Translation Edit Rate, or Extended Edit Distance. Furthermore, it also supports other metrics - Match Error Rate, Word Information Lost, Word Information Preserved, and SQuAD evaluation metrics. Last but not least, we also made possible the evaluation of the ROUGE score using multiple references.
Argument unification
Importantly, all text metrics assume preds, target input order with these explicit keyword arguments. If different naming was used before v0.7, it is deprecated and completely removed in v0.8.
Import and naming changes
TorchMetrics v0.7 brings more extensive and minor changes to how metrics should be imported. The import changes directly impact v0.7, meaning that you will most likely need to change the import statement for some specific metrics. All naming changes follow our standard deprecation process, meaning that in v0.7, any metric that is renamed will still work but raise an error asking to use the new metric name. From v0.8, the old metric names will no longer be available.
[0.7.0] - 2022-01-17
Added
- Added NLP metrics:
- Added
MultiScaleSSIMinto image metrics (#679) - Added Signal to Distortion Ratio (
SDR) to audio package (#565) - Added
MinMaxMetricto wrappers (#556) - Added
ignore_indexto retrieval metrics (#676) - Added support for multi references in
ROUGEScore(#680) - Added a default VSCode devcontainer configuration (#621)
Changed
- Scalar metrics will now consistently have additional dimensions squeezed (#622)
- Metrics having third party dependencies removed from global import (#463)
- Untokenized for
BLEUScoreinput stay consistent with all the other text metrics (#640) - Arguments reordered for
TER,BLEUScore,SacreBLEUScore,CHRFScorenow the expected input order is predictions first and target second (#696) - Changed dtype of metric state from
torch.floattotorch.longinConfusionMatrixto accommodate larger values (#715) - Unify
preds,targetinput argument's naming across all text metrics (#723, #727)bert,bleu,chrf,sacre_bleu,wip,wil,cer,ter,wer,mer,rouge,squad
Deprecated
- Renamed IoU -> Jaccard Index (#662)
- Renamed text WER metric: (#714)
functional.wer->functional.word_error_rateWER->WordErrorRate
- Renamed correlation coefficient classes: (#710)
MatthewsCorrcoef->MatthewsCorrCoefPearsonCorrcoef->PearsonCorrCoefSpearmanCorrcoef->SpearmanCorrCoef
- Renamed audio STOI metric: (#753, #758)
audio.STOItoaudio.ShortTimeObjectiveIntelligibilityfunctional.audio.stoitofunctional.audio.short_time_objective_intelligibility
- Renamed audio PESQ metrics: (#751)
functional.audio.pesq->functional.audio.perceptual_evaluation_speech_qualityaudio.PESQ->audio.PerceptualEvaluationSpeechQuality
- Renamed audio SDR metrics: (#711)
functional.sdr->functional.signal_distortion_ratiofunctional.si_sdr->functional.scale_invariant_signal_distortion_ratioSDR->SignalDistortionRatioSI_SDR->ScaleInvariantSignalDistortionRatio
- Renamed audio SNR metrics: (#712)
functional.snr->functional.signal_distortion_ratiofunctional.si_snr->functional.scale_invariant_signal_noise_ratioSNR->SignalNoiseRatioSI_SNR->ScaleInvariantSignalNoiseRatio
- Renamed F-score metrics: (#731, #740)
functional.f1->functional.f1_scoreF1->F1Scorefunctional.fbeta->functional.fbeta_scoreFBeta->FBetaScore
- Renamed Hinge metric: (#734)
functional.hinge->functional.hinge_lossHinge->HingeLoss
- Renamed image PSNR metrics (#732)
functional.psnr->functional.peak_signal_noise_ratioPSNR->PeakSignalNoiseRatio
- Renamed image PIT metric: (#737)
functional.pit->functional.permutation_invariant_trainingPIT->PermutationInvariantTraining
- Renamed image SSIM metric: (#747)
functional.ssim->functional.scale_invariant_signal_noise_ratioSSIM->StructuralSimilarityIndexMeasure
- Renamed detection
MAPtoMeanAveragePrecisionmetric (#754) - Renamed Fidelity & LPIPS image metric: (#752)
image.FID->image.FrechetInceptionDistanceimage.KID->image.KernelInceptionDistanceimage.LPIPS->image.LearnedPerceptualImagePatchSimilarity
Removed
- Removed
embedding_similaritymetric (#638) - Removed argument
concatenate_textsfromwermetric (#638) - Removed arguments
newline_sepanddecimal_placesfromrougemetric (#638)
Fixed
- Fixed MetricCollection kwargs filtering when no
kwargsare present in update signature (#707)
Contributors
@ashutoshml, @Borda, @cuent, @Fariborzzz, @getgaurav2, @janhenriklambrechts, @justusschock, @karthikrangasai, @lucadiliello, @mahinlma, @mathemusician, @mona0809, @mrleu, @puhuk, @quancs, @SkafteNicki, @stancld, @twsl
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Fixing mAP on GPU
[0.6.2] - 2021-12-15
Fixed
- Fixed
torch.sortcurrently does not support booldtypeon CUDA (#665) - Fixed mAP properly checks if ground truths are empty (#684)
- Fixed initialization of tensors to be on the correct device for
MAPmetric (#673)
Contributors
@OlofHarrysson, @tkupek, @twsl
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Own mAP implementation
[0.6.1] - 2021-12-06
Changed
- Migrate MAP metrics from pycocotools to PyTorch (#632)
- Use
torch.topkinstead oftorch.argsortin retrieval precision for speedup (#627)
Fixed
- Fix empty predictions in MAP metric (#594, #610, #624)
- Fix edge case of AUROC with
average=weightedon GPU (#606) - Fixed
forwardin compositional metrics (#645)
Contributors
@Callidior, @SkafteNicki, @tkupek, @twsl, @zuoxingdong
If we forgot someone due to not matching commit email with GitHub account, let us know :]
More metrics than ever
[0.6.0] - 2021-10-28
We are excited to announce that Torchmetrics v0.6 is now publicly available. TorchMetrics v0.6 does not focus on specific domains but adds a ton of new metrics to several domains, thus increasing the number of metrics in the repository to over 60! Not only have v0.6 added metrics within already covered domains, but we also add support for two new: Pairwise metrics and detection.
https://devblog.pytorchlightning.ai/torchmetrics-v0-6-more-metrics-than-ever-e98c3983621e
Pairwise Metrics
TorchMetrics v0.6 offers a new set of metrics in its functional backend for calculating pairwise distances. Given a tensor X with shape [N,d] (N observations, each in d dimensions), a pairwise metric calculates [N,N] matrix of all possible combinations between the rows of X.
Detection
TorchMetrics v0.6 now includes a detection package that provides for the MAP metric. The implementation essentially wraps pycocotools around securing that we get the correct value, but with the benefit of now being able to scale to multiple devices (as any other metric in TorchMetrics).
New additions
-
In the
audiopackage, we have two new metrics: Perceptual Evaluation of Speech Quality (PESQ) and Short Term Objective Intelligibility (STOI). Both metrics can be used to assert speech quality. -
In the
retrievalpackage, we also have two new metrics: R-precision and Hit-rate. R-precision corresponds to recall at the R-th position of the query. The hit rate is the ratio of the total number of hits returned as a result of a query (hits) to the total number of hits returned. -
The
textpackage also receives an update in the form of two new metrics: Sacre BLEU score and character error rate. Sacre BLUE score provides and more systematic way of comparing BLUE scores across tasks. The character error rate is similar to the word error rate but instead calculates if a given algorithm has correctly predicted a sentence based on a character-by-character comparison. -
The
regressionpackage got a single new metric in the form of the Tweedie deviance score metric. Deviance scores are generally a better measure of fit than measures such as squared error when trying to model data coming from highly screwed distributions. -
Finally, we have added five new metrics for simple aggregation:
SumMetric,MeanMetric,MinMetric,MaxMetric,CatMetric. All five metrics take in a single input (either native python floats ortorch.Tensor) and keep track of the sum, average, min, etc. These new aggregation metrics are especially useful in combination with self.log from lightning if you want to log something other than the average of the metric you are tracking.
Detail changes
Added
- Added audio metrics:
- Added Information retrieval metrics:
- Added NLP metrics:
- Added other metrics:
- Added
MAP(mean average precision) metric to new detection package (#467) - Added support for float targets in
nDCGmetric (#437) - Added
averageargument toAveragePrecisionmetric for reducing multi-label and multi-class problems (#477) - Added
MultioutputWrapper(#510) - Added metric sweeping:
- Added simple aggregation metrics:
SumMetric,MeanMetric,CatMetric,MinMetric,MaxMetric(#506) - Added pairwise submodule with metrics (#553)
pairwise_cosine_similaritypairwise_euclidean_distancepairwise_linear_similaritypairwise_manhatten_distance
Changed
AveragePrecisionwill now as default output themacroaverage for multilabel and multiclass problems (#477)half,double,floatwill no longer change the dtype of the metric states. Usemetric.set_dtypeinstead (#493)- Renamed
AverageMetertoMeanMetric(#506) - Changed
is_differentiablefrom property to a constant attribute (#551) ROCandAUROCwill no longer throw an error when either the positive or negative class is missing. Instead, return 0 scores and give a warning
Deprecated
- Deprecated
torchmetrics.functional.self_supervised.embedding_similarityin favour of new pairwise submodule
Removed
- Removed
dtypeproperty (#493)
Fixed
- Fixed bug in
F1withaverage='macro'andignore_index!=None(#495) - Fixed bug in
pitby using the returned first result to initialize device and type (#533) - Fixed
SSIMmetric using too much memory (#539) - Fixed bug where
deviceproperty was not properly updated when the metric was a child of a module (#542)
Contributors
@an1lam, @Borda, @karthikrangasai, @lucadiliello, @mahinlma, @obus, @quancs, @SkafteNicki, @stancld, @tkupek
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Own NLP implementations
[0.5.1] - 2021-08-30
Added
- Added
deviceanddtypeproperties (#462) - Added
TextTesterclass for robustly testing text metrics (#450)
Changed
- Added support for float targets in
nDCGmetric (#437)
Removed
- Removed
rouge-scoreas dependency for text package (#443) - Removed
jiweras dependency for text package (#446) - Removed
bert-scoreas dependency for text package (#473)
Fixed
- Fixed ranking of samples in
SpearmanCorrCoefmetric (#448) - Fixed bug where compositional metrics where unable to sync because of type mismatch (#454)
- Fixed metric hashing (#478)
- Fixed
BootStrappermetrics not working on GPU (#462) - Fixed the semantic ordering of kernel height and width in
SSIMmetric (#474)
Contributors
@justusschock, @karthikrangasai, @kingyiusuen, @obus, @SkafteNicki, @stancld
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Text-related (NLP) metrics
[0.5.0] - 2021-08-09
This release includes general improvements to the library and new metrics within the NLP domain.
https://devblog.pytorchlightning.ai/torchmetrics-v0-5-nlp-metrics-f4232467b0c5
Natural language processing is arguably one of the most exciting areas of machine learning, with models such as BERT, ROBERTA, GPT-3 etc., really pushing what automated text translation, recognition, and generation systems are capable of.
With the introduction of these models, many metrics have been proposed that measure how well these models perform. TorchMetrics v0.5 includes 4 such metrics: BERT score, BLEU, ROUGE and WER.
Detail changes
Added
- Added Text-related (NLP) metrics:
- Added
MetricTrackerwrapper metric for keeping track of the same metric over multiple epochs (#238) - Added other metrics:
- Added support in
nDCGmetric for target with values larger than 1 (#349) - Added support for negative targets in
nDCGmetric (#378) - Added
Noneas reduction option inCosineSimilaritymetric (#400) - Allowed passing labels in (n_samples, n_classes) to
AveragePrecision(#386)
Changed
- Moved
psnrandssimfromfunctional.regression.*tofunctional.image.*(#382) - Moved
image_gradientfromfunctional.image_gradientstofunctional.image.gradients(#381) - Moved
R2Scorefromregression.r2scoretoregression.r2(#371) - Pearson metric now only store 6 statistics instead of all predictions and targets (#380)
- Use
torch.argmaxinstead oftorch.topkwhenk=1for better performance (#419) - Moved check for number of samples in R2 score to support single sample updating (#426)
Deprecated
- Rename
r2score>>r2_scoreandkldivergence>>kl_divergenceinfunctional(#371) - Moved
bleu_scorefromfunctional.nlptofunctional.text.bleu(#360)
Removed
- Removed restriction that
thresholdhas to be in (0,1) range to support logit input (#351, #401) - Removed restriction that
predscould not be bigger thannum_classesto support logit input (#357) - Removed module
regression.psnrandregression.ssim(#382): - Removed (#379):
- function
functional.mean_relative_error num_thresholdsargument inBinnedPrecisionRecallCurve
- function
Fixed
- Fixed bug where classification metrics with
average='macro'would lead to wrong result if a class was missing (#303) - Fixed
weighted,multi-classAUROC computation to allow for 0 observations of some class, as contribution to final AUROC is 0 (#376) - Fixed that
_forward_cacheand_computedattributes are also moved to the correct device if metric is moved (#413) - Fixed calculation in
IoUmetric when usingignore_indexargument (#328)
Contributors
@BeyondTheProof, @Borda, @CSautier, @discort, @edwardclem, @gagan3012, @hugoperrin, @karthikrangasai, @paul-grundmann, @quancs, @rajs96, @SkafteNicki, @vatch123
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Fixing DDP sync
Multimedia - audio & image quality
Overview
https://devblog.pytorchlightning.ai/torchmetrics-v0-4-introducing-multimedia-metrics-e6380a3ad354
Audio
The first highlight of v0.4.0 is a set of 3 new metrics for calculating for evaluating audio data: Scale-invariant signal-to-distortion ratio, Scale-invariant signal-to-noise ratio, and signal-to-noise ratio. All these metrics take a predicted audio tensor and a target tensor, both with the shape [...,time] and calculate the metric over the time axis.
Image
Version v0.4.0 also includes a completely new image package. Since its initial 0.2.0 release, Torchmetrics has had both PSNR and SSIM in its regression module, metrics that can be used to evaluate image quality.
With the image module, we are adding three new metrics for evaluating the quality of generative models (such as GANS): Inception score (IS), Fréchet inception distance (FID) and kernel inception distance (KID).
More Functionality
In addition to the new audio and image package, we also want to highlight a couple of features:
- Addition of MeanAbsolutePercentageError (MAPE) metric to the regression package. Useful in regression settings where you want to focus on the relative instead of absolute error.
- Addition of KLDivergence metric to the classification package. Useful for measuring the distance between probability distributions like the ones outputted in variational auto-encoders.
- Addition of CosineSimilarity metric to the regression package. Useful for calculating the angle between two embedding vectors in domains such as metric learning.
- As requested by multiple users, Accuracy, Precision, Recall, FBeta, F1, StatScore, Hamming, ConfusionMatrix now directly support that predictions can be unnormalized, e.g. logits from your model. No need to call
.softmax(dim=-1)anymore! - All modular metrics now have both a
syncandsync_contextmethods that allow the user full control over when metric states are synced. Note that we still automatically do this whenever calling thecomputemethod. - The
is_differentiableproperty has been adopted by many more of our metrics!
Thanks
Big thanks to all community members for their contributions and feedback.
A special thanks to @quancs for leading the development of the new audio package.
[0.4.0] - 2021-06-24
Added
- Added Cosine Similarity metric (#305)
- Added Specificity metric (#210)
- Added
add_metricsmethod toMetricCollectionfor adding additional metrics after initialization (#221) - Added pre-gather reduction in the case of
dist_reduce_fx="cat"to reduce communication cost (#217) - Added better error message for
AUROCwhennum_classesis not provided for multiclass input (#244) - Added support for unnormalized scores (e.g. logits) in
Accuracy,Precision,Recall,FBeta,F1,StatScore,Hamming,ConfusionMatrixmetrics (#200) - Added
MeanAbsolutePercentageError(MAPE)metric. (#248) - Added
squaredargument toMeanSquaredErrorfor computingRMSE(#249) - Added FID metric (#213)
- Added
is_differentiableproperty toConfusionMatrix,F1,FBeta,Hamming,Hinge,IOU,MatthewsCorrcoef,Precision,Recall,PrecisionRecallCurve,ROC,StatScores(#253) - Added audio metrics: SNR, SI_SDR, SI_SNR (#292)
- Added Inception Score metric to image module (#299)
- Added KID metric to image module (#301)
- Added
syncandsync_contextmethods for manually controlling when metric states are synced (#302) - Added
KLDivergencemetric (#247)
Changed
- Forward cache is reset when
resetmethod is called (#260) - Improved per-class metric handling for imbalanced datasets for
precision,recall,precision_recall,fbeta,f1,accuracy, andspecificity(#204) - Decorated
torch.jit.unusedtoMetricCollectionforward (#307) - Renamed
thresholdsargument to binned metrics for manually controlling the thresholds (#322)
Deprecated
- Deprecated
torchmetrics.functional.mean_relative_error(#248) - Deprecated
num_thresholdsargument inBinnedPrecisionRecallCurve(#322)
Removed
- Removed argument
is_multiclass(#319)
Fixed
- AUC can also support more dimensional inputs when all but one dimension are of size 1 (#242)
- Fixed
dtypeof modular metrics after reset has been called (#243) - Fixed calculation in
matthews_corrcoefto correctly match formula (#321)
Contributors
@AnselmC, @arvindmuralie77, @bhadreshpsavani, @Borda, @GiannisVagionakis, @hassiahk, @IgorHoholko, @johannespitz, @justusschock, @maximsch2, @pranjaldatta, @quancs, @simran2905, @SkafteNicki, @tchaton
If we forgot someone due to not matching commit email with GitHub account, let us know :]