Quantitative comparison of uncertainty methods

This section describes how we compared the methods for estimating uncertainty by using the Corpus of Linguistic Acceptability (CoLA) (Warstadt, Singh, and Bowman 2019) dataset. The CoLA dataset consists of a collection of sentences along with a binary indicator of whether they are acceptable. Sentences can be labeled as unacceptable for a variety of reasons, including improper syntax, semantics, or morphology. These sentences are taken from examples in linguistic publications. There are two validation sets. One validation set is taken from the same sources used in forming the training dataset (in domain), and the other validation set is taken from sources that aren’t contained in the training set (out of domain). The following table summarizes this information.

Dataset	Total size	Positive	Negative
Training	8551	6023	2528
Validation (in domain)	527	363	164
Validation (out of domain)	516	354	162

The comparison uses a RoBERTa (Liu et al. 2019) base architecture with pretrained weights and a randomly initialized head with a single hidden layer. Hyperparameters are mostly suggested in the RoBERTa paper with a few minor modifications.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Deep ensembles

Temperature scaling