Quantitative comparison of uncertainty methods
This section describes how we compared the methods for estimating uncertainty by using the Corpus of Linguistic Acceptability (CoLA) (Warstadt, Singh, and Bowman 2019) dataset. The CoLA dataset consists of a collection of sentences along with a binary indicator of whether they are acceptable. Sentences can be labeled as unacceptable for a variety of reasons, including improper syntax, semantics, or morphology. These sentences are taken from examples in linguistic publications. There are two validation sets. One validation set is taken from the same sources used in forming the training dataset (in domain), and the other validation set is taken from sources that aren’t contained in the training set (out of domain). The following table summarizes this information.
| Dataset | Total size | Positive | Negative |
|---|---|---|---|
Training |
8551 |
6023 |
2528 |
Validation (in domain) |
527 |
363 |
164 |
Validation (out of domain) |
516 |
354 |
162 |
The comparison uses a RoBERTa (Liu et al. 2019) base architecture with pretrained weights and a randomly initialized head with a single hidden layer. Hyperparameters are mostly suggested in the RoBERTa paper with a few minor modifications.