
When working with probabilistic models, predictions are expressed as full distributions rather than point estimates. To keep things simple, we focus on the case where the outcome \(Y\) to be predicted consists of a finite number of \(n\) possible outcomes labeled \([1:n] = \{1,2,\ldots,n\}\). A probabilistic forecast then takes the form of a vector \(\pi=(\pi_1,\pi_2,\ldots,\pi_n)\) in the probability simplex \(\Delta^{n-1}\). Each coordinate \(\pi_i\) represents the predicted probability of outcome \(i\) occurring. How should we evaluate the quality of such probabilistic forecasts?
A scoring rule assigns a numerical reward \(s(\pi,Y)\) to the probabilistic forecast \(\pi\) when outcome \(Y\) occurs. If the true distribution of the outcome \(Y\) is \(p\), the expected reward for reporting \(\pi\) is \[ S(\pi,p) \equiv \sum_{i=1}^n p_i \, s(\pi,i). \tag{1}\]
Although the function \(s: \Delta^{n-1} \times [1:n] \to \mathbb{R}\) is generally non-linear, the function \(S\) can be extended to a function from \(\Delta^{n-1} \times \mathbb{R}^n\) to \(\mathbb{R}\), linear in its second argument, through Equation 1 by interpreting \(p\) as a vector in \(\mathbb{R}^n\). The remark that it is linear in its second argument will reveal to be very useful later. Furthermore, if one denotes by \(\delta_i = (0,\ldots,0,1,0,\ldots,0) \in \Delta^{n-1}\) the Dirac measure at \(i\), then the scoring rule can be recovered from the expected reward via
\[s(\pi,i) = S(\pi,\delta_i).\]
The central requirement for the design of scoring rules is that a forecaster has no incentive to misreport their beliefs. This means that if the forecaster’s belief about the distribution of \(Y\) is given by the probability distribution \(\pi \in\Delta^{n-1}\), then reporting \(\pi\) should maximize their expected reward. There are a number of situations in which such a design is desirable. For example, in market settings where agents are asked to provide probabilistic forecasts, proper scoring rules incentivize truthful reporting of beliefs. A scoring rule is called proper if the mapping \(\pi \mapsto S(\pi, p)\) attains its maximum at \(\pi=p\). Formally, this means that for all two distributions \(p,\pi \in \Delta^{n-1}\):
\[ S(p,p) \ge S(\pi, p). \]
This condition ensures that the best strategy, in expectation, is to report one’s genuine probabilities. If the inequality is strict whenever \(\pi \ne p\), the scoring rule is called strictly proper. Proper scoring rules have a long history in statistics and decision theory. The natural question arises: what do proper scoring rules look like, and how can we construct them? What functional forms can we use for \(s(\pi,i)\) that ensure properness?
For each distribution \(p\), define its self-expected score \[ H(p)=S(p,p)=\sum_{i=1}^n p_i \, s(p,i). \]
It is the average reward a forecaster receives when its reported distribution matches the true distribution. Crucially, the affine function \(p \mapsto S(\pi,p)\) describes a supporting hyperplane to the function \(H\) at the point \(\pi\): it is linear in \(p\), matches \(H\) at \(p=\pi\), while never exceeding it elsewhere. If one knew that \(H\) were convex and differentiable, by uniqueness of supporting hyperplanes to convex and differentiable functions, one could immediately write down a representation for \(S(\pi,p)\) in terms of \(H\). But \(H\) is indeed convex since it is the pointwise maximum of the family of affine functions \(p \mapsto S(\pi,p)\) indexed by \(\pi\). Assuming differentiability to keep a few technicalities at bay, this shows that:
\[ \begin{align*} s(\pi,i) &= S(\pi,\delta_i) = S(\pi, \pi) + \left< \nabla H(\pi), \delta_i - \pi \right>\\ &= H(\pi) + \left< \nabla H(\pi), \delta_i - \pi \right>. \end{align*} \]
Without assuming differentiability, one can use subgradients instead of gradients to obtain a similar representation. This shows that proper scoring rules \(s(\pi,i)\) are in one-to-one correspondence with convex functions \(H(\pi)\) on the probability simplex \(\pi \in \Delta^{n-1}\). Similarly, strictly proper scoring rules correspond to strictly convex functions. Extension to continuous sample spaces is possible through the use of functional derivatives instead of gradients or subgradients; see (Gneiting and Raftery 2007) for details.
Let us look at some examples of proper scoring rules defined through this correspondence:
Logarithmic Score: The logarithmic scoring rule is defined as \(s(\pi,i) = \log(\pi_i)\). The corresponding self-expected score is the negative Shannon entropy: \[ H(p) = \sum_{i=1}^n p_i \log(p_i). \] It is interesting to note that the logarithmic scoring rule is essentially the only local proper scoring rule, i.e. one of the type \(s(\pi,i) = F(\pi_i, i)\) for some function \(F\). In other words, the score assigned to outcome \(i\) depends only on the predicted probability \(\pi_i\) of that outcome, and not on the other predicted probabilities \(\pi_j\) for \(j \ne i\). Indeed, assuming \(F\) smooth for simplicity, the condition of properness easily implies that \(\partial_{\pi_i} F(\pi_i, i) = A\) for some constant \(\alpha\) independent of \(i\). Integrating this relation gives that \(F(\pi_i, i) = \alpha \log(\pi_i) + \beta_i\), where necessarily \(\alpha>0\) for properness, and where \(B_i\) are arbitrary constants.
Brier Score: The Brier scoring rule is given by \(s(\pi,i) = \pi_i - \tfrac12 \sum_{j=1}^n \pi_j^2\). The associated self-expected score is \[ H(p) = \frac12 \, \sum_{i=1}^n p_i^2. \]
Spherical Score: The spherical scoring rule is defined as \(s(\pi,i) = \frac{\pi_i}{\|\pi\|_2}\). The corresponding self-expected score is \[ H(p) = \|p\|_2. \]
Energy Score: consider a distance function \(d: [1:n] \times [1:n] \to \mathbb{R}_+\). The energy scoring rule is defined through expected distances: \[ s(\pi,i) = - \mathbb{E}[d(X,i)] \] where \(X \sim \pi\). The associated self-expected score is \[ H(p) = -\mathbb{E}[d(X,X')] = -\sum_{i,j=1}^n p_i p_j \, d(i,j). \] where \(X,X' \sim p\) are independent. This function is convex in \(p\) if the distance matrix \(M_{i,j}=d(i,j)\) is negative semi-definite on the subspace of zero-sum vectors, i.e., if for all vectors \(z \in \mathbb{R}^n\) with \(\sum_{i=1}^n z_i = 0\), one has \(\sum_{i,j=1}^n z_i z_j \, d(i,j) \le 0\). Luckily, there are many such distances. For example, if the distance \(d\) is of the form \[d(i,j) = \|\varphi_i - \varphi_j\|_2^2\] for some (feature) vectors \(\varphi_1,\ldots,\varphi_n \in \mathbb{R}^m\), then the corresponding distance matrix is negative semi-definite on the subspace of zero-sum vectors. In fact, with a bit of algebra, one can check that for any \(0< \beta \le 2\), the distance defined as \[d(i,j) = \|\varphi_i - \varphi_j\|_2^\beta\] also leads to a negative semi-definite distance matrix on the subspace of zero-sum vectors. This includes, in particular, the case \(\beta=1\) which corresponds to the standard Euclidean distance.