Fitting a distribution with customized maximum likelihood

Introduction

When we perform distribution fitting using the MaximumLikelihoodFactory class, the default optimization solver sometimes fails to maximize the likelihood. This might be because the optimization solver is not appropriate in this particular case, or because the bounds of the problem are not properly set. In this example, we configure the optimization solver by two different methods.

  • The first method sets a specific solver among the NLopt solvers. Moreover, we customize the distribution parameters bounds. This helps the solver to avoid the part of the domain where the distribution cannot be built.

  • The second method sets the ResourceMap key that is used by the MaximumLikelihoodFactory in order to select the default solver.

import openturns as ot

Simulating a sample

In this example, we consider the TruncatedNormal distribution with default parameters and generate a large sample from it.

distribution = ot.TruncatedNormal()
print("Distribution=", distribution)
size = 10000
sample = distribution.getSample(size)
Distribution= TruncatedNormal(mu = 0, sigma = 1, a = -1, b = 1)

We can use the TruncatedNormalFactory class.

factory = ot.TruncatedNormalFactory()
fittedDistribution = factory.build(sample)
print("Fitted distribution=", fittedDistribution)
Fitted distribution= TruncatedNormal(mu = -0.0087698, sigma = 0.984923, a = -0.999836, b = 1)

But this method sometimes fails for specific distributions, for example when the sample size is very small or when the distribution has specific properties (e.g. some parameter is known beforehand). In these cases, general-purpose methods cannot be used, and the methods presented below may be relevant.

Defining the estimation problem

The truncated normal distribution has 4 parameters:

  • the mean and standard deviation parameters of the underlying Normal distribution,

  • the minimum and maximum of the underlying truncated distribution.

Estimating the parameters will be easier if two conditions are met:

  • the starting point (i.e., the initial values of the parameters) of the optimization is compatible with the sample,

  • the bounds are compatible with the distribution.

First, we define the starting point of the optimization, using statistics from the sample.

sample_min = sample.getMin()[0]
sample_max = sample.getMax()[0]
sample_mean = sample.computeMean()[0]
sigma_hat = sample.computeStandardDeviation()[0]
startingPoint = [sample_mean, sigma_hat, sample_min, sample_max]
print("startingPoint=", startingPoint)
startingPoint= [-0.002562574711278147, 0.5384068806033934, -0.9997360092906988, 0.9999009268469972]

Secondly, we set the bounds of the parameters. We assume that the minimum and maximum cannot be lower than -1 or greater than 1. The parameters have the following bounds:

  • \mu \in [-\infty, +\infty] ;

  • \sigma \in [0, +\infty] ;

  • a \in [-1, 1] ;

  • b \in [-1, 1].

But this may not give satisfactory results, because infinite parameters are impossible and a zero standard deviation is infeasible. Furthermore, we must have a < b. We assume that we know that the mean parameter cannot be greater than 10 and lower than -10. Similarly, the standard deviation cannot be greater than 10. Hence, we use the following bounds:

  • \mu \in [-10, +10] ;

  • \sigma \in [\epsilon, +10] ;

  • a \in [-1 - \epsilon, \bar{x} - \epsilon] ;

  • b \in [\bar{x} + \epsilon, 1 + \epsilon]

where \bar{x} is the sample mean and \epsilon is a small parameter.

epsilon = 1.0e-3
bounds_lower = [-10.0, epsilon, -1.0 - epsilon, sample_mean + epsilon]
bounds_upper = [10.0, 10.0, sample_mean - epsilon, 1.0 + epsilon]
print("bounds_lower=", bounds_lower)
print("bounds_upper=", bounds_upper)
bounds_lower= [-10.0, 0.001, -1.001, -0.0015625747112781468]
bounds_upper= [10.0, 10.0, -0.003562574711278147, 1.001]

The boolean False means unbounded and True means bounded.

finiteLowerBound = [True] * 4
finiteUpperBound = [True] * 4
interval = ot.Interval(bounds_lower, bounds_upper, finiteLowerBound, finiteUpperBound)

Setting the solver and bounds

The two methods that we suggest are based on the MaximumLikelihoodFactory class which uses the maximum likelihood estimator to estimate the parameters of the distribution depending on the sample. The first method is to use the setOptimizationAlgorithm() method. This sets the optimization algorithm as a replacement for the default optimization solver.

def printTruncatedNormalParameters(truncatedNormalDistribution):
    mu, sigma, a, b = truncatedNormalDistribution.getParameter()
    print(" ", truncatedNormalDistribution.getImplementation().getClassName())
    print("  mu= %.3f" % (mu))
    print("  sigma= %.4f" % (sigma))
    print("  a= %.2f" % (a))
    print("  b= %.2f" % (b))
    return None
factory = ot.MaximumLikelihoodFactory(distribution)
factory.setOptimizationBounds(interval)
distribution_MLE = factory.build(sample)
print("Fitted distribution with default solver:")
printTruncatedNormalParameters(distribution_MLE)
default_solver = factory.getOptimizationAlgorithm()
print("Default solver=", default_solver.getImplementation().getClassName())
solver = ot.NLopt("LN_COBYLA")
print("New solver=", solver.getClassName(), solver.getAlgorithmName())
solver.setStartingPoint(startingPoint)
factory.setOptimizationAlgorithm(solver)
print(
    "Updated solver=",
    factory.getOptimizationAlgorithm().getImplementation().getClassName(),
)
distribution_MLE = factory.build(sample)
print("Fitted distribution with new solver:")
printTruncatedNormalParameters(distribution_MLE)
Fitted distribution with default solver:
  TruncatedNormal
  mu= -0.000
  sigma= 1.0000
  a= -1.00
  b= 1.00
Default solver= TNC
New solver= NLopt LN_COBYLA
Updated solver= NLopt
Fitted distribution with new solver:
  TruncatedNormal
  mu= -0.016
  sigma= 0.5526
  a= -1.00
  b= 1.00

Using ResourceMap to set the solver

Another method is to use a special key of the ResourceMap, which defines the default solver used by MaximumLikelihoodFactory. In this case, we do not define the starting point.

ot.ResourceMap.SetAsString(
    "MaximumLikelihoodFactory-DefaultOptimizationAlgorithm", "LN_COBYLA"
)
factory = ot.MaximumLikelihoodFactory(distribution)
factory.setOptimizationBounds(interval)
print(
    "Default solver=",
    factory.getOptimizationAlgorithm().getImplementation().getClassName(),
)
distribution_MLE = factory.build(sample)
print("Fitted distribution:")
printTruncatedNormalParameters(distribution_MLE)
Default solver= NLopt
Fitted distribution:
  TruncatedNormal
  mu= -0.009
  sigma= 1.0016
  a= -1.00
  b= 1.00

Fitting a LogNormal with zero location parameter

We now consider the LogNormal distribution and present two different topics.

  • We show how to estimate the parameters of a LogNormal distribution which has a zero location parameter.

  • We compute the list of solvers which are compatible with a specific MLE problem.

We generate a large sample from the standard LogNormal distribution.

standardLogNormal = ot.LogNormal()
sample = standardLogNormal.getSample(1000)

If we estimate the parameters using LogNormalFactory, the algorithm does not know that the true value of the location parameter \gamma is zero. Hence, the estimator may not necessarily lead to an estimated parameter exactly equal to zero. In this example, we assume that we know that this parameter is zero. In this case, the simplest method is to use the setKnownParameter() method. Furthermore, we know that the absolute value of the mean of the underlying normal distribution cannot be greater than 5. Finally, we know that the standard deviation of the underlying normal distribution cannot be lower than, say, \epsilon = 10^{-4}.

logNormalFactoryWithZeroLocation = ot.MaximumLikelihoodFactory(standardLogNormal)
logNormalFactoryWithZeroLocation.setKnownParameter([0.0], [2])
lowerBound = [-5.0, 1.0e-4]
upperBound = [5.0, -1.0]
finiteLowerBound = [True, True]
finiteUpperBound = [True, False]
bounds = ot.Interval(lowerBound, upperBound, finiteLowerBound, finiteUpperBound)
logNormalFactoryWithZeroLocation.setOptimizationBounds(bounds)
fittedDistribution = logNormalFactoryWithZeroLocation.build(sample)
print("   fittedDistribution = ", fittedDistribution)
fittedDistribution =  LogNormal(muLog = 0.00970194, sigmaLog = 1.00033, gamma = 0)

See the list of solvers compatible with a given MLE problem

In the next script, we perform a loop over the algorithms from the NLopt class and select the algorithms which can solve the problem. This is done using try/except statements. Moreover, some solvers pretend to solve the problem, but do not make any change to the parameters. In this case, we do not select them.

defaultParameter = standardLogNormal.getParameter()
defaultSolver = logNormalFactoryWithZeroLocation.getOptimizationAlgorithm()
startingPoint = defaultSolver.getStartingPoint()
print("Default solver = ", defaultSolver.getImplementation().getClassName())
availableSolverList = []
differenceThreshold = 1.0e-8
for algorithmName in ot.NLopt.GetAlgorithmNames():
    print("- ", algorithmName)
    solver = ot.NLopt(algorithmName)
    solver.setStartingPoint(startingPoint)
    logNormalFactoryWithZeroLocation.setOptimizationAlgorithm(solver)
    try:
        fittedDistribution = logNormalFactoryWithZeroLocation.build(sample)
        print("   fittedDistribution = ", fittedDistribution)
        parameter = fittedDistribution.getParameter()
        differenceOfParameters = parameter - defaultParameter
        if differenceOfParameters.norm() > differenceThreshold:
            # Consider only solvers which make a difference after optimization
            availableSolverList.append(algorithmName)
    except TypeError:
        print("   Fail : Impossible to build with ", algorithmName)
Default solver =  NLopt
-  AUGLAG
   fittedDistribution =  LogNormal(muLog = 0, sigmaLog = 1, gamma = 0)
-  AUGLAG_EQ
   fittedDistribution =  LogNormal(muLog = 0, sigmaLog = 1, gamma = 0)
-  GD_MLSL
   Fail : Impossible to build with  GD_MLSL
-  GD_MLSL_LDS
   Fail : Impossible to build with  GD_MLSL_LDS
-  GN_CRS2_LM
   Fail : Impossible to build with  GN_CRS2_LM
-  GN_DIRECT
   Fail : Impossible to build with  GN_DIRECT
-  GN_DIRECT_L
   Fail : Impossible to build with  GN_DIRECT_L
-  GN_DIRECT_L_NOSCAL
   Fail : Impossible to build with  GN_DIRECT_L_NOSCAL
-  GN_DIRECT_L_RAND
   Fail : Impossible to build with  GN_DIRECT_L_RAND
-  GN_DIRECT_L_RAND_NOSCAL
   Fail : Impossible to build with  GN_DIRECT_L_RAND_NOSCAL
-  GN_DIRECT_NOSCAL
   Fail : Impossible to build with  GN_DIRECT_NOSCAL
-  GN_ESCH
   Fail : Impossible to build with  GN_ESCH
-  GN_ISRES
   Fail : Impossible to build with  GN_ISRES
-  GN_MLSL
   Fail : Impossible to build with  GN_MLSL
-  GN_MLSL_LDS
   Fail : Impossible to build with  GN_MLSL_LDS
-  GN_ORIG_DIRECT
   Fail : Impossible to build with  GN_ORIG_DIRECT
-  GN_ORIG_DIRECT_L
   Fail : Impossible to build with  GN_ORIG_DIRECT_L
-  G_MLSL
   Fail : Impossible to build with  G_MLSL
-  G_MLSL_LDS
   Fail : Impossible to build with  G_MLSL_LDS
-  LD_AUGLAG
   fittedDistribution =  LogNormal(muLog = 0, sigmaLog = 1, gamma = 0)
-  LD_AUGLAG_EQ
   fittedDistribution =  LogNormal(muLog = 0, sigmaLog = 1, gamma = 0)
-  LD_CCSAQ
   fittedDistribution =  LogNormal(muLog = 0, sigmaLog = 1, gamma = 0)
-  LD_LBFGS
   fittedDistribution =  LogNormal(muLog = 0.0158819, sigmaLog = 1.00312, gamma = 0)
-  LD_MMA
   fittedDistribution =  LogNormal(muLog = 0, sigmaLog = 1, gamma = 0)
-  LD_SLSQP
   fittedDistribution =  LogNormal(muLog = 0.0174256, sigmaLog = 1.00132, gamma = 0)
-  LD_TNEWTON
   fittedDistribution =  LogNormal(muLog = 0.0158698, sigmaLog = 1.00311, gamma = 0)
-  LD_TNEWTON_PRECOND
   fittedDistribution =  LogNormal(muLog = 0.0158698, sigmaLog = 1.00311, gamma = 0)
-  LD_TNEWTON_PRECOND_RESTART
   fittedDistribution =  LogNormal(muLog = 0.0158698, sigmaLog = 1.00311, gamma = 0)
-  LD_TNEWTON_RESTART
   fittedDistribution =  LogNormal(muLog = 0.0158698, sigmaLog = 1.00311, gamma = 0)
-  LD_VAR1
   fittedDistribution =  LogNormal(muLog = 0.0160269, sigmaLog = 1.00314, gamma = 0)
-  LD_VAR2
   fittedDistribution =  LogNormal(muLog = 0.0160269, sigmaLog = 1.00314, gamma = 0)
-  LN_AUGLAG
   fittedDistribution =  LogNormal(muLog = 0, sigmaLog = 1, gamma = 0)
-  LN_AUGLAG_EQ
   fittedDistribution =  LogNormal(muLog = 0, sigmaLog = 1, gamma = 0)
-  LN_BOBYQA
   fittedDistribution =  LogNormal(muLog = 0.0153223, sigmaLog = 1.00494, gamma = 0)
-  LN_COBYLA
   fittedDistribution =  LogNormal(muLog = 0.00970194, sigmaLog = 1.00033, gamma = 0)
-  LN_NELDERMEAD
   fittedDistribution =  LogNormal(muLog = 0.0128211, sigmaLog = 1.00282, gamma = 0)
-  LN_NEWUOA
   Fail : Impossible to build with  LN_NEWUOA
-  LN_SBPLX
   fittedDistribution =  LogNormal(muLog = 0.0143433, sigmaLog = 1.00252, gamma = 0)
print("Available solvers = ", len(availableSolverList))
for name in availableSolverList:
    print("- ", name)
Available solvers =  12
-  LD_LBFGS
-  LD_SLSQP
-  LD_TNEWTON
-  LD_TNEWTON_PRECOND
-  LD_TNEWTON_PRECOND_RESTART
-  LD_TNEWTON_RESTART
-  LD_VAR1
-  LD_VAR2
-  LN_BOBYQA
-  LN_COBYLA
-  LN_NELDERMEAD
-  LN_SBPLX

Pick one and try it.

algorithmName = "LD_SLSQP"
solver = ot.NLopt(algorithmName)
print("   New solver = ", solver.getClassName(), solver.getAlgorithmName())
solver.setStartingPoint(startingPoint)
logNormalFactoryWithZeroLocation.setOptimizationAlgorithm(solver)
fittedDistribution = logNormalFactoryWithZeroLocation.build(sample)
print("   fittedDistribution = ", fittedDistribution)
New solver =  NLopt LD_SLSQP
fittedDistribution =  LogNormal(muLog = 0.0174256, sigmaLog = 1.00132, gamma = 0)