doc/fitmle.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125

fitmle(1) -- Fit a set of values with a power-law distribution
======

## SYNOPSIS

`fitmle` <data_in> [<tol> [TEST [<num\_test>]]]

## DESCRIPTION

`fitmle` fits the data points contained in the file <data_in> with a
power-law function P(k) ~ k^(-gamma), using the Maximum-Likelihood
Estimator (MLE). In particular, `fitmle` finds the exponent `gamma`
and the minimum of the values provided on input for which the
power-law behaviour holds. The second (optional) argument <tol> sets
the acceptable statistical error on the estimate of the exponent.

If `TEST` is provided, the program associates a p-value to the
goodness of the fit, based on the Kolmogorov-Smirnov statistics
computed on <num_test> sampled distributions from the theoretical
power-law function. If <num_test> is not provided, the test is based
on 100 sampled distributions.


## PARAMETERS

* <data_in>:
    Set of values to fit. If is equal to `-` (dash), read the set from
    STDIN.

* <tol>: 
    The acceptable statistical error on the estimation of the
    exponent. If omitted, it is set to 0.1.
    
* TEST:
    If the third parameter is `TEST`, the program computes an estimate
    of the p-value associated to the best-fit, based on <num_test>
    synthetic samples of the same size of the input set.

* <num_test>:
    Number of synthetic samples to use for the estimation of the
    p-value of the best fit.

## OUTPUT

If `fitmle` is given less than three parameters (i.e., if `TEST` is
not specified), the output is a line in the format:

        gamma k_min ks

where `gamma` is the estimate for the exponent, `k_min` is the
smallest of the input values for which the power-law behaviour holds,
and `ks` is the value of the Kolmogorov-Smirnov statistics of the
best-fit. 

If `TEST` is specified, the output line contains also the estimate of
the p-value of the best fit:

        gamma k_min ks p-value

where `p-value` is based on <num_test> samples (or just 100, if
<num_test> is not specified) of the same size of the input, obtained
from the theoretical power-law function computed as a best fit.
 
## EXAMPLES

Let us assume that the file `AS-20010316.net_degs` contains the degree
sequence of the data set `AS-20010316.net` (the graph of the Internet
at the AS level in March 2001). The exponent of the best-fit power-law
distribution can be obtained by using:

        $ fitmle AS-20010316.net_degs 
        Using discrete fit
        2.06165 6 0.031626 0.17
        $

where `2.06165` is the estimated value of the exponent `gamma`, `6` is
the minimum degree value for which the power-law behaviour holds, and
`0.031626` is the value of the Kolmogorov-Smirnov statistics of the
best-fit. The program is also telling us that it decided to use the
discrete fitting procedure, since all the values in
`AS-20010316.net_degs` are integers. The latter information is printed
to STDERR.

It is possible to compute the p-value of the estimate by running:

        $ fitmle AS-20010316.net_degs 0.1 TEST
        Using discrete fit
        2.06165 6 0.031626 0.17
        $

which provides a p-value equal to 0.17, meaning that 17% of the
synthetic samples showed a value of the KS statistics larger than that
of the best-fit. The estimation of the p-value here is based on 100
synthetic samples, since <num_test> was not provided. If we allow a
slightly larger value of the statistical error on the estimate of the
exponent `gamma`, we obtain different values of `gamma` and `k_min`,
and a much higher p-value:

        $ fitmle AS-20010316.net_degs 0.15 TEST 1000
        Using discrete fit
        2.0585 19 0.0253754 0.924
        $

Notice that in this case, the p-value of the estimate is equal to
0.924, and is based on 1000 synthetic samples.

## SEE ALSO

deg_seq(1), power_law(1)


## REFERENCES

* A\. Clauset, C. R. Shalizi, and M. E. J. Newman. "Power-law
  distributions in empirical data". SIAM Rev. 51, (2007), 661-703.

* V\. Latora, V. Nicosia, G. Russo, "Complex Networks: Principles,
  Methods and Applications", Chapter 5, Cambridge University Press
  (2017)


## AUTHORS

(c) Vincenzo 'KatolaZ' Nicosia 2009-2017 `<v.nicosia@qmul.ac.uk>`.