Skip to content

Commit 53e9214

Browse files
authored
Merge pull request #42 from axiomhq/loglogbeta
Make HyperLogLog order independent
2 parents af9851f + 4750bc2 commit 53e9214

File tree

12 files changed

+625
-847
lines changed

12 files changed

+625
-847
lines changed

README.md

Lines changed: 27 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,43 +1,43 @@
1-
HyperLogLog - an algorithm for approximating the number of distinct elements
2-
---
1+
# HyperLogLog - an algorithm for approximating the number of distinct elements
32

43
[![GoDoc](https://godoc.org/github.com/axiomhq/hyperloglog?status.svg)](https://godoc.org/github.com/axiomhq/hyperloglog) [![Go Report Card](https://goreportcard.com/badge/github.com/axiomhq/hyperloglog)](https://goreportcard.com/report/github.com/axiomhq/hyperloglog) [![CircleCI](https://circleci.com/gh/axiomhq/hyperloglog/tree/master.svg?style=svg)](https://circleci.com/gh/axiomhq/hyperloglog/tree/master)
54

6-
An improved version of [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog) for the count-distinct problem, approximating the number of distinct elements in a multiset **using 33-50% less space** than other usual HyperLogLog implementations.
5+
An improved version of [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog) for the count-distinct problem, approximating the number of distinct elements in a multiset. This implementation offers enhanced performance, flexibility, and simplicity while maintaining accuracy.
76

8-
This work is based on ["Better with fewer bits: Improving the performance of cardinality estimation of large data streams - Qingjun Xiao, You Zhou, Shigang Chen"](http://cse.seu.edu.cn/PersonalPage/csqjxiao/csqjxiao_files/papers/INFOCOM17.pdf).
7+
## Note on Implementation History
98

10-
## Implementation
9+
The initial version of this work (tagged as v0.1.0) was based on ["Better with fewer bits: Improving the performance of cardinality estimation of large data streams - Qingjun Xiao, You Zhou, Shigang Chen"](http://cse.seu.edu.cn/PersonalPage/csqjxiao/csqjxiao_files/papers/INFOCOM17.pdf). However, the current implementation has evolved significantly from this original basis, notably moving away from the tailcut method.
1110

12-
The core differences between this and other implementations are:
13-
* **use metro hash** instead of xxhash
14-
* **sparse representation** for lower cardinalities (like HyperLogLog++)
15-
* **loglog-beta** for dynamic bias correction medium and high cardinalities.
16-
* **4-bit register** instead of 5 (HLL) and 6 (HLL++), but most implementations use 1-byte registers out of convenience
11+
## Current Implementation
1712

18-
In general it borrows a lot from [InfluxData's fork](https://github.com/influxdata/influxdb/tree/master/pkg/estimator/hll) of [Clark Duvall's HyperLogLog++ implementation](https://github.com/clarkduvall/hyperloglog), but uses **50% less space**.
13+
The current implementation is based on the LogLog-Beta algorithm, as described in:
1914

20-
## Results
21-
A direct comparison with the [HyperLogLog++ implementation used by InfluxDB](https://github.com/influxdata/influxdb/tree/master/pkg/estimator/hll) yielded the following results:
15+
["LogLog-Beta and More: A New Algorithm for Cardinality Estimation Based on LogLog Counting"](https://arxiv.org/pdf/1612.02284) by Jason Qin, Denys Kim, and Yumei Tung (2016).
2216

23-
| Exact | Axiom (8.2 KB) | Influx (16.39 KB) |
24-
| --- | --- | --- |
25-
| 10 | 10 (0.0% off) | 10 (0.0% off) |
26-
| 50 | 50 (0.0% off) | 50 (0.0% off) |
27-
| 250 | 250 (0.0% off) | 250 (0.0% off) |
28-
| 1250 | 1249 (0.08% off) | 1249 (0.08% off) |
29-
| 6250 | 6250 (0.0% off) | 6250 (0.0% off) |
30-
| 31250 | **31008 (0.7744% off)** | 31565 (1.0080% off) |
31-
| 156250 | **156013 (0.1517% off)** | 156652 (0.2573% off) |
32-
| 781250 | **782364 (0.1426% off)** | 775988 (0.6735% off) |
33-
| 3906250 | 3869332 (0.9451% off) | **3889909 (0.4183% off)** |
34-
| 10000000 | **9952682 (0.4732% off)** |9889556 (1.1044% off) |
17+
Key features of the current implementation:
18+
* **Metro hash** used instead of xxhash
19+
* **Sparse representation** for lower cardinalities (like HyperLogLog++)
20+
* **LogLog-Beta** for dynamic bias correction across all cardinalities
21+
* **8-bit registers** for convenience and simplified implementation
22+
* **Order-independent insertions and merging** for consistent results regardless of data input order
23+
* **Removal of tailcut method** for a more straightforward approach
24+
* **Flexible precision** allowing for 2^4 to 2^18 registers
3525

26+
This implementation is now more straightforward, efficient, and flexible, while remaining backwards compatible with previous versions. It provides a balance between precision, memory usage, speed, and ease of use.
27+
28+
## Precision and Memory Usage
29+
30+
This implementation allows for creating HyperLogLog sketches with arbitrary precision between 2^4 and 2^18 registers. The memory usage scales with the number of registers:
31+
32+
* Minimum (2^4 registers): 16 bytes
33+
* Default (2^14 registers): 16 KB
34+
* Maximum (2^18 registers): 256 KB
35+
36+
Users can choose the precision that best fits their use case, balancing memory usage against estimation accuracy.
3637

3738
## Note
3839
A big thank you to Prof. Shigang Chen and his team at the University of Florida who are actively conducting research around "Big Network Data".
3940

40-
4141
## Contributing
4242

4343
Kindly check our [contributing guide](https://github.com/axiomhq/hyperloglog/blob/main/Contributing.md) on how to propose bugfixes and improvements, and submitting pull requests to the project
@@ -48,4 +48,4 @@ Kindly check our [contributing guide](https://github.com/axiomhq/hyperloglog/blo
4848

4949
Distributed under MIT License (`The MIT License`).
5050

51-
See [LICENSE](LICENSE) for more information.
51+
See [LICENSE](LICENSE) for more information.

beta.go

Lines changed: 273 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,273 @@
1+
package hyperloglog
2+
3+
import (
4+
"fmt"
5+
"math"
6+
)
7+
8+
var betaMap = map[uint8]func(float64) float64{
9+
4: beta4,
10+
5: beta5,
11+
6: beta6,
12+
7: beta7,
13+
8: beta8,
14+
9: beta9,
15+
10: beta10,
16+
11: beta11,
17+
12: beta12,
18+
13: beta13,
19+
14: beta14,
20+
15: beta15,
21+
16: beta16,
22+
17: beta17,
23+
18: beta18,
24+
}
25+
26+
func beta(p uint8, ez float64) float64 {
27+
f, ok := betaMap[p]
28+
if !ok {
29+
panic(fmt.Sprintf("invalid precision %d", p))
30+
}
31+
return f(ez)
32+
}
33+
34+
/*
35+
p=4
36+
[-0.582581413904517,-1.935300357560050,11.07932375 8035073,-22.131357446444323,22.505391846630037,-12 .000723834917984,3.220579408194167,-0.342225302271 235]
37+
*/
38+
func beta4(ez float64) float64 {
39+
zl := math.Log(ez + 1)
40+
return -0.582581413904517*ez +
41+
-1.935300357560050*zl +
42+
11.079323758035073*math.Pow(zl, 2) +
43+
-22.131357446444323*math.Pow(zl, 3) +
44+
22.505391846630037*math.Pow(zl, 4) +
45+
-12.000723834917984*math.Pow(zl, 5) +
46+
3.220579408194167*math.Pow(zl, 6) +
47+
-0.342225302271235*math.Pow(zl, 7)
48+
}
49+
50+
/*
51+
p=5
52+
[-0.7518999460733967,-0.9590030077748760,5.5997371 322141607,-8.2097636999765520,6.5091254894472037,- 2.6830293734323729,0.5612891113138221,-0.046333162 2196545]
53+
*/
54+
func beta5(ez float64) float64 {
55+
zl := math.Log(ez + 1)
56+
return -0.7518999460733967*ez +
57+
-0.9590030077748760*zl +
58+
5.5997371322141607*math.Pow(zl, 2) +
59+
-8.2097636999765520*math.Pow(zl, 3) +
60+
6.5091254894472037*math.Pow(zl, 4) +
61+
-2.6830293734323729*math.Pow(zl, 5) +
62+
0.5612891113138221*math.Pow(zl, 6) +
63+
-0.0463331622196545*math.Pow(zl, 7)
64+
}
65+
66+
/*
67+
p=6
68+
[29.8257900969619634,-31.3287083337725925,-10.5942 523036582283,-11.5720125689099618,3.81887543739074 92,-2.4160130328530811,0.4542208940970826,-0.05751 55452020420]
69+
*/
70+
func beta6(ez float64) float64 {
71+
zl := math.Log(ez + 1)
72+
return 29.8257900969619634*ez +
73+
-31.3287083337725925*zl +
74+
-10.5942523036582283*math.Pow(zl, 2) +
75+
-11.5720125689099618*math.Pow(zl, 3) +
76+
3.8188754373907492*math.Pow(zl, 4) +
77+
-2.4160130328530811*math.Pow(zl, 5) +
78+
0.4542208940970826*math.Pow(zl, 6) +
79+
-0.0575155452020420*math.Pow(zl, 7)
80+
}
81+
82+
/*
83+
p=7
84+
[2.8102921290820060,-3.9780498518175995,1.31626800 41351582,-3.9252486335805901,2.0080835753946471,-0 .7527151937556955,0.1265569894242751,-0.0109946438726240]
85+
*/
86+
func beta7(ez float64) float64 {
87+
zl := math.Log(ez + 1)
88+
return 2.8102921290820060*ez +
89+
-3.9780498518175995*zl +
90+
1.3162680041351582*math.Pow(zl, 2) +
91+
-3.9252486335805901*math.Pow(zl, 3) +
92+
2.0080835753946471*math.Pow(zl, 4) +
93+
-0.7527151937556955*math.Pow(zl, 5) +
94+
0.1265569894242751*math.Pow(zl, 6) +
95+
-0.0109946438726240*math.Pow(zl, 7)
96+
}
97+
98+
/*
99+
p=8
100+
[1.00633544887550519,-2.00580666405112407,1.643697 49366514117,-2.70560809940566172,1.392099802442225 98,-0.46470374272183190,0.07384282377269775,-0.00578554885254223]
101+
*/
102+
func beta8(ez float64) float64 {
103+
zl := math.Log(ez + 1)
104+
return 1.00633544887550519*ez +
105+
-2.00580666405112407*zl +
106+
1.64369749366514117*math.Pow(zl, 2) +
107+
-2.70560809940566172*math.Pow(zl, 3) +
108+
1.39209980244222598*math.Pow(zl, 4) +
109+
-0.46470374272183190*math.Pow(zl, 5) +
110+
0.07384282377269775*math.Pow(zl, 6) +
111+
-0.00578554885254223*math.Pow(zl, 7)
112+
}
113+
114+
/*
115+
p=9
116+
[-0.09415657458167959,-0.78130975924550528,1.71514 946750712460,-1.73711250406516338,0.86441508489048 924,-0.23819027465047218,0.03343448400269076,-0.00 207858528178157]
117+
*/
118+
func beta9(ez float64) float64 {
119+
zl := math.Log(ez + 1)
120+
return -0.09415657458167959*ez +
121+
-0.78130975924550528*zl +
122+
1.71514946750712460*math.Pow(zl, 2) +
123+
-1.73711250406516338*math.Pow(zl, 3) +
124+
0.86441508489048924*math.Pow(zl, 4) +
125+
-0.23819027465047218*math.Pow(zl, 5) +
126+
0.03343448400269076*math.Pow(zl, 6) +
127+
-0.00207858528178157*math.Pow(zl, 7)
128+
}
129+
130+
/*
131+
p=10
132+
[-0.25935400670790054,-0.52598301999805808,1.48933 034925876839,-1.29642714084993571,0.62284756217221615,-0.15672326770251041,0.02054415903878563,-0.00 112488483925502]
133+
*/
134+
func beta10(ez float64) float64 {
135+
zl := math.Log(ez + 1)
136+
return -0.25935400670790054*ez +
137+
-0.52598301999805808*zl +
138+
1.48933034925876839*math.Pow(zl, 2) +
139+
-1.29642714084993571*math.Pow(zl, 3) +
140+
0.62284756217221615*math.Pow(zl, 4) +
141+
-0.15672326770251041*math.Pow(zl, 5) +
142+
0.02054415903878563*math.Pow(zl, 6) +
143+
-0.00112488483925502*math.Pow(zl, 7)
144+
}
145+
146+
/*
147+
p=11
148+
[-4.32325553856025e-01,-1.08450736399632e-01,6.091 56550741120e-01,-1.65687801845180e-02,-7.958293410 87617e-02,4.71830602102918e-02,-7.81372902346934e- 03,5.84268708489995e-04]
149+
*/
150+
func beta11(ez float64) float64 {
151+
zl := math.Log(ez + 1)
152+
return -0.432325553856025*ez +
153+
-0.108450736399632*zl +
154+
0.609156550741120*math.Pow(zl, 2) +
155+
-0.0165687801845180*math.Pow(zl, 3) +
156+
-0.0795829341087617*math.Pow(zl, 4) +
157+
0.0471830602102918*math.Pow(zl, 5) +
158+
-0.00781372902346934*math.Pow(zl, 6) +
159+
0.000584268708489995*math.Pow(zl, 7)
160+
}
161+
162+
/*
163+
p=12
164+
[-3.84979202588598e-01,1.83162233114364e-01,1.3039 6688841854e-01,7.04838927629266e-02,-8.95893971464 453e-03,1.13010036741605e-02,-1.94285569591290e-03 ,2.25435774024964e-04]
165+
*/
166+
func beta12(ez float64) float64 {
167+
zl := math.Log(ez + 1)
168+
return -0.384979202588598*ez +
169+
0.183162233114364*zl +
170+
0.130396688841854*math.Pow(zl, 2) +
171+
0.0704838927629266*math.Pow(zl, 3) +
172+
-0.0089589397146453*math.Pow(zl, 4) +
173+
0.0113010036741605*math.Pow(zl, 5) +
174+
-0.00194285569591290*math.Pow(zl, 6) +
175+
0.000225435774024964*math.Pow(zl, 7)
176+
}
177+
178+
/*
179+
p=13
180+
[-0.41655270946462997,-0.22146677040685156,0.38862 131236999947,0.45340979746062371,-0.36264738324476 375,0.12304650053558529,-0.01701540384555510,0.001 02750367080838]
181+
*/
182+
func beta13(ez float64) float64 {
183+
zl := math.Log(ez + 1)
184+
return -0.41655270946462997*ez +
185+
-0.22146677040685156*zl +
186+
0.38862131236999947*math.Pow(zl, 2) +
187+
0.45340979746062371*math.Pow(zl, 3) +
188+
-0.36264738324476375*math.Pow(zl, 4) +
189+
0.12304650053558529*math.Pow(zl, 5) +
190+
-0.01701540384555510*math.Pow(zl, 6) +
191+
0.00102750367080838*math.Pow(zl, 7)
192+
}
193+
194+
/*
195+
p=14
196+
[-3.71009760230692e-01,9.78811941207509e-03,1.8579 6293324165e-01,2.03015527328432e-01,-1.16710521803 686e-01,4.31106699492820e-02,-5.99583540511831e-03 ,4.49704299509437e-04]
197+
*/
198+
199+
func beta14(ez float64) float64 {
200+
zl := math.Log(ez + 1)
201+
return -0.371009760230692*ez +
202+
0.00978811941207509*zl +
203+
0.185796293324165*math.Pow(zl, 2) +
204+
0.203015527328432*math.Pow(zl, 3) +
205+
-0.116710521803686*math.Pow(zl, 4) +
206+
0.0431106699492820*math.Pow(zl, 5) +
207+
-0.00599583540511831*math.Pow(zl, 6) +
208+
0.000449704299509437*math.Pow(zl, 7)
209+
}
210+
211+
/*
212+
p=15
213+
[-0.38215145543875273,-0.89069400536090837,0.37602 335774678869,0.99335977440682377,-0.65577441638318 956,0.18332342129703610,-0.02241529633062872,0.001 21399789330194]
214+
*/
215+
func beta15(ez float64) float64 {
216+
zl := math.Log(ez + 1)
217+
return -0.38215145543875273*ez +
218+
-0.89069400536090837*zl +
219+
0.37602335774678869*math.Pow(zl, 2) +
220+
0.99335977440682377*math.Pow(zl, 3) +
221+
-0.65577441638318956*math.Pow(zl, 4) +
222+
0.18332342129703610*math.Pow(zl, 5) +
223+
-0.02241529633062872*math.Pow(zl, 6) +
224+
0.00121399789330194*math.Pow(zl, 7)
225+
}
226+
227+
/*
228+
p=16
229+
[-0.37331876643753059,-1.41704077448122989,0.407291 84796612533,1.56152033906584164,-0.99242233534286128,0.26064681399483092,-0.03053811369682807,0.00155770210179105]
230+
*/
231+
func beta16(ez float64) float64 {
232+
zl := math.Log(ez + 1)
233+
return -0.37331876643753059*ez +
234+
-1.41704077448122989*zl +
235+
0.40729184796612533*math.Pow(zl, 2) +
236+
1.56152033906584164*math.Pow(zl, 3) +
237+
-0.99242233534286128*math.Pow(zl, 4) +
238+
0.26064681399483092*math.Pow(zl, 5) +
239+
-0.03053811369682807*math.Pow(zl, 6) +
240+
0.00155770210179105*math.Pow(zl, 7)
241+
}
242+
243+
/*
244+
p=17
245+
[-0.36775502299404605,0.53831422351377967,0.769702 89278767923,0.55002583586450560,-0.745755882611469 41,0.25711835785821952,-0.03437902606864149,0.0018 5949146371616]
246+
*/
247+
func beta17(ez float64) float64 {
248+
zl := math.Log(ez + 1)
249+
return -0.36775502299404605*ez +
250+
0.53831422351377967*zl +
251+
0.76970289278767923*math.Pow(zl, 2) +
252+
0.55002583586450560*math.Pow(zl, 3) +
253+
-0.74575588261146941*math.Pow(zl, 4) +
254+
0.25711835785821952*math.Pow(zl, 5) +
255+
-0.03437902606864149*math.Pow(zl, 6) +
256+
0.00185949146371616*math.Pow(zl, 7)
257+
}
258+
259+
/*
260+
p=18
261+
[-0.36479623325960542,0.99730412328635032,1.553543 86230081221,1.25932677198028919,-1.533259482091101 63,0.47801042200056593,-0.05951025172951174,0.0029 1076804642205]
262+
*/
263+
func beta18(ez float64) float64 {
264+
zl := math.Log(ez + 1)
265+
return -0.36479623325960542*ez +
266+
0.99730412328635032*zl +
267+
1.55354386230081221*math.Pow(zl, 2) +
268+
1.25932677198028919*math.Pow(zl, 3) +
269+
-1.53325948209110163*math.Pow(zl, 4) +
270+
0.47801042200056593*math.Pow(zl, 5) +
271+
-0.05951025172951174*math.Pow(zl, 6) +
272+
0.00291076804642205*math.Pow(zl, 7)
273+
}

demo/go.mod

Lines changed: 0 additions & 11 deletions
This file was deleted.

demo/go.sum

Lines changed: 0 additions & 12 deletions
This file was deleted.

0 commit comments

Comments
 (0)