+ - 0:00:00
Notes for current slide
Notes for next slide

A Tool to Detect Potential Data Leaks in Forecasting Competitions

tsdataleaks

Thiyanga S. Talagala

40th International Symposium on Forecasting

1 / 43

Forecasting competitions

  • Forecasting competitions have played a significant role in the advancement of forecasting practices.
  • Objectives

    • Examine the relative performance of forecasting methods under different circumstances.

    • Identify best practices and gain knowledge.

  • Organizing a competition:

    • Data collection

    • Data processing

    • Evaluate forecasts to rank the submissions.

2 / 43

What is data leakage?

3 / 43

What is data leakage?

4 / 43

What is data leakage?

5 / 43

What is data leakage?

6 / 43

What is data leakage?

7 / 43

What is data leakage?

8 / 43

Matching subsets

9 / 43

Matching block

10 / 43

Matching block

11 / 43

Repeating patterns

12 / 43

Add a constant

13 / 43

Methodology

14 / 43

Methodology

15 / 43

Methodology

16 / 43

Methodology

17 / 43

Methodology

18 / 43

Methodology

19 / 43

Methodology

20 / 43

Methodology

21 / 43

Methodology

22 / 43

Methodology

23 / 43

Methodology

24 / 43

R package - tsdataleaks

devtools::install_github("thiyangt/tsdataleaks")
library(tsdataleaks)
25 / 43

R package - tsdataleaks

devtools::install_github("thiyangt/tsdataleaks")
library(tsdataleaks)

Simulated dataset

set.seed(2020)
x <- rnorm(12)
y <- rnorm(13)
lst <- list(
a = c(rnorm(11), x[1:6]),
b = rnorm(17),
c = c(x, rnorm(7)),
d = rnorm(18),
e = c(rnorm(12), x[1:7]+ rep(5, 7)),
f = rnorm(20),
g = c(y, rnorm(1), y[1:7]),
h = c(rnorm(19)),
i = c(x[1:10], rnorm(4), x))
25 / 43
26 / 43
27 / 43
28 / 43

find_dataleaks

find_dataleaks(lst, h=6)
$a
.id start end
3 c 1 6
5 e 13 18
9 i 1 6
10 i 15 20
$e
.id start end
3 c 2 7
9 i 2 7
10 i 16 21
$g
.id start end
7 g 2 7
$i
.id start end
3 c 7 12

29 / 43

viz_dataleaks

find_dataleaks(lst, h=6)
$a
.id start end
3 c 1 6
5 e 13 18
9 i 1 6
10 i 15 20
$e
.id start end
3 c 2 7
9 i 2 7
10 i 16 21
$g
.id start end
7 g 2 7
$i
.id start end
3 c 7 12
find_dataleaks(lst, h=6) %>%
viz_dataleaks()

30 / 43

viz_dataleaks

find_dataleaks(lst, h=6)
$a
.id start end
3 c 1 6
5 e 13 18
9 i 1 6
10 i 15 20
$e
.id start end
3 c 2 7
9 i 2 7
10 i 16 21
$g
.id start end
7 g 2 7
$i
.id start end
3 c 7 12
find_dataleaks(lst, h=6) %>%
viz_dataleaks()

31 / 43

reason_dataleaks

f1 <- find_dataleaks(lst, h=6)
reason_dataleaks(lst, f1, h=6)
series1 .id start end dist_mean dist_sd is.useful.leak reason
1 a c 1 6 0 0 useful exact match
2 a e 13 18 -5 0 useful add constant
3 a i 1 6 0 0 useful exact match
4 a i 15 20 0 0 useful exact match
5 e c 2 7 5 0 useful add constant
6 e i 2 7 5 0 useful add constant
7 e i 16 21 5 0 useful add constant
8 g g 2 7 0 0 useful exact match
9 i c 7 12 0 0 useful exact match

32 / 43

Application

M1 Competition Yearly Series

library(Mcomp)
data("M1")
M1Y <- subset(M1, "yearly")
M1Y_x <- lapply(M1Y, function(temp){temp$x})
m1y_f1 <- find_dataleaks(M1Y_x, h=6, cutoff = 1)
m1y_f1
$YAF17
.id start end
22 YAM6 9 14
$YAM6
.id start end
16 YAF17 16 21
$YAM28
.id start end
78 YAI21 16 21
$YAB3
.id start end
18 YAM2 14 19
$YAB4
.id start end
17 YAM1 15 20
$YAI21
.id start end
43 YAM28 16 21
$YAG29
.id start end
137 YAC15 6 11
33 / 43

Application: M1 competition yearly series

viz_dataleaks(m1y_f1)

34 / 43

YAM28 and YAI21

$YAF17
.id start end
22 YAM6 9 14
$YAM6
.id start end
16 YAF17 16 21
$YAM28
.id start end
78 YAI21 16 21
$YAB3
.id start end
18 YAM2 14 19
$YAB4
.id start end
17 YAM1 15 20
$YAI21
.id start end
43 YAM28 16 21
$YAG29
.id start end
137 YAC15 6 11

35 / 43

Application: M1 competition yearly series

reason_dataleaks(M1Y_x, m1y_f1, h=6)
series1 .id start end dist_mean dist_sd is.useful.leak reason
1 YAF17 YAM6 9 14 5.4 0.4 not useful Do not know
2 YAM6 YAF17 16 21 -5.4 0.4 not useful Do not know
3 YAM28 YAI21 16 21 0.0 0.0 not useful exact match
4 YAB3 YAM2 14 19 0.0 0.0 useful exact match
5 YAB4 YAM1 15 20 0.0 0.0 useful exact match
6 YAI21 YAM28 16 21 0.0 0.0 not useful exact match
7 YAG29 YAC15 6 11 -36815.7 6159.2 useful Do not know

36 / 43

Application: M1 competition yearly series

reason_dataleaks(M1Y_x, m1y_f1, h=6)

37 / 43

38 / 43

Beware

M1 competition: YAB3

Time Series:
Start = 1969
End = 1987
Frequency = 1
[1] 5 8536 18971 32580 70608 86077 100525 121461 120460 157851
[11] 176392 219000 269540 367673 461099 484684 349699 499888 588478
Time Series:
Start = 1988
End = 1993
Frequency = 1
[1] 476087 504506 666222 706620 734453 910998

M1 competition: YAM2

Time Series:
Start = 1972
End = 1993
Frequency = 1
[1] 5 8536 18971 32850 70608 86077 100525 121461 120460 157851
[11] 176392 219000 269540 367673 461099 484684 349699 499888 588478 476008
[21] 504507 666224
Time Series:
Start = 1994
End = 1999
Frequency = 1
[1] 706620 734453 911262 1055800 1069070 1202480
39 / 43
40 / 43

Discussion

  1. Organizers: to avoid data leakages

  2. Competitors: detect data leakages

  3. Entire research community: forecast accuracy and evaluation

41 / 43

Discussion

Future work: Use time series features to reduce the computational cost.

42 / 43

Thank you!

Slides available at:

https://thiyanga.netlify.app/talk/isf20-talk/

Email:

ttalagala@sjp.ac.lk

tsdataleaks

devtools::install_github("thiyangt/tsdataleaks")
library(tsdataleaks)

This work is licensed under a Creative Commons Attribution 4.0 International License.

43 / 43

Forecasting competitions

  • Forecasting competitions have played a significant role in the advancement of forecasting practices.
  • Objectives

    • Examine the relative performance of forecasting methods under different circumstances.

    • Identify best practices and gain knowledge.

  • Organizing a competition:

    • Data collection

    • Data processing

    • Evaluate forecasts to rank the submissions.

2 / 43
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow