class: center, middle, inverse, title-slide # A Tool to Detect Potential Data Leaks in Forecasting Competitions ## tsdataleaks ### Thiyanga S. Talagala ###
40th International Symposium on Forecasting
--- # Forecasting competitions - Forecasting competitions have played a significant role in the advancement of forecasting practices. - Objectives - Examine the relative performance of forecasting methods under different circumstances. - Identify best practices and gain knowledge. - Organizing a competition: - Data collection - Data processing - Evaluate forecasts to rank the submissions. --- # What is data leakage? <img src="ISF_2020_Thiyanga_S_Talagala_files/figure-html/unnamed-chunk-1-1.png" style="display: block; margin: auto;" /> --- # What is data leakage? <img src="ISF_2020_Thiyanga_S_Talagala_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- # What is data leakage? <img src="ISF_2020_Thiyanga_S_Talagala_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- # What is data leakage? <img src="ISF_2020_Thiyanga_S_Talagala_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- # What is data leakage? <img src="ISF_2020_Thiyanga_S_Talagala_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- # What is data leakage? <img src="ISF_2020_Thiyanga_S_Talagala_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- <!--Data leaksages can be present in different formats--> ## Matching subsets ![](ISF_2020_Thiyanga_S_Talagala_files/figure-html/unnamed-chunk-7-1.png)<!-- -->![](ISF_2020_Thiyanga_S_Talagala_files/figure-html/unnamed-chunk-7-2.png)<!-- --> --- ## Matching block ![](img1.png) --- ## Matching block ![](img2.png) --- ## Repeating patterns ![](ISF_2020_Thiyanga_S_Talagala_files/figure-html/unnamed-chunk-11-1.png)<!-- -->![](ISF_2020_Thiyanga_S_Talagala_files/figure-html/unnamed-chunk-11-2.png)<!-- --> --- ### Add a constant ![](ISF_2020_Thiyanga_S_Talagala_files/figure-html/unnamed-chunk-12-1.png)<!-- -->![](ISF_2020_Thiyanga_S_Talagala_files/figure-html/unnamed-chunk-12-2.png)<!-- --> --- background-image: url("m1.png") background-size: contain ## Methodology --- background-image: url("m2.png") background-size: contain ## Methodology --- background-image: url("m3.png") background-size: contain ## Methodology --- background-image: url("m4.png") background-size: contain ## Methodology --- background-image: url("m5.png") background-size: contain ## Methodology --- background-image: url("m6.png") background-size: contain ## Methodology --- background-image: url("m7.png") background-size: contain ## Methodology --- background-image: url("m8.png") background-size: contain ## Methodology --- background-image: url("m9.png") background-size: contain ## Methodology --- background-image: url("m10.png") background-size: contain ## Methodology --- background-image: url("m11.png") background-size: contain ## Methodology --- background-image: url(hexsticker.png) background-size: 200px background-position: 98% 50% ## R package - tsdataleaks ```r devtools::install_github("thiyangt/tsdataleaks") library(tsdataleaks) ``` -- ### Simulated dataset ```r set.seed(2020) x <- rnorm(12) y <- rnorm(13) lst <- list( a = c(rnorm(11), x[1:6]), b = rnorm(17), c = c(x, rnorm(7)), d = rnorm(18), e = c(rnorm(12), x[1:7]+ rep(5, 7)), f = rnorm(20), g = c(y, rnorm(1), y[1:7]), h = c(rnorm(19)), i = c(x[1:10], rnorm(4), x)) ``` --- background-image: url("ts2.png") background-size: contain --- background-image: url("ts3.png") background-size: contain --- background-image: url("ts4.png") background-size: contain --- ## find_dataleaks .left-code[ ```r find_dataleaks(lst, h=6) ``` ``` $a .id start end 3 c 1 6 5 e 13 18 9 i 1 6 10 i 15 20 $e .id start end 3 c 2 7 9 i 2 7 10 i 16 21 $g .id start end 7 g 2 7 $i .id start end 3 c 7 12 ``` ] .right-plot[ ![](ts4.png) ] --- ## viz_dataleaks .pull-left[ ```r find_dataleaks(lst, h=6) ``` ``` $a .id start end 3 c 1 6 5 e 13 18 9 i 1 6 10 i 15 20 $e .id start end 3 c 2 7 9 i 2 7 10 i 16 21 $g .id start end 7 g 2 7 $i .id start end 3 c 7 12 ``` ] .pull-right[ ```r find_dataleaks(lst, h=6) %>% viz_dataleaks() ``` ![](ISF_2020_Thiyanga_S_Talagala_files/figure-html/unnamed-chunk-17-1.png)<!-- --> ] --- ## viz_dataleaks .pull-left[ ```r find_dataleaks(lst, h=6) ``` ``` $a .id start end 3 c 1 6 5 e 13 18 9 i 1 6 10 i 15 20 $e .id start end 3 c 2 7 9 i 2 7 10 i 16 21 $g .id start end 7 g 2 7 $i .id start end 3 c 7 12 ``` ] .pull-right[ ```r find_dataleaks(lst, h=6) %>% viz_dataleaks() ``` ![](ISF_2020_Thiyanga_S_Talagala_files/figure-html/unnamed-chunk-20-1.png)<!-- --> ![](ts5.png) ] --- ## reason_dataleaks ```r f1 <- find_dataleaks(lst, h=6) reason_dataleaks(lst, f1, h=6) ``` ``` series1 .id start end dist_mean dist_sd is.useful.leak reason 1 a c 1 6 0 0 useful exact match 2 a e 13 18 -5 0 useful add constant 3 a i 1 6 0 0 useful exact match 4 a i 15 20 0 0 useful exact match 5 e c 2 7 5 0 useful add constant 6 e i 2 7 5 0 useful add constant 7 e i 16 21 5 0 useful add constant 8 g g 2 7 0 0 useful exact match 9 i c 7 12 0 0 useful exact match ``` ![](TSlist.png) --- ## Application M1 Competition Yearly Series ```r library(Mcomp) data("M1") M1Y <- subset(M1, "yearly") M1Y_x <- lapply(M1Y, function(temp){temp$x}) m1y_f1 <- find_dataleaks(M1Y_x, h=6, cutoff = 1) m1y_f1 ``` ``` $YAF17 .id start end 22 YAM6 9 14 $YAM6 .id start end 16 YAF17 16 21 $YAM28 .id start end 78 YAI21 16 21 $YAB3 .id start end 18 YAM2 14 19 $YAB4 .id start end 17 YAM1 15 20 $YAI21 .id start end 43 YAM28 16 21 $YAG29 .id start end 137 YAC15 6 11 ``` --- ## Application: M1 competition yearly series ```r viz_dataleaks(m1y_f1) ``` ![](ISF_2020_Thiyanga_S_Talagala_files/figure-html/unnamed-chunk-25-1.png)<!-- --> --- ## YAM28 and YAI21 .left-code[ ``` $YAF17 .id start end 22 YAM6 9 14 $YAM6 .id start end 16 YAF17 16 21 $YAM28 .id start end 78 YAI21 16 21 $YAB3 .id start end 18 YAM2 14 19 $YAB4 .id start end 17 YAM1 15 20 $YAI21 .id start end 43 YAM28 16 21 $YAG29 .id start end 137 YAC15 6 11 ``` ] .pull-right[ ![](ISF_2020_Thiyanga_S_Talagala_files/figure-html/unnamed-chunk-27-1.png)<!-- --> ] --- ## Application: M1 competition yearly series ```r reason_dataleaks(M1Y_x, m1y_f1, h=6) ``` ``` series1 .id start end dist_mean dist_sd is.useful.leak reason 1 YAF17 YAM6 9 14 5.4 0.4 not useful Do not know 2 YAM6 YAF17 16 21 -5.4 0.4 not useful Do not know 3 YAM28 YAI21 16 21 0.0 0.0 not useful exact match 4 YAB3 YAM2 14 19 0.0 0.0 useful exact match 5 YAB4 YAM1 15 20 0.0 0.0 useful exact match 6 YAI21 YAM28 16 21 0.0 0.0 not useful exact match 7 YAG29 YAC15 6 11 -36815.7 6159.2 useful Do not know ``` ![](ts4.png) --- ## Application: M1 competition yearly series ```r reason_dataleaks(M1Y_x, m1y_f1, h=6) ``` ![](ISF_2020_Thiyanga_S_Talagala_files/figure-html/unnamed-chunk-31-1.png)<!-- --> --- .pull-left[ ![](ISF_2020_Thiyanga_S_Talagala_files/figure-html/unnamed-chunk-32-1.png)<!-- --> ] .pull-right[ ![](ISF_2020_Thiyanga_S_Talagala_files/figure-html/unnamed-chunk-33-1.png)<!-- --> ] --- ## Beware **M1 competition: YAB3** ``` Time Series: Start = 1969 End = 1987 Frequency = 1 [1] 5 8536 18971 32580 70608 86077 100525 121461 120460 157851 [11] 176392 219000 269540 367673 461099 484684 349699 499888 588478 ``` ``` Time Series: Start = 1988 End = 1993 Frequency = 1 [1] 476087 504506 666222 706620 734453 910998 ``` **M1 competition: YAM2** ``` Time Series: Start = 1972 End = 1993 Frequency = 1 [1] 5 8536 18971 32850 70608 86077 100525 121461 120460 157851 [11] 176392 219000 269540 367673 461099 484684 349699 499888 588478 476008 [21] 504507 666224 ``` ``` Time Series: Start = 1994 End = 1999 Frequency = 1 [1] 706620 734453 911262 1055800 1069070 1202480 ``` --- background-image: url("beware.png") background-size: contain --- ## Discussion ![](advantage.png) 1. Organizers: to avoid data leakages 2. Competitors: detect data leakages 3. Entire research community: forecast accuracy and evaluation --- ## Discussion Future work: Use time series features to reduce the computational cost. .pull-left[ ![](ISF_2020_Thiyanga_S_Talagala_files/figure-html/unnamed-chunk-36-1.png)<!-- --> ] .pull-right[ ![](ISF_2020_Thiyanga_S_Talagala_files/figure-html/unnamed-chunk-37-1.png)<!-- --> ] --- background-image: url(hexsticker.png) background-size: 200px background-position: 98% 6% # Thank you! ### Slides available at: https://thiyanga.netlify.app/talk/isf20-talk/ ### Email: ttalagala@sjp.ac.lk ### tsdataleaks ```r devtools::install_github("thiyangt/tsdataleaks") library(tsdataleaks) ``` ![](https://i.creativecommons.org/l/by/4.0/88x31.png) This work is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).