You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am curious if someone could clarify what type of source data is used to implement the PBO algorithm: is the input M matrix purely the returns data derived from the N trials obtained by testing various T model parameter configurations IS, or does it also include the respective OOS performance of each parameter configuration T?
After first reading the MLDP paper, I had assumed it was the latter, that we also need to input related OOS returns data since we are e.g. comparing the optimal shape IS to the median OOS value in each CSCV combination. Additionally, Figure 1 (below) from the paper shows the CVCV process with M partitioned into IS and OOS sections.
However, when attempting to implement the algorithm using this and also R libraries, I see that only a single matrix M of returns data is input.
Also note that in the paper, they never speak of IS/OOS explicitly in describing the construction of M:
"First, we form a matrix M by collecting the performance series from
the N trials. In particular, each column n = 1, . . . , N represents a vector of
profits and losses over t = 1, . . . , T observations associated with a particular
model configuration tried by the researcher"
Am I missing something? Perhaps the CSCV process derives 'synthetic' OOS data using the IS returns by means of sampling under IID assumptions? Or, is that we do need to include both IS and OOS returns data and are supposed to e.g. join matrices of IS and OOS data into a symmetrical matrix/df?
The text was updated successfully, but these errors were encountered:
The Probability of Backtest Overfitting (PBO) algorithm uses a single matrix M of returns data as input. This matrix M is constructed by collecting the performance series from the N trials obtained by testing various T model parameter configurations in-sample (IS). The matrix M does not include out-of-sample (OOS) performance data.
The Combinatorially Symmetric Cross-Validation (CSCV) process used to estimate the PBO divides the in-sample data into multiple subsets and uses some of these subsets as “pseudo” out-of-sample data to estimate the out-of-sample performance. This is done by iteratively selecting one subset as the “pseudo” out-of-sample data and using the remaining subsets as the in-sample data to calibrate the model. The calibrated model is then applied to the “pseudo” out-of-sample data to estimate its out-of-sample performance. This process is repeated for each subset, and the results are combined to estimate the PBO.
So, to answer your question, you only need to provide a single matrix M of in-sample returns data as input to the PBO algorithm. The CSCV process will use this data to estimate the out-of-sample performance and calculate the PBO.
I am curious if someone could clarify what type of source data is used to implement the PBO algorithm: is the input M matrix purely the returns data derived from the N trials obtained by testing various T model parameter configurations IS, or does it also include the respective OOS performance of each parameter configuration T?
After first reading the MLDP paper, I had assumed it was the latter, that we also need to input related OOS returns data since we are e.g. comparing the optimal shape IS to the median OOS value in each CSCV combination. Additionally, Figure 1 (below) from the paper shows the CVCV process with M partitioned into IS and OOS sections.
However, when attempting to implement the algorithm using this and also R libraries, I see that only a single matrix M of returns data is input.
Also note that in the paper, they never speak of IS/OOS explicitly in describing the construction of M:
"First, we form a matrix M by collecting the performance series from
the N trials. In particular, each column n = 1, . . . , N represents a vector of
profits and losses over t = 1, . . . , T observations associated with a particular
model configuration tried by the researcher"
Am I missing something? Perhaps the CSCV process derives 'synthetic' OOS data using the IS returns by means of sampling under IID assumptions? Or, is that we do need to include both IS and OOS returns data and are supposed to e.g. join matrices of IS and OOS data into a symmetrical matrix/df?
The text was updated successfully, but these errors were encountered: