Sample datasets

Overview

This vignette describes various scenarios of creating sample datasets that fit the preferences and needs of the users; and this for the different models. The functions in the package PINstimation use two types of datasets: (1) A sequence of daily buys and sells (2) A high-frequency trading data. This is the reason why only two sample datasets are preloaded with the package, namely dailytrades, and hfdata. The users can also generate simulated data using the function generatedata_mpin() for PIN, and MPIN models; and the function generatedata_adjpin() for the ADJPIN model. Below we provide some scenarios of creating sample datasets both for the PIN, MPIN, and ADJPIN models.

Sample datasets for the PIN model

The PIN model is a multilayer PIN model with a single information layer. We can, therefore, use the function generatedata_mpin(), in order to generate sample data for the PIN model. Generically, this is done as follows:

generatedata_mpin(..., layers=1)

If the user would like to create a sample dataset for infrequently traded stock, she can specify low values or ranges for the trade intensity rates. For instance, let’s assume that the user suspects that an infrequently-traded stock has an average of uninformed trading intensity for buys and sells between 300 and 500. They generate a single sample dataset for this scenario as follows:

pindata <- generatedata_mpin(layers=1, ranges = list(eps.b=c(300, 500), eps.s=c(300,500)), verbose = FALSE)

The details of the generated sample dataset can be displayed with the following code

show(pindata)

## ----------------------------------
## Data series successfully generated
## ----------------------------------
## Simulation model     : MPIN model
## Number of layers : 1 layer(s)
## Number of trading days   : 60 days
## ----------------------------------
## Type object@data to get the simulated data
## 
##  Data simulation  
## 
## ===========  ==============  ============  =============
## Variables    Theoretical.    Empirical.    Aggregates.  
## ===========  ==============  ============  =============
## alpha        0.441347        0.383333      0.383333     
## delta        0.545978        0.521739      0.521739     
## mu           242             245.69        245.69       
## eps.b        428             422.16        422.16       
## eps.s        340             340.35        340.35       
## ----                                                    
## Likelihood   -               (589.731)     (589.731)    
## mpin         -               0.109936      0.109936     
## ===========  ==============  ============  =============
## 
## -------
## Running time: 0.009 seconds

You access the sequences of buys, and sells through the slot @data of the object pindata.

show(pindata@data[1:10, ])

##      b   s
## 1  677 349
## 2  394 352
## 3  663 349
## 4  433 322
## 5  454 354
## 6  431 555
## 7  410 295
## 8  638 329
## 9  661 361
## 10 691 331

You can, now use the dataset object pindata to check the accuracy of the different estimation functions. You can do that by comparing the actual parameters of the sample datasets to the estimated parameters of the estimation functions. Let us start with displaying the actual parameters of the sample datasets. These can be accessed through the slot @empiricals of the dataset object, which stores the empirical parameters computed from the sequences of buys and sells generated. Please refer to the documentation of generatedata_mpin() for more information.

actual <- unlist(pindata@empiricals)
show(actual)

##       alpha       delta          mu       eps.b       eps.s 
##   0.3833333   0.5217391 245.6936557 422.1632653 340.3541667

Estimate the PIN model using the function pin_ea(), and display the estimated parameters

model <- pin_ea(data=pindata@data, verbose = FALSE)
estimates <- model@parameters 
show(estimates)

##       alpha       delta          mu       eps.b       eps.s 
##   0.3833334   0.5217368 245.2719559 423.0591290 339.6197723

Now calculate the absolute errors of the estimation method.

errors <- abs(actual - estimates)
show(errors)

##        alpha        delta           mu        eps.b        eps.s 
## 6.535650e-08 2.280731e-06 4.216998e-01 8.958637e-01 7.343943e-01

Sample datasets for the MPIN model

In contrast to the PIN model, the number of information layers is free. We can, therefore, use the function generatedata_mpin() with the desired number of information layers, in order to generate sample data for the MPIN model. We can also skip specifying the number of layers, and the default setting will be used: the number of layers will be randomly selected from the integer set from 1 to 5. Generically, this is done as follows:

generatedata_mpin(...)

If the user would like to create a sample dataset for frequently traded stock with two information layers, she can set the argument layers to 2, and specify high values or ranges for the trade intensity rates. For instance, let’s assume that the user suspects that a frequently-traded stock has an average of uninformed trading intensity for buys and sells between 12000 and 15000. They generate a single sample dataset for this scenario as follows:

mpindata <- generatedata_mpin(layers=2, ranges = list(eps.b=c(12000, 15000), eps.s=c(12000,15000)), verbose = FALSE)

The details of the generated sample dataset can be displayed with the following code

show(mpindata)

## ----------------------------------
## Data series successfully generated
## ----------------------------------
## Simulation model     : MPIN model
## Number of layers : 2 layer(s)
## Number of trading days   : 60 days
## ----------------------------------
## Type object@data to get the simulated data
## 
##  Data simulation  
## 
## ===========  ==================  ==================  =============
## Variables    Theoretical.        Empirical.          Aggregates.  
## ===========  ==================  ==================  =============
## alpha        0.234876, 0.315287  0.183333, 0.450000  0.633333     
## delta        0.586538, 0.016950  0.454545, 0.037037  0.157895     
## mu           1023, 2504          1043.98, 2502.22    2080.1       
## eps.b        14991               14978.75            14978.75     
## eps.s        13264               13255.54            13255.54     
## ----                                                              
## Likelihood   -                   (813.976)           (813.976)    
## mpin         -                   0.044579            0.044579     
## ===========  ==================  ==================  =============
## 
## -------
## Running time: 0.011 seconds

You access the sequences of buys, and sells through the slot @data of the object mpindata.

show(mpindata@data[1:10, ])

##        b     s
## 1  15034 13106
## 2  17532 13170
## 3  15946 13246
## 4  14905 13321
## 5  15047 13192
## 6  15176 13397
## 7  14916 13106
## 8  15057 13227
## 9  17238 13181
## 10 17575 13436

You can, now use the dataset object mpindata to check the accuracy of the different estimation functions, namely mpin_ml(), and mpin_ecm(). You can do that by comparing the empirical PIN value derived from the sample dataset to the estimated PIN value of the estimation functions. Let us start with displaying the empirical PIN value obtained from the sample dataset. This value can be accessed through the slot @emp.pin of the dataset object, which stores the empirical PIN value computed from the sequences of buys and sells generated. Please refer to the documentation of generatedata_mpin() for more information.

actualmpin <- unlist(mpindata@emp.pin)
show(actualmpin)

##      MPIN 
## 0.0445794

Estimate the MPIN model using the functions mpin_ml(), and mpin_ecm(), and display the estimated MPIN values.

model_ml <- mpin_ml(data=mpindata@data, verbose = FALSE)
model_ecm <- mpin_ecm(data=mpindata@data, verbose = FALSE)
mlmpin <- model_ml@mpin
ecmpin <- model_ecm@mpin
estimates <- setNames(c(mlmpin, ecmpin), c("ML", "ECM"))
show(estimates)

##         ML        ECM 
## 0.04466447 0.04465845

Now calculate the absolute errors of both estimation methods.

errors <- abs(actualmpin - estimates)
show(errors)

##           ML          ECM 
## 8.506868e-05 7.904695e-05

The function generatedata_mpin() can generate a data.series object that contains a collection of dataset objects. For instance, the user can generate layers, and use it to check the accuracy of the MPIN estimation.

size <- 3
collection <- generatedata_mpin(series = size, layers = 3, verbose = FALSE)
show(collection)

## ----------------------------------
## Simulated data successfully generated
## ----------------------------------
## Simulation model     : MPIN model
## Number of layers : 3 layer(s)
## Number of datasets   : 3 datasets
## Number of trading days   : 60 days
## ----------------------------------
## Type object@datasets to access the list of dataset objects
## 
##  Data simulation   
## 
## -------
## Running time: 0.022 seconds

accuracy <- devmpin <- 0
for (i in 1:size) {
    sdata <- collection@datasets[[i]]
    model <- mpin_ml(sdata@data, xtraclusters = 3, verbose=FALSE)
    accuracy <- accuracy + (sdata@layers == model@layers)
    devmpin <- devmpin + abs(sdata@emp.pin - model@mpin)
    
}
cat('The accuracy of layer detection: ', paste0(accuracy*(100/size),"%.\n"), sep="")

## The accuracy of layer detection: 66.6666666666667%.

cat('The average error in MPIN estimates: ', devmpin/size, ".\n", sep="")

## The average error in MPIN estimates: 0.001239405.

Sample datasets for the ADJPIN model

The AdjPIN model is an extension of the PIN model that includes the possibility of liquidity shocks. To obtain a sample dataset distributed according to the assumptions of the AdjPIN model, users can use the function generatedata_adjpin(). Generically, this is done as follows:

generatedata_adjpin(...)

If the user desires to create 2 sample datasets for frequently traded stock, they can specify high values or ranges for the trade intensity rates. For instance, let’s assume that the user suspects that a frequently-traded stock has an average of uninformed trading intensity for buys and sells between 10000 and 15000.

adjpindatasets <- generatedata_adjpin(series = 2, ranges = list(eps.b=c(10000, 15000),  eps.s=c(10000,15000)), verbose = FALSE)

The details of the generated sample data series can be displayed with the following code:

show(adjpindatasets)

## ----------------------------------
## Simulated data successfully generated
## ----------------------------------
## Simulation model     : AdjPIN model
## Model Restrictions   : Unrestricted model
## Number of datasets   : 2 datasets
## Number of trading days   : 60 days
## ----------------------------------
## Type object@datasets to access the list of dataset objects
## 
##  Data simulation   
## 
## -------
## Running time: 0.069 seconds

You access the first dataset from adjpindatasets using this code:

adjpindata <- adjpindatasets@datasets[[1]]
show(adjpindata)

## ----------------------------------
## Data series successfully generated
## ----------------------------------
## Simulation model     : AdjPIN model
## Model Restrictions   : Unrestricted model
## Number of trading days   : 60 days
## ----------------------------------
## Type object@data to get the simulated data
## 
##  Data simulation  
## 
## ===========  ==============  ============
## Variables    Theoretical.    Empirical.  
## ===========  ==============  ============
## alpha        0.494418        0.383333    
## delta        0.408723        0.304348    
## theta        0.154106        0.189189    
## theta'       0.232658        0.217391    
## ----                                     
## eps.b        12871           12864.68    
## eps.s        14812           14816.58    
## mu.b         47341           47294.52    
## mu.s         51557           51307.28    
## d.b          48409           48593.52    
## d.s          48454           48353.23    
## ----                                     
## Likelihood                   (878.331)   
## adjPIN       0.343           0.283       
## PSOS         0.265           0.295       
## ===========  ==============  ============
## 
## -------
## Running time: 0.028 seconds

You can, now use the dataset object adjpindata to check the accuracy of the different estimation functions, namely MLE, and ECM algorithms. You can do that by comparing the empirical adjpin, and psos values derived from the sample dataset to the estimated adjpin, and psos values obtained from the estimation functions. Let us start with displaying the empirical adjpin, and psos values obtained from the sample dataset. These values can be accessed through the slot @emp.pin of the dataset object, which stores the empirical adjpin/psos value computed from the sequences of buys and sells generated. Please refer to the documentation of generatedata_adjpin() for more information.

actualpins <- unlist(adjpindata@emp.pin)
show(actualpins)

##    adjpin      psos 
## 0.2832069 0.2952618

Estimate the AdjPIN model using adjpin(method="ML"), and adjpin(method="ECM"), and display the estimated adjpin/psos values.

model_ml <- adjpin(data=adjpindata@data, method = "ML", verbose = FALSE)
model_ecm <- adjpin(data=adjpindata@data, method = "ECM", verbose = FALSE)
mlpins <- c(model_ml@adjpin, model_ml@psos)
ecmpins <- c(model_ecm@adjpin, model_ecm@psos)
estimates <- rbind(mlpins, ecmpins)
colnames(estimates) <- c("adjpin", "psos")
rownames(estimates) <- c("ML", "ECM")
show(estimates)

##        adjpin      psos
## ML  0.2835816 0.2950904
## ECM 0.2835771 0.2950969

Now calculate the absolute errors of both estimation methods.

errors <- abs(estimates - rbind(actualpins, actualpins))
show(errors)

##           adjpin         psos
## ML  0.0003747311 0.0001713981
## ECM 0.0003702344 0.0001649447

Getting help

If you encounter a clear bug, please file an issue with a minimal reproducible example on GitHub.