Clean univariate time-series data — cleanTS • cleanTS

cleanTS()is the main function of the package which creates a cleanTS object. It performs all the different data cleaning tasks, such as converting the timestamps to proper format, imputation of missing values, handling outliers, etc. It is a wrapper function that calls all the other internal functions to performs different data cleaning tasks.

Usage

cleanTS(
  data,
  date_format,
  imp_methods = c("na_interpolation", "na_locf", "na_ma", "na_kalman"),
  time = NULL,
  value = NULL,
  replace_outliers = TRUE
)

Arguments

data: A data frame containing the input data. By default, it considers that the first column to contain the timestamps and the second column contains the observations.If that is not the case or if it contains more than two columns then specify the names of time and value columns using the time and value arguments.
date_format: Format of timestamps used in the data. It uses lubridate formats as mentioned here. More than one formats can be using a vectors of strings.
imp_methods: The imputation methods to be used.
time: Optional, the name of column in provided data to be used as time column.
value: Optional, the name of column in provided data, to be used as value column.
replace_outliers: Boolean, if TRUE then the outliers found will be removed and imputed using the given imputation methods.

Value

A cleanTS object which contains:

Cleaned data
Missing timestamps
Duplicate timestamps
Imputation errors
Outliers
Outlier imputation errors

Details

The first task is to check the input time series data for structural and data type-related errors. Since the functions need univariate time series data, the input data is checked for the number of columns. By default, the first column is considered to be the time column, and the second column to be the observations. Alternatively, if the time and value arguments are given, then those columns are used. The time column is converted to a POSIX object. The value column is converted to a numeric type. The column names are also changed to time and value. All the data is converted to a data.table object. This data is then passed to other functions to check for missing and duplicate timestamps. If duplicate timestamps are found, then the observation values are checked. If the observations are the same, then only one copy of that observation is kept. But if the observations are different, then it is not possible to find the correct one, so the observation is set to NA. This data is the passed to a function for finding and handling missing observations. The methods given in the imp_methods argument are compared and selected. The MCAR and MAR values are handled seperately. After the best methods are found, imputation is performed using those methods. The user can also pass user-defined functions for comparison. The user-defined function should follow the structure as the default functions. It should take a numeric vector containing missing values as input, and return a numeric vector of the same length without missing values as output. Once the missing values are handled the data is checked for outliers. If the replace_outliers parameter is set to TRUE in the cleanTS() function, then the outliers are replaced by NA and imputed using the procedure mentioned for imputing missing values. Then it creates a cleanTS object which contains the cleaned data, missing timestamps, duplicate timestamps, imputation methods, MCAR imputation error, MAR imputation error, outliers, and if the outliers are replaced then imputation errors for those imputations are also included. The cleanTS object is returned by the function.

Examples

if (FALSE) {
  # Convert sunspots.month to dataframe
  data <- timetk::tk_tbl(sunspot.month)
  print(data)

  # Randomly insert missing values to simulate missing value imputation
  set.seed(10)
  ind <- sample(nrow(data), 100)
  data$value[ind] <- NA

  # Perform cleaning
  cts <- cleanTS(data, date_format = "my", time = "index", value = "value")
  print(cts)
}