cleanTS()
is the main function of the package which creates a cleanTS
object. It performs all the different data cleaning tasks, such as
converting the timestamps to proper format, imputation of missing values,
handling outliers, etc. It is a wrapper function that calls all the other
internal functions to performs different data cleaning tasks.
Usage
cleanTS(
data,
date_format,
imp_methods = c("na_interpolation", "na_locf", "na_ma", "na_kalman"),
time = NULL,
value = NULL,
replace_outliers = TRUE
)
Arguments
- data
A data frame containing the input data. By default, it considers that the first column to contain the timestamps and the second column contains the observations.If that is not the case or if it contains more than two columns then specify the names of time and value columns using the
time
andvalue
arguments.- date_format
Format of timestamps used in the data. It uses lubridate formats as mentioned here. More than one formats can be using a vectors of strings.
- imp_methods
The imputation methods to be used.
- time
Optional, the name of column in provided data to be used as time column.
- value
Optional, the name of column in provided data, to be used as value column.
- replace_outliers
Boolean, if
TRUE
then the outliers found will be removed and imputed using the given imputation methods.
Value
A cleanTS
object which contains:
Cleaned data
Missing timestamps
Duplicate timestamps
Imputation errors
Outliers
Outlier imputation errors
Details
The first task is to check the input time series data for structural and data type-related errors. Since the functions need univariate time series data, the input data is checked for the number of columns. By default, the first column is considered to be the time column, and the second column to be the observations. Alternatively, if the time and value arguments are given, then those columns are used. The time column is converted to a POSIX object. The value column is converted to a numeric type. The column names are also changed to time and value. All the data is converted to a data.table object. This data is then passed to other functions to check for missing and duplicate timestamps. If duplicate timestamps are found, then the observation values are checked. If the observations are the same, then only one copy of that observation is kept. But if the observations are different, then it is not possible to find the correct one, so the observation is set to NA. This data is the passed to a function for finding and handling missing observations. The methods given in the imp_methods argument are compared and selected. The MCAR and MAR values are handled seperately. After the best methods are found, imputation is performed using those methods. The user can also pass user-defined functions for comparison. The user-defined function should follow the structure as the default functions. It should take a numeric vector containing missing values as input, and return a numeric vector of the same length without missing values as output. Once the missing values are handled the data is checked for outliers. If the replace_outliers parameter is set to TRUE in the cleanTS() function, then the outliers are replaced by NA and imputed using the procedure mentioned for imputing missing values. Then it creates a cleanTS object which contains the cleaned data, missing timestamps, duplicate timestamps, imputation methods, MCAR imputation error, MAR imputation error, outliers, and if the outliers are replaced then imputation errors for those imputations are also included. The cleanTS object is returned by the function.
Examples
if (FALSE) {
# Convert sunspots.month to dataframe
data <- timetk::tk_tbl(sunspot.month)
print(data)
# Randomly insert missing values to simulate missing value imputation
set.seed(10)
ind <- sample(nrow(data), 100)
data$value[ind] <- NA
# Perform cleaning
cts <- cleanTS(data, date_format = "my", time = "index", value = "value")
print(cts)
}