Accelerating Big Data Analysis—Data Cleansing & Standardization Strategy

March 05, 2021 by Dr. Fanqi Meng

Traditionally, the data cleansing is defined as the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.

In recent years, sensor networks have gained wide popularity in variety of application scenarios, ranging from monitoring applications in manufacturing production lines to more sophisticated sensor deployments in the Research and Development scenarios such as ADAS and autonomous driving tests in the automotive industry, or wind tunnel turbine tests in the aerospace industry.

The metadata gathered during the generation of the massive amount of sensor data sets is playing a more important role. It provides key attributes information so that the big data set can be strategically managed and prepared for analysis. Managed well, the big data set's value as a data asset increases for the entire engineering team. The data can be leveraged for additional scenarios such as simulation.


Having a data cleansing and standardization strategy shared across the engineering organization can maximize the value of your data, saving time and improving accuracy.  

Metadata is the data about data and can be regarded as the properties of the data. Once the data has been acquired from sensors, the associated metadata becomes equally important. In general, it is common to see these types of metadata after the data acquisition process.

  1. Test-level static information – background information on the specific test, the overall structure and the phenomena which the sensor network is there to measure, which are typically defined before the beginning of the test.
  2. Sensor-level static information – background information such as the type and location of each sensor, which are typically defined before the beginning of the test.
  3. Sensor-level dynamic information – information on the status of the sensor during the test i.e., whether it is energized, activated, and self-diagnostics status etc.
  4. Dynamic data quality information – a measure of the quality of the data of a continuous variable.

The DataLook module of Viviota’s Time-to-Insight (TTI) software is based on NI’s DataFinder technology, which is an indexing service that parses any custom file format for descriptive information (metadata) and creates a database of the descriptive information within the target data files. This database is automatically updated as soon as a valid data file is created, deleted or edited. Once the metadata is indexed, with the help of DataPlugins, which map custom file formats onto the TDM model, the DataFinder search looks at all of the metadata at the file, channel group, and channel-level based on user specified search criteria.

In order for the DataLook module to rapidly and efficiently find the needed data sets for analysis, the TTI workflow goes through the DataPrep module first, which is dedicated to the data cleansing tasks. The tasks typically include:

  1. Metadata standardization. Most of the test and sensor level static information is recorded as the metadata by manual entry, which is prone to typos and inconsistency in style and format. If a search is performed based on these raw metadata, the result can be erroneous or incomplete. Metadata standardization improves search accuracy and completeness. The DataPrep module provides both manual and dictionary guided correction methods for the standardization process.
  2. Data Filtering and Correction. DataPrep takes the advantage of the LabVIEW plugin architecture to allow the rapid implementation and integration of clients’ application-specific data pre-processing methods.
  3. Metadata enrichment. DataPrep also allows data calculations based on customized formulas. The results can be either new derived data sets, or additional metadata linked to the original data set (for example, the statistics of an analog data set). Another example of metadata enrichment is the association of test setup pictures with the test data, which allows the automated picture insertion to the analysis reports.
  4. Automation with Analysis Server. In addition to GUI-based interactive data cleansing, TTI also offers Analysis Server-based configuration that automates the process. As new raw test files come into the designated directory, they can be cleansed automatically based on pre-defined analysis routines and are ready for the downstream searching and analysis.

In general, the TTI software suite shortens the time-consuming tasks in test data management that once took days to now happen in seconds, which improves the efficiency and reduce the product time-to-market significantly. Having ways to standardize data management increases the value of your data as it can now be leveraged across teams and time.