Accelerating AI & ML Analysis with TTI’s Data Cleansing Technology

June 24, 2024 by Barry Hutt

The importance of data preparation has grown exponentially with the rise of AI. Data comes in many forms and formats, including homegrown applications, SQL databases, files, sensors, video, and physics-driven analog data. Traditionally, data cleansing is defined as detecting and correcting (or removing) corrupt or inaccurate records from a dataset, table, or database. The data challenge presented is identifying the data's incomplete, incorrect, inaccurate, or irrelevant parts and then replacing, modifying, or deleting the dirty or coarse data.

Blog2-1

In recent years, sensor networks have gained wide popularity in various application scenarios, ranging from monitoring applications in manufacturing production lines to more sophisticated sensor deployments in research and development scenarios such as autonomous driving in the automotive industry. The metadata gathered during the generation of the massive amount of sensor data sets plays a more important role because it provides key attributes and information so that the big data set can be strategically managed and prepared for analysis.

Metadata is the data about data and can be regarded as the properties of the data. Once the data has been acquired, the associated metadata becomes equally important. In general, it is common to see these types of metadata after the data acquisition process.

  1. Test-level static information – background information on the specific test, the overall structure, and the phenomena that the sensor network is there to measure, typically defined before the beginning of the test.
  2. Sensor-level static information – background information such as the type and location of each sensor, typically defined before the beginning of the test.
  3. Sensor-level dynamic information—information on the sensor's status during the test, i.e., whether it is energized or activated, its self-diagnostics status, etc.
  4. Dynamic data quality information – a measure of the quality of the data of a continuous variable.

Viviota’s Time-to-Insight (TTI) software is based on NI’s DataFinder technology, which is an indexing service that parses any custom file format for descriptive information (metadata) and creates a database of the descriptive information within the target data files. This database is automatically updated when a valid data file is created, deleted, or edited. Once the metadata is indexed, with the help of DataPlugins, which map custom file formats onto the TDM model, the DataFinder search looks at all of the metadata at the file, channel group, and channel level based on user-specified search criteria.

_Viviota_ASAM_diagram-1

In order for the the TTI software to rapidly and efficiently find the needed data sets for analysis, the TTI workflow goes through a module dedicated to the data cleansing tasks. The tasks typically include:

  1. Metadata standardization. Most of the test and sensor-level static information is recorded as metadata by manual entry, prone to typos and inconsistency in style and format. The result can be erroneous or incomplete if a search is performed based on raw metadata. The TTI data preparation module provides both manual and dictionary-guided correction methods for the standardization process.
  2. Data Filtering and Correction. TTI takes advantage of the LabVIEW plugin architecture to allow the rapid implementation and integration of clients’ application-specific data pre-processing methods.
  3. Metadata enrichment. TTI also allows data calculations based on customized formulas. The results can be either new derived data sets, or additional metadata linked to the original data set (for example, the statistics of an analog data set). Another example of metadata enrichment is the association of test setup pictures with the test data, which allows the automated picture to be inserted into the analysis reports.
  4. Automation and report generation. In addition to GUI-based interactive data cleansing, TTI offers an automated Server that automates the process. As new raw test files come into the designated directory, they can be cleansed automatically based on pre-defined analysis routines and are ready for downstream searching and analysis.
  5. Automated event detection creates features for Machine Learning that enable ML models to be used for predictive outcomes.

TTI software suite shortens the time-consuming tasks in test data management that once took days to now happen in seconds, which improves efficiency and reduces the product time-to-market significantly.