Quality Over Quantity

Big Data is considered the gold standard when it comes to AI. However, high-quality datasets lead to better insights than sheer amounts of data.

Artificial intelligence thrives on data, and especially neural networks are insatiable. If you feed them sufficiently, you eventually get the desired results. Or do you? Nowadays, terms like »small data,« »little data,« or »smart data« are becoming more common. There may not be a fixed definition behind them, but the common approach is to lead machine learning models to useful insights even with small datasets and short training times. To make an AI system »intelligent,« it does not necessarily need abundant but, above all, high-quality data nourishment. In simple terms: The most sophisticated algorithm is useless if data quality is poor.

The main reason for this shift in thinking is the rapidly growing energy consumption of AI models. While neural networks and deep learning are always computationally intensive, their power consumption depends heavily on the quantity and quality of data. New research approaches, including the so-called Green AI, are increasingly focused on methods to improve data quality and the balance between accuracy and (energy) efficiency of models. These developments not only benefit the environment but also companies that have limited training data available and still want to use it profitably.

© Fraunhofer IPK/Larissa Klassen
The EIBA assistance system for identifying used parts is multi- sensory: The sorting workstation has several cameras and a scale.

Navigating the Data Jungle

But what exactly characterizes high-quality training data? Data quality is influenced by numerous factors, but it crucially depends on the data collection method. For obtaining high-quality data, an accurate capture process is essential. The next step is to find the subset out of the acquired, yet unsorted data that has the potential to become first-rate training data, meaning it contains precisely the information the AI needs for learning. The challenge is to filter out these information features and patterns in exactly the right proportion from a data set.

Used Part Identification by means of KI

Researchers at Fraunhofer IPK have addressed this challenge in the EIBA project. Together with technology partners, they developed an AI-based assistance system that identifies old automotive components and assesses their condition, all without a need for QR codes or barcodes. The underlying necessity: numerous (industrial) old parts end up in recycling yards each year. A more environmentally and economically sensible approach is remanufacturing, in which the worn-out component is brought back to its original condition. However, this requires the product to be clearly identified, which is challenging when it is dirty, rusty, or overpainted. An additional difficulty is that many products are only slightly different from each other. The new assistance system makes the evaluation of used parts significantly easier.

Quick Start with Limited Data

The task of the Fraunhofer IPK team was to train neural networks and special algorithms for machine vision to recognize used parts. In the data acquisition stage, the researchers chose a multimodal approach that intentionally accesses multiple data sources, because a single image is often insufficient for the AI to identify an object clearly. In comparison to humans, we perceive the object, examine it from different angles, look for characteristic features, and incorporate additional information independent of color and shape. Inspired by this multisensory human perception, the solution developed at Fraunhofer IPK includes stereo cameras and a scale to capture weight and optical properties in 2D and 3D. Additionally, business and delivery data from logistics and documentation processes are also integrated.

Since it is time-consuming and costly for smaller companies to generate large data sets in advance – meaning capturing all used parts optically – data collection was integrated into the ongoing operations of the application partner C-ECO, a service provider for the circular economy, using fixed cameras at work stations. About 200,000 image data points were collected this way in a first proof-of-concept. The AI had more than enough training data, but was it sufficient to achieve effective results? What the researchers had not anticipated was the often poor quality of the image data. Many shots had hands, coffee cups, or other utensils in the picture, the part was cropped, shaded, or only the empty worktable was visible.

A significant portion of the data turned out to be not only unusable for training. They could even be harmful to it, because the algorithm tried to learn to recognize objects in images, even when these objects were partially hidden, merged with a messy background, or completely missing. This led to non-sensical correlations in the data, and at the same time, important classes or patterns could not be learned adequately.

There are many possible industrial use cases for AI-based image processing

First Step: Cleaning up the Data!

To overcome the new challenges, the researchers underwent a fundamental paradigm shift. They abandoned the principle of »more data yields better results« and replaced it with »meaningful data arrangement yields better results«. However, correcting each image pixel by pixel by hand would be an enormous effort. Therefore, the data experts at Fraunhofer IPK developed a method that utilizes AI and statistics to evaluate image quality. This allowed them to pre-sort the flood of images for their suitability in the training process automatically.

Specifically, this meant that the datasets were cleaned, removing incorrect, duplicate, unimportant, inaccurate, or incomplete values, and bringing them into a statistically representative distribution. A dataset with high information diversity is created when all data classes are included, and each class is represented as well as possible. The challenge lies in finding the right balance between data reduction and information gain: If too many data points are filtered out, the performance of the AI suffers.

AI Data Detectives: Finding Errors

Clustering is a method for organizing a large dataset into classes without prior knowledge of these classes. It represents a form of unsupervised machine learning where unlabeled data is grouped solely based on their »spatial similarity«: The assignment depends on how far a data point is from a so-called cluster center. Using this technique, the researchers managed to create groups with identical and similar visual patterns. This allowed them to identify »outliers« and data with redundant information content and remove them from the dataset.

Another applied technique, especially for data cleaning, is segmentation. To accurately identify image objects despite a complex background, certain features are extracted step by step from the data and continuously refined. Accordingly, the researchers initially identified visual differences in color, shape, and texture between the objects and the background. Then, they could separate all essential data points – objects or clear structures – from their (chaotic) surroundings.

Small Quantity – Big Impact

With the help of the developed automated data preprocessing method, the research team managed to isolate the most relevant image data and reduce the training data set by 60 percent. This not only led to significantly more accurate predictions by the AI assistance system – over 98 percent of the used automotive parts were correctly identified in performance tests – but also reduced energy consumption. In the pre-sorted data, the algorithm can recognize patterns more quickly because it spends less time analyzing irrelevant information. This reduces the training effort and computational power.

Last but not least, the focus is always on the human using the assistance system. The more precise it works, the more motivated they are to feed it with new data. Through continuous digitalization and simultaneous use and evaluation of data, a kind of AI life cycle is created: a cycle in which knowledge about each used part is constantly expanded, thereby continuously improving the AI application.

Quality Over Quantity

Big Data is considered the gold standard when it comes to AI. However, high-quality datasets lead to better insights than sheer amounts of data.

Navigating the Data Jungle

Used Part Identification by means of KI

Quick Start with Limited Data

First Step: Cleaning up the Data!

AI Data Detectives: Finding Errors

Small Quantity – Big Impact

Project Success in Numbers

AI training data volume was reduced by 60 %.

The AI assistance system correctly identified over 98 % of all used parts.

For each correctly sorted and subsequently refurbished component, 8.8 kg of CO2 equivalents are saved.

Funding Notice

You may also be interested in:

Contact Press / Media

Oliver Heimann

Contact Press / Media

Paul Koch