Navigating the Data Jungle
But what exactly characterizes high-quality training data? Data quality is influenced by numerous factors, but it crucially depends on the data collection method. For obtaining high-quality data, an accurate capture process is essential. The next step is to find the subset out of the acquired, yet unsorted data that has the potential to become first-rate training data, meaning it contains precisely the information the AI needs for learning. The challenge is to filter out these information features and patterns in exactly the right proportion from a data set.
Used Part Identification by means of KI
Researchers at Fraunhofer IPK have addressed this challenge in the EIBA project. Together with technology partners, they developed an AI-based assistance system that identifies old automotive components and assesses their condition, all without a need for QR codes or barcodes. The underlying necessity: numerous (industrial) old parts end up in recycling yards each year. A more environmentally and economically sensible approach is remanufacturing, in which the worn-out component is brought back to its original condition. However, this requires the product to be clearly identified, which is challenging when it is dirty, rusty, or overpainted. An additional difficulty is that many products are only slightly different from each other. The new assistance system makes the evaluation of used parts significantly easier.
Quick Start with Limited Data
The task of the Fraunhofer IPK team was to train neural networks and special algorithms for machine vision to recognize used parts. In the data acquisition stage, the researchers chose a multimodal approach that intentionally accesses multiple data sources, because a single image is often insufficient for the AI to identify an object clearly. In comparison to humans, we perceive the object, examine it from different angles, look for characteristic features, and incorporate additional information independent of color and shape. Inspired by this multisensory human perception, the solution developed at Fraunhofer IPK includes stereo cameras and a scale to capture weight and optical properties in 2D and 3D. Additionally, business and delivery data from logistics and documentation processes are also integrated.
Since it is time-consuming and costly for smaller companies to generate large data sets in advance – meaning capturing all used parts optically – data collection was integrated into the ongoing operations of the application partner C-ECO, a service provider for the circular economy, using fixed cameras at work stations. About 200,000 image data points were collected this way in a first proof-of-concept. The AI had more than enough training data, but was it sufficient to achieve effective results? What the researchers had not anticipated was the often poor quality of the image data. Many shots had hands, coffee cups, or other utensils in the picture, the part was cropped, shaded, or only the empty worktable was visible.
A significant portion of the data turned out to be not only unusable for training. They could even be harmful to it, because the algorithm tried to learn to recognize objects in images, even when these objects were partially hidden, merged with a messy background, or completely missing. This led to non-sensical correlations in the data, and at the same time, important classes or patterns could not be learned adequately.