Realizing tracking of objects by making neural networks learn colorization of monochrome images

Tracking objects that appear in images is an important and fundamental problem for computer vision. However, teaching visually "tracking objects" and artificial intelligence (AI) visually is not practical, as it requires a large amount of labeled data sets for learning and so on. Under such circumstances, researcher group has announced a method which can accurately track objects in video even without labeled data set.

Google AI Blog: Self-Supervised Tracking via Video Colorization

Carl Vondrick et al. Developed a method of tracking objects using " monochrome image colorization ". In this method we use a convolution neural network to color monochrome images, but we restrict color copying from a single reference frame (the first frame of the image). Then, the network seems to automatically learn how to visually track the objects in the video with unsupervised learning . The important point is that you can track multiple objects without requiring a labeled dataset for learning in that "no training for tracking is necessary".

The following image shows that the neural network is tracking objects in monochrome video using this method in fact. The neural network recognizes colored objects in the image, and since one object is colored with one color, it is recognized that neural networks recognize objects of different colors as "different objects" . The neural network accurately recognizes the object even in a scene where two or more people are scrambled, such as a two-person organization who performs a pairing in a dojo.

Vondrick says "Because colors are consistent in time" about "why they thought about tracking objects using video colorization". This means that the color is considered to be based on the common fact that it does not change with the lapse of time, and "Consider a case where colors are excluded from the image and there are multiple objects of the same color By adding color-coding steps and color-coding, you can teach (to the neural network) "to track specific objects and areas", explaining the learning model.

Mr. Vondrick uses the Kinetics dataset when learning the neural network. We convert all but the first frame in the video to monochrome and train the convolution neural network to predict "the original color of the next frame". It seems that it was learning that it was expecting that the neural network can track the area in order to restore the original color accurately.

The "Reference Frame" on the left of the following is "the first frame in the image" which is a color reference, and this is not converted to monochrome. The left moving image is a monochrome converted image, and the coloring of the object is done with reference to the color of "the first frame in the image".

By raising this accuracy, it becomes as follows. The leftmost image below is the "first frame in the video" that was not monochrome converted, and the neural network colors the monochrome image referring to this frame. The middle image is monochrome converted input data. The image on the right is a monochrome image with AI colored. Looking at the image on the right, the state of the abdominal woman is very natural, it seems that she sees the image before it is converted to monochrome, but the color was attached to the neural created by the research group of Vondrick et al. It is a network.

Also, the neural network can track a pose by specifying a specific person as "keypoint" in the first frame. The image below shows that the neural network is tracking the posing of a human being reflected in a monochrome image. The red line indicates "pose recognized by the neural network".

The "tracking of objects using video colorization" developed by Google does not seem to be able to produce highly accurate results than models created using highly supervised learning, but based on optical flow It is possible to track objects more accurately than the latest method .

in Software, Posted by logu_ii