'PhotoDNA' that detects child pornography can reconstruct images from hashes, contrary to Microsoft's claim



Many companies, including Google, Facebook, and Twitter, use

PhotoDNA, an image identification technology developed by Microsoft to remove illegal content from the platform. PhotoDNA attaches a digital signature to illegal images to create a database, and uses the images on the Internet in the form of collating them with this database. Microsoft claims that images cannot be reconstructed from hashed illegal image data, but researchers have newly reported that they have succeeded in reconstructing images from PhotoDNA.

Inverting PhotoDNA
https://www.anishathalye.com/2021/12/20/ hence-photodna /

Professor Hany Farid, a researcher of digital image analysis at Microsoft Research and Dartmouth College, developed PhotoDNA in 2009 to detect illegal content, especially child pornography. PhotoDNA gives an image a 'unique digital signature' and checks to see if it matches the 'illegal image signature' contained in the database.

Microsoft has so far argued about PhotoDNA that 'PhotoDNA hashes are not reversible and cannot be used to recreate images.' However, Ash Athalye , a PhD student at the Massachusetts Institute of Technology, has successfully used machine learning to develop a 'ribosome ' that can invert the hash of PhotoDNA.

Below is an image created by the ribosome from a hash of PhotoDNA. The left column is part of the 144-byte PhotoDNA hash, the center column is the image reconstructed from the PhotoDNA hash using ribosomes, and the right column is the image from which the hash was based. The image reconstructed from the hash is in a blurred state, but it is quite close to the original image.



According to Athalye, the details of the PhotoDNA algorithm are not disclosed, but the algorithm is

reverse engineered based on public information. Also, in the past, the compiled library used for the hash calculation of PhotoDNA has been leaked.

On the other hand, there is not much research on PhotoDNA hashes because it is not open source. In one of the few studies, Microsoft collaborated in 2019 to investigate PhotoDNA's privacy-protecting capabilities and concluded that 'PhotoDNA is resistant to machine learning-based classification attacks.' Meanwhile, in 2021, research shows that PhotoDNA's Perceptual Hash function is unlikely to be robust enough to withstand new attacks. Ribosomes are one of the reversal attacks on the PhotoDNA hash function, validating Microsoft's claim that 'PhotoDNA hashes cannot be used to recreate images,' Athalye said.

Ribosomes are trained on various datasets, and if you reconstruct the hash from four datasets: 'CelebA ', ' COCO ', '100,000 images scraped from Reddit', and 'CelebA / COCO / Reddit combined'. It looks like the following. There is a difference in the perfection of the images reconstructed by the model, and it can be seen that the fourth column with the largest and most diverse data set is the best.



However, the above images are only part of the result, and not all images are well reconstructed. Some of them could not be reconstructed well like the image below.



From the above results, Athalye shows that, contrary to Microsoft's claim, it is possible to duplicate the thumbnail of the original image using PhotoDNA hashes. The ribosome code and trained model can be downloaded from the following.

GitHub --anishathalye / ribosome: Synthesize photos from PhotoDNA using machine learning ????
https://github.com/anishathalye/ribosome

in Note, Posted by darkhorse_log