PhotoDNA – Robust hashing that finds online child sexual abuse material
PhotoDNA is a hashing technology that is widely used to detect online child sexual abuse material. It differs from binary hashing technologies in that it calculates hash values based on the visual content of an image. Like all hashing technologies, PhotoDNA can only detect images that have been identified and classified as child sexual abuse material. However, unlike binary hashing technologies, robust hashing technologies can detect small variations in the classified image or video.
Microsoft started developing PhotoDNA in collaboration with Dartmouth College back in 2009. It was a social responsibility project, and once it was ready for use it was donated to organisations like the National Centre for Missing & Exploited Children (NCMEC), and Project VIC. It is also used by online platforms such as OneDrive, Google Gmail, Twitter, Facebook, and Adobe Systems, to name a few, and incorporated into software by businesses such as NetClean that build software to detect child sexual abuse material.
We have previously mentioned PhotoDNA in our blogs on Project Arachnid’s web crawler, and on binary hashing technology. Below we describe how the software produces a hash, and how it contributes to the armoury of tools that work proactively to detect online child sexual abuse material.
A robust hash
A binary hash is created by a mathematical algorithm that transforms data of any size into a much shorter fixed-length data. This shorter sequence represents the original data and becomes the file’s hash, or digital fingerprint. In the case of a binary hash, the smallest alteration to a file will generate a completely new output, i.e. a new hash.
In contrast to the binary system, robust hashing uses several algorithms and looks at specific characteristics of the actual image rather than just the binary data of the image-file. The result is a robust hash which consists of the output from all the algorithms combined.
Whereas two copies of the same image in different file formats will produce completely different binary hashes, the resulting robust hashes will be mathematically close. And, as with binary hashes, robust hashes cannot be reengineered into the images that they represent.
PhotoDNA can identify a specific image regardless of its binary data. This is made possible because PhotoDNA looks at the visual content of the image, instead of exact binary image data. By calculating the mathematical distance (the Euclidean distance) between two PhotoDNA hashes it is possible to verify that the two hash values represent two different versions of the same image. Even if an image is edited very slightly, such as resized or saved in a different format, PhotoDNA will still match against that image.
The purpose of PhotoDNA is to find images with the same visual content. As PhotoDNA is developed to recognise slightly altered images, it is important to establish how similar the output data must be to ensure that a match is actually showing the same visual imagery.
In the case of a binary hash, where the output is built around the binary content of a file, matching is near on always guaranteed. The probability that two different files would generate the same output (a hash collision) is incredibly low – as in almost impossibly low.
PhotoDNA and similar types of algorithms that measure the characteristics of images, cannot guarantee such a secure outcome. The probability that a hash collision has occurred is still microscopic in this case, but not as small as in the case of a binary hash.
In order to aid users, Microsoft has developed guidelines to establish how mathematically close the PhotoDNA hashes must be in order to establish that it is the same visual imagery. If the guidelines don’t allow enough variation, images with the same visual content will be missed, and conversely, if the guidelines allow too wide an interpretation images that differ in visual content might be identified as the same image.
Binary and robust hashing is perfect harmony
We are careful not to state that one technology is better than the next. The whole idea behind the Technical Model National Response is to raise awareness of the fact that a range of different technologies must be applied to ensure a robust and near on complete response to the problem of online child sexual abuse as possible.
In the case of hashing one might, however, look at why one would use binary hashing when a product such as PhotoDNA can detect images that have been slightly changed.
The choice of technology always depends on the context and purpose of the search taking place. Depending on the search, one or the other, or the technologies combined, might be most effective.
Binary hashing may work better if one is looking for material in real time, as it is typically extremely quick. PhotoDNA has to analyse the complete image instead of only binary data in the file, which although quick takes longer.
Binary hashing works well in systems where speed makes all the difference. However in a different context, e.g. on a social platform, speed might not be of essence but rather ensuring that as much material as possible is found.
NetClean’s product NetClean ProActive uses both a binary hash and PhotoDNA to detect material. If a binary match has been made, the software can be configured to initiate a PhotoDNA search to see if there are more child sexual abuse material on the computer. To further broaden the search, it is also possible to do scheduled or manual PhotoDNA searches.
Detection technologies provide a proactive approach to fighting online child sexual abuse and is the key to this fight. All businesses and organisations, not just IT platforms and social media actors, can install software that detects material. Only when the material is found can it be investigated, analysed and lead to the removal of children from harmful situations.
Hashing technologies are used in many different ways to stop child sexual abuse material
Hashing technologies are used in a number or ways by law enforcement, social media platforms, NGO’s and businesses, and in combination with other technologies, to detect child sexual abuse material in the workplace, on social media platforms or on hosting sites. Hashing technologies are efficient and reliable.
The limitations of hashing technologies are also their strengths. As hashing technologies can only detect images that have been identified and classified as child sexual abuse material, they are unable to detect new or previously unseen material. However, simultaneously this ensures that they only detect material that has been classified by law enforcement professionals and nothing else.
About the Technical Model National Response
Inspired by the WeProtect Global Alliance Model we have set out to develop an initiative that looks at technology. We call it the Technical Model National Response. It is an overview of the existing technologies that need to be applied by different sectors and businesses to effectively fight the spread of child sexual abuse material.
Learn about the other
- Show all
- Businesses & Organisations
- Child protection
- Internet Service Providers
- Law enforcement
- NC Report 2016
- NC report 2018
- NC Report 2018 Links
- NC Report 2019
- NC Report 2020
- NetClean Labs
- News items
- Press releases
- Reports and research
- Social Media Platforms & Search Engines
- Svensk press
- Technologies from NC report 2019
- The Technical Model National Response