Hash values – Fingerprinting child sexual abuse material
This series of blog posts looks at technologies that can prevent online child sexual abuse material from being shared or downloaded. Here we look at what hashing is, and how this technology, which is frequently used by forensic investigators, is vital to the fight against child sexual abuse crime.
What are hash values?
A binary hash is created by a mathematical algorithm that transforms data of any size into much shorter fixed-length data. This shorter sequence represents the original data and becomes this file’s signature, or its hash value – often called its digital fingerprint.
The algorithm guarantees that the same input data always generates the same output data, and that it cannot be reversed or traced back to the original input data. A feature of binary hashes is thus that the original data, e.g. an image, cannot be recreated from the hash value.
Most hash functions are arbitrary in that they do not generate similar output data for files that are similar in content or likeness in image. There is no connection between what the files look like or contain and the hash value that the algorithm produces. The data is also so precise that if the file is in anyway altered the hash will change entirely. This change is called an avalanche effect.
The list of security threats that businesses and organisations face is long; it ranges from ransomware to phishing to stealing intellectual property. This is nothing new, all businesses and organisations know that they must take steps to protect themselves from cyber-attacks and most invest heavily in IT security. Gateways, such as web gateways and e-mail gateways, firewalls and DNS (domain name servers) are all used to try to identify and stop harmful traffic with help of filter solutions.
One of the biggest concerns for IT security professionals are internal data breaches. Firewalls can be used to stop hackers from gaining access to sensitive data, still, employees can all too easily click on malicious links contained in phishing emails or visit websites that download malware or ransomware.
Why use hash values?
Hashes are not reversible, therefore they are used for things like verification of data and encryption of passwords and other sensitive data.
There are several types of hash functions that are used worldwide to verify files. Two well-known cryptographical hash functions are MD5 and SHA. Their algorithms can be used for verification of data, i.e. if a hash accompanying a document is exactly the same as it was when the document was sent, then the file has not been altered on the way. This means that it can also be used for matching purposes, to identify identical files.
Hashes can also be used for encryption purposes. In the case of passwords, what is typed into the password field is turned into a hash before it is compared against what is held in the database of hashed passwords. Therefore if a database of passwords is hacked, the hacker will find encrypted data, and not the passwords in their original form. This means that things like passwords can be stored with relative safety, and documents can be secured against manipulation.
Hash collisions and broken hashes
As hash functions have infinite input length and a predefined output length, there is inevitably going to be a possibility that two different inputs produce the same output hash. This is called a hash collision, but depending on which hash function is used, the likelihood of this happening is extremely low. For modern cryptographical hash functions it is highly unlikely that a hash collision will occur and it is almost impossible to manually create a collision.
Another issue that can make hashing problematic is a broken algorithm. If the algorithm has been broken, you can, if you know the hash value, create the identical hash with different input data. To put it in a simpler way: Let’s say you know the hash value for a particular password, but don’t know the original password. If the hash algorithm is broken, it is possible to use different indata to create the same hash value and access the password protected account. This makes some algorithms unsuitable for encryption purposes, however, they can still be used for verification when transmitting files.
Hash values and child sexual abuse material
In digital evidence forensics, cryptographical hash algorithms are used for file identification and evidence authentication. By creating databases of hashed child sexual abuse material, new material can quickly be matched against already known files.
When previously unknown images are found, they are processed and hashed. The hashes are then added to the database of known material. This cataloguing system allows police forces to allocate their resources better and reduce manual labour. Less time is spent categorising images and more time can be spent analysing previously unseen material and identifying victims.
Equally important, hashes save investigators from viewing the same material over and over again, reducing the mental toll on people who work in this field. Reducing sensitive and illegal images to hashes also furthers collaboration between the different organisations that work worldwide to stop the dissemination of online child sexual abuse material.
Different kinds of matching
There are also other technologies and algorithms that match and identify images that are not based on binary hashing technique. One example of such an algorithm is PhotoDNA, that matches images based on the visual information in the images. This means that PhotoDNA can find the same image even if it has been saved in a different file format. This contrasts from binary hashes that only recognise identical files based on the binary information.
Combining hash values and PhotoDNA
Both technologies work well on their own, however combining the two increases the probability of finding material. One example of how they can be combined is NetClean ProActive, which is software deployed on work computers to detect child sexual abuse material. It uses its hash database first, and if the software finds a match on a computer, it can also start a PhotoDNA search. In practice this means that if an image is detected, a PhotoDNA search is deployed to find visually identical images in addition to the binary identical images. Another example of this technology in action can be found in our blog post on web crawlers.
Detection is part of the solution
Detection of child sexual abuse material creates a good foundation for stopping the dissemination of online child sexual abuse material. However, as NetClean makes clear in the Technical Model National Response, these technologies, like all other technologies, are not in themselves a silver bullet. Online child sexual abuse material is shared across all tiers of the Internet, and therefore it stands to reason that a range of different tools should operate on all tiers, on networks and on devices. Governments and businesses can help stop the dissemination of online child sexual abuse by investing in a comprehensive response as described in the Technical Model National Response.
About the Technical Model National Response
Inspired by the WeProtect Global Alliance Model we have set out to develop an initiative that looks at technology. We call it the Technical Model National Response. It is an overview of the existing technologies that need to be applied by different sectors and businesses to effectively fight the spread of child sexual abuse material.
Learn about the other
- Show all
- Businesses & Organisations
- Child protection
- Internet Service Providers
- Law enforcement
- NC Report 2016
- NC report 2018
- NC Report 2018 Links
- NetClean Labs
- News items
- Press releases
- Reports and research
- Social Media Platforms & Search Engines
- Svensk press
- Technologies from NC report 2019
- The Technical Model National Response