Using crawling and hashing technologies to find child sexual abuse material – The Internet Watch Foundation

Using crawling and hashing technologies to find child sexual abuse material – The Internet Watch Foundation
11 February, 2019 NetClean
In Business, Child protection, Technology

Using crawling and hashing technologies to find child sexual abuse material – The Internet Watch Foundation

In our series on technologies that are used to stop child sexual abuse, we have written about how they work and what they are used for. Following our blogs on web crawlers and hash matching, this post looks at how the Internet Watch Foundation (IWF) uses these technologies.

The IWF Web Crawler

The IWF’s web crawler was developed around the same time that Project Arachnid developed theirs. The crawlers are built similarly, and both aim to crawl websites to push content into databases for verification and indexing. They both also follow links on pages to see whether they can find more child sexual abuse material on other pages. Does the world need more than one crawler? Yes. It provides resiliency. Should there be an issue with any one crawler, others will pick up the work.

Using PhotoDNA, the IWF crawler scans images and matches them against the information held in the IWF database. If there is a match within a certain distance, the URL will be sent back to be analysed and verified by a human. If the image is verified as a child sexual abuse image, a notice is issued to the site host.

Granular classification

Rather than just classifying the image as illegal or not, the IWF analysts grade images. The granularity fits the UK legal system which classifies images on a scale from 1-5. This granularity is a way of future proofing the database so that if the IWF image database were to be used in another country, the material relevant to that country could be pulled out.

To make the crawler even more effective, the IWF is also working to find ways to automate the review process more by using an artificial intelligence (AI) classifier. The aim is to develop a classifier that will return the likelihood of crawled images being child sexual abuse material.

The depth of the crawl

The IWF crawler is set to stop crawling after two URLs in depth if it has not found child sexual abuse material. Instead the crawler is sent off on new targeted searches several times daily. The new searches are informed either by intelligence gathered at the IWF or by information shared by the police and industry members.

There are several reasons for running the IWF crawler in this targeted way. One is resources and the other is what the IWF believe is the most efficient way of finding new material. In their experience it is rare that the crawler finds new child sexual abuse material after it has moved more than two links in depth to sites where it has not found child sexual abuse material.

With the introduction of GDPR, there are also new data laws that the IWF has to consider. GDPR is designed to harmonise data privacy laws across the EU, and following a human rights review by Lord McDonald, former head of the Crown Prosecution Service in the UK, the IWF was advised that crawlers cannot go too deeply without specific reasons.

Also in compliance with the GDPR, the IWF does not store datasets, but delete the material after it has been reviewed. Similarly, as laws are not unified in different countries of what is and what isn’t illegal, the IWF does not do automated notice and takedown.

Three types of crawls

In addition to setting the depth of the crawler, the IWF can configure the crawler to carry out different spans of searches. The most common search strategy is the targeted crawler; the one that is described above. The crawler can also be set to do a wider search on a certain group of URLs or IP-addresses. As this method is slower and less precise it is used more occasionally. Finally there is a dark web crawler that can find links to images on the open web, or generate intelligence on what is stored on the dark web to see if those images are also stored on the open web.

Direct notices or via INHOPE

Notices sent out are handled differently depending on the country where the site or site owner is located. If the provider is based in the UK or in the Commonwealth, a notice is issued directly to the provider. A direct notice is also sent out in countries where the IWF run hotline portals, e.g. Kenya, Rwanda and Namibia. If the provider is based in an INHOPE country, the URL will be reported to the INHOPE database and sent to the relevant hotline.

UK Takedown rate of 90%

The majority of UK content (90%) is taken down within two days after it has been found by the IWF with almost half (45%) taken down within two hours.

When material is not taken down, the IWF proceeds to contact the hosting provider to motivate them to become part of the solution. When, occasionally, the IWF comes across hosting providers that solely deal in illegal material (which includes not only child sexual abuse material, but also other illegal material such as malware and fake pharmacies etc), providers are reported to law enforcement.

The aim of the IWF crawler is to protect victims from revictimisation through images being distributed across the internet. Today, the majority of the material found by the IWF crawler is old material that is already widely distributed. In the future the aim is to identify new material before it can be further distributed.

The IWF Image Hash List

In an earlier blog post we looked at hashing technologies, and how they are used in the fight against child sexual abuse. The IWF also maintains a hash list that is used in several different ways. For perceptional hashing the IWF uses PhotoDNA hashes. The benefit of the PhotoDNA hash is that it will recognise images even if they are slightly altered. For other use cases, faster and more exact hashing algorithms are used.

Sourced from the IWF and CAID

The hash list is sourced from intelligence gathered at the IWF and also includes hashes from the UK Child Abuse Image Database (CAID).

Several different uses

In addition to identifying child sexual abuse material and pushing back URLs for review and verification, the IWF uses the hash list to filter out already known and categorised images to ensure that analysts at the IWF don’t have to repeatedly review the same images.

The hash list is also provided to IWF industry members so that they can scan their own systems. One example is Facebook, that in addition to using their own list and an NGO sharing platform, uses the IWF hash list to identify child sexual abuse material on their platform.

Scan at upload – an aim for the future

One of the future aims of the IWF is to enable hosting platforms to scan images at the point of upload, instead of only scanning existing systems. This would limit the spread of the material and allow the platforms to be proactive rather than reactive. However, currently this requires changes to legislation, as there are countries that require someone within a business or organisation to manually review all material to verify that it is illegal before it is referred to the national report centre.

Fighting child sexual abuse

Web crawlers and hash lists are two of the collaborative measures and online technologies used by the IWF to find online child sexual abuse material. In future blog posts we will be looking at how INHOPE works with takedown and notice, and how national systems such as CAID operate.

The Internet Watch Foundation (IWF) was set up in 1996 in the UK, as an independent body after a series of meetings between Government, Police and the Internet Industry. The aim was to establish an independent body that could receive, assess and trace public complaints about child sexual abuse on the Internet. It operates a hotline for anonymous tips, but it also takes a pro-active role in searching the Internet for online child sexual abuse material. 

About the Technical Model National Response

Inspired by the WeProtect Global Alliance Model we have set out to develop an initiative that looks at technology. We call it the Technical Model National Response.  It is an overview of the existing technologies that need to be applied by different sectors and businesses to effectively fight the spread of child sexual abuse material.

Learn about the other

  • Aug202018

    Hashing Technologies
    Read now

  • Aug192018

    Read now

  • Aug182018

    Artificial Intelligence
    Read now

  • Aug162018
    Blocking - Technical Model National Response

    Blocking Technologies
    Read now

  • Aug162018

    Web Crawlers
    Read now

  • Aug152018

    Filter Technologies
    Read now

  • Aug142018

    Keyword Matching
    Read now

  • Aug142018

    Law Enforcement Collaboration platform – Coming soon

  • Aug132018

    Notice and Takedown
    Coming soon