How web crawlers can help find child sexual abuse material

How web crawlers can help find child sexual abuse material
24 September, 2018 NetClean

How web crawlers can help find child sexual abuse material

In our series on the Technical Model National Response and technologies that are used to find online child sexual abuse material, we now look at web crawlers and how they can be built to look for this illegal material. In this case, we specifically look at Project Arachnid – operated by the Canadian Centre for Child Protection.

What is a web crawler?

Web crawlers, or crawlers, Robots, Search Bots or just Bots, as they are also known, are automated software that search engines and other bodies use to, for example, find and index what’s new on the Internet.

There are many different types of web crawlers, however in general they all follow the same pattern of work. They crawl over websites, download the content and push it into a database which indexes content, and finally visits all the hyperlinks that exists on the webpage to find new material to index.

Traditional web crawlers are programmed to index written content. Although they might be able to read images they are not programmed to recognise illicit photographed or filmed material, such as online child sexual abuse. However a web crawler built to look for specific fingerprints, or hash values, can be a useful tool when looking for online child sexual abuse material.

The Arachnid crawler – how does it work?

The Canadian Centre for Child Protection (C3P) through its operation of the hotline have built a web crawler called Project Arachnid, which has the specific task of finding and removing online child sexual abuse material. It operates by using Microsoft’s Photo DNA technology along with hashes (digital fingerprints) from lists generated by several organisations, the biggest being NCMEC, The Royal Canadian Mounted Police (RCMP) and Interpol. This combined technology ensures that even if an image is slightly altered, the crawler can still detect the fingerprint of child sexual abuse material.

Project Arachnid started from a list of URLs that had been reported to the hotline to contain online child sexual abuse material. After its initial kick-off in 2016, the web crawler has continued its search by following links from pages that it has already found. Although analysts flag new URLs to be added to Project Arachnid, the volume of online child sexual abuse is so high that the web crawler has never needed to be restarted or sent in a new direction.

The Arachnid web crawler scans thousands of URLs per second. It scans the images on the URL and pushes what it recognises as child sexual abuse material into Project Arachnid’s classification system. The content is then triple-verified by three different analysts to ensure that the image can be classified as child sexual abuse material. Once this classification has been made, a notice is sent to the hosting provider, requisitioning that the material is removed. As the final step, the hosting provider makes sure that the material is removed. For material that has gone through the triple verification, and that is publicly available on the Internet, take-down notices have so far seen a 98% success rate.

Crawling the Darknet

Project Arachnid can also crawl the Darknet, a space that is built to keep both publishers and visitors anonymous (publishers on the Darknet are hidden behind layers of encryption and IP addresses are bounced through a series of world-wide nodes to conceal them). Therefore, when the crawler finds online child sexual abuse images on the Darknet, notices are not sent, as the publisher is unknown.

Crawling the Darknet is still valuable however, as many of the Darknet pages include links back to the open web. The content found on the Darknet is also pushed into Project Arachnid’s classification system, which helps to disrupt the material when it appears on the open web.

The major gain – limiting revictimisation

Project Arachnid’s aim is to remove content as quickly as possible to prevent revictimisation.

The fact that images are actively pursued and removed offers the victims of this crime relief. Knowing that there is specific technology, organisations and NGOs working to remove material that can otherwise be shared again and again, helps alleviate the feeling that the cycle of abuse is endless.

Technology heavily dependent on human resources

Web crawlers are efficient at finding online child sexual abuse material that is already known.. They make a huge difference in tackling the spread of child sexual abuse material. The challenge that NGOs face processing the information identified by crawlers is the ever-growing need for human resources to ensure that the material is viewed and categorized.

In the case of Project Arachnid, an automatic notice is sent for material that has been triple verified and the hosting provider is known. For all other material, three analysts have to classify the image that the web crawler has pushed into the system to ensure that no mistakes are made. Finally, Project Arachnid is currently detecting more than 100,000 images per month that require analyst assessment, a number that has historically increased each month. Key to the success of the project and the removal of child sexual abuse material on the internet is continued funding and engagement.

Project Arachnid is currently detecting over 100,000 unique images per month that requires analyst assessment

  • More than 52 billion images have been processed
  • Of these, 1,7 million images have been triggered for analyst review
  • 910 000 notices have been sent to providers

About the Technical Model National Response

Inspired by the WeProtect Global Alliance Model we have set out to develop an initiative that looks at technology. We call it the Technical Model National Response.  It is an overview of the existing technologies that need to be applied by different sectors and businesses to effectively fight the spread of child sexual abuse material.

Learn about the other

  • Aug202018

    Hashing Technologies
    Read now

  • Aug192018

    Read now

  • Aug182018

    Artificial Intelligence
    Read now

  • Aug162018
    Blocking - Technical Model National Response

    Blocking Technologies
    Read now

  • Aug162018

    Web Crawlers
    Read now

  • Aug152018

    Filter Technologies
    Read now

  • Aug142018

    Keyword Matching
    Read now

  • Aug142018

    Law Enforcement Collaboration platform – Coming soon

  • Aug132018

    Notice and Takedown
    Coming soon