Keyword matching is widely used in everything from search engines to online advertising campaigns to media monitoring tools. In the search for child sexual abuse material, its function is to match words or phrases, in filenames or in text, that have been listed as suspicious and worth investigating.
Keyword matching in its simplest form is lists of words, phrases or groupings that match directly against for example filenames, chatlogs, documents or websites, to identify if they are relevant or not.
In addition to exact matching, matches can also be case invariant. This means, for example, that even if capital letters are used, the match will still be made.
Next level is fuzzy matching, which will match even if there are variations, made by mistake or on purpose. This includes simple spelling mistakes, letters being switched around, double letters, the letter A swapped with 4, or E for 3 etc.
The match can be further refined by attaching different values to different words, and different words in relation to each other.
Although not classified as keyword matching, further development of textual analysis with AI algorithms is used to analyse larger volumes of text for semantic summaries, translations, and correction of spelling to name a few examples. Keyword matching relies on the quality of the keyword lists, how words have been combined and how relationships between words have been scored.
Files containing child sexual abuse material are often named in specific ways, hence the importance of keyword matches to filenames. They are often combinations of words, scrambled words or very specific terms used by offenders to describe certain types of material.
Lists of known keywords can be used by law enforcement to triage and identify pertinent material, and by platform providers and businesses to highlight suspected files.
Modern web filters, which are used by most businesses, also use keyword matching in a number of ways to look at content and produce a probability score to determine how likely it is that a site contains certain content, and whether it should be blocked or not.
Strengths and limitations
Keyword matching is fast and takes up very limited processing power compared to analysis of images. It is also quite easy to get started. Even a limited keyword list will provide value from the start, and the process to refine and build lists to make them better is straight forward.
However, keyword matching is also highly complicated. The quality and value of keyword matching is directly related to the quality of the list that is used. This makes intelligence and deep knowledge of the subject necessary, and that much time is needed to maintain a list in order for it to be effective. As child sexual abuse material is rarely a prioritised area, this means that many lists are lacking.
Also important to note is that a match does not automatically mean that the file contains child sexual abuse material, it is only an indication, yet the file still needs to be reviewed.
Keyword matching is one of many technologies that can be applied by businesses to stop child sexual abuse material. In the last section of the NetClean Report 2019 we presented an overview of technologies and methods available to businesses to stop child sexual abuse material. The articles were a revision and abridgement of longer and more technically detailed articles, published here. In a series of blog posts we will compare the different technologies and show how they complement each other.