There are many technologies working in tandem to stop online child sexual abuse material (CSAM) from spreading online, and from being downloaded, stored and consumed on devices. In this series of articles, we have looked at a few of these: filter technologies, hashing technologies and AI, and in this last item on the list of technologies, we look at Keyword matching.
Keyword matching is widely used in search engines, online advertising campaigns, and media monitoring tools to mention a few things. In the search for CSAM its function is to match words or phrases in filenames or in texts that have been listed as suspicious and worth investigating.
Keyword matching in its simplest form is lists of words, phrases or groupings that match directly against for example filenames, chatlogs, documents or websites, to identify if they are relevant or not. In addition to exact matching, matches can also be case invariant. This means, for example, that even if capital letters are used, the match will still be made.
Next level is fuzzy matching, which will match even if there are variations, made by mistake or on purpose. This includes simple spelling mistakes, letters being switched around, double letters, the letter A swapped with 4, or E for 3, etc. The match can be further refined by attaching different values to different words, and different words in relation to each other.
Although not classified as keyword matching, further development of textual analysis with AI algorithms is used to analyze larger volumes of text for semantic summaries, translations, and correction of spelling to name a few examples. Keyword matching relies on the quality of the keyword lists, how words have been combined and how relationships between words have been scored.
Files containing CSAM are often named in specific ways, hence the importance of keyword matches to filenames. They are often combinations of words, scrambled words or very specific terms used by offenders to describe certain types of material. Lists of known keywords can be used by law enforcement to triage and identify important material, and by platform providers and businesses to highlight suspected files.
Keyword matching is fast and takes up very limited processing power compared to analysis of images. It is also quite easy to get started. Even a limited keyword list will provide value from the start, and the process to refine and build lists to make them better is straightforward.
However, it is also highly complicated. The value of keyword matching is directly related to the quality of the list that is used. This makes intelligence and deep knowledge of the subject necessary, and much time is needed to maintain a list in order for it to be effective. Also important to note is that a match does not automatically mean that the file contains CSAM, it is only an indication, yet the file still needs to be reviewed.