Research indicates that artificial intelligence programs used for image generation are being trained using inappropriate photographs of minors.


A recent report reveals that well-known AI image-generating systems contain numerous images depicting child sexual abuse. The report urges companies to address this harmful flaw in their technology.

The use of those images has facilitated AI systems in generating believable and explicit images of simulated children and converting social media pictures of fully dressed teenage individuals into nudes. This has raised concerns among schools and law enforcement agencies globally.

Previously, experts studying anti-abuse techniques believed that unchecked artificial intelligence tools created abusive depictions of children by merging information gathered from two sources of online images: adult pornography and harmless pictures of children.

The Stanford Internet Observatory discovered over 3,200 images in the LAION database that are believed to depict child sexual abuse. This database is used to train AI image-making technology, including Stable Diffusion. The watchdog organization, located at Stanford University, collaborated with the Canadian Centre for Child Protection and other charities focused on combatting abuse to identify the illegal content and report the original photo links to authorities. Of the images found, approximately 1,000 were confirmed to be illicit.

The reaction was prompt. Prior to the Wednesday publication of the Stanford Internet Observatory’s findings, LAION informed The Associated Press that it was temporarily taking down its data sets.

According to a statement from LAION, which is the acronym for Large-scale Artificial Intelligence Open Network, they have a strict policy against illegal content. As a precautionary measure, they have removed the LAION datasets to ensure their safety before making them available again.

Although the images make up only a small portion of LAION’s index of approximately 5.8 billion images, the Stanford group believes that they are affecting the performance of AI tools in producing harmful results. These images also perpetuate the exploitation of real victims who may appear multiple times.

According to David Thiel, the chief technologist at Stanford Internet Observatory who wrote the report, the difficulty in addressing this issue stems from the competitive nature of the field, which has led to many generative AI projects being rushed to market and easily accessible.

Thiel stated in an interview that using a complete scrape of the internet to create a dataset for training models should have been limited to research purposes and should not have been made publicly available without thorough scrutiny.

One of the influential users of LAION who played a role in the creation of the dataset is a startup called Stability AI, located in London. They are the creators of Stable Diffusion, which converts text to images. The latest versions of Stable Diffusion have made it more difficult to generate harmful content, but an older version from last year (which Stability AI claims they did not release) is still being used in other applications and tools. According to the report from Stanford, this older version remains the most widely used model for creating explicit imagery.

Lloyd Richardson, the director of information technology at the Canadian Centre for Child Protection, responsible for managing Canada’s hotline for reporting online sexual exploitation, stated that the model cannot be retrieved as it is currently being used by multiple individuals on their personal computers.

On Wednesday, Stability AI clarified that it only provides filtered versions of Stable Diffusion. They also mentioned that since taking over the development of Stable Diffusion, they have taken proactive measures to prevent potential misuse.

The company stated that their filters prevent harmful content from reaching the models. By removing this content beforehand, they can help prevent the models from creating unsafe material.

LAION was the brainchild of a German researcher and teacher, Christoph Schuhmann, who told the AP earlier this year that part of the reason to make such a huge visual database publicly accessible was to ensure that the future of AI development isn’t controlled by a handful of powerful companies.

He stated that it would be safer and fairer if we could make it more democratic, allowing both the research community and the general public to reap its benefits.

LAION’s data is primarily obtained from Common Crawl, a collection of information continuously gathered from the public internet. However, Rich Skrenta, executive director of Common Crawl, stated that it is necessary for LAION to carefully scan and filter the data before utilizing it.

LAION announced that they have created strict filters to identify and eliminate any illegal material before publishing their datasets. They are continuously working to enhance these filters. The report from Stanford recognized the developers of LAION for their efforts to filter out explicit content involving minors, but suggested that they could have consulted with experts in child safety earlier to improve their methods.

Several generators that convert text into images are based on the LAION database, although it is not always apparent which ones. OpenAI, the creator of DALL-E and ChatGPT, stated that it does not utilize LAION and has customized its models to reject requests for sexual content involving minors.

Google developed the Imagen model for text-to-image translation using the LAION dataset. However, following an audit of the database in 2022, Google chose not to release it to the public due to the discovery of inappropriate content such as pornographic images, racist language, and harmful societal stereotypes.

Attempting to retrospectively tidy the information is challenging, prompting the Stanford Internet Observatory to advocate for more extreme actions. This includes urging individuals who have utilized LAION-5B for constructing training sets – which consists of over 5 billion image-text pairs – to either discard them or collaborate with intermediaries to purify the content. Additionally, the suggestion is to essentially make an earlier iteration of Stable Diffusion vanish from the majority of the internet, except for the most obscure corners.

According to Thiel, legitimate platforms can cease offering downloads if they are known for frequently producing harmful images and do not have measures in place to prevent them.

As an illustration, Thiel mentioned CivitAI, a popular platform for creating AI-generated pornographic content, but noted that it lacks adequate safeguards to prevent the production of images involving children. The report also urges AI company Hugging Face, which provides the training data for these models, to establish more effective protocols for reporting and removing links to abusive material.

Hugging Face stated that it is continuously collaborating with regulators and organizations focused on child safety in order to detect and eliminate harmful content. Similarly, CivitAI has implemented stringent guidelines for generating images of children and has recently introduced updates to enhance protection measures. The company also expressed its dedication to adjusting and expanding its policies as technology advances.

The report from Stanford raises concerns about the use of children’s photos in AI systems without their family’s consent, citing protections under the federal Children’s Online Privacy Protection Act.

According to Rebecca Portnoff, the head of data science at Thorn, a non-profit fighting child sexual abuse, their research has revealed that the number of AI-generated images used by abusers is relatively low but steadily increasing.

To prevent negative consequences, developers should ensure that the datasets used to create AI models do not contain any abusive content. Additionally, Portnoff suggests that steps can be taken to mitigate harmful applications of these models even after they have been distributed.

Technology companies and organizations dedicated to protecting children are currently utilizing a system called “hashing” to identify and remove videos and images containing child abuse. According to Portnoff, this same method could also be applied to AI models that are being used inappropriately.

She stated that it is not currently occurring, but in her opinion, it is something that can and should be accomplished.

Source: wral.com