If there is one thing that almost everybody is aware of about AI tools, it is that they require large amount of data in order to train. And this data, as of now, comes from the internet, something which is already being debated about. AI tools like ChatGPT are already being accused by many of using data of authors, artists and publications without their consent. And now, turns out, some companies are keen on selling their user data to these AI firms.
As per a Gizmodo report originally attributed to 404 Media, Automattic, the parent company of platforms such as WordPress and Tumblr, is in discussions to sell content from its sites to AI firms like MidJourney and OpenAI for training purposes.
While the specific details of the arrangement remain unclear, Automattic is emphasising to users that they will have the option to opt-out of their data being used to train AI at any time.
According to the 404 report, there is internal disagreement within Automattic, with concerns raised about the inclusion of private content that was inadvertently scraped for AI training, contrary to the company’s intended practices. Adding complexity to the situation, advertising content not owned by Automattic, including materials from a previous Apple Music campaign, has reportedly found its way into the training dataset.
In addition to this, the report adds that in the wake of the situation, a product manager working at the company has even started taking his photos down from Tumblr in order to ensure that they are not used to train AI.
As for the company’s promise of letting users opt out of their data being shared, Automattic is set to unveil a new feature for the same, the report suggests.
In a blog post, the company outlined how this new feature will “give users more control” over their content.
“We’re doing a number of things at WordPress.com and Tumblr to give you more control over the content you’ve created,” the blog post says, as it talks about launching a setting to “discourage crawling by AI companies.”
The company says in the post, “We currently block, by default, major AI platform crawlers—including ones from the biggest tech companies—and update our lists as new ones launch.”
Talking about how it has a setting to discourage “search engines from indexing a website on WordPress and Tumblr,” the post added, ” We have a setting to discourage search engines from indexing a site on WordPress.com and Tumblr. This signals to search engines not to crawl that content or include it in search results.”
AI companies can also be restricted from using users’ content for training purposes. The post adds, “We have added similar settings to WordPress.com and Tumblr to discourage crawling by AI companies. If you already discourage search engine indexing, this is automatically enabled. We will share only public content that’s hosted on WordPress.com and Tumblr from sites that haven’t opted out.”
The company also noted in the post that it is actively collaborating with specific AI entities, ensuring alignment with community priorities such as attribution, opt-outs, and user control. All partnerships, the company said, will adhere to opt-out preferences, with additional efforts planned to provide regular updates to partners regarding individuals who opt out. The company also said that it aims to facilitate the removal of content from past sources and future training as requested by users.