TikTok’s parent company has launched a web scraping tool capable of gobbling up the world’s online data 25 times faster than OpenAI
ByteDance seems eager to make up for lost time searching the web for the data needed to train its synthetic AI models.
According to research from Kasada, a company that specializes in managing bots for companies with online data, the China-based parent company of video app TikTok has released a web crawler bot or crawler bot. data, dubbed Bytespider, around April. The bot’s existence was also confirmed by Dark Visit, which monitors scraper bots.
Research shows that ByteDance’s bot has quickly become one of the most powerful, if not the most powerful, bot on the internet. It is collecting data at many times the rate of other large companies, such as (GoogleMeta, AmazonOpenAI and Anthropic, used their own scraper robot to help create and improve large or multimodal language models, known as LLM or LMM.
Sam Crowther, CEO of Kasada, said that since Bytespider came on the scene, it has collected data at about 25 times the rate of GPTbots, such as GPTbot, which scrapes data for OpenAI’s ChatGPT platform and basic models. Bytespider achieved 3,000 times the speed of ClaudeBot, from Anthropic, the company that operates the Claude platform.
According to Kasada, as the months passed, Bytespider became even more aggressive. Data shows a spike in crawling activity from Bytespider over the past six weeks.
Representatives for TikTok and ByteDance did not respond to emails seeking comment.
ByteDance’s strong uptake comes despite the possibility of TikTok being banned in the US in the coming months. President Joe Biden signed a law requiring ByteDance to sell TikTok due to national security concerns or shut it down.
Bytespider bots, like those of OpenAI and Anthropic, Does not respect robots.txtresearch shows. Robots.txt is a line of code that publishers can include on a website that, while not legally binding in any way, is responsible for signaling to scanning bots that they cannot Get data from that website.
Web scraping has been around for decades, mainly because search engines crawled links to websites. But the emergence of innovative AI tools has added a new dimension and made this practice a Origin of lawsuits And argumentative. The people and organizations whose works were removed claim that their copyrights are being violated in the process. All models using basic generative AI tools are trained on large amounts of online data, practically anything available on the web, especially written information. Tech companies use scraper bots to copy it all for free and include it in their data sets.
“It seems like they’re trying their best to catch up,” Crowther said of the aggressive scraping being done by Bytespider. Just last year, ByteDance was reportedly so far behind in the general AI race they are using OpenAI to help build ByteDance’s own LLM, which is against OpenAI’s terms of service. Earlier this year, ByteDance released a chat-based LLM name is Duabobut work on that model should have been completed before Bytespider accumulated more recent training data.
According to a person familiar with the company, it is “clear” that ByteDance is working on a new LLM. As for what ByteDance plans to do with the new LLM, a person familiar with the company’s ambitions said one goal involves search functionality for TikTok.
Last week, TikTok released an update to its current search-focused function keywords for advertisingessentially allowing advertisers to search for words that are trending on TikTok in real time. It allows marketers to create ads with relevant keywords that appear to get them on the screens of more users.
A new AI model with data on more recent internet trends and topics could expand and improve TikTok’s search environment even further, according to the person familiar with the company’s ambitions.
“Given the audience and usage, TikTok as a search environment is a completely biddable space with keywords and topics, which will be interesting to many people,” said this person. currently spending a lot of money on Google.” .
Are you a TikTok or ByteDance employee or have insight or tips to share? Contact Kali hay safely through Signal at +1-949-280-0267 or at [email protected].
Data sheet: Stay at the forefront of the business of technology with insightful analysis of the biggest names in the industry.
Register here.