• 0 Posts
  • 7 Comments
Joined 2 years ago
cake
Cake day: July 7th, 2023

help-circle

  • Some details. One of the major players doing the tar pit strategy is Cloudflare. They’re a giant in networking and infrastructure, and they use AI (more traditional, nit LLMs) ubiquitously to detect bots. So it is an arms race, but one where both sides have massive incentives.

    Making nonsense is indeed detectable, but that misunderstands the purpose: economics. Scraping bots are used because they’re a cheap way to get training data. If you make a non zero portion of training data poisonous you’d have to spend increasingly many resources to filter it out. The better the nonsense, the harder to detect. Cloudflare is known it use small LLMs to generate the nonsense, hence requiring systems at least that complex to differentiate it.

    So in short the tar pit with garbage data actually decreases the average value of scraped data for bots that ignore do not scrape instructions.





  • I use Kagi, they provide access to all the main models in a chat interface and have a mode that feeds search engine results to them. It’s mostly replaced search engines for me. For programming work I find them very useful for using unfamiliar tools and libraries, I can ask it what I want to so and it’ll generally tell me how correctly. Importantly, the search engine mode has citations. $25 a month, but worth it.