Machine learning is inevitable for web scraping
By Aleksandras Šulženko
Automated bots are an inevitable part of the internet landscape
Machine learning recently experienced a revival of public interest with the launch of ChatGPT. While it has always produced some interesting results, such extensive chat functionalities have caught more attention than any previous machine learning accomplishment.
Businesses and researchers, however, have been working with these technologies for decades. Most large businesses, ranging from ecommerce platforms(opens in new tab) to AI research organizations, already use machine learning as part of their value proposition.
With the availability of data and the increasingly easy development of models, machine learning is becoming more accessible to all businesses and even solo entrepreneurs. As such, the technology will soon become more ubiquitous.
Web scraping’s unintentional effects
Automated bots are an inevitable part of the internet landscape. Search engines rely on them to find, analyze, and index new websites. Travel fare aggregators rely on similar automation to collect data and provide services to their customers. Many other businesses also run bots at various stages of their value-creating processes.
All of these processes make data gathering on the internet inevitable. Unfortunately, just like any regular internet user, processing the requests of bots takes bandwidth and server(opens in new tab) resources. Instead of any value, however, bots will never be consumers of business products, so the generated traffic, while not malicious, is not highly valuable.
Couple that with the fact that there are some actors running malicious bots that actively degrade the user experience, and it will be no surprise that many website administrators implement various anti-automation measures into websites. Differentiating between legitimate and malicious traffic is difficult already, differentiating between harmless and malicious bot traffic is obscenely troublesome.
So, to maintain high user experience levels, website(opens in new tab) owners implement anti-bot measures. At the same time, people running automation scripts start implementing ways to circumvent such measures, making it a constant cat-and-mouse game.
As the game continues, both sides start using more sophisticated technologies, one of which includes various implementations of machine learning algorithms. These are especially useful to website owners, as detecting bots through static-rule-based systems can be difficult.
While web scraping largely stands at the sidelines of these battles, scrapers still get hit by the same bans because websites do not invest much into differentiating between bots. As the practice has become more popular over the years, the impact has been rising in tandem.
As such, web scraping has unintentionally pushed businesses to develop more sophisticated anti-bot technologies that are intended to catch malicious actors. Unfortunately, the same net works equally as well on scraping scripts.
Conclusion
Web scraping has unintentionally caused significant leaps in website security and machine learning development. It has also made gathering large training datasets from the web much easier. As the industry continues to work towards further optimization, machine learning models will become an integral part of data acquisition.
With these changes occurring, machine learning will inevitably have to be applied to web scraping to improve optimization across the board and minimize the risk of losing access to data. So, web scraping itself pushes others to develop improved machine learning models, which causes a feedback loop.
https://www.techradar.com/opinion/machine-learning-is-inevitable-for-web-scraping