A frequent approach found across most of the protections against information harvesting is to make it difficult for an automatic system to navigate the application, somehow distinguishing between automated and human activity. For most of these protections, automated navigation is understood as the process of fetching a page, parsing its contents, extracting the target URLs to start over again with the process. Some of them additionally fingerprint what is called the expected navigation flow and behavior aimed at detecting abnormal activity.
During recent years test-driven development (TDD) tools have provided a novel and practical way to programmatically interact with web browsers, enabling developers and testers to take advantage of the browser’s power through easy to write automation scripts, when developing and testing web applications. In this talk we will show how test-driven development could be used to write a new generation of web crawlers capable of using the most powerful tool available for that means, the web browser. We also present a target based solution that works on a real world scenario.
The techniques described in the talk will shed some light on how information can be harvested by driving a browser natively as a user would do. They make use of Selenium WebDriver, a suite of tools to automate web browsers, Python and Mozilla Firefox.
We conclude that the techniques analyzed aimed at limiting information harvesting were not effective at stopping a web crawler built on the premises presented here. Additional mitigations are discussed as a simple way to make the application flow less predictable and more robust against information harvesting.