r3 - 25 Jun 2005 - 03:37:19 - AaronWillisYou are here: TWiki >  Codev Web > WebScrapingProxy
Tags:
, create new tag

Abstract

A proxy is a piece of middleware that sits between the browser (or another program pretending to be a browser) and the Internet connection. When a web page is requested, the request goes to the proxy, which downloads the relevant data, optionally preprocesses it, then returns it to the browser as expected.

The Web Scraping Proxy is a way to automatically generate the CPAN:LWP Perl code necessary to emulate a real browser within scripts.

Applications

TWikiTestingInfrastructure?

  • the intended primary use of the WebScrapingProxy is to generate automated testing scripts which will be "replayed" for use as UnitTests? and RegressionTests?

Other Application Ideas

  • could be used to generate more detailed logs and statistics analysis
    • /me wonders if it is possible to implement the existing logs using an (internal) proxy server and if so, would it be a GoodThing? ?
  • a debugging tool
  • simulate real user scenarios in a load test situation
    • need to add timestamps to translate.pl and add sleep statements to simulate user inactivity
  • it may be possible to use this as part of a benchmarking suite, although i haven't done so myself, and the TWiki benchmarking suites probably wouldn't benefit from this (i mention it for completeness)

Setup

Server

CPAN Requirements

Download and Install

download http://twiki.org/p/pub/Codev/WebScrapingProxy/translate.pl.txt and wsp version 2
mkdir wsp ; cd wsp
wget -O - http://www.research.att.com/~hpk/wsp/wspv2.tgz | tar xz
# (edit wsp.pl to have proper path to perl binary)
wget -O - http://twiki.org/p/pub/Codev/WebScrapingProxy/translate.pl.txt >translate.pl
chmod +x translate.pl

Start Server

./wsp.pl -v | ./translate.pl >drive-lwp.pl

To see the events printed to the terminal as they happen, run the output through tee:

./wsp.pl -v | ./translate.pl | tee >drive-lwp.pl

Browser Configuration

web-scraping-proxy-settings.png
  1. Your browser needs to be configured to use wsp.pl as a proxy. Methods vary from browser to browser, but in most cases you just set HTTP Proxy to localhost and Port to 5634
    • warning.gif If you're running the WebScrapingProxy server locally, be sure to clear entries in No Proxy for: localhost, 127.0.0.1
  2. Browse to a page in your wiki (making sure to use port 5634, eg http://localhost:5634/twiki/bin/view/)
  3. Watch the output (if you used tee), or inspect drive-lwp.pl (eg, tail -f drive-lwp.pl)

Development

Outstanding Issues

  • i've not actually used any code output from the WebScrapingProxy yet. in particular, i'm not sure what issues will come up regarding user authentication. i'm working on this background documentation first.
  • as TWiki doesn't have a pure REST interface, cgi parameters will probably also need to be logged and new code generation code will need to be added to translate.pl
  • this page looks too "DocumentMode"-y; don't hesitate to refactor or add...

Brainstorming/Wishlist

There are many possible improvements and applications for the WebScrapingProxy

  • an improvement would be to pick up any Set-Cookie headers and reuse them for the remainder of the session

  • could a proxy be used to (transparently) cache external INCLUDE's?

wsp.pl improvements

wsp.pl could do with a few improvements; patches (diff -u) could be attached to this topic
  • shebang line (#!) doesn't have the "standard" /usr/bin/perl path
  • -i option ignores .jpg, .gif, and .css; could/should add others (especially .png, but perhaps other media types, too)
  • script should conditionally use the SSL modules and simply disable (or fail) if trying to use the proxy with an SSL connection

Resources

-- WillNorris - 27 Sep 2004

Anonymous proxy network specifically for web scraping. www.ScrapeGoat.com

-- AaronWillis - 25 Jun 2005

 
Topic attachments
I Attachment Action Size Date Who Comment
txttxt translate.pl.txt manage 1.2 K 27 Sep 2004 - 22:15 WillNorris converts wsp logs into LWP Perl scripts
pngpng web-scraping-proxy-settings.png manage 28.4 K 27 Sep 2004 - 22:14 WillNorris browser proxy settings dialog box
Edit | WYSIWYG | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r3 < r2 < r1 | More topic actions
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback SourceForge.net Logo