Monday, March 11, 2013

Web-harvest scrapper 2.1 - how to use

Often it happens that you need to get some data from a web source, but the developers of the site disallow or simply do not have resources to implement any b2b API.

That's where scrapers come in. They are software pieces acting as a browser but providing some programmatic API to process results in a program code.

During my search for an acceptable solution, I came over the http://web-harvest.sourceforge.net/ ; nice features are:

  1. All-java, used as a library
  2. Great versatility - it is packed; really. Xquery, Xpath, regex searches, emulation of browser activity, templating and variables, different script integration. 
  3. Nice UI workbench, which allows to develop scripts easily and see the results immediately. Then, just save the XML configuration and invoke it from the code then. 
However, the version under link above is 2.0 version. It is not present in Maven, and I am building with maven. However, there is a new 2.1 version, which is highly redone - maven build process, switched to Guice injection, etc. 

I fancy 2.1 alot, but there are some issues with it - NO documentation at all, a little different behavior. I made a fork for myself on GitHub (https://github.com/lexaux/web-harvest) - and applying changes there. Hopefully, will be able to contact developers and contribute. 

For now, a quick how-to on running the 2.1 web-harvest scraper in your code (UI is pretty straightforward). It is really different from 2.0. So, here it goes:

3 comments:

Anonymous said...

Hi

I tried your code but I get an error :

org.webharvest.exception.ParserException: cvc-elt.1.a: Cannot find the declaration of element 'config'.

Any Idea? thanks

Charles

Alexander said...

Hello,

Unfortunately I don't have right time to look into the exact issue; maybe the weekend will bring some.

However you could take a look at the working example here https://github.com/lexaux/airlinealerter

this one uses web-harvest 2.1 just as you want, and it used to work at least a couple of month ago :))

sorry,
Alex.

Anonymous said...

the config file error was because it needs namespace declaration in the config.xml. There is a config.xsd to validate the config.xml in version 2.1