Home   Archive   Permalink

Web Scraping

Hello, I very recently found Rebol language and I'd like to ask if this tool is well suited to web scraping. I want to write a program that:
1 Connects to a webpage: http://www.oanda.com/currency/historical-rates
2 Fills 2 date fields
3 Fills 2 text fields
4 Clicks the Get Table button
5 Waits for the requested data
6 Captures the numeric data just behind "High:" into a variable
7 Captures the numeric data just behind "High:" into a second variable.
Do you think this is possible with Rebol 2 or 3?
Any hint would be welcomed.

posted by:   Francisco       24-Aug-2010/7:30:56-7:00

Hi, Rebol can do this very easily,
You need to POST your data like below (note that I'm too lazy to do all the job, just get the idea)
t: read/custom http://www.oanda.com/currency/historical-rates [POST "exch2=usd&date1=06/07/08&date=09/07/08&..."]
add any other form elements in ..., t will hold the server response.
then parse the t. here is an example of parse:
t: {
     <TD>BGCOLOR="#DFDFDF"><FONT FACE="Verdana" size="2">  High:<FONT></TD>
     <TD BGCOLOR="#EFEFEF"><FONT FACE="Verdana" size="2"> 0.79060 </FONT></TD>
     <TD BGCOLOR="#DFDFDF"><FONT FACE="Verdana" size="2">  Low:</FONT></TD>
     <TD BGCOLOR="#EFEFEF"><FONT FACE="Verdana" size="2"> 0.77410 </FONT></TD>
>> parse/all t [some [thru {< FONT FACE="Verdana" size="2" >} copy v to {<} (print v) ]]
;result will be:
Hope this will help. Sorry for my laziness.

posted by:   Endo       24-Aug-2010/8:11:11-7:00

@Nick: my post look so weird it's because it has some html parts. Could you fix that?

posted by:   Endo       24-Aug-2010/8:13:58-7:00

Hello Endo:
Thanks for your help.
Please could you post the part of the string that clicks the "Get Table" button?
In another .NET development platform I'had to use:
Uri,String: javascript:document.getElementById('SUBMIT').click();

posted by:   Francisco       24-Aug-2010/9:11:52-7:00

Hi, I don't simulate the clicking of Get Table button, instead, I send an http request & get the result.
Open the page and look at its source, you'll see the < form ...> and other form elements like < input name=date1 ...> so you can get the necessary names of form elements and fill them in your request:
t: read/custom http://example.com [POST {date1="010203"¤cy="usd"}] ;etc..
so, your request goes to server and t will hold the server response, then you can parse the t.
Also look at this document: http://www.rebol.org/art-display-article.r?article=x60w

posted by:   Endo       24-Aug-2010/10:35:11-7:00

I'm still reading about the parser... (Serious stuff!!...) But as a preliminary test, I've used the following line in the rebol console:
t: read/custom http://www.oanda.com/currency/historical-rates [POST "exch2=eur&expr2=usd&date1=01/01/10&date=01/02/10"]
After that, I've sent t to the editor:
editor t
I found there the raw data from the first screen (the query) but the result screen is missing.
Do you see anything wrong in the code sent to the webserver?

posted by:   Francisco       24-Aug-2010/14:43:02-7:00

Here's an example string that works:
     t: read/custom http://www.oanda.com/currency/historical-rates [POST {lang=en&result=1&date1=08%2F18%2F10&date=08
To get that, instead of wading through the source, I saved the page, changed POST to GET in the source, opened the edited page, clicked the "Get Table" button, and copied the string from the URL bar :)

posted by:   Nick       24-Aug-2010/17:21:54-7:00

Since you're communicating with REBOLers here, a sure fire way to send HTML code, with no worry about formatting errors, is to just post the compressed string:
     editor decompress #{

posted by:   Nick       24-Aug-2010/17:32:43-7:00

In this situation, you can use REBOL nicely for your needs. If you need to do page scraping, AutoIt is a really useful tool (Windows only).

posted by:   Nick       24-Aug-2010/23:07:50-7:00

First of all, you should put all the form elements because they may needed by the server:
p: [POST "exch2=eur&expr2=usd&date1=01/01/10&date=01/02/10&result=1&date_fmt=dd/mm/yy&exch=EUR&expr=USD&margin_fixed=0&name=format=HTML"]
url: http://www.oanda.com/currency/historical-rates
editor read/custom url p
print read/custom url p
this will get the correct results. Parsing is a littbe bit difficult at a first glance. But then, you realize that it is really powerful.

posted by:   Endo       25-Aug-2010/2:57:10-7:00

Small fix:
p: [POST {exch2=eur&expr2=usd&date1=01/01/10&date=01/02/10&result=1&date_fmt=dd/mm/yy&exch=EUR&expr=USD&margin_fixed=0&format=HTML}]
there was "name=format=HTML" should be "format=HTML".

posted by:   Endo       25-Aug-2010/2:59:37-7:00

Endo & Nick: Thanks a lot... I'm still "processing" the parsing section of the tutorials.

posted by:   Francisco       25-Aug-2010/9:27:33-7:00

For what you want to do with parse, it's really not too complex. To extract strings from a larger block of text, using start and end search strings, memorize this syntax:
parse (your-text-block) [
     any [thru {start} copy variable to {end} (actions with variable)] to end
Here's an example from http://re-bol.com/examples.txt :
R E B O L []
code: {
     view layout [
         btn {some text}
         btn {some more text}
strings: copy []
parse code [
     any [thru "{" copy a-string to "}" (append strings a-string)] to end
foreach str strings [
     if true = request rejoin [{Change "} str {"?}] [
         replace/all code str request-text/title rejoin [
             {Change "} str {" to:}
editor code

posted by:   Nick       25-Aug-2010/10:16:05-7:00

Thanks for your support, with your help I could load the variables high and low with the correct numeric values taken from the webpage. There is yet a lot more to do as this is the core of a financial forecasting tool that has to retrieve the weekly ranges for a number of weeks from the pass and after some calculations apply the results to the current live data.
In my book the retrieval part of the problem was the hardest and with your help and Rebol has been a piece of cake.

posted by:   Francisco       25-Aug-2010/19:53:45-7:00

By the way, if you need other values instead of high & low, you can use this:
get the result in CSV format, parse it (<PRE>(take-this-part)</PRE>), t will be:
t: {08/20/2010,0.77960
b: map-each v extract next parse t none 2 [to-decimal v]
this will get all the decimal values for selected week,
>> sort b
== [0.7796 0.7853 0.7871 0.7871 0.788 0.7904 0.7908]
>> first b
== 0.7796
>> last b
== 0.7908
>> first maximum-of b
== 0.7908
>> first minimum-of b
== 0.7796
etc. Hope this will help.

posted by:   Endo       26-Aug-2010/3:23:19-7:00

"take this part" is take the text inside PRE tags.

posted by:   Endo       26-Aug-2010/3:25:35-7:00