XQuery/Wikipedia Lookup

Page scraping is one way to retrieve a specific fact from a page provided its structure is stable.

Here the task is to use wikipedia to find the Latin name for a bird, given its common name.

declare namespace h = "http://www.w3.org/1999/xhtml";

let $name := request:get-parameter("name",())
let $url := escape-uri(concat("http://en.wikipedia.org/wiki/",$name),false()) 
let $page := doc($url)
let $genus := $page//h:tr[h:td[. ='Genus:']]/h:td[2]
let $species := $page//h:tr[h:td[. ='Species:']]/h:td[2]
let $binomial := string($page//h:tr[h:th//h:a[.='Binomial name']]/following-sibling::h:tr//h:b)
return 
   <bird name="{$name}" genus="{$genus}" species="{$species}" binomial="{$binomial}"/>

Here, the path to locate the data required, assuming the page is in Bird page format, involves complex XPath expressions. For example, the genus is the second cell in a table row whose first cell is 'Genus'.

Black Swan Wikipedia

The script often fails because:

the name is ambiguous Thrush Wikipedia
the name is too broad Kiwi Wikipedia

It is not hard to see that more semantic markup with ontological relationships would be preferable to these uncertain contortions.

This article is issued from Wikibooks. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.