Processing HTML and XML

Introduction

Automating a lot of xml and html processing is an important goal of Xillio Content Tools. You can crawl and scrape websites, get exactly the parts of content you need from pages, APIs or feeds, and let robots build new xml or clean/change html. This article aims to give a complete impression of the possibilities on this subject and to explain how to use all built-in html/xml functionality.

Variable types

Html and xml both have their own type in Xillio Content Tools, although they can be used largely the same way. Html pages are of type NODE and the xml type is simply called XML.
html and xml variable type in debugger view

Scraping webpages or xml documents

For data collection or website migration you might need to have robots extracting complete pages or specific parts from html- or xml documents.

Loading html or xml from the web

Loading a web document is very simple on itself. Open a new robot, and put the following code in it:

html_page = loadpage("http://www.google.com");
xml_doc = loadpage("http://www.omdbapi.com/?t=back+to+the+future&y=&plot=full&r=xml");
log(html_page);
Now put a breakpoint on the second line, press play and look at the debug panel. You should see the NODE variable. Note that you can select two tabs in the preview; Source and Web. The web version is only meant for a glance of which page you got. You shouldn't rely on it for debugging, since it's rendered with a different engine than the internal page variable. So the source tab is what you need.
Now do another step (with the step-in button or F9) and you should see the XML variable. So now you know that the result of the loadpage() function depends on the content that resides on the specified url.

There's a lot more to navigating with Xill, like the click() and input() functions. This is beyond the scope of this article, but you can read about it in the web navigation tutorial.

Opening xml or html from disk

Loading from the local file system can be done by using loadxml,

Extracting information

To extract

Crawling

Processing XML

Building or changing xml/html