All Articles

GrabQL, a query language for data scraping

In the past, for a while I implemented some PHP console applications that grabbed data from other sites. The clients needed to collect some information from different sources and render in a certain way, and it was at that time that I started thinking about a more general solution: a query language for Web scraping, that may be used to easily implement what I needed.

Unfortunately, I was too busy hadn’t a chance to carry on with this personal project. But I have an update: I have defined the basics of the GrabQL language and currently working on its implementation. Soon I’ll start committing my code to GitHub and share with the developers’ community.

Concepts

The concepts are pretty simple:

  • SQL-like syntax: very common and easy to understand
  • Fast
  • Flexible: I want to use regular expressions and XPath, and be able to query a static document or a URL
  • Modular: I want to load eventual extensions on-demand, and keep it light
  • Portable: I want to run the query in a PHP application or using a stand-alone console application

Basic implementation

The basic implementation of GrabQL will include a SELECT command:

select [xpath | regular expression]
to [variable | output]
from [source]
where [condition]
order by [order]
limit [from], [to]

E.g., imagine the following XML file (based on a MSDN sample):

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="myfile.xsl" ?>
<bookstore specialty="novel">
  <book style="autobiography">
    <author>
      <first-name>Joe</first-name>
      <last-name>Bob</last-name>
      <award>Trenton Literary Review Honorable Mention</award>
    </author>
    <price>12</price>
  </book>
  <book style="textbook">
    <author>
      <first-name>Joan</first-name>
      <last-name>Williams</last-name>
      <publication>Selected Short Stories of
        <first-name>Joan</first-name>
        <last-name>Williams</last-name>
      </publication>
    </author>
    <editor>
      <first-name>Britney</first-name>
      <last-name>Bob</last-name>
    </editor>
    <price>55</price>
  </book>
  <book style="autobiography">
    <author>
       <first-name>Michael</first-name>
       <last-name>Smith</last-name>
       <price>33</price>
    </author>
  </book>
</bookstore>

What I want to run is something like:

# Destination variable
var results = {}

# Query using xPath, "first-name" field starting by 'Jo'
select *
from http://foo.com/books-inventory.xml
to results
where '//author'
filter by first-name like 'Jo%'
order by last-name
limit 0, 1

# Format results using JSON format
echo results::json

and get the first author as a JSON object:

[
    {
        "first-name": "Joe",
        "last-name": "Bob",
        "award": "Trenton Literary Review Honorable Mention"
    },
    {
        "first-name": "Joan",
        "last-name": "Williams",
        "publication": "Selected Short Stories of Joan Williams"
    }
]

What else?

An important note: when I first thought about GrabQL, I wasn’t yet aware of YQL (Yahoo! Query Language). It is surely an interesting project, but it has some limitations (e.g. registering the app to get an Access Key, number of daily calls, etc.). I want my Web scraper to be open-source and flexible, and to be able to add features like:

  • managing a login cookie
  • using a list of proxies for intensive Web scraping
  • save results to a MongoDb collection
  • and so on!

I’ve already implemented the runtime that will handle the GrabQL code, and I’m going to push it soon. I’m still working on the definition of language and its syntax, but I think to be close to its first release!

Ambitious project? Well it is, but I’m sure it will result in something extremely useful.