Using XPath to query parliamentary documents

In this tutorial you learn the structure of parliamentary documents in XML, and you learn the basics of querying them with XPath expressions.

Parliamentary proceedings in XML.

XML is a format for adding structure to text. To see how this works, open h-tk-20092010-41-4027.xml in Firefox. You will see words written like this: <docinfo> and normal text. Also you see little minus signs on the left. You can click on these and then close that part of the file. An XML file is a nested structure which resembles a tree. By clicking on the plus and minus signs, you hide or bring back part of the tree.

Click on the minus sign next to <root>. Now open the root again by clicking on the plus. Root has three "children": docinfo, meta, and proceedings. Close them all by clicking on them.

Now you can look at the three bits of information one-by-one. Docinfo just contains some technical information. Close it quickly. Meta contains metadata about the document itself, using the Dublin Core description. Open and close some of the children of meta. Find the source XML from which this file was created by looking for the dc:source element. Take that URL and paste it in a new tab in your browser. Open and close again some of the elements.

The proceedings element contains the interesting stuff. It contains the meeting notes of one day of parliament. The proceedings are divided into topic's. Inspect the first one. Topic's again are divided into scene's. The first topic has just one scene below it. The second topic has several scenes. Scenes themselves are subdivided into speeches and stage-directions. If you study a speech you can see who gave the speech, sometimes at what time, etc. You can also read the different paragraph's the speech was made of.

Using XPath to query XML documents.

Now we will see how you can query XML documents using XPath. Follow carefully the steps outlined here.

  1. Download h-tk-20092010-41-4027Styled.xml to your own computer and save it.
  2. Download QueryProceedingsByXPath.xsl to your computer and save it in the same directory as the XML file you just saved.
  3. Open the XML file in a browser. You see Result of expression and then in typewriter font the text of the first scene from those proceedings we looked at earlier.
  4. What is going on here? Open the file QueryProceedingsByXPath.xsl in a text editor like NotePad. You see stuff that looks already a bit familiar: a lot of smaller and greater than signs.
  5. You see the following bit of information:
                        <xsl:variable name="myxpath" select="//scene[1]"/>
                        
                        <xsl:template match="/">
                            <h3 name='search'>Result of expression </h3>
                            <tt>
                                <xsl:copy-of select="$myxpath"/>
                            </tt>
                        </xsl:template>
                    

    The most important thing for you to remember is that variable named myxpath. That is your XPath query. In the xsl:template below it, that query myxpath is executed on the proceedings XML file, and the result is copied to the screen. Your browser does all that for you!

  6. The XPath asks for the first scene in the document. And yes, that is what we see.
  7. Your turn: ask for the second scene. You do that as follows:
    1. In NotePad, change the [1] in the definition of the variable myxpath into, well, [2].
    2. Save the file. Do not change the name.
    3. Reload the XML file in your browser. Pressing Control R usually does that for you.
    4. That's it!
  8. You see that the result starts with De heer Pechtold (D66) De heer Pechtold D66 . In English this means Mister Pechtold, and D66 is the name of his party. Go to the XML file and find this text in the second scene. It is in a stage-direction. We do not want to see that. Thus we only want the text spoken in speeches below the first scene. Easy: change your query to //scene[2]//p.
  9. Now change your query so that you only see the stage-direction, and not the text of the speeches.
  10. Can you get rid of that annoying duplication? Well, there are two stage-directions....
  11. Let's go to the next scene. You can ask for it of course by //scene[3]. You could also ask for it by asking for the scenes of speaker "Halsema". Like this://scene[@speaker = 'Halsema']. Be careful to use single quotes in your XPaths in these examples.
  12. Look at the result. You see twice the line Mevrouw Halsema (GroenLinks) Mevrouw Halsema GroenLinks . Well, that means, Halsema has two scenes. We can check that by the query count(//scene[@speaker = 'Halsema']).
  13. Change your query to only get the first scene of Halsema. Why does //scene[1][@speaker = 'Halsema'] not work?
  14. How many speeches does this scene contain? Write a query which gives the speech starting with Het woord is aan mevrouw Kant.
  15. Now ask for the first speech by changing scene to speech. Ask for the first stage-direction. Ask for all scenes by removing the square brackets. Ask for the metadata.
  16. Below you find more examples. Experiment with them. Do the XPath tutorial at ZVON, and learn more. Remember, if you want to see the value of attributes, you must enclose your query with the data() function, like in data($document//speech/@speaker).
  17. What is going on?If look at the the page source of the XML file in your browser (usually with Control U), you see the same big XML file. But on the screen you just see a part of it. That part that you queried for. How does this work? If you look carefully at the XML file h-tk-20092010-41-4027Styled.xml (with View Page Source) you see on the second line the following instruction:
    <?xml-stylesheet type="text/xsl" href="QueryProceedingsByXPath.xsl"?>
    This is a command to the browser to process the file according to the instructions in QueryProceedingsByXPath.xsl. And that is the file you were editing all the time. Thus you instructed the browser to show only parts of the file. If you want to see all, use the XPath query /.

Examples

If you try these examples yourself, put $document in front of the query.

XPath Query
//scene Return all scenes
//scene[5] Return the fifth scene
//scene[@party='D66'] Return all scenes by speaker of party 'D66'
//scene[@party='D66']//speech Return all speeches within those scenes
//speech[contains(p,'oorlog')] Return all speeches with a paragraph containing the string 'oorlog'
data($document//speech/@speaker) Return all names of speakers of speeches
//speech[.//stage-direction] Return all speeches which have a stage-direction

Working with attributes

We have asked above for values of XML elements, like speeches and scenes. We also want to ask for the values of attributes. A question like "Who speaks during the second scene?". The XPath expression for this is simple: //scene//speech/@speaker. The rule is that if you want the value of an attribute you put the @-sign in front of its name.

Caveat.However, if you put this expression in our stylesheet as before, you see nothing. Try it. This is because the browser treats attributes different from elements. To solve this, replace the template in QueryProceedingsByXPath.xsl by this one:

            <xsl:template match="/">
                <h3 name="search">Result of expression </h3>
                <tt>
                    <xsl:copy-of select="$myxpath"/>
                </tt>
                
                <h3 name="search">Result of attribute expression </h3>
                <tt>
                    <xsl:for-each select='$myxpath'>
                        <xsl:value-of select="."/><br/>
                    </xsl:for-each>
                </tt>
            </xsl:template>
        
You see, that we use another XSLT command to show the value of a list of attributes.

Now experiment with asking for attribute values.