php xml manipulation

the PHP library makes me sad sometimes, like earlier tonight.

I've been working on the blog engine, trying to ﬁgure out how to get various bits of markup working that I don't want to have to write by hand every post¹, and it requires doing XML manipulation in PHP.

if you take the time to look around on the internet (and I did), you'll see a lot of people who want to manipulate HTML or XML in PHP, and all the replies recommend things like SimpleXML, which can't remove or change tags, or DOM, which also didn't work for my needs². exhausting these options, every thread ends with "use str_replace" or "use preg_replace". sigh. what's the point in having these libraries if they're not actually useful?

I want to do this right, damnit, and I'm going to use a tool that parses my pseudo-HTML fragments into a tree and allows me to add, change, or remove tags at will!

luckily, as I was resigning myself to writing a library from scratch, I discovered simplehtmldom. unlike other libraries I've tried using recently, simplehtmldom worked right out of the box with no issues whatsoever. code example follows.

this code is pulled directly from the footnotes parser of this blog engine, rev 0:

$content = "<p><ref>foo</ref></p><div>lala</div>"; // the post contents

$uid = "blah"; // some unique id so footnotes don't collide


// parse the post into a DOM tree
$html = str_get_html($content);

// find all ref tags
$refs = $html->find("ref");
// sigh. no library is perfect. no matches should = empty array.
if($refs === null) $refs = array();

$footnotes = array();
$index = 0;

foreach($refs as $ref) {
    // labelled reference
    if($ref->label != null) {
        if(!in_array("label:".$ref->label, $footnotes)) {
            // if the label doesn't exist on record yet, create it
            $footnotes[$index++] = "label:".$ref->label;
            $lndes = $index - 1;
        } else {
            // otherwise, find its index
            $lndes = array_search("label:".$ref->label, $footnotes);
        }
        if($ref->innertext) {
            /* if a labelled ref has innertext, it is not itself a
             * footnote marker and should not be displayed. */
            // save the footnote text
            $footnotes[$lndes] = $ref->innertext;
            // destroy the tag
            $ref->outertext = '';
            continue;
        }
    } else {
        /* unlabelled footnotes must contain their foottext within
         * the ref tags.
         */
        $footnotes[$index++] = $ref->innertext;
    }
    $ref->outertext = "<sup><a href=\"#".$uid."/fn$index\">[$index]</a></sup>";
}
// oh yeah, nested <ref> tags: don't.

// write the (possibly) modified DOM back to text.
$content = $html->save();

// now that we have the footnotes, emit them at the end of the post.
if(count($footnotes) > 0) {
    $content .= "<div class=\"footnotes\">";
    foreach($footnotes as $footnum => $foottext) {
        if(strpos($foottext, "label:") === 0) {
            trigger_error("unmatched footnote $foottext", E_USER_ERROR);
        }
        $content .= "<a name=\"".$uid."/fn".($footnum + 1)."\"></a><sup>"
                 .($footnum + 1)."</sup> ".$foottext."<br />";
    }
    $content .= "</div>";
}

// get around a php memory leak involving circular references.
$html->clear();
unset($html);

less painful than I expected it would be! hooray!

¹ especially footnotes
² DOM only accepts properly formed XML or HTML with an all-containing root node and will only create output with a doctype and a single root node. I wanted something that would create output as close as possible to the input.

posted on wednesday september 9^th, 2009 at 1:58 (PDT)