Thoughts on filtering include()d HTML
Oct. 23rd, 2012 04:34 pmConsider a HTML/XML file as tree of nodes:
* root node
* node A
* subnode 1
* node B
* subnode 2
* subnode 2a
* subnode 3
* node C
* subnode 5
With these properties:
- Any node may be loaded from a separate resource.
- External resources may load further child nodes from other resources.
- Each resource may be a different file type.
To enable these properties, there is a container environment which loads the resources and filters them.
The multiple filtering problem
Because each load operation may cause a filter to be run, because loaded resources may load other resources, and because a filtering operation on a plain file or chunk of HTML/XML data may not distinguish loaded child content from the parent's content, any filter may be run multiple times on the same data.
The problems of running the filter twice are obvious:
- Wasted CPU time in refiltering the child data over and over.
- A badly written filter could corrupt the data on a second run.
Example
My personal website wraps an output buffering function around php include() to save the contents to a variable for filtering. This method does not allow for a parent to tell whether any part of the child's data has already been filtered or not. The parent node receives a single chunk of HTML that combines in the same context:
- prior html
- included data (filtered)
- later html
- other included data (filtered or not)
Potential solutions
- The most obvious solution is to create a tree structure of nodes,
where each node contains:
- subtrees of this structure (children)
- data and toString()
- list of filters applied so far
- If the child node is the same format as the final complete result -- for example, if both are HTML files -- then the filter may be delayed until the the full file is completely built, and then run once. However, if a child node has a child node where both are a different type that is expected to be filtered before they are added to the tree, the problem remains.
- A potential alternative is to have separate filtering-function contexts depending on the current resource type. For instance, resource type A may use one filter to load a child resource of type C, while resource type B may use a different filter to load the same resource. However, this leads to the same double-filtering problem if there is a descendancy loop, for example of types A-B-C-A.
- One may also limit the purpose of filtering to a final translation to the result HTML/XML. This defines the problem away at the cost of reducing the potential usefulness of the filtering concept.
- One may demand that filtering functions have no effect on a second run on the same data, but this does not solve the problem. It only makes programmers' work more difficult.
This feels like it's more complicated than it needs to be.
I get the sense that this is a common problem with known solutions, but nothing comes to mind. I decided to change the definition of filtering into translating-to-HTML while running the final HTML filter once at the end, and will think about (i.e., won't get around to) using a tree structure in future version.
The thought of building a DOM-like tree on the server brought back to mind an old idea about potentially sending binary DOM components to the client to avoid the cost of serializing and deserializing the data. Internal binary formats probably use large chunks of memory so serializing to text is likely to reduce transport time, or to not be significantly worse when compressed. A standard binary format would become just another serialization that the client would need to translate to its native format. CPUs have gotten fast enough that the time spent parsing HTML is no longer significant. All in all, that was a bad idea.