2005-08-06 | ものがたり

the “true” checklists: XML Performance

MicrosoftがPatterns & Practicesとかいうタイトルで何か書いてることは知ってたけど、XMLパフォーマンスに関するネタまで書いているとは知らなかった。

で、まず僕はこれらの執筆者を誰も知らないのだけど、ひどい内容である。こんなのが本当だと信じてチンタラ動くコードを書かされる人がかわいそうなので、ここにまともなチェックリストを載せておく。

Design Considerations
- Avoid XML as long as possible.
- Avoid processing large documents.
- Avoid validation. XmlValidatingReader is 2-3x slower than XmlTextReader.
- Avoid DTD, especially IDs and entity references.
- Use streaming interfaces.
- Consider hard coded processing, including validation.
- Shorten node name length.
- Consider sharing NameTable, but only when names are likely to be really common. With more and more irrelevant names, it becomes slower and slower.
Parsing XML
- Use XmlTextReader and avoid validating readers.
- When node is required, consider using XmlDocument.ReadNode(), not the entire Load().
- Set null for XmlResolver property on some XmlReaders to avoid access to external resources.
- Make full use of MoveToContent() and Skip(). They avoids extraneous name creation. However, it becomes almost nothing when you use XmlValidatingReader.
- Avoid accessing Value for Text/CDATA nodes as long as possible.
Validating XML
- Avoid extraneous validation.
- Consider caching schemas.
- Avoid identity constraint usage. Not only because it stores key/fields for the entire document, but also because the keys are boxed.
- Avoid extraneous strong typing. It results in XmlSchemaDatatype.ParseValue(). It could also result in avoiding access to Value string.
Writing XML
- Write output directly as long as possible.
- To save documents, XmlTextWriter without indentation is better than TextWriter/Stream/file output (all indented) except for human reading.
DOM Processing
- Avoid InnerXml. It internally creates XmlTextReader/XmlTextWriter. InnerText is fine.
- Avoid PreviousSibling. XmlDocument is very inefficient for backward traverse.
- Append nodes as soon as possible. Adding a big subtree results in longer extraneous run to check ID attributes.
- Prefer FirstChild/NextSibling and avoid to access ChildNodes. It creates XmlNodeList which is initially not instantiated.
XPath Processing
- Consider to use XPathDocument but only when you need the entire document. With XmlDocument you can use ReadNode() but no equivalent for XPathDocument.
- Avoid preceding-sibling and preceding axes queries, especially over XmlDocument. They would result in sorting, and for XmlDocument they need access to PreviousSibling.
- Avoid // (descendant). The returned nodes are mostly likely to be irrelevant.
- Avoid position(), last() and positional predicates (especially something like foo[last()-1]).
- Compile XPath string to XPathExpression and reuse it for frequent query.
- Don’t run XPath query frequently. It is costy since it always have to Clone() XPathNavigators.
XSLT Processing
- Reuse (cache) XslTransform objects.
- Avoid id() and key() in XSLT. They can return all kind of nodes that prevents node-type based optimization.
- Avoid document() especially with dynamic argument.
- ~~Push~~Pull style query is usually better than template match.
- Minimize output size. More importantly, minimize input.

内容は一部MSのとかぶってるけど、一部は全く反対のことを書いている。おもしろいかもしれないから、後でmonogatariの方に載せておこう。今日はDTLLの話を書いてしまったのでナシ。

メモ: monodocをgtk-sharp 2.0でビルドする

ドキュメントブラウザはmono-toolsモジュールでビルドされるので、そっちを使う。
中のconfigure.inをいじって、GTK_SHARPをPKG_CHECK_MODULESしていない行を有効にする（コメントになっている）。次のif testの行も有効にする。さらに、gecko-sharpをPKG_CHECK_MODULESしている部分を、geck-sharpからgecko-sharp-2.0に書き換える。ていうかこのパッチと同じことをすれば良い。

DtllReader 0.4

とりあえずv0.4に合わせて更新。まだSOMしかないので、読み書きコード編集以外に用途がないけど。何でSOMしか出来なかったのか思い出した。具体的にstrongly typedで実現しようと思ったら、property bindingがbacktrackingを引き起こして面倒というかパフォーマンス的にイケてなかったからだ。weakly typedだとUnicode codepointの比較では”12345”より”1235”の方が大きくなったりするし…