Post about Xidel

This commit is contained in:
~lucidiot 2023-12-18 14:53:10 +01:00
parent 091a085dec
commit 28c36c080c
1 changed files with 45 additions and 0 deletions

View File

@ -2091,6 +2091,51 @@ because astronomers need coffee to go through the night -->
<p>Just the <a href="https://m.highwaysengland.co.uk/feeds/rss/UnplannedEvents.xml" target="_blank">incidents-everywhere</a> feed gets a new item from every few minutes to <strong>every few seconds</strong>, which is the fastest update rate I have ever seen on a feed. I guess those feeds really can only be used by software for further processing.</p>
]]></description>
</item>
<item>
<title>Xidel, the Swiss Army knife of XML and JSON processing</title>
<pubDate>Mon, 18 Dec 2023 14:47:48 +0100</pubDate>
<guid isPermaLink="false">xidel</guid>
<category domain="https://envs.net/~lucidiot/rsrsss/">Tool</category>
<link>https://www.videlibri.de/xidel.html</link>
<description><![CDATA[
<p>You may have noticed that I have started posting the occasional home-grown <abbr title="Outline Processor Markup Language">OPML</abbr> file, to provide <a href="http://opml.org/spec2.opml#subscriptionLists" target="_blank">subscription lists</a> and share many feeds at once. Some of the files include hundreds of feeds, and no, I did not write those files entirely by hand.</p>
<p>The first OPML file that got generated automatically was for the feeds of the <a href="https://envs.net/~lucidiot/rsrsss/feed.xml#nhc-cphc" target="_blank">National Hurricane Center</a>. It initially used <a href="https://tildegit.org/lucidiot/rsrsss/src/commit/773c10f95655ae0650d6b72da5eb7c5d9a02bab9/bin/build_nhc_opml" target="_blank">a shell script</a> that combined some JavaScript code via Node.js and a call to <a href="https://blacksmoke16.github.io/oq/" target="_blank">oq</a>, a wrapper around <a href="https://jqlang.github.io/jq/" target="_blank">jq</a> that can convert between <abbr title="YAML Ain't a Markup Language">YAML</abbr>, <abbr title="JavaScript Object Notation">JSON</abbr> and <abbr title="eXtensible Markup Language">XML</abbr>.</p>
<p>The JavaScript code was retrieving the <abbr title="National Hurricane Center">NHC</abbr>'s <a href="https://www.nhc.noaa.gov/aboutrss.shtml" target="_blank">RSS feeds list</a> page and <a href="https://stackoverflow.com/a/1732454/5990435" target="_blank">parsing it using a regular expression</a>. It would then generate a JSON representation of an OPML file, which gets converted to XML almost as described in <a href="https://www.xml.com/pub/a/2006/05/31/converting-between-xml-and-json.html" target="_blank">this nearly 14 years old article</a> by oq. That's how I was used to generating feeds within my <a href="https://tilde.town/~lucidiot/itsb/" target="_blank"><abbr title="International Transport Safety Bureau">ITSB</abbr></a> project.</p>
<p>But that did not mean I was really happy with this. I do not like having a lot of dependencies in my projects, particularly those that can be heavy or restrictive in terms of CPU architectures. Ideally, I would like to be able to do almost everything on Windows XP, since one of my many other niche interests is in older Windows systems.</p>
<h3>XProc and XQuery</h3>
<p>While going through <a href="https://en.wikipedia.org/wiki/Category:XML-based_standards" target="_blank">the XML-based standards category on Wikipedia</a> to look for potentially interesting namespaces for RSS feeds, I stumbled upon <a href="https://xproc.org/" target="_blank">XProc</a>. XProc is an XML schema that lets you define pipelines, particularly to process XML data. This reminded me of <a href="https://tildegit.org/lucidiot/itsb/src/commit/01c48a495059c54022b768e02a199fa8b5474077/itsb.xml" target="_blank">the main XML file of ITSB</a>, which holds both the contents of its homepage and the instructions to generate the hundreds of feeds it serves. A series of <abbr title="eXtensible Stylesheet Language Transformations">XSLT</abbr> turned that file into either the HTML homepage, an OPML containing all the feeds, or a Bash script that can be executed to generate all of the feeds.</p>
<p>XProc looked like an interesting path to rewrite ITSB entirely and make it go beyond only generating feeds for transport accident investigation reports, which is something I have been thinking about for a while. However, the only mature implementations of XProc appear to be in Java, which is a hard no in most of my projects. Searching for <kbd>xproc</kbd> on GitHub made me find <a href="https://github.com/xquery/xproc.xq" target="_blank">xproc.xq</a>, an XProc implementation that relied on a Java implementation of another strange language, <a href="https://www.w3.org/TR/xquery-31/" target="_blank">XQuery</a>.</p>
<p>XQuery, the <em>XML Query Language</em>, is an extension of <a href="https://www.w3.org/TR/xpath-31/" target="_blank">XPath</a>, a language that you probably encountered if you have been working with XML for a while. XPath is used within XSLT, and it's often one of the quickest ways to extract something from an XML document with most XML libraries if you don't want to deal with the complexity of converting between the XML paradigm and your programming language's. You can even use it within your browser, with the <a href="https://developer.mozilla.org/en-US/docs/Web/API/Document/evaluate" target="_blank"><code>Document.evaluate()</code></a> method of the <abbr title="Document Object Model">DOM</abbr> <abbr title="Application Programming Interface">API</abbr>.</p>
<p>While XPath is mostly meant to give a succint way to describe a filter on XML data, XQuery goes beyond that and allows iterating on and processing that XML. You can probably rewrite any XSLT into an XQuery script, and it will probably be easier to read. XQuery allows declaring functions, using variables, etc. and even provides a syntax that reminds me of <abbr title="Structured Query Language">SQL</abbr> and <abbr title="SPARQL Protocol And RDF Query Language">SPARQL</abbr>:</p>
<figure>
<pre>for $galaxy in //galaxy, $planet in .//planet
group by $galaxy, $type as xs:string := $planet/@type
where count($planet) > 3
group by $type
let $count := count($galaxy)
stable order by $count descending
count $i
return &lt;type id="{$i}" name="{$type}" count="{$count}" /&gt;
</pre>
<figcaption>Example of an XQuery <abbr title="For, Let, Where, Order by, Return">FLWOR</abbr> query, pronounced <em>flower</em></figcaption>
</figure>
<p>This example, in an imaginary XML document holding the universe, computes how many galaxies have at least three planets of a given type, then returns those planet types, starting with the type with the most galaxies, and including a unique identifier. The ordering is marked as stable, forcing the XQuery implementation to order any types that have the same count in the same order on every execution. This is a quite complicated expression, but you can do way worse in XQuery. This would probably be doable with an XSLT, but it would definitely be very painful.</p>
<p>The XProc implementation I found in XQuery was using some <a href="https://www.progress.com/marklogic/server" target="_blank">MarkLogic Server</a> extensions. It is a document-oriented database that lets you run either XQuery or JavaScript to query its data, and it is proprietary, so I was definitely not interested in trying to use it. A fun thing to note however is it that it also provided unit tests via <a href="https://robwhitby.github.io/xray/" target="_blank">xray</a>. You know you've got a strong query language when you can have a unit testing framework for it!</p>
<p>I went looking for an open-source XQuery implementation that does not require Java, and that could provide enough vendor-specific extensions to replace those used within xproc.xq. I first found <a href="http://www.zorba.io/home" target="_blank">Zorba</a>, basically MarkLogic Server but open-source, and it provided an impressive amount of extensions. You can even <a href="http://www.zorba.io/documentation/latest/modules/connectors/sqlite" target="_blank">interact with SQLite databases</a> in it, to use a query language within a query language. Unfortunately, Zorba is an incredible mess to build, so I quickly gave up trying to package it for Alpine Linux and tried to find something else.</p>
<h3>Xidel</h3>
<p>I gave up for a little while, then stumbled upon <a href="https://www.videlibri.de/xidel.html" target="_blank">Xidel</a>, an open-source tool written in Pascal that supports applying CSS selectors, XPath queries, XQuery scripts, as well as JSONiq, a JSON equivalent of XPath and XQuery that got merged into XPath and XQuery 3.1. It can make HTTP requests, parse HTML (not just XHTML), submit HTML forms found in pages, interact with the filesystem, run other processes, and more.</p>
<p>It could allow me to merge every single of the tools that I use in ITSB into just one dependency. And that dependency is just one statically compiled executable that I can easily download automatically in scripts if I need it. And Pascal can be compiled on <a href="https://wiki.freepascal.org/Platform_list" target="_blank">a lot of platforms</a>. Xidel does work on Windows XP!</p>
<p>I started playing around with it, and very quickly decided to rewrite my NHC OPML generator with it to drop the Node.js, oq and jq dependencies. That is how I ended up with <a href="https://tildegit.org/lucidiot/rsrsss/src/commit/db7bc7d82802c219245ad7d3f40188ce67111b82/xquery/opml/noaa/nhc.xqy" target="_blank">the current implementation of the NHC generator</a>.</p>
<p>I then <a href="https://tildegit.org/lucidiot/rsrsss/commit/79daa5d69032be42fba8036421a3327db5d28945" target="_blank">rewrote the CSS sprites generator</a>. I use a single image for all of the icons displayed in the web version of RSRSSS, and some CSS to take just a portion of the image each time to get one icon at a time. I also took the time to optimize the CSS.</p>
<p>That sent me <a href="https://status.cafe/statuses/33023" target="_blank">on a roll</a>, and I started writing a whole bunch of XQuery scripts, including one to <a href="https://tildegit.org/lucidiot/rsrsss/commit/0f0ed2084b59002b9ab1afb8d38f5941ff7fa4a6" target="_blank">use the W3C Feed Validation Service from the command line</a>, and started looking for websites that would give me good OPML files to make.</p>
<p>That's how there has been a wave of OPML files coming in recently. I just want more excuses to write in XQuery! If I find the motivation to work on ITSB again, I will definitely be introducing Xidel in it and start slowly converting everything to it. I have also considered using it as a static site generator, and for a few other projects.</p>
<p>In some email exchanges, I have dubbed Xidel the <em>Overwhelmingly Powerful Mother Of All Legendary XML/HTML/JSON Processor of Doom</em> due to how impressed I was with how versatile it, and XQuery, are.</p>
<p>So, if you find yourself trying to extract data from HTML, XML or JSON documents, do check out <a href="https://www.videlibri.de/xidel.html" target="_blank">Xidel</a>. It might not be as trendy as other tools like jq, but it is a lot more powerful.</p>
]]></description>
</item>
</channel>
<access:restriction relationship="allow" />
</rss>