very belatedly add README

This commit is contained in:
Alexis Marie Wright 2022-04-15 09:10:57 -04:00
parent f085b6bcb5
commit ad04f87299
1 changed files with 139 additions and 0 deletions

139
README.org Normal file
View File

@ -0,0 +1,139 @@
* Lexie's RSS Monster (for Cial)
** /notices code/ OwO what's this?
This is a tool for generating an RSS feed from the content of https://www.inhuman-comic.com.
You can see a demo of its output (/not/ live updating!) [[https://lexie.space/rssmonster-tmp/][here]].
** Tell me more!
It starts from the Archives page, which helpfully has a complete list of comic pages and splashes published to date, and reads each to find the actual comic page image. From this it produces an RSS 2.0 XML feed file which any compliant RSS reader - which given the age of that standard should be most of them! - can consume.
The feed XML from this tool isn't /quite/ compliant with RSS 2.0, which does not specify any means of including arbitrary metadata in feed items. I'm cheating a little by using HTML5-style ~data-~ prefixed attributes on the ~<item />~ element in order to capture a few pieces of metadata for the comic page a given item represents.
Capturing metadata this way is useful because, with it, the feed XML file can also serve as a cache of previously retrieved comic pages. When the script starts up, it checks whether a feed XML file is already present, and if so reads in that file and uses it to identify already-fetched pages. This way, the script only has to look at pages it hasn't seen before, which reduces both server load and runtime.
*There are some caveats to be aware of.*
*** Site structure
Because the site isn't generated from any kind of content-management system, the script has to rely on the structure of the site itself in order to find the information it needs for the feed. This is /basically okay/ because the site is consistently structured - otherwise this script would not have been possible at all. *But the site needs to /stay/ consistently structured for the script to keep working.*
HTML parsing is surprisingly flexible, but not bulletproof, and not all of the information this script uses is encoded in the site's HTML - some of it, notably page numbers and titles, has to come from the plain(ish) text of the Archives page. *So if you change how you structure the Archives page, or the comic pages, this script is likely to break.*
*** Error handling
The script can emit considerable debugging information to its output, and tries to be pretty good about explaining why things break when they do. The script also tries hard to ensure that, if anything /does/ break, the worst that happens is that the feed XML isn't updated - the idea is, it may not gain new pages right away, but it will at least stay correct with what it has.
That said, *I can't guarantee all errors will be caught.* It is a good idea to add the RSS feed in your own feed reader, so that you can keep an eye on how it's behaving when you add new content. If something seems to be going wrong, you should be able to get the script's output via email (or however cronjob output is handled on your host), and that will give you some idea what's wrong.
You can always also run the script by hand to observe its output directly. That will look something like this if everything's good:
#+begin_src sh :eval never
/usr/local/bin/php -f rssmonster.php
2022-04-15T16:44:56+00:00 [info] Read 448717 bytes from /tmp/feed.xml
2022-04-15T16:44:56+00:00 [info] Read 757 items from /tmp/feed.xml
2022-04-15T16:44:56+00:00 [warn] Got HTTP 404 from http://www.inhuman-comic.com/11backcover.php; skipping
2022-04-15T16:44:56+00:00 [warn] Got HTTP 404 from http://www.inhuman-comic.com/coimc536.php; skipping
2022-04-15T16:44:57+00:00 [info] Finished fetching 758 pages
2022-04-15T16:44:57+00:00 [info] writing 449526 bytes to /tmp/feed.xml
#+end_src
(On the first run, it will look a /lot/ longer, since there'll be a lot more pages the script has to fetch - it doesn't have a feed XML to work from yet, so it has to look at everything listed on the Archives page to build one.)
** ok that's cool and all but how do i use it tho
Fair question!
*** Prerequisites
You need:
- a PHP 7.3 or newer installation, built with SimpleXML support
- a location on the filesystem where the script can write the feed XML file
- that location also to be serveable as part of the website, so that RSS readers can find and use the feed XML
*** Setting up the RSS monster
1. On the site server, clone the repository (or download and unpack the zip file) in a directory where you can run PHP code.
2. Check the values in ~config/default.php~ and update them as necessary.
- In particular, you will certainly need to update ~feedPath~, since that's where the completed feed XML will be placed. This needs to be somewhere that's included in your web server config, so that you can visit the feed XML file in the browser.
3. Test-run the script: ~php -f rssmonster.php~
- If all's well, this will emit a feed file at the location specified in ~feedPath~.
4. Add a cron job to periodically run the script.
- How frequently it runs will determine how quickly new pages posted to the website are picked up and included in the feed. Every four hours is probably enough.
Once you've got the RSS monster running happily, you'll need to set up the website to make the RSS feed XML available to visitors.
*** Setting up the website
This is as simple as adding an HTML ~link~ tag to the ~head~ of each page on the site. The tag might look like this:
#+begin_src html :eval never
<link
href="Your Feed URL Here"
rel="alternate"
type="application/rss+xml"
title="Your Feed Title Here" />
#+end_src
*Note that you'll need to set ~href~ and ~title~ in this tag.* For ~title~, that'll be the name you want RSS readers to show for your site. For ~href~, that needs to be the relative URL corresponding with the location of the feed XML file on your server. I don't know what that needs to be! But your server admin will.
Ideally, you want to add this tag to /every/ page on your site, and to the page template (if any). That's important because RSS readers with browser integrations will be able to show there's a feed available, no matter what page of the site someone happens to be looking at. It also helps with RSS readers that /don't/ have a browser integration; with the tag on every page, someone who wants to follow your work via RSS can drop any URL on the site into their reader's "Add Feed" dialog, and it will work as expected.
** What could possibly go wrong?
Oh, a few things! Here's some troubleshooting advice.
*** I changed a page that is already in the feed, but the feed isn't updating to match
This is expected! Since the existing feed XML is used as a cache of known pages, the script will not re-check a page that already appears in the feed XML file.
*Fix:* Delete the existing feed XML file. This will force the script to re-check all pages on the site, getting the newest version of each.
*** The script doesn't seem to be breaking, but new pages don't show up in the feed
You might have changed the HTML structure of the Archives page.
The script looks in the Archives page for the following tag structure:
#+begin_src html :eval never
<div class="textcontent">
<ul> <!-- representing an arc -->
<a href="comic#.php">page #</a> // "Title of the page"<br>
<a href="comic#.php">page #</a> // "Title of the page"<br>
</ul>
</div>
#+end_src
If the page no longer has this shape, then the script won't be able to parse it and find comic links. That may result in errors from the script, but it may /also/ result in the script silently failing to add new pages to the XML (and failing to add /any/ pages to the XML if you regenerate from scratch).
*Fix:* Modify either the Archives page, or the script, such that the script is able to understand the page.
(Mostly you would need to make these changes in [[lib/fetch-site-content.php][~lib/fetch-site-content.php~]].)
*** New items are added, but their images aren't showing up
You might have changed the HTML of the affected comic pages.
Similar to the above, the script expects to find the following HTML structure in a comic page:
#+begin_src html :eval never
<div class="page">
<img src="comic#.png" />
</div>
#+end_src
If this structure isn't present, then the image won't be found. This /should/ produce an error in the script's output, but it's possible there may be cases where it would silently fail.
*Fix:* Update either the script or the page, as above. (Probably easier to modify the page in this case, since the script isn't really set up to handle lots of variation in the way the comic pages are structured. It /should/ be ok with changes that aren't too drastic, and the Archive page parsing is considerably more sensitive - still, better to avoid the problem entirely if possible, by keeping page HTML consistent.)
*** New items are added, but their titles or something aren't correct
You might have changed how the Archives page is structured.
Outside of the comic image itself, all information for a given comic page comes from its entry in the Archives page. As mentioned above, that needs to stay consistent in order to be correctly parsed. The fix is the same, too: change either the page or the script such that the script can parse the page.
*** None of that helped or something else is going wrong
Something else broke. :) Consult your server admin for advice - I'm happy to provide support with the script implementation itself, but there's not much I can do about the way the server is set up. That said, if your server admin needs to talk to me, they can hit me up via fedi ([[https://tilde.zone/@alexis][@alexis@tilde.zone]]) or email ([[mailto:lexie@alexis-marie-wright.me][lexie@alexis-marie-wright.me]]).