rssmonster

Go to file

Alexis Marie Wright b3df0ce3d7 fix feed generator value		2022-04-27 00:55:56 -04:00
config	debug logging off by default	2022-04-26 23:45:28 -04:00
lib	fix feed generator value	2022-04-27 00:55:56 -04:00
templates	fix feed generator value	2022-04-27 00:55:56 -04:00
vendor	deps for php 7.3.31	2022-04-26 23:33:37 -04:00
README.org	writing style	2022-04-16 22:11:49 -04:00
composer.json	deps for php 7.3.31	2022-04-26 23:33:37 -04:00
composer.lock	deps for php 7.3.31	2022-04-26 23:33:37 -04:00
rssmonster.php	Use existing feed as cache when present	2022-04-08 00:26:10 -04:00

README.org

Lexie's RSS Monster (for Cial)

Lexie's RSS Monster (for Cial)

notices code OwO what's this?

This is a tool for generating an RSS feed from the content of https://www.inhuman-comic.com.

You can see a demo of its output (not live updating!) here.

Tell me more!

It starts from the Archives page, which helpfully has a complete list of comic pages and splashes published to date, and reads each to find the actual comic page image. From this it produces an RSS 2.0 XML feed file which any compliant RSS reader - which given the age of that standard should be most of them! - can consume.

The feed XML from this tool isn't quite compliant with RSS 2.0, which does not specify any means of including arbitrary metadata in feed items. I'm cheating a little by using HTML5-style data- prefixed attributes on the <item /> element in order to capture a few pieces of metadata for the comic page a given item represents.

Capturing metadata this way is useful because, with it, the feed XML file can also serve as a cache of previously retrieved comic pages. When the script starts up, it checks whether a feed XML file is already present, and if so reads in that file and uses it to identify already-fetched pages. This way, the script only has to look at pages it hasn't seen before, which reduces both server load and runtime.

There are some caveats to be aware of.

Site structure

Because the site isn't generated from any kind of content-management system, the script has to rely on the structure of the site itself in order to find the information it needs for the feed. This is basically okay because the site is consistently structured. Otherwise this script would not have been possible at all. But the site needs to stay consistently structured for the script to keep working.

HTML parsing is surprisingly flexible, but not bulletproof, and not all of the information this script uses is encoded in the site's HTML. Some of it, notably page numbers and titles, has to come from the plain(ish) text of the Archives page. So if you change how you structure the Archives page, or the comic pages, this script is likely to break.

Dates

Put simply: There aren't any! I wasn't able to find anything in the site content that identified a publication date for each of the comic pages.

That's not a problem for the site, but it is potentially a problem for RSS readers. If we don't provide something for a date, it isn't safe to assume all readers will be able to show the feed items in the correct order.

This script deals with that by faking it based on when the script is run. That should work OK, but may behave a little strangely if there's an existing feed XML file, and more than one new item has been posted on the site since the last time the script was run. In that case, the easiest fix is probably to delete the feed XML file.

(Note also that, because the script does have to make up dates as it goes along, deleting the feed XML file will change all the dates in it. This shouldn't be a problem; RSS readers have other ways to know whether two items in different versions of a feed are the same.)

Error handling

The script can emit considerable debugging information to its output, and tries to be pretty good about explaining why things break when they do. The script also tries hard to ensure that, if anything does break, the worst that happens is that the feed XML isn't updated. The idea is, it may not gain new pages right away, but it will at least stay correct with what it has.

That said, I can't guarantee all errors will be caught. It is a good idea to add the RSS feed in your own feed reader, so that you can keep an eye on how it's behaving when you add new content. If something seems to be going wrong, you should be able to get the script's output via email (or however cronjob output is handled on your host), and that will give you some idea what's wrong.

You can always also run the script by hand to observe its output directly. That will look something like this if everything's good:

/usr/local/bin/php -f rssmonster.php
2022-04-15T16:44:56+00:00 [info] Read 448717 bytes from /tmp/feed.xml
2022-04-15T16:44:56+00:00 [info] Read 757 items from /tmp/feed.xml
2022-04-15T16:44:56+00:00 [warn] Got HTTP 404 from http://www.inhuman-comic.com/11backcover.php; skipping
2022-04-15T16:44:56+00:00 [warn] Got HTTP 404 from http://www.inhuman-comic.com/coimc536.php; skipping
2022-04-15T16:44:57+00:00 [info] Finished fetching 758 pages
2022-04-15T16:44:57+00:00 [info] writing 449526 bytes to /tmp/feed.xml

(On the first run, it will look a lot longer, since there'll be a lot more pages the script has to fetch. It doesn't have a feed XML to work from yet, so it has to look at everything listed on the Archives page to build one.)

ok that's cool and all but how do i use it tho

Fair question!

Prerequisites

You need:

a PHP 7.3 or newer installation, built with SimpleXML support
a location on the filesystem where the script can write the feed XML file
that location also to be serveable as part of the website, so that RSS readers can find and use the feed XML

Setting up the RSS monster

On the site server, clone the repository (or download and unpack the zip file) in a directory where you can run PHP code.
Check the values in config/default.php and update them as necessary.
- In particular, you will certainly need to update feedPath, since that's where the completed feed XML will be placed. This needs to be somewhere that's included in your web server config, so that you can visit the feed XML file in the browser.
Test-run the script: php -f rssmonster.php
- If all's well, this will emit a feed file at the location specified in feedPath.
Add a cron job to periodically run the script.
- How frequently it runs will determine how quickly new pages posted to the website are picked up and included in the feed. Every four hours is probably enough.

Once you've got the RSS monster running happily, you'll need to set up the website to make the RSS feed XML available to visitors.

Setting up the website

This is as simple as adding an HTML link tag to the head of each page on the site. The tag might look like this:

<link
  href="Your Feed URL Here"
  rel="alternate"
  type="application/rss+xml"
  title="Your Feed Title Here" />

Note that you'll need to set href and title in this tag. For title, that'll be the name you want RSS readers to show for your site. For href, that needs to be the relative URL corresponding with the location of the feed XML file on your server. I don't know what that needs to be! But your server admin will.

Ideally, you want to add this tag to every page on your site, and to the page template (if any). That's important because RSS readers with browser integrations will be able to show there's a feed available, no matter what page of the site someone happens to be looking at. It also helps with RSS readers that don't have a browser integration; with the tag on every page, someone who wants to follow your work via RSS can drop any URL on the site into their reader's "Add Feed" dialog, and it will work as expected.

What could possibly go wrong?

Oh, a few things! Here's some troubleshooting advice.

I changed a page that is already in the feed, but the feed isn't updating to match

This is expected! Since the existing feed XML is used as a cache of known pages, the script will not re-check a page that already appears in the feed XML file.

Fix: Delete the existing feed XML file. This will force the script to re-check all pages on the site, getting the newest version of each.

The script doesn't seem to be breaking, but new pages don't show up in the feed

You might have changed the HTML structure of the Archives page.

The script looks in the Archives page for the following tag structure:

  <div class="textcontent">
	<ul> <!-- representing an arc -->
	  <a href="comic#.php">page #</a> // "Title of the page"<br>
	  <a href="comic#.php">page #</a> // "Title of the page"<br>
	</ul>
  </div>

If the page no longer has this shape, then the script won't be able to parse it and find comic links. That may result in errors from the script, but it may also result in the script silently failing to add new pages to the XML (and failing to add any pages to the XML if you regenerate from scratch).

Fix: Modify either the Archives page, or the script, such that the script is able to understand the page.

(Mostly you would need to make these changes in lib/fetch-site-content.php.)

New items are added, but their images aren't showing up

You might have changed the HTML of the affected comic pages.

Similar to the above, the script expects to find the following HTML structure in a comic page:

  <div class="page">
	<img src="comic#.png" />
  </div>

If this structure isn't present, then the image won't be found. This should produce an error in the script's output, but it's possible there may be cases where it would silently fail.

Fix: Update either the script or the page, as above. (Probably easier to modify the page in this case, since the script isn't really set up to handle lots of variation in the way the comic pages are structured. It should be okay with changes that aren't too drastic, and the Archive page parsing is considerably more sensitive. Still, better to avoid the problem entirely if possible, by keeping page HTML consistent.)

New items are added, but their titles or something aren't correct

You might have changed how the Archives page is structured.

Outside of the comic image itself, all information for a given comic page comes from its entry in the Archives page. As mentioned above, that needs to stay consistent in order to be correctly parsed. The fix is the same, too: change either the page or the script such that the script can parse the page.

Dates aren't in the right order in the feed file

More than one item might have been posted on the site since the last time the script ran. See the discussion of date handling above for details. (And let me know about it! If that is a real problem, then it's one worth fixing.)

Fix: Delete the existing feed file and let the script create a new one from scratch.

None of that helped or something else is going wrong

Something else broke. :) Consult your server admin for advice. I'm happy to provide support with the script implementation itself, but there's not much I can do about the way the server is set up. That said, if your server admin needs to talk to me, they can hit me up via fedi (@alexis@tilde.zone) or email (lexie@alexis-marie-wright.me).