blah blah blah generator for cosmic.voyage verse from prose

Go to file

terris Station 3d3a35a711 added mark8e.py for epic poetry lookalike		2019-11-20 14:21:02 -05:00
corpus/prose	heebie jeebies	2019-11-14 23:50:15 -05:00
samples	don't even know what changes i made. oops	2019-11-20 14:10:29 -05:00
mark8.py	don't even know what changes i made. oops	2019-11-20 14:10:29 -05:00
mark8e.py	added mark8e.py for epic poetry lookalike	2019-11-20 14:21:02 -05:00
markchainer.py	heebie jeebies	2019-11-14 23:50:15 -05:00
readme.md	heebie jeebies	2019-11-14 23:50:15 -05:00
sedtest.txt	heebie jeebies	2019-11-14 23:50:15 -05:00
sedtest.txt.std	heebie jeebies	2019-11-14 23:50:15 -05:00
stdtxt.sh	heebie jeebies	2019-11-14 23:50:15 -05:00

readme.md

* * * * * * * * * * * * * * * * * * * * * * * *
*    ___                _   __  _____    ___  * 
*   / _ )_______ ____ _(_) /  |/  / /__ ( _ ) *
*  / _  / __/ _ `/ _ `/ / / /|_/ /  '_// _  | *
* /____/_/  \_,_/\_, /_/ /_/  /_/_/\_(_)___/  * 
*               /___/                         * 
* * * * * * * * * * * * * * * * * * * * * * * *

This is a project (work in progress) to generate verses that look like poetry (using markov chains) for the ship stjörnuvagn Bragi on https://cosmic.voyage/

It is presented here without text-sources or markdown chains because you can get the sources from project gutenberg like I did. And besides, I don't want to distribute gutenberg texts without the license verbiage (I had to remove it before generating the models). Support Project Gutenberg! Great old texts are not just for mining, they are also for reading. https://www.gutenberg.org/

Requirements

I did all this in a virtualenv, and installed the following packages with pip3:

markovify

nltk

language-check - installs languagetool, which requires java

Included

mark8.py - the main generator proof of concept
markchainer.py - generates models from text files already processed by:
stdtxt.sh - sed pipeline to clean up the text (numbers, blank lines, underscores, brackets)
samples/mark8test.txt - rough-looking samples produced by rough-looking code during debugging.

Procedure

download some large textfiles from project gutenberg or from https://www.archive.org 1a. alternatively build your own large corpus through other means (web scraping, download corpora archives, etc.)
trim each text files as needed, so they contain the kinds of things you want to generate text from
use iconv or other means to make sure the texts are all of the same kind of encoding. (utf-8, ascii were tested)
use stdtxt.sh on the main input files. this should produce something like inputfile.txt.std
supernice python3 markchainer.py (this will look in './corpus/prose/' for *.std files, and generate a model for each. (will be found in './corpus/prose/chains' called something like inputfile.txt.std.mkdch )
supernice python3 mark8.py >> output.txt

supernice is just a bash alias:

alias supernice='nice -n 19 ionice -c 3'

...to help reduce load on the server from running this toy. the markovify package (esp. when using nltk stuf) can consume a lot of resources (especially when combined with langauge-check/LanguageTool!) so the python scripts were slowed down even more using time.sleep().