# README

`mill.py v1.2.0`

Markdown interface for [llama.cpp](//github.com/ggerganov/llama.cpp).


## Requirements

1. [Python 3.x](//python.org) (tested on `3.11`)
2. [llama.cpp](//github.com/ggerganov/llama.cpp) (tested on `b1860`)

Developed and tested on Linux. I believe it could also work on Windows or Mac.


## Features

1. Lets you interact with `llama.cpp` using Markdown
2. Enables you to use almost every `llama.cpp` option
3. Makes no assumptions about what model you want to use
4. Lets you change any option at any point in the document
5. Caches prompts automatically
6. Streams output
7. Runs in a CLI environment as well as a CGI environment
8. Reads input document from `stdin`, writes output document to `stdout`
9. Lets you add support for any other language (i.e. other than Markdown) or
   LLM engine through Python modules


## Example

Contents of `hello.md`:

    ## Variables

    ```mill-llm
    --model
    mixtral-8x7b-instruct-v0.1.Q5_0.gguf
    ```
    
    ```mill-llm
    --ctx-size
    0
    ```
    
    ```mill-llm
    --keep
    -1
    ```
    
    ```mill
    message template
    

    Me:
    
    > [INST]  [/INST]
    
    Bot:
    
    >
    ```
    
    ```mill
    prompt indent
    >
    ```
    
    ## Chat
    
    ```mill
    prompt start
    ```
    
    Me:
    
    > [INST] Hello, how are you? [/INST]
    
    Bot:
    
    >

Command:

```bash
export MILL_LLAMACPP_MAIN=path/to/llama.cpp/main /path/to/mill_cli.py <hello.md
```

Output:

    ## Variables

    ```mill-llm
    --model
    mixtral-8x7b-instruct-v0.1.Q5_0.gguf
    ```
    
    ```mill-llm
    --ctx-size
    0
    ```
    
    ```mill-llm
    --keep
    -1
    ```
    
    ```mill
    message template


    Me:

    > [INST]  [/INST]

    Bot:
    
    >
    ```
    
    ```mill
    prompt indent
    >
    ```
    
    ## Chat
    
    ```mill
    prompt start
    ```
    
    Me:
    
    > [INST] Hello, how are you? [/INST]
    
    Bot:
    
    > Hello! I'm just a computer program, so I don't have feelings, but I'm here to help you with any questions you have to the best of my ability. Is there something specific you would like to know or talk about?</s>
    
    Me:
    
    > [INST]  [/INST]
    
    Bot:
    
    >


## CLI install + usage

1. Clone the Git repo or download a release tarball and unpack it.
2. Set the environment variable `MILL_LLAMACPP_MAIN` to the path of
   `llama.cpp/main` or your wrapper around it.
3. Pipe your Markdown document to `mill_cli.py`.

```bash
export MILL_LLAMACPP_MAIN=/path/to/llama.cpp/main
python /path/to/mill_cli.py <document.md
```

The result printed on stdout is the original document with the generated text
from the LLM added.

Like any CLI tool, you can also use it through SSH:

```bash
cat document.md | ssh <host> \
    "MILL_LLAMACPP_MAIN=/path/to/llama.cpp/main python /path/to/mill_cli.py" \
    2>/dev/null
```

Use the command-line arguments to select a different language or LLM engine.
You can use `-h` for a usage description.


## CGI install + usage

1. Clone the Git repo or download a release tarball and unpack it.
2. Set the environment variable `MILL_LLAMACPP_MAIN` to the path of
   `llama.cpp/main` or your wrapper around it.
3. Start your CGI web server.

```bash
mkdir -pv public_html/cgi-bin
cp -v mill_cgi.py public_html/cgi-bin
cp -v mill.py public_html/cgi-bin
cp -v mill_readme.py public_html/cgi-bin
cp -v mill_lang_markdown.py public_html/cgi-bin
cp -v mill_llm_llama_cpp.py public_html/cgi-bin
cp -v mill_example_markdown_llama_cpp.py public_html/cgi-bin
chmod +x public_html/cgi-bin/mill_cgi.py
export MILL_LLAMACPP_MAIN=/path/to/llama.cpp/main
python -m http.server --cgi -d public_html
```

`mill.py` doesn't come with a web interface, but it should work well with
generic HTTP tools.  Here is an example `curl` invocation:

```bash
cat document.md | curl -s -N -X POST --data-binary @- \
    --dump-header /dev/null http://host/path/to/cgi-bin/mill_cgi.py
```

On Android, I can recommend [HTTP
Shortcuts](https://github.com/Waboodoo/HTTP-Shortcuts). You can for example use
it to send your phone's clipboard directly to the CGI tool and copy the HTTP
response automatically back to the clipboard.

Use the `language` and `llm_engine` query-string parameters to select a
different language or LLM engine.


## Markdown tutorial

This section describes the Markdown language module of `mill.py`.

`mill.py` is controlled with variables embedded in the Markdown document.

In general, variables take the form

    ```variable-type [reset]
    name
    [value]
    ```

Variables are assigned to in fenced code blocks. The syntax follows the
CommonMark spec as much as possible. The first line inside the block is the
name of the variable.  The name can contain spaces in principle. The text of
the block from the second line onward is the value.  Nothing prevents you from
having a multi-line value. It depends on the variable whether or not this makes
sense.  The value of a block with only a variable name is the empty string.

Variables are either syntax variables or LLM variables. The distinction is made
based on the variable type contained in the info string.  Syntax variables have
type `mill` and are handled directly by `mill.py` while LLM variables have
other types and are passed on to the LLM-engine module.

Syntax variables and LLM variables exist in two different namespaces. The
namespace is implied by the variable type. If the `reset` flag is given, then
the variable value must be absent. The variable is reset to its default value.
If the variable has no default value, then it ceases to exist in the namespace
until it is assigned to again.

`mill.py` parses the text in a single pass from top-to-bottom and then calls
the LLM at the end. Some syntax variables affect input parsing.  Assignments to
a variable overwrite any existing value. For LLM variables, the final value of
a variable is the value passed on to the LLM.

The following two subsections explain variables in more detail. For each
variable, the default value is given as the value.


### Syntax variables

The following variables are syntax variables.


    ```mill
    prompt start
    ```

The `prompt start` variable marks the start of the prompt.  `mill.py` excludes
_everything_ before the last occurrence of this variable from the prompt.
_However_, if this variable does not exist, then `mill.py` considers that the
potential prompt starts at the beginning of the document.  In other words,
omitting it is the same as putting it at the very start of the document.

When the prompt size exceeds the LLM's context limit, you can either move the
`prompt start` variable down or create another one.  The value of this variable
doesn't matter.  It's only its position in the document that counts.


    ```mill
    prompt indent
       >
    ```

The value of the `prompt indent` variable must be (at most) one line. It's a
line prefix.  Only blocks for which the lines start with this prefix are
considered to be part of the prompt. These blocks are called _prompt indent
blocks_ throughout the tutorial. The `prompt indent` variable affects input
parsing. For each line of input, the most recent value of this variable is used
to identify prompt indent blocks.

Technically, you can set `prompt indent` to the empty string. _No variables are
parsed in a prompt indent block._ So, in this situation, if the prompt starts
before the setting, then all the text below the assignment is considered to be
part of the prompt, and any variable assignments below the setting are ignored.


    ```mill
    message template
    ```

The `message template` variable contains the template for each message. When
`mill.py` responds to a prompt, the value of this variable is added at the end
of the output of the LLM.

Note that `mill.py` does not add extra newlines to the output of the LLM in
general. You can add blank lines at the start of the message template instead.
This is by design. Some models are sensitive to newlines, so the user should be
able to control newlines.


### LLM variables

There are three different variable types for LLM variables:

1. `mill-llm`
2. `mill-llm-file`
3. `mill-llm-b64-gz-file`

The first type simply assigns the value to the name.

For some LLM engines (like `llama.cpp`), it's useful to pass arguments via a
file. This can be done using the second and third variable types. For example,
you can pass a grammar via either `--grammar` or `--grammar-file`.  However,
grammars can contain tokens that `mill.py` does not know how to shell-escape.
In that case, you have to use `--grammar-file`. The next paragraph explains how
to use it.

To pass an argument via a file, use `mill-llm-file` or `mill-llm-b64-gz-file`.
The former is for text data, the latter for binary data. The value is stored in
a temporary file. The name of the temporary file subsequently becomes the new
value of the variable.  Binary data must be a base64 representation of a
gzipped file. The file is uncompressed by `mill.py` before passing it to the
LLM. The base64 data can be split across multiple lines. The newlines are
removed in that case.


### Prompt construction

The algorithm to construct the entire prompt is simple and can be stated in one
line: _concatenate the text of all the prompt indent blocks below the last
prompt start._

The text of a prompt indent block does not include the prompt indent for each
line. Everything else is included, even newlines, with one exception: the
newline that ends the block is excluded.


## `llama.cpp` tutorial

This section describes the `llama.cpp` LLM-engine module of `mill.py`.


### LLM variables

`suppress eos`

Some models perform better if the EOS is part of the prompt. `llama.cpp` models
have a setting `add_eos_token` that seems to mean 'please add the EOS to the
generated text.' `mill.py` respects this setting, and adds the EOS in that case
if the model generates it, _unless_ you declare the LLM variable `suppress eos`
in the document. In that case `mill.py` will not add the EOS token if the model
generates it.

Other LLM variables are simply passed on to `llama.cpp` as command-line
arguments. A variable with an empty value is passed without a value (i.e. as a
flag).  There are a couple of LLM variables that are reserved for `mill.py`, so
you cannot use them. These are:

- `--file` for the input prompt
- `--prompt-cache` for the prompt cache to use
- `--prompt-cache-all` so that generated text is also cached

Using these variables results in an error.


### Environment variables

Apart from LLM variables, there are also a few environment variables that
influence the behavior of the `llama.cpp` module.


`MILL_LLAMACPP_MAIN`

This variable is required and must be set to the path of `llama.cpp/main`. It
can be your own script too. As long as:

1. The script can accept arguments that are passed from `mill.py`.
2. The standard output consists of the input prompt followed by the generated
   text.
3. The error output contains error output generated by `llama.cpp`. This is
   used by `mill.py` to extract some settings from the model's metadata
   such as BOS and EOS tokens and whether or not to add them in the right
   places.


`MILL_LLAMACPP_CACHE_DIR`

Path to the directory where the prompt caches are stored. By default this is
the OS's temporary-files directory. Note: prompt caches can be large files and
`mill.py` does not automatically clean them. You can recognize the files by the
extension `.promptcache`.


`MILL_LLAMACPP_TIMEOUT`

The maximum number of seconds to wait for the `llama.cpp/main` process to
complete.  Default is 600.


### Prompt caching

For each invocation, a prompt cache is generated. `mill.py` searches for a
matching prompt cache after parsing.


## Adding support for other languages

To add support for another language:

1. Create a new Python module named `mill_lang_<language_id>` where all
   non-alphanumeric characters of `language_id` are replaced by underscores.
2. Implement a `parse` function similar to the one in `mill_lang_markdown.py`.
3. Add a docstring to the module. This docstring serves as the module's README.
4. Put your module anywhere on the Python path of `mill.py`.
5. When using the CLI interface, pass the `-l <language_id>` argument.
6. When using the CGI interface, pass the `language=<language_id>` query-string
   parameter.

If the environment variable `MILL_DEFAULT_LANGUAGE` is set to `<language_id>`,
`mill.py` uses the language by default.


## Adding support for other LLM engines

Adding support for another LLM engine is similar to adding support for another
language:

1. Create a new Python module named `mill_llm_<llm_id>` where all
   non-alphanumeric characters of `llm_id` are replaced by underscores.
2. Implement a `generate` function similar to the one in
   `mill_llm_llama_cpp.py`.
3. Add a docstring to the module. This docstring serves as the module's README.
4. Put your module anywhere on the Python path of `mill.py`.
5. When using the CLI interface, pass the `-e <llm_id>` argument.
6. When using the CGI interface, pass the `llm_engine=<llm_id>` query-string
   parameter.

If the environment variable `MILL_DEFAULT_LLM` is set to `<llm_id>`, `mill.py`
uses the LLM engine by default.


## Adding example documentation

It's possible to add example documentation for uses of specific combinations of
`language_id` and `llm_id`.

1. Create a new Python module named `mill_example_<language_id>_<llm_id>`.
2. Create a global `example` variable in it and give it a string value. This
   value is printed in the README below the 'Features' list.
3. Create a global `runnable_example` variable in it and give it a string
   value. This value is printed at the end of the README.

The `example` variable is pure documentation. On the other hand, the intent for
the `runnable_example` variable is to have some text that can be executed by
`mill.py`. It should turn the README into an executable document.


## Runnable example

```mill-llm
--model
mixtral-8x7b-instruct-v0.1.Q5_0.gguf
```

```mill-llm
--ctx-size
0
```

```mill-llm
--keep
-1
```

```mill
message template


Me:

> [INST]  [/INST]

Bot:

>
```

```mill
prompt indent
>
```

You can pipe this README to `mill.py`.

The prompt is empty, so `mill.py` will respond by adding the message
template. Also in this case, since we didn't specify the prompt start,
`mill.py` will include all prompt indent blocks.

When the prompt is not empty, the prompt is sent to the LLM and the
generated text is appended. Every newline output by the LLM introduces
a prompt indent. Finally, the message template is added.