Markdown interface for llama.cpp

Go to file

unworriedsafari 68de921a95 v1.1		2024-02-14 21:37:42 +00:00
LICENSE.md	Initial commit	2024-02-12 19:58:12 +00:00
mill_cgi.py	Update README	2024-02-12 20:15:47 +00:00
mill_cli.py	Update README	2024-02-12 20:15:47 +00:00
mill_example_markdown_llama_cpp.py	Initial commit	2024-02-12 19:58:12 +00:00
mill_lang_markdown.py	v1.1	2024-02-14 21:37:42 +00:00
mill_llm_llama_cpp.py	v1.1	2024-02-14 21:37:42 +00:00
mill_readme.py	Initial commit	2024-02-12 19:58:12 +00:00
mill.py	v1.1	2024-02-14 21:37:42 +00:00
README.md	v1.1	2024-02-14 21:37:42 +00:00

README.md

README

mill.py v1.1

Markdown interface for llama.cpp.

Requirements

Python 3.x (tested on 3.11)
llama.cpp (tested on b1860)

Developed and tested on Linux. I believe it should also work on Windows or Mac.

Features

Lets you interact with llama.cpp using Markdown
Enables you to use almost every llama.cpp option
Makes no assumptions about what model you want to use
Lets you change any option at any point in the document
Caches prompts automatically
Streams output
Runs in a CLI environment as well as a CGI environment
Reads input document from stdin, writes output document to stdout
Lets you add support for any other language or LLM through Python modules

Example

Contents of hello.md:

## Variables

```mill-llm
--model
mixtral-8x7b-instruct-v0.1.Q5_0.gguf
```

```mill-llm
--ctx-size
0
```

```mill-llm
--keep
-1
```

```mill
message template


Me:

> [INST]  [/INST]

Bot:

>
```

```mill
prompt indent
>
```

## Chat

```mill
prompt start
```

Me:

> [INST] Hello, how are you? [/INST]

Bot:

>

Command:

export MILL_LLAMACPP_MAIN=path/to/llama.cpp/main /path/to/mill_cli.py <hello.md

Output:

## Variables

```mill-llm
--model
mixtral-8x7b-instruct-v0.1.Q5_0.gguf
```

```mill-llm
--ctx-size
0
```

```mill-llm
--keep
-1
```

```mill
message template


Me:

> [INST]  [/INST]

Bot:

>
```

```mill
prompt indent
>
```

## Chat

```mill
prompt start
```

Me:

> [INST] Hello, how are you? [/INST]

Bot:

> Hello! I'm just a computer program, so I don't have feelings, but I'm here to help you with any questions you have to the best of my ability. Is there something specific you would like to know or talk about?</s>

Me:

> [INST]  [/INST]

Bot:

>

Adding support for other languages

Markdown support is included. To add another language:

Create a new Python module named mill_lang_languageid where all non-alphanumeric characters of languageid are replaced by underscores.
Implement a parse function similar to the one in mill_lang_markdown.py.
Put your module anywhere on the Python path of mill.py.
When using the CLI interface, pass the languageid to the -l argument.
When using the CGI interface, pass the languageid to the language query-string parameter.

Adding support for other LLMs + using other LLMs

llama.cpp support is included. Adding support for LLMs is similar to adding support for other languages:

Create a new Python module named mill_llm_llmid where all non-alphanumeric characters of the llmid part are replaced by underscores.
Implement a generate function similar to the one in mill_llm_llama_cpp.py.
Put your module anywhere on the Python path of mill.py.
When using the CLI interface, pass the llmid to the -e argument.
When using the CGI interface, pass the llmid to the llm_engine query-string parameter.

CLI install + usage

Clone the Git repo or download these files:
1. mill_cli.py
2. mill.py
3. mill_lang_markdown.py
4. mill_llm_llama_cpp.py
5. mill_example_markdown_llama_cpp.py
Put files 2-5 on the Python path of mill_cli.py. Easy solution: put all files in the same folder.
Set the environment variable MILL_LLAMACPP_MAIN to the path of llama.cpp/main or your wrapper around it.
Pipe your Markdown document to mill_cli.py.

export MILL_LLAMACPP_MAIN=/path/to/llama.cpp/main
python /path/to/mill_cli.py <document.md

The result printed on stdout is the original document with the generated text from the LLM added.

Like any CLI tool, you can also use it through SSH:

cat document.md | ssh <host> \
    "MILL_LLAMACPP_MAIN=/path/to/llama.cpp/main python /path/to/mill_cli.py" \
    2>/dev/null

Use the command-line arguments to select a different language or LLM. You can use -h for a usage description.

CGI install + usage

Clone the Git repo or download these files:
1. mill_cgi.py
2. mill.py
3. mill_lang_markdown.py
4. mill_llm_llama_cpp.py
5. mill_example_markdown_llama_cpp.py
Put files 2-5 on the Python path of mill_cgi.py. Easy solution: put all files in the same folder.
Set the environment variable MILL_LLAMACPP_MAIN to the path of llama.cpp/main or your wrapper around it.
Start your CGI web server.

mkdir -pv public_html/cgi-bin
cp -v mill_cgi.py public_html/cgi-bin
cp -v mill.py public_html/cgi-bin
cp -v mill_lang_markdown.py public_html/cgi-bin
cp -v mill_llm_llama_cpp.py public_html/cgi-bin
cp -v mill_example_markdown_llama_cpp.py public_html/cgi-bin
chmod +x public_html/cgi-bin/mill_cgi.py
export MILL_LLAMACPP_MAIN=/path/to/llama.cpp/main
python -m http.server --cgi -d public_html

mill.py doesn't come with a web interface, but it should work well with generic HTTP tools. Here is an example curl invocation:

cat document.md | curl -s -N -X POST --data-binary @- \
    --dump-header /dev/null http://host/path/to/cgi-bin/mill_cgi.py

On Android, I can recommend HTTP Shortcuts. You can for example use it to send your phone's clipboard directly to the CGI tool and copy the HTTP response automatically back to the clipboard.

Use the language and llm_engine query-string parameters to select a different language or LLM.

Markdown tutorial

mill.py is controlled with variables embedded in the Markdown document.

In general, variables take the form

```variable-type [reset]
name
value
```

Variables are assigned in fenced code blocks. The syntax follows the CommonMark spec pretty closely. The first line inside the block is the name of the variable. The name can contain spaces in principle. The text of the block from the second line on is the value. Nothing prevents you from having a multi-line value. It depends on the variable whether or not this makes sense. The value of a block with only a variable name is the empty string.

Variables are either syntax variables or LLM variables. This distinction is made based on the variable type contained in the info string. Syntax variables have type mill and are handled directly by mill.py while LLM variables have other types and are passed on to the LLM module. More on that later.

Syntax variables and LLM variables exist in two different namespaces. The namespace is implied by the variable type. If the reset flag is given, then the variable value must be absent. The variable is reset to its default value. If the variable has no default value, then it ceases to exist in the namespace until it is assigned to again.

mill.py parses the text in a single pass from top-to-bottom and then calls LLM at the end. Some syntax variables affect input parsing. Assignments to a variable overwrite the previous value. For LLM variables arguments, the final value of the variable is the value passed on to the LLM.

The following two subsections explain variables in more detail. For each variable, the default value is given as the value.

Syntax variables

The following variables are syntax variables.

```mill
prompt start
```

The prompt start variable marks the start of the prompt. mill.py excludes everything before the last occurrence of this variable from the prompt. However, if this variable does not exist, then mill.py will consider that the potential prompt starts at the beginning of the document. In other words, omitting it is the same as putting it at the very start of the document.

When the prompt becomes too big for the LLM's context, you can either move the prompt start variable down or create another one. The value of this variable doesn't matter. It's only its position in the document that counts.

```mill
prompt indent
   >
```

The value of the prompt indent variable must be (at most) one line. It's a line prefix. Only blocks for which the lines start with this prefix are considered to be part of the prompt. These blocks are called prompt indent blocks throughout the tutorial. The prompt indent variable affects input parsing. For each line of input, the most recent value of this variable is used to identify prompt indent blocks.

Technically, you can set prompt indent to the empty string. No variables are parsed in a prompt indent block. So, in this situation, if the prompt starts before the assignment, then all the text below the assignment is considered to be part of the prompt.

```mill
message template
```

The message template variable contains the template for each message. When mill.py responds to a prompt, the value of this variable is added at the end of the output of the LLM.

Note that mill.py in general does not add extra newlines to the output of the LLM. You can add blank lines at the start of the message template instead. This is by design. Some models are sensitive to newlines, so the user should be able to control newlines.

LLM variables

There are three different variable types for LLM variables:

mill-llm
mill-llm-file
mill-llm-b64-gz-file

The first type simply assigns the value to the name.

For some LLMs (like llama.cpp), it's useful to pass arguments via a file which can be done using the second and third variable types. For example, you can pass a grammar via either --grammar or --grammar-file. However, grammars can contain tokens that mill.py does not know how to shell-escape. In that case, you have to use --grammar-file. The next paragraph explains how to use it.

To pass an argument via a file, use mill-llm-file or mill-llm-b64-gz-file. The former is for text data, the latter for binary data. The value is stored in a temporary file. The name of the temporary file subsequently becomes the new value of the variable. Binary data must be a base64 representation of a gzipped file. The file is uncompressed by mill.py before passing it to the LLM. The base64 data can be split across multiple lines. The newlines are removed in that case.

Prompt construction

The algorithm to construct the entire prompt is simple and can be stated in one line: concatenate the text of all the prompt indent blocks below the last prompt start.

The text of a prompt indent block does not include the prompt indent for each line. Everything else is included, even newlines, with one exception: the newline that ends the block is excluded.

`llama.cpp` tutorial

LLM variables

suppress eos

Some models perform better if the EOS is part of the prompt. llama.cpp models have a setting add_eos_token that seems to mean 'please add the EOS to the generated text.' mill.py respects this setting, and adds the EOS in that case if the model generates it, unless you declare the LLM variable suppress eos in the document. In that case mill.py will not add the EOS token if the model generates it.

Other LLM variables are simply passed on to llama.cpp as command-line arguments. A variable with an empty value is passed without a value (i.e. as a flag). There are a couple of argument variables that are reserved for mill.py, so you cannot use them. These are:

--file for the input prompt
--prompt-cache for the prompt cache to use
--prompt-cache-all so that generated text is also cached

Using these variables results in an error.

Environment variables

Apart from LLM variables that come from the language module. There are some environment variables that influence the behavior of mill.py.

MILL_LLAMACPP_MAIN

This variable is required and must be set to the path of llama.cpp/main. It can be your own script too. As long as:

The script can accept arguments that are passed from mill.py.
The standard output consists of the input prompt followed by the generated text.
The error output contains error output generated by llama.cpp. This is used by mill.py to extract some settings from the model's metadata such as BOS and EOS tokens and whether or not to add them in the right places.

MILL_LLAMACPP_CACHE_DIR

Path to the directory where the prompt caches are stored. By default this is the OS's temporary-files directory. Note prompt caches can be large files and mill.py does not automatically clean them. You can recognize the files by the extension .promptcache.

MILL_LLAMACPP_TIMEOUT

The maximum number of seconds to wait for llama.cpp/main process to complete. Default is 600.

Prompt caching

For each invocation, a prompt cache is generated. mill.py searches for a matching prompt cache after parsing.

Runnable example

--model
mixtral-8x7b-instruct-v0.1.Q5_0.gguf

--ctx-size
0

--keep
-1

message template


Me:

> [INST]  [/INST]

Bot:

>

prompt indent
>

You can pipe this README to mill.py.

The prompt is empty, so mill.py will respond by adding the message template. Also in this case, since we didn't specify the prompt start, mill.py will include all prompt indent blocks.

When the prompt is not empty, the prompt is sent to the LLM and the generated text is appended. Every newline output by the LLM introduces a prompt indent. Then the message template is added.

README.md

README

Requirements

Features

Example

Adding support for other languages

Adding support for other LLMs + using other LLMs

CLI install + usage

CGI install + usage

Markdown tutorial

Syntax variables

LLM variables

Prompt construction

llama.cpp tutorial

LLM variables

Environment variables

Prompt caching

Runnable example

`llama.cpp` tutorial