mill.py/mill_lang_markdown.py

# mill.py, Markdown interface for llama.cpp
# Copyright (C) 2024 unworriedsafari <unworriedsafari@tilde.club>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU Affero General Public License for more details.
#
# You should have received a copy of the GNU Affero General Public License
# along with this program.  If not, see <https://www.gnu.org/licenses/>

"""
## Markdown tutorial

This section describes the Markdown language module of `mill.py`.

`mill.py` is controlled with variables embedded in the Markdown document.

In general, variables take the form

    ```variable-type [reset]
    name
    [value]
    ```

Variables are assigned to in fenced code blocks. The syntax follows the
CommonMark spec as much as possible. The first line inside the block is the
name of the variable.  The name can contain spaces in principle. The text of
the block from the second line onward is the value.  Nothing prevents you from
having a multi-line value. It depends on the variable whether or not this makes
sense.  The value of a block with only a variable name is the empty string.

Variables are either syntax variables or LLM variables. The distinction is made
based on the variable type contained in the info string.  Syntax variables have
type `mill` and are handled directly by `mill.py` while LLM variables have
other types and are passed on to the LLM-engine module.

Syntax variables and LLM variables exist in two different namespaces. The
namespace is implied by the variable type. If the `reset` flag is given, then
the variable value must be absent. The variable is reset to its default value.
If the variable has no default value, then it ceases to exist in the namespace
until it is assigned to again.

`mill.py` parses the text in a single pass from top-to-bottom and then calls
the LLM at the end. Some syntax variables affect input parsing.  Assignments to
a variable overwrite any existing value. For LLM variables, the final value of
a variable is the value passed on to the LLM.

The following two subsections explain variables in more detail. For each
variable, the default value is given as the value.


### Syntax variables

The following variables are syntax variables.


    ```mill
    prompt start
    ```

The `prompt start` variable marks the start of the prompt.  `mill.py` excludes
_everything_ before the last occurrence of this variable from the prompt.
_However_, if this variable does not exist, then `mill.py` considers that the
potential prompt starts at the beginning of the document.  In other words,
omitting it is the same as putting it at the very start of the document.

When the prompt size exceeds the LLM's context limit, you can either move the
`prompt start` variable down or create another one.  The value of this variable
doesn't matter.  It's only its position in the document that counts.


    ```mill
    prompt indent
       >
    ```

The value of the `prompt indent` variable must be (at most) one line. It's a
line prefix.  Only blocks for which the lines start with this prefix are
considered to be part of the prompt. These blocks are called _prompt indent
blocks_ throughout the tutorial. The `prompt indent` variable affects input
parsing. For each line of input, the most recent value of this variable is used
to identify prompt indent blocks.

Technically, you can set `prompt indent` to the empty string. _No variables are
parsed in a prompt indent block._ So, in this situation, if the prompt starts
before the setting, then all the text below the assignment is considered to be
part of the prompt, and any variable assignments below the setting are ignored.


    ```mill
    message template
    ```

The `message template` variable contains the template for each message. When
`mill.py` responds to a prompt, the value of this variable is added at the end
of the output of the LLM.

Note that `mill.py` does not add extra newlines to the output of the LLM in
general. You can add blank lines at the start of the message template instead.
This is by design. Some models are sensitive to newlines, so the user should be
able to control newlines.


### LLM variables

There are three different variable types for LLM variables:

1. `mill-llm`
2. `mill-llm-file`
3. `mill-llm-b64-gz-file`

The first type simply assigns the value to the name.

For some LLM engines (like `llama.cpp`), it's useful to pass arguments via a
file. This can be done using the second and third variable types. For example,
you can pass a grammar via either `--grammar` or `--grammar-file`.  However,
grammars can contain tokens that `mill.py` does not know how to shell-escape.
In that case, you have to use `--grammar-file`. The next paragraph explains how
to use it.

To pass an argument via a file, use `mill-llm-file` or `mill-llm-b64-gz-file`.
The former is for text data, the latter for binary data. The value is stored in
a temporary file. The name of the temporary file subsequently becomes the new
value of the variable.  Binary data must be a base64 representation of a
gzipped file. The file is uncompressed by `mill.py` before passing it to the
LLM. The base64 data can be split across multiple lines. The newlines are
removed in that case.


### Prompt construction

The algorithm to construct the entire prompt is simple and can be stated in one
line: _concatenate the text of all the prompt indent blocks below the last
prompt start._

The text of a prompt indent block does not include the prompt indent for each
line. Everything else is included, even newlines, with one exception: the
newline that ends the block is excluded.
"""

import base64, contextlib, gzip, os, re, sys, tempfile


def parse(input_lines):
    return Language(input_lines)


class Language(contextlib.AbstractContextManager):
    _default_prompt_indent    = '   >'


    def __init__(self, input_lines):
        self._input_lines = input_lines


    def __enter__(self):
        self.prompt = ''
        self.llm_vars = {}
        self.returncode = 0
        self._syntax_vars = {}
        self._temp_files = []

        complete_prompt = ''

        # The stripping and adding of newlines is a bit complicated. This is
        # the result of some trial and error with/without prompt indents.
        last_line_in_prompt = False
        var_update_lines = 0
        for idx, line in enumerate(self._input_lines):
            prompt_indent = self._syntax_vars.get(
                'prompt indent',
                self._default_prompt_indent)

            if last_line_in_prompt and prompt_indent:
                print()

            # Still inside last variable update
            if var_update_lines:
                print(line, end='')
                var_update_lines -= 1
                continue

            current_line_in_prompt = line.startswith(prompt_indent)

            if not current_line_in_prompt:
                namespace, updated_variable, var_update_lines = \
                    self._var_parse(idx)

                if namespace is self._syntax_vars:
                    if updated_variable == 'prompt start':
                        if 'prompt start' in self._syntax_vars:
                            self.prompt = ''
                            print(f'[DEBUG] Prompt start: {idx}', file=sys.stderr)
                        else:
                            self.prompt = complete_prompt

                    elif updated_variable == 'prompt indent':
                        if len(namespace.get(updated_variable, '').split(os.linesep)) > 1:
                            raise SyntaxError(f'line {idx+4}: value for prompt indent must be at most one line')

                if var_update_lines:
                    var_update_lines -= 1

                print(line, end='')

                last_line_in_prompt = False

            else:
                new_part = ''

                if last_line_in_prompt and prompt_indent:
                    new_part += os.linesep

                if prompt_indent and line.endswith(os.linesep):
                    print(line[:-len(os.linesep)], end='')
                    new_part += line[len(prompt_indent):-len(os.linesep)]
                else:
                    print(line, end='')
                    new_part += line[len(prompt_indent):]

                self.prompt += new_part
                complete_prompt += new_part

                last_line_in_prompt = True

            sys.stdout.flush()

        return self


    def __exit__(self, exc_type, exc_value, traceback):
        for f in self._temp_files:
            os.remove(f)
        self._temp_files = []
        return None


    def _var_parse(self, start_idx):
        input_lines = self._input_lines[start_idx:]
        if not input_lines:
            return {}, '', 0

        # Do we have an opening code fence?
        opening_fence = input_lines[0].lstrip(' ')
        indent_len = len(input_lines[0]) - len(opening_fence)
        if indent_len > 3:
            return {}, '', 0

        # Determine fence string
        fence_string = opening_fence[:3]
        if fence_string not in ['```', '~~~']:
            return {}, '', 0

        while len(fence_string) < len(opening_fence) and \
            opening_fence[len(fence_string)] == fence_string[0]:
            fence_string += fence_string[0]

        # Determine variable type
        info_string = opening_fence[len(fence_string):].strip()
        variable_types = ['mill-llm-file',
                          'mill-llm-b64-gz-file',
                          'mill-llm',
                          'mill']
        variable_type = [t for t in variable_types if \
                         info_string.split(' ')[0] == t]

        if not variable_type:
            return {}, '', 0
        else:
            variable_type = variable_type[0]

        namespace = self._syntax_vars if variable_type == 'mill' \
                                      else self.llm_vars

        # Determine variable name
        variable_name = input_lines[1].strip() if len(input_lines) >= 2 else ''
        if not variable_name:
            raise SyntaxError(f'line {start_idx+2}: expected variable name')

        # Gather variable value
        variable_value = ''
        num_block_lines = 2

        for idx, line in enumerate(input_lines[2:], start=3):
            # Strip indentation from line
            for i in range(0,indent_len):
                if line.startswith(' '):
                    line = line[1:]

            if line.startswith(fence_string):
                num_block_lines = idx
                break

            variable_value += line.rstrip() \
                if variable_type == 'mill-llm-base64-gz-file' \
                else line

        if variable_value.endswith(os.linesep):
            variable_value = variable_value[:-len(os.linesep)]

        if 'reset' in info_string.split(' '):
            if variable_value:
                raise SyntaxError(f'line {start_idx+3}: value specified in reset (not allowed)')
            if variable_name in namespace:
                del namespace[variable_name]
        else:
            # Handle file variables
            if variable_type == 'mill-llm-b64-gz-file':
                variable_value = gzip.decompress(
                    base64.standard_b64decode(variable_value))
            elif variable_type == 'mill-llm-file':
                variable_value = variable_value.encode('utf-8')

            if variable_type in ['mill-llm-file',
                                 'mill-llm-b64-gz-file']:
                with tempfile.NamedTemporaryFile(delete=False) as fp:
                    fp.write(variable_value)
                    fp.flush()
                    variable_value = fp.name
                self._temp_files += [variable_value]

            namespace[variable_name] = variable_value

            print(f'[DEBUG] {variable_name}: {variable_value}', file=sys.stderr)
            print(f'[DEBUG] temp files: {self._temp_files}', file=sys.stderr)

        return namespace, variable_name, num_block_lines


    def print_message_template(self):
        message_template = self._syntax_vars.get('message template', '   >')
        lines = message_template.split(os.linesep)
        for idx, line in enumerate(lines):
            if idx != len(lines)-1:
                line += os.linesep
            print(line, end='')

        sys.stdout.flush()


    def print_generated_text(self, generated_text):
        if not generated_text:
            return

        # True but irrelevant
        # self.prompt += generated_text

        prompt_indent = self._syntax_vars.get('prompt indent',
                                              self._default_prompt_indent)
        lines = generated_text.split(os.linesep)
        for idx, line in enumerate(lines):
            if idx != 0:
                print(os.linesep + prompt_indent, end='')
            print(line, end='')

        sys.stdout.flush()