diff --git a/lib-htmgem.php b/lib-htmgem.php index 8b1305f..d0a0cf6 100644 --- a/lib-htmgem.php +++ b/lib-htmgem.php @@ -254,6 +254,9 @@ class GemtextTranslate_html { # Adds no-break space to stick the (EM/EN dashes) to words : aaaaaa – bb. ==> aaaaaa –$bb. $text = mb_ereg_replace("([—–]) ([^.]+)\.", "\\1".self::NARROW_NO_BREAK_SPACE."\\2.", $text); + + # Replaces several spaces (0x20) by only one + $text = preg_replace("/ +/", " ", $text); } } diff --git a/tests/files_with_html/text.gmi.html b/tests/files_with_html/text.gmi.html index 0e4e6c0..423d5b0 100644 --- a/tests/files_with_html/text.gmi.html +++ b/tests/files_with_html/text.gmi.html @@ -9,7 +9,7 @@

 

v0.14.3, November 29th 2020

 

-

This is an increasingly less rough sketch of an actual spec for Project Gemini. Although not finalised yet, further changes to the specification are likely to be relatively small. You can write code to this pseudo-specification and be confident that it probably won't become totally non-functional due to massive changes next week, but you are still urged to keep an eye on ongoing development of the protocol and make changes as required.

+

This is an increasingly less rough sketch of an actual spec for Project Gemini. Although not finalised yet, further changes to the specification are likely to be relatively small. You can write code to this pseudo-specification and be confident that it probably won't become totally non-functional due to massive changes next week, but you are still urged to keep an eye on ongoing development of the protocol and make changes as required.

 

This is provided mostly so that people can quickly get up to speed on what I'm thinking without having to read lots and lots of old phlog posts and keep notes.

 

@@ -17,26 +17,26 @@

 

1 Overview

 

-

Gemini is a client-server protocol featuring request-response transactions, broadly similar to gopher or HTTP. Connections are closed at the end of a single transaction and cannot be reused. When Gemini is served over TCP/IP, servers should listen on port 1965 (the first manned Gemini mission, Gemini 3, flew in March '65). This is an unprivileged port, so it's very easy to run a server as a "nobody" user, even if e.g. the server is written in Go and so can't drop privileges in the traditional fashion.

+

Gemini is a client-server protocol featuring request-response transactions, broadly similar to gopher or HTTP. Connections are closed at the end of a single transaction and cannot be reused. When Gemini is served over TCP/IP, servers should listen on port 1965 (the first manned Gemini mission, Gemini 3, flew in March '65). This is an unprivileged port, so it's very easy to run a server as a "nobody" user, even if e.g. the server is written in Go and so can't drop privileges in the traditional fashion.

 

1.1 Gemini transactions

 

-

There is one kind of Gemini transaction, roughly equivalent to a gopher request or a HTTP "GET" request. Transactions happen as follows:

+

There is one kind of Gemini transaction, roughly equivalent to a gopher request or a HTTP "GET" request. Transactions happen as follows:

 

-

C: Opens connection

-

S: Accepts connection

+

C: Opens connection

+

S: Accepts connection

C/S: Complete TLS handshake (see section 4)

-

C: Validates server certificate (see 4.2)

-

C: Sends request (one CRLF terminated line) (see section 2)

-

S: Sends response header (one CRLF terminated line), closes connection

-

under non-success conditions (see 3.1 and 3.2)

-

S: Sends response body (text or binary data) (see 3.3)

-

S: Closes connection

-

C: Handles response (see 3.4)

+

C: Validates server certificate (see 4.2)

+

C: Sends request (one CRLF terminated line) (see section 2)

+

S: Sends response header (one CRLF terminated line), closes connection

+

under non-success conditions (see 3.1 and 3.2)

+

S: Sends response body (text or binary data) (see 3.3)

+

S: Closes connection

+

C: Handles response (see 3.4)

 

1.2 Gemini URI scheme

 

-

Resources hosted via Gemini are identified using URIs with the scheme "gemini". This scheme is syntactically compatible with the generic URI syntax defined in RFC 3986, but does not support all components of the generic syntax. In particular, the authority component is allowed and required, but its userinfo subcomponent is NOT allowed. The host subcomponent is required. The port subcomponent is optional, with a default value of 1965. The path, query and fragment components are allowed and have no special meanings beyond those defined by the generic syntax. Spaces in gemini URIs should be encoded as %20, not +.

+

Resources hosted via Gemini are identified using URIs with the scheme "gemini". This scheme is syntactically compatible with the generic URI syntax defined in RFC 3986, but does not support all components of the generic syntax. In particular, the authority component is allowed and required, but its userinfo subcomponent is NOT allowed. The host subcomponent is required. The port subcomponent is optional, with a default value of 1965. The path, query and fragment components are allowed and have no special meanings beyond those defined by the generic syntax. Spaces in gemini URIs should be encoded as %20, not +.

 

2 Gemini requests

 

@@ -46,7 +46,7 @@

 

<URL> is a UTF-8 encoded absolute URL, including a scheme, of maximum length 1024 bytes.

 

-

Sending an absolute URL instead of only a path or selector is effectively equivalent to building in a HTTP "Host" header. It permits virtual hosting of multiple Gemini domains on the same IP address. It also allows servers to optionally act as proxies. Including schemes other than "gemini" in requests allows servers to optionally act as protocol-translating gateways to e.g. fetch gopher resources over Gemini. Proxying is optional and the vast majority of servers are expected to only respond to requests for resources at their own domain(s).

+

Sending an absolute URL instead of only a path or selector is effectively equivalent to building in a HTTP "Host" header. It permits virtual hosting of multiple Gemini domains on the same IP address. It also allows servers to optionally act as proxies. Including schemes other than "gemini" in requests allows servers to optionally act as protocol-translating gateways to e.g. fetch gopher resources over Gemini. Proxying is optional and the vast majority of servers are expected to only respond to requests for resources at their own domain(s).

 

3 Gemini responses

 

@@ -72,7 +72,7 @@

 

3.2 Status codes

 

-

Gemini uses two-digit numeric status codes. Related status codes share the same first digit. Importantly, the first digit of Gemini status codes do not group codes into vague categories like "client error" and "server error" as per HTTP. Instead, the first digit alone provides enough information for a client to determine how to handle the response. By design, it is possible to write a simple but feature complete client which only looks at the first digit. The second digit provides more fine-grained information, for unambiguous server logging, to allow writing comfier interactive clients which provide a slightly more streamlined user interface, and to allow writing more robust and intelligent automated clients like content aggregators, search engine crawlers, etc.

+

Gemini uses two-digit numeric status codes. Related status codes share the same first digit. Importantly, the first digit of Gemini status codes do not group codes into vague categories like "client error" and "server error" as per HTTP. Instead, the first digit alone provides enough information for a client to determine how to handle the response. By design, it is possible to write a simple but feature complete client which only looks at the first digit. The second digit provides more fine-grained information, for unambiguous server logging, to allow writing comfier interactive clients which provide a slightly more streamlined user interface, and to allow writing more robust and intelligent automated clients like content aggregators, search engine crawlers, etc.

 

The first digit of a response code unambiguously places the response into one of six categories, which define the semantics of the <META> line.

 

@@ -80,63 +80,63 @@

 

Status codes beginning with 1 are INPUT status codes, meaning:

 

-

The requested resource accepts a line of textual user input. The <META> line is a prompt which should be displayed to the user. The same resource should then be requested again with the user's input included as a query component. Queries are included in requests as per the usual generic URL definition in RFC3986, i.e. separated from the path by a ?. Reserved characters used in the user's input must be "percent-encoded" as per RFC3986, and space characters should also be percent-encoded.

+

The requested resource accepts a line of textual user input. The <META> line is a prompt which should be displayed to the user. The same resource should then be requested again with the user's input included as a query component. Queries are included in requests as per the usual generic URL definition in RFC3986, i.e. separated from the path by a ?. Reserved characters used in the user's input must be "percent-encoded" as per RFC3986, and space characters should also be percent-encoded.

 

3.2.2 2x (SUCCESS)

 

Status codes beginning with 2 are SUCCESS status codes, meaning:

 

-

The request was handled successfully and a response body will follow the response header. The <META> line is a MIME media type which applies to the response body.

+

The request was handled successfully and a response body will follow the response header. The <META> line is a MIME media type which applies to the response body.

 

3.2.3 3x (REDIRECT)

 

Status codes beginning with 3 are REDIRECT status codes, meaning:

 

-

The server is redirecting the client to a new location for the requested resource. There is no response body. <META> is a new URL for the requested resource. The URL may be absolute or relative. The redirect should be considered temporary, i.e. clients should continue to request the resource at the original address and should not performance convenience actions like automatically updating bookmarks. There is no response body.

+

The server is redirecting the client to a new location for the requested resource. There is no response body. <META> is a new URL for the requested resource. The URL may be absolute or relative. The redirect should be considered temporary, i.e. clients should continue to request the resource at the original address and should not performance convenience actions like automatically updating bookmarks. There is no response body.

 

3.2.4 4x (TEMPORARY FAILURE)

 

Status codes beginning with 4 are TEMPORARY FAILURE status codes, meaning:

 

-

The request has failed. There is no response body. The nature of the failure is temporary, i.e. an identical request MAY succeed in the future. The contents of <META> may provide additional information on the failure, and should be displayed to human users.

+

The request has failed. There is no response body. The nature of the failure is temporary, i.e. an identical request MAY succeed in the future. The contents of <META> may provide additional information on the failure, and should be displayed to human users.

 

3.2.5 5x (PERMANENT FAILURE)

 

Status codes beginning with 5 are PERMANENT FAILURE status codes, meaning:

 

-

The request has failed. There is no response body. The nature of the failure is permanent, i.e. identical future requests will reliably fail for the same reason. The contents of <META> may provide additional information on the failure, and should be displayed to human users. Automatic clients such as aggregators or indexing crawlers should not repeat this request.

+

The request has failed. There is no response body. The nature of the failure is permanent, i.e. identical future requests will reliably fail for the same reason. The contents of <META> may provide additional information on the failure, and should be displayed to human users. Automatic clients such as aggregators or indexing crawlers should not repeat this request.

 

3.2.6 6x (CLIENT CERTIFICATE REQUIRED)

 

Status codes beginning with 6 are CLIENT CERTIFICATE REQUIRED status codes, meaning:

 

-

The requested resource requires a client certificate to access. If the request was made without a certificate, it should be repeated with one. If the request was made with a certificate, the server did not accept it and the request should be repeated with a different certificate. The contents of <META> (and/or the specific 6x code) may provide additional information on certificate requirements or the reason a certificate was rejected.

+

The requested resource requires a client certificate to access. If the request was made without a certificate, it should be repeated with one. If the request was made with a certificate, the server did not accept it and the request should be repeated with a different certificate. The contents of <META> (and/or the specific 6x code) may provide additional information on certificate requirements or the reason a certificate was rejected.

 

3.2.7 Notes

 

-

Note that for basic interactive clients for human use, errors 4 and 5 may be effectively handled identically, by simply displaying the contents of <META> under a heading of "ERROR". The temporary/permanent error distinction is primarily relevant to well-behaving automated clients. Basic clients may also choose not to support client-certificate authentication, in which case only four distinct status handling routines are required (for statuses beginning with 1, 2, 3 or a combined 4-or-5).

+

Note that for basic interactive clients for human use, errors 4 and 5 may be effectively handled identically, by simply displaying the contents of <META> under a heading of "ERROR". The temporary/permanent error distinction is primarily relevant to well-behaving automated clients. Basic clients may also choose not to support client-certificate authentication, in which case only four distinct status handling routines are required (for statuses beginning with 1, 2, 3 or a combined 4-or-5).

 

-

The full two-digit system is detailed in Appendix 1. Note that for each of the six valid first digits, a code with a second digit of zero corresponds is a generic status of that kind with no special semantics. This means that basic servers without any advanced functionality need only be able to return codes of 10, 20, 30, 40 or 50.

+

The full two-digit system is detailed in Appendix 1. Note that for each of the six valid first digits, a code with a second digit of zero corresponds is a generic status of that kind with no special semantics. This means that basic servers without any advanced functionality need only be able to return codes of 10, 20, 30, 40 or 50.

 

The Gemini status code system has been carefully designed so that the increased power (and correspondingly increased complexity) of the second digits is entirely "opt-in" on the part of both servers and clients.

 

3.3 Response bodies

 

-

Response bodies are just raw content, text or binary, ala gopher. There is no support for compression, chunking or any other kind of content or transfer encoding. The server closes the connection after the final byte, there is no "end of response" signal like gopher's lonely dot.

+

Response bodies are just raw content, text or binary, ala gopher. There is no support for compression, chunking or any other kind of content or transfer encoding. The server closes the connection after the final byte, there is no "end of response" signal like gopher's lonely dot.

 

-

Response bodies only accompany responses whose header indicates a SUCCESS status (i.e. a status code whose first digit is 2). For such responses, <META> is a MIME media type as defined in RFC 2046.

+

Response bodies only accompany responses whose header indicates a SUCCESS status (i.e. a status code whose first digit is 2). For such responses, <META> is a MIME media type as defined in RFC 2046.

 

-

Internet media types are registered with a canonical form. Content transferred via Gemini MUST be represented in the appropriate canonical form prior to its transmission except for "text" types, as defined in the next paragraph.

+

Internet media types are registered with a canonical form. Content transferred via Gemini MUST be represented in the appropriate canonical form prior to its transmission except for "text" types, as defined in the next paragraph.

 

-

When in canonical form, media subtypes of the "text" type use CRLF as the text line break. Gemini relaxes this requirement and allows the transport of text media with plain LF alone (but NOT a plain CR alone) representing a line break when it is done consistently for an entire response body. Gemini clients MUST accept CRLF and bare LF as being representative of a line break in text media received via Gemini.

+

When in canonical form, media subtypes of the "text" type use CRLF as the text line break. Gemini relaxes this requirement and allows the transport of text media with plain LF alone (but NOT a plain CR alone) representing a line break when it is done consistently for an entire response body. Gemini clients MUST accept CRLF and bare LF as being representative of a line break in text media received via Gemini.

 

-

If a MIME type begins with "text/" and no charset is explicitly given, the charset should be assumed to be UTF-8. Compliant clients MUST support UTF-8-encoded text/* responses. Clients MAY optionally support other encodings. Clients receiving a response in a charset they cannot decode SHOULD gracefully inform the user what happened instead of displaying garbage.

+

If a MIME type begins with "text/" and no charset is explicitly given, the charset should be assumed to be UTF-8. Compliant clients MUST support UTF-8-encoded text/* responses. Clients MAY optionally support other encodings. Clients receiving a response in a charset they cannot decode SHOULD gracefully inform the user what happened instead of displaying garbage.

 

-

If <META> is an empty string, the MIME type MUST default to "text/gemini; charset=utf-8". The text/gemini media type is defined in section 5.

+

If <META> is an empty string, the MIME type MUST default to "text/gemini; charset=utf-8". The text/gemini media type is defined in section 5.

 

3.4 Response body handling

 

-

Response handling by clients should be informed by the provided MIME type information. Gemini defines one MIME type of its own (text/gemini) whose handling is discussed below in section 5. In all other cases, clients should do "something sensible" based on the MIME type. Minimalistic clients might adopt a strategy of printing all other text/* responses to the screen without formatting and saving all non-text responses to the disk. Clients for unix systems may consult /etc/mailcap to find installed programs for handling non-text types.

+

Response handling by clients should be informed by the provided MIME type information. Gemini defines one MIME type of its own (text/gemini) whose handling is discussed below in section 5. In all other cases, clients should do "something sensible" based on the MIME type. Minimalistic clients might adopt a strategy of printing all other text/* responses to the screen without formatting and saving all non-text responses to the disk. Clients for unix systems may consult /etc/mailcap to find installed programs for handling non-text types.

 

4 TLS

 

@@ -146,27 +146,27 @@

 

4.1 Version requirements

 

-

Servers MUST use TLS version 1.2 or higher and SHOULD use TLS version 1.3 or higher. TLS 1.2 is reluctantly permitted for now to avoid drastically reducing the range of available implementation libraries. Hopefully TLS 1.3 or higher can be specced in the near future. Clients who wish to be "ahead of the curve MAY refuse to connect to servers using TLS version 1.2 or lower.

+

Servers MUST use TLS version 1.2 or higher and SHOULD use TLS version 1.3 or higher. TLS 1.2 is reluctantly permitted for now to avoid drastically reducing the range of available implementation libraries. Hopefully TLS 1.3 or higher can be specced in the near future. Clients who wish to be "ahead of the curve MAY refuse to connect to servers using TLS version 1.2 or lower.

 

4.2 Server certificate validation

 

-

Clients can validate TLS connections however they like (including not at all) but the strongly RECOMMENDED approach is to implement a lightweight "TOFU" certificate-pinning system which treats self-signed certificates as first- class citizens. This greatly reduces TLS overhead on the network (only one cert needs to be sent, not a whole chain) and lowers the barrier to entry for setting up a Gemini site (no need to pay a CA or setup a Let's Encrypt cron job, just make a cert and go).

+

Clients can validate TLS connections however they like (including not at all) but the strongly RECOMMENDED approach is to implement a lightweight "TOFU" certificate-pinning system which treats self-signed certificates as first- class citizens. This greatly reduces TLS overhead on the network (only one cert needs to be sent, not a whole chain) and lowers the barrier to entry for setting up a Gemini site (no need to pay a CA or setup a Let's Encrypt cron job, just make a cert and go).

 

-

TOFU stands for "Trust On First Use" and is public-key security model similar to that used by OpenSSH. The first time a Gemini client connects to a server, it accepts whatever certificate it is presented. That certificate's fingerprint and expiry date are saved in a persistent database (like the .known_hosts file for SSH), associated with the server's hostname. On all subsequent connections to that hostname, the received certificate's fingerprint is computed and compared to the one in the database. If the certificate is not the one previously received, but the previous certificate's expiry date has not passed, the user is shown a warning, analogous to the one web browser users are shown when receiving a certificate without a signature chain leading to a trusted CA.

+

TOFU stands for "Trust On First Use" and is public-key security model similar to that used by OpenSSH. The first time a Gemini client connects to a server, it accepts whatever certificate it is presented. That certificate's fingerprint and expiry date are saved in a persistent database (like the .known_hosts file for SSH), associated with the server's hostname. On all subsequent connections to that hostname, the received certificate's fingerprint is computed and compared to the one in the database. If the certificate is not the one previously received, but the previous certificate's expiry date has not passed, the user is shown a warning, analogous to the one web browser users are shown when receiving a certificate without a signature chain leading to a trusted CA.

 

This model is by no means perfect, but it is not awful and is vastly superior to just accepting self-signed certificates unconditionally.

 

4.3 Client certificates

 

-

Although rarely seen on the web, TLS permits clients to identify themselves to servers using certificates, in exactly the same way that servers traditionally identify themselves to the client. Gemini includes the ability for servers to request in-band that a client repeats a request with a client certificate. This is a very flexible, highly secure but also very simple notion of client identity with several applications:

+

Although rarely seen on the web, TLS permits clients to identify themselves to servers using certificates, in exactly the same way that servers traditionally identify themselves to the client. Gemini includes the ability for servers to request in-band that a client repeats a request with a client certificate. This is a very flexible, highly secure but also very simple notion of client identity with several applications:

 

 

-

Gemini requests will typically be made without a client certificate. If a requested resource requires a client certificate and one is not included in a request, the server can respond with a status code of 60, 61 or 62 (see Appendix 1 below for a description of all status codes related to client certificates). A client certificate which is generated or loaded in response to such a status code has its scope bound to the same hostname as the request URL and to all paths below the path of the request URL path. E.g. if a request for gemini:example.com/foo returns status 60 and the user chooses to generate a new client certificate in response to this, that same certificate should be used for subsequent requests to gemini:example.com/foo, gemini:example.com/foo/bar/, gemini:example.com/foo/bar/baz, etc., until such time as the user decides to delete the certificate or to temporarily deactivate it. Interactive clients for human users are strongly recommended to make such actions easy and to generally give users full control over the use of client certificates.

+

Gemini requests will typically be made without a client certificate. If a requested resource requires a client certificate and one is not included in a request, the server can respond with a status code of 60, 61 or 62 (see Appendix 1 below for a description of all status codes related to client certificates). A client certificate which is generated or loaded in response to such a status code has its scope bound to the same hostname as the request URL and to all paths below the path of the request URL path. E.g. if a request for gemini:example.com/foo returns status 60 and the user chooses to generate a new client certificate in response to this, that same certificate should be used for subsequent requests to gemini:example.com/foo, gemini:example.com/foo/bar/, gemini:example.com/foo/bar/baz, etc., until such time as the user decides to delete the certificate or to temporarily deactivate it. Interactive clients for human users are strongly recommended to make such actions easy and to generally give users full control over the use of client certificates.

 

5 The text/gemini media type

 

@@ -174,17 +174,17 @@

 

In the same sense that HTML is the "native" response format of HTTP and plain text is the native response format of gopher, Gemini defines its own native response format - though of course, thanks to the inclusion of a MIME type in the response header Gemini can be used to serve plain text, rich text, HTML, Markdown, LaTeX, etc.

 

-

Response bodies of type "text/gemini" are a kind of lightweight hypertext format, which takes inspiration from gophermaps and from Markdown. The format permits richer typographic possibilities than the plain text of Gopher, but remains extremely easy to parse. The format is line-oriented, and a satisfactory rendering can be achieved with a single pass of a document, processing each line independently. As per gopher, links can only be displayed one per line, encouraging neat, list-like structure.

+

Response bodies of type "text/gemini" are a kind of lightweight hypertext format, which takes inspiration from gophermaps and from Markdown. The format permits richer typographic possibilities than the plain text of Gopher, but remains extremely easy to parse. The format is line-oriented, and a satisfactory rendering can be achieved with a single pass of a document, processing each line independently. As per gopher, links can only be displayed one per line, encouraging neat, list-like structure.

 

Similar to how the two-digit Gemini status codes were designed so that simple clients can function correctly while ignoring the second digit, the text/gemini format has been designed so that simple clients can ignore the more advanced features and still remain very usable.

 

5.2 Parameters

 

-

As a subtype of the top-level media type "text", "text/gemini" inherits the "charset" parameter defined in RFC 2046. However, as noted in 3.3, the default value of "charset" is "UTF-8" for "text" content transferred via Gemini.

+

As a subtype of the top-level media type "text", "text/gemini" inherits the "charset" parameter defined in RFC 2046. However, as noted in 3.3, the default value of "charset" is "UTF-8" for "text" content transferred via Gemini.

 

-

A single additional parameter specific to the "text/gemini" subtype is defined: the "lang" parameter. The value of "lang" denotes the natural language or language(s) in which the textual content of a "text/gemini" document is written. The presence of the "lang" parameter is optional. When the "lang" parameter is present, its interpretation is defined entirely by the client. For example, clients which use text-to-speech technology to make Gemini content accessible to visually impaired users may use the value of "lang" to improve pronunciation of content. Clients which render text to a screen may use the value of "lang" to determine whether text should be displayed left-to-right or right-to-left. Simple clients for users who only read languages written left-to-right may simply ignore the value of "lang". When the "lang" parameter is not present, no default value should be assumed and clients which require some notion of a language in order to process the content (such as text-to-speech screen readers) should rely on user-input to determine how to proceed in the absence of a "lang" parameter.

+

A single additional parameter specific to the "text/gemini" subtype is defined: the "lang" parameter. The value of "lang" denotes the natural language or language(s) in which the textual content of a "text/gemini" document is written. The presence of the "lang" parameter is optional. When the "lang" parameter is present, its interpretation is defined entirely by the client. For example, clients which use text-to-speech technology to make Gemini content accessible to visually impaired users may use the value of "lang" to improve pronunciation of content. Clients which render text to a screen may use the value of "lang" to determine whether text should be displayed left-to-right or right-to-left. Simple clients for users who only read languages written left-to-right may simply ignore the value of "lang". When the "lang" parameter is not present, no default value should be assumed and clients which require some notion of a language in order to process the content (such as text-to-speech screen readers) should rely on user-input to determine how to proceed in the absence of a "lang" parameter.

 

-

Valid values for the "lang" parameter are comma-separated lists of one or more language tags as defined in RFC4646. For example:

+

Valid values for the "lang" parameter are comma-separated lists of one or more language tags as defined in RFC4646. For example: