wiki/qm.md at main

12 KiB

Raw Permalink Blame History

title
Compiled Qt translations

The compiled Qt translation file (*.qm) is generated by Qt Linguist and holds all the translation data that a Qt application can use for a single language.

I have written a Kaitai Struct YAML schema for this format.

Conventions used in this document

When left unspecified, a number is signed.
When left unspecified, a string is encoded in UTF-8.
Strings with a defined size may contain null bytes. Strings without a defined size are null-terminated.

Structure

The file starts with 16 bytes of a magic header, then is structured in blocks. The number of blocks is only determined by reading them until t he end of the file.

+-------+---------+---------+-...-+---------+
| Magic | Block 1 | Block 2 |     | Block N |
+-------+---------+---------+-...-+---------+

The magic header is as follows:

3C B8 64 18 CA EF 9C 95 CD 21 1C BF 60 A1 BD DD

Block

Block:
+-----+--------------+----------------+
| Tag | Block length | Block contents |
+-----+--------------+----------------+

Tag (unsigned byte)

One of the following:

0x2F: Contexts block
0x42: Hashes block
0x69: Messages block
0x88: Numerus Rules block
0x96: Dependencies block
0xA7: Language block

Block length (unsigned int32)

The size of the block's contents, measured in bytes.

Block contents

The contents of each block depend on the tag.

There should only be one of each block tag in a single file. There should always be a Hashes block.

Contexts block

When the QM file has been generated using lrelease -compress, the messages in the file are compressed by their common prefixes: their hash, their hash and context, or their hash, context and source text. The context prefix will be stored in a hash table in the Contexts block, and the context and source text will only be mentioned in the attributes of the first message that has this context or source text.

This block cannot exceed 131072 bytes in size; if this limit is exceeded, lrelease acts like -compress was not set and the contexts will be saved in the Messages block.

Block contents (Contexts block):
+------------+--------------+
| Hash table | Context pool |
+------------+--------------+

Hash table

The hash table maps a hash of the context to an offset where the context might be found in the context pool.

Hash table:
+--------+----------+----------+-...-+----------+
| Length | Offset 1 | Offset 2 |     | Offset N |
+--------+----------+----------+-...-+----------+

Length (unsigned int16): Length of the hash table.
Offset (unsigned int16): Offset, in bytes, within the context pool, where the context's string should be seeked. Note that the context string probably will not be found at this offset; it will be found further away. All offsets should be multiples of 2. An offset of 0 means this hash does not exist in this file.

Note that the hash table's size may exceed the actual amount of contexts, resulting in many offsets being set to zero.

Context pool

Context pool:
+--------+-----------+-----------+-...-+-----------+
| 0x0000 | Context 1 | Context 2 |     | Context N |
+--------+-----------+-----------+-...-+-----------+

As offset 0 in the hash table means that the context does not exist in this file, the context at offset 0 in the context pool is set to 0x0000.

Context

Context:
+--------+---------+---------+
| Length | Context | Padding |
+--------+---------+---------+

Length (unsigned byte): The length of the context string in bytes.
Context (string): The context name, truncated to up to 255 characters.
Padding (optional unsigned byte): An extra null byte (0x00) may be added to ensure the size of this whole block is a multiple of 2.

Hashes block

The Hashes block holds pointers

Block contents (Hashes block):
+--------+--------+-...-+--------+
| Hash 1 | Hash 2 |     | Hash N |
+--------+--------+-...-+--------+

Hash

Hash:
+------+--------+
| Hash | Offset |
+------+--------+

Hash (unsigned int32): A hash of the bytes represented by the concatenation of the source text and of the comment strings of a single message. This can be used for faster lookup of translations, since the source text and comment are defined in the source code.
Offset (unsigned int32): Offset, in bytes, of the start of the message designated by this hash, starting from the beginning of the contents of the Messages block.

Messages block

Block contents (Messages block):
+-------------+-------------+-...-+-------------+
| Attribute 1 | Attribute 2 |     | Attribute N |
+-------------+-------------+-...-+-------------+

There is no exact structure for a message: attributes should be read into a list until an End attribute is reached, meaning all the attributes in this list are part of the message.

Messages are usually looked up using the Hashes block first, rather than reading through the Messages block sequentially.

Attribute

Attributes have no official name; they have been named attributes as they are the various properties of a message.

Attribute:
+-----+--------------------+
| Tag | Attribute contents |
+-----+--------------------+

Tag (unsigned byte)

One of the following:

End
Source text (UTF-16)
Translation
Context (UTF-16)
Hash (obsolete)
Source text
Context
Comment
Unknown (obsolete)

Attribute contents

The contents of each attribute depend on the tag.

There should only be one Comment attribute.
There should only be one of either a Context or a Context (UTF-16) attribute.
There should only be one of either a Source text or a Source text (UTF-16) attribute.
There may be zero or more Translation attributes.
There must be one End attribute.

End attribute

Attributes with the End tag signify the end of the message. They have no contents.

Source text (UTF-16) attribute

Attribute contents (Source text (UTF-16) attribute):
+--------+-------------+
| Length | Source text |
+--------+-------------+

Length (int32): Length of the string, in bytes. Should always be a multiple of 2, unless it is negative, which indicates an empty string.
Source text (UTF-16 string): The source text for this message. If the translations are ID-based, this will be the ID of this translation, and the context and comment will always be empty.

Translation attribute

Attribute contents (Translation attribute):
+--------+-------------+
| Length | Translation |
+--------+-------------+

Length (int32): Length of the string, in bytes. Should always be a multiple of 2, unless it is negative, which indicates an empty string.
Translation (UTF-16 string): The translated text for this message.

Context (UTF-16) attribute

Attribute contents (Context (UTF-16) attribute):
+--------+---------+
| Length | Context |
+--------+---------+

Length (int32): Length of the string, in bytes. Should always be a multiple of 2, unless it is negative, which indicates an empty string.
Context (UTF-16 string): Name of the context in which this message appears. This is usually a Qt class name.

Hash attribute

Attribute contents (Hash attribute):
+------+
| Hash |
+------+

Hash (uint32): Hash of the message. This is now only stored in the separate Hashes block.

Source text attribute

Attribute contents (Source text attribute):
+--------+-------------+
| Length | Source text |
+--------+-------------+

Length (unsigned int32): Length, in bytes, of the source text.
Source text (string): The source text for this message. If the translations are ID-based, this will be the ID of this translation, and the context and comment will always be empty.

Context attribute

Attribute contents (Context attribute):
+--------+---------+
| Length | Context |
+--------+---------+

Length (unsigned int32): Length, in bytes, of the context.
Context (string): Name of the context in which this message appears. This is usually a Qt class name.

Comment attribute

Attribute contents (Comment attribute):
+--------+---------+
| Length | Comment |
+--------+---------+

Length (unsigned int32): Length, in bytes, of the comment.
Comment (string): A comment left by the developer on this message, meant for disambiguation.

https://doc.qt.io/qt-6/i18n-source-translation.html#disambiguation

Unknown obsolete attribute

Attribute contents (Unknown obsolete attribute):
+------+
| Byte |
+------+

Byte (unknown, 1 byte): No definition known.

This attribute is not found in Qt 2.1.1, and can be found in Qt 2.2.0 as "Obsolete 1". It is now known as "Obsolete 2", because the Hash attribute because "Obsolete 1". I cannot find any other versions between those two version nombers those could tell what this attribute was for.

Numerus Rules block

Defines the rules for automatic pluralization of names in the translation language.

Block contents (Numerus Rules block):
+------------------+------------------+-...-+------------------+
| Rule component 1 | Rule component 2 |     | Rule component N |
+------------------+------------------+-...-+------------------+

Rule component (unsigned byte)

Either an integer, an arithmetic operator with optional flags, a logical operator or a rule separator.

The following arithmetic operators are defined:

0x01: Equality operator. Followed by one integer X, means "the value is equal to X".
0x02: Less than operator. Followed by one integer X, means "the value is less than to X".
0x03: Less than or equal operator. Followed by one integer X, means "the value is less than or equal to X".
0x04: Between operator. Followed by two integers X and Y, means "the value is between X and Y".

The following flags can be applied to the arithmetic operators:

0x08: Not.
0x10: Modulo 10. Get the remainder of the division of the value by 10 before applying the operator.
0x20: Modulo 100. Get the remainder of the division of the value by 100 before applying the operator.
0x40: Leading 1000. Meaning is unclear.

The following logical operators are defined:

0xFD: And.
0xFE: Or.

The logical operators apply in their order of definition; "A and B or C and D" means "((A and B) or C) and D".

Finally, the rule separator is defined:

0xFF: New rule.

The numerus rules are applied to a numeric value to determine whether the name associated with this value should be pluralized. Each rule is applied one after the other, and maps to a different pluralization form. The amount of pluralization forms depends on the language.

With N pluralization forms, there should be N-1 rules. If the first rule matches, then the second pluralization form is picked. If the second rule matches, then the third pluralization form is picked. If no rule matches, then the first pluralization form is picked -- defined as the singular form.

Dependencies block

Block contents (Dependencies block):
+--------------+--------------+-...-+--------------+
| Dependency 1 | Dependency 2 |     | Dependency N |
+--------------+--------------+-...-+--------------+

Dependency (string): The name of a file in the same directory as this one that this file depends on.

Language block

Block contents (Language block):
+---------------+
| Language code |
+---------------+

Language code (string): Holds the language code of the translation file.

References

Qt Linguist Manual
Writing Source Code for Translation
Source code of the QM reader and writer of Qt Linguist
Source code of the QTranslator, which reads from QM files to perform the translations within apps
The Qt archive, to examine the sources of older versions of Qt. Try reading src/corelib/kernel/qtranslator.cpp or tools/linguist/shared/qm.cpp.

12 KiB Raw Permalink Blame History