mu/038---literal_strings.cc

325 lines
9.1 KiB
C++
Raw Normal View History

//: Allow instructions to mention literals directly.
//:
//: This layer will transparently move them to the global segment (assumed to
//: always be the second segment).
5001 - drop the :(scenario) DSL I've been saying for a while[1][2][3] that adding extra abstractions makes things harder for newcomers, and adding new notations doubly so. And then I notice this DSL in my own backyard. Makes me feel like a hypocrite. [1] https://news.ycombinator.com/item?id=13565743#13570092 [2] https://lobste.rs/s/to8wpr/configuration_files_are_canary_warning [3] https://lobste.rs/s/mdmcdi/little_languages_by_jon_bentley_1986#c_3miuf2 The implementation of the DSL was also highly hacky: a) It was happening in the tangle/ tool, but was utterly unrelated to tangling layers. b) There were several persnickety constraints on the different kinds of lines and the specific order they were expected in. I kept finding bugs where the translator would silently do the wrong thing. Or the error messages sucked, and readers may be stuck looking at the generated code to figure out what happened. Fixing error messages would require a lot more code, which is one of my arguments against DSLs in the first place: they may be easy to implement, but they're hard to design to go with the grain of the underlying platform. They require lots of iteration. Is that effort worth prioritizing in this project? On the other hand, the DSL did make at least some readers' life easier, the ones who weren't immediately put off by having to learn a strange syntax. There were fewer quotes to parse, fewer backslash escapes. Anyway, since there are also people who dislike having to put up with strange syntaxes, we'll call that consideration a wash and tear this DSL out. --- This commit was sheer drudgery. Hopefully it won't need to be redone with a new DSL because I grow sick of backslashes.
2019-03-13 01:56:55 +00:00
void test_transform_literal_string() {
run(
"== code 0x1\n"
5001 - drop the :(scenario) DSL I've been saying for a while[1][2][3] that adding extra abstractions makes things harder for newcomers, and adding new notations doubly so. And then I notice this DSL in my own backyard. Makes me feel like a hypocrite. [1] https://news.ycombinator.com/item?id=13565743#13570092 [2] https://lobste.rs/s/to8wpr/configuration_files_are_canary_warning [3] https://lobste.rs/s/mdmcdi/little_languages_by_jon_bentley_1986#c_3miuf2 The implementation of the DSL was also highly hacky: a) It was happening in the tangle/ tool, but was utterly unrelated to tangling layers. b) There were several persnickety constraints on the different kinds of lines and the specific order they were expected in. I kept finding bugs where the translator would silently do the wrong thing. Or the error messages sucked, and readers may be stuck looking at the generated code to figure out what happened. Fixing error messages would require a lot more code, which is one of my arguments against DSLs in the first place: they may be easy to implement, but they're hard to design to go with the grain of the underlying platform. They require lots of iteration. Is that effort worth prioritizing in this project? On the other hand, the DSL did make at least some readers' life easier, the ones who weren't immediately put off by having to learn a strange syntax. There were fewer quotes to parse, fewer backslash escapes. Anyway, since there are also people who dislike having to put up with strange syntaxes, we'll call that consideration a wash and tear this DSL out. --- This commit was sheer drudgery. Hopefully it won't need to be redone with a new DSL because I grow sick of backslashes.
2019-03-13 01:56:55 +00:00
"b8/copy \"test\"/imm32\n"
"== data 0x2000\n" // need an empty segment
5001 - drop the :(scenario) DSL I've been saying for a while[1][2][3] that adding extra abstractions makes things harder for newcomers, and adding new notations doubly so. And then I notice this DSL in my own backyard. Makes me feel like a hypocrite. [1] https://news.ycombinator.com/item?id=13565743#13570092 [2] https://lobste.rs/s/to8wpr/configuration_files_are_canary_warning [3] https://lobste.rs/s/mdmcdi/little_languages_by_jon_bentley_1986#c_3miuf2 The implementation of the DSL was also highly hacky: a) It was happening in the tangle/ tool, but was utterly unrelated to tangling layers. b) There were several persnickety constraints on the different kinds of lines and the specific order they were expected in. I kept finding bugs where the translator would silently do the wrong thing. Or the error messages sucked, and readers may be stuck looking at the generated code to figure out what happened. Fixing error messages would require a lot more code, which is one of my arguments against DSLs in the first place: they may be easy to implement, but they're hard to design to go with the grain of the underlying platform. They require lots of iteration. Is that effort worth prioritizing in this project? On the other hand, the DSL did make at least some readers' life easier, the ones who weren't immediately put off by having to learn a strange syntax. There were fewer quotes to parse, fewer backslash escapes. Anyway, since there are also people who dislike having to put up with strange syntaxes, we'll call that consideration a wash and tear this DSL out. --- This commit was sheer drudgery. Hopefully it won't need to be redone with a new DSL because I grow sick of backslashes.
2019-03-13 01:56:55 +00:00
);
CHECK_TRACE_CONTENTS(
"transform: -- move literal strings to data segment\n"
"transform: adding global variable '__subx_global_1' containing \"test\"\n"
"transform: instruction after transform: 'b8 __subx_global_1'\n"
);
}
//: We don't rely on any transforms running in previous layers, but this layer
//: knows about labels and global variables and will emit them for previous
//: layers to transform.
:(after "Begin Transforms")
// Begin Level-3 Transforms
Transform.push_back(transform_literal_strings);
// End Level-3 Transforms
:(before "End Globals")
int Next_auto_global = 1;
:(code)
void transform_literal_strings(program& p) {
trace(3, "transform") << "-- move literal strings to data segment" << end();
if (p.segments.empty()) return;
segment& code = *find(p, "code");
segment& data = *find(p, "data");
for (int i = 0; i < SIZE(code.lines); ++i) {
line& inst = code.lines.at(i);
for (int j = 0; j < SIZE(inst.words); ++j) {
word& curr = inst.words.at(j);
if (curr.data.at(0) != '"') continue;
ostringstream global_name;
global_name << "__subx_global_" << Next_auto_global;
++Next_auto_global;
add_global_to_data_segment(global_name.str(), curr, data);
curr.data = global_name.str();
}
trace(99, "transform") << "instruction after transform: '" << data_to_string(inst) << "'" << end();
}
}
void add_global_to_data_segment(const string& name, const word& value, segment& data) {
trace(99, "transform") << "adding global variable '" << name << "' containing " << value.data << end();
// emit label
data.lines.push_back(label(name));
// emit size for size-prefixed array
data.lines.push_back(line());
emit_hex_bytes(data.lines.back(), SIZE(value.data)-/*skip quotes*/2, 4/*bytes*/);
// emit data byte by byte
data.lines.push_back(line());
line& curr = data.lines.back();
for (int i = /*skip start quote*/1; i < SIZE(value.data)-/*skip end quote*/1; ++i) {
char c = value.data.at(i);
curr.words.push_back(word());
curr.words.back().data = hex_byte_to_string(c);
curr.words.back().metadata.push_back(string(1, c));
}
}
//: Within strings, whitespace is significant. So we need to redo our instruction
//: parsing.
5001 - drop the :(scenario) DSL I've been saying for a while[1][2][3] that adding extra abstractions makes things harder for newcomers, and adding new notations doubly so. And then I notice this DSL in my own backyard. Makes me feel like a hypocrite. [1] https://news.ycombinator.com/item?id=13565743#13570092 [2] https://lobste.rs/s/to8wpr/configuration_files_are_canary_warning [3] https://lobste.rs/s/mdmcdi/little_languages_by_jon_bentley_1986#c_3miuf2 The implementation of the DSL was also highly hacky: a) It was happening in the tangle/ tool, but was utterly unrelated to tangling layers. b) There were several persnickety constraints on the different kinds of lines and the specific order they were expected in. I kept finding bugs where the translator would silently do the wrong thing. Or the error messages sucked, and readers may be stuck looking at the generated code to figure out what happened. Fixing error messages would require a lot more code, which is one of my arguments against DSLs in the first place: they may be easy to implement, but they're hard to design to go with the grain of the underlying platform. They require lots of iteration. Is that effort worth prioritizing in this project? On the other hand, the DSL did make at least some readers' life easier, the ones who weren't immediately put off by having to learn a strange syntax. There were fewer quotes to parse, fewer backslash escapes. Anyway, since there are also people who dislike having to put up with strange syntaxes, we'll call that consideration a wash and tear this DSL out. --- This commit was sheer drudgery. Hopefully it won't need to be redone with a new DSL because I grow sick of backslashes.
2019-03-13 01:56:55 +00:00
void test_instruction_with_string_literal() {
parse_instruction_character_by_character(
"a \"abc def\" z\n" // two spaces inside string
5001 - drop the :(scenario) DSL I've been saying for a while[1][2][3] that adding extra abstractions makes things harder for newcomers, and adding new notations doubly so. And then I notice this DSL in my own backyard. Makes me feel like a hypocrite. [1] https://news.ycombinator.com/item?id=13565743#13570092 [2] https://lobste.rs/s/to8wpr/configuration_files_are_canary_warning [3] https://lobste.rs/s/mdmcdi/little_languages_by_jon_bentley_1986#c_3miuf2 The implementation of the DSL was also highly hacky: a) It was happening in the tangle/ tool, but was utterly unrelated to tangling layers. b) There were several persnickety constraints on the different kinds of lines and the specific order they were expected in. I kept finding bugs where the translator would silently do the wrong thing. Or the error messages sucked, and readers may be stuck looking at the generated code to figure out what happened. Fixing error messages would require a lot more code, which is one of my arguments against DSLs in the first place: they may be easy to implement, but they're hard to design to go with the grain of the underlying platform. They require lots of iteration. Is that effort worth prioritizing in this project? On the other hand, the DSL did make at least some readers' life easier, the ones who weren't immediately put off by having to learn a strange syntax. There were fewer quotes to parse, fewer backslash escapes. Anyway, since there are also people who dislike having to put up with strange syntaxes, we'll call that consideration a wash and tear this DSL out. --- This commit was sheer drudgery. Hopefully it won't need to be redone with a new DSL because I grow sick of backslashes.
2019-03-13 01:56:55 +00:00
);
CHECK_TRACE_CONTENTS(
"parse2: word: a\n"
"parse2: word: \"abc def\"\n"
5001 - drop the :(scenario) DSL I've been saying for a while[1][2][3] that adding extra abstractions makes things harder for newcomers, and adding new notations doubly so. And then I notice this DSL in my own backyard. Makes me feel like a hypocrite. [1] https://news.ycombinator.com/item?id=13565743#13570092 [2] https://lobste.rs/s/to8wpr/configuration_files_are_canary_warning [3] https://lobste.rs/s/mdmcdi/little_languages_by_jon_bentley_1986#c_3miuf2 The implementation of the DSL was also highly hacky: a) It was happening in the tangle/ tool, but was utterly unrelated to tangling layers. b) There were several persnickety constraints on the different kinds of lines and the specific order they were expected in. I kept finding bugs where the translator would silently do the wrong thing. Or the error messages sucked, and readers may be stuck looking at the generated code to figure out what happened. Fixing error messages would require a lot more code, which is one of my arguments against DSLs in the first place: they may be easy to implement, but they're hard to design to go with the grain of the underlying platform. They require lots of iteration. Is that effort worth prioritizing in this project? On the other hand, the DSL did make at least some readers' life easier, the ones who weren't immediately put off by having to learn a strange syntax. There were fewer quotes to parse, fewer backslash escapes. Anyway, since there are also people who dislike having to put up with strange syntaxes, we'll call that consideration a wash and tear this DSL out. --- This commit was sheer drudgery. Hopefully it won't need to be redone with a new DSL because I grow sick of backslashes.
2019-03-13 01:56:55 +00:00
"parse2: word: z\n"
);
// no other words
CHECK_TRACE_COUNT("parse2", 3);
}
:(before "End Line Parsing Special-cases(line_data -> l)")
if (line_data.find('"') != string::npos) { // can cause false-positives, but we can handle them
parse_instruction_character_by_character(line_data, l);
continue;
}
:(code)
void parse_instruction_character_by_character(const string& line_data, vector<line>& out) {
2018-11-25 03:55:59 +00:00
if (line_data.find('\n') != string::npos && line_data.find('\n') != line_data.size()-1) {
raise << "parse_instruction_character_by_character: should receive only a single line\n" << end();
return;
}
// parse literals
istringstream in(line_data);
in >> std::noskipws;
line result;
result.original = line_data;
// add tokens (words or strings) one by one
while (has_data(in)) {
skip_whitespace(in);
if (!has_data(in)) break;
char c = in.get();
if (c == '#') break; // comment; drop rest of line
if (c == ':') break; // line metadata; skip for now
if (c == '.') {
if (!has_data(in)) break; // comment token at end of line
if (isspace(in.peek()))
continue; // '.' followed by space is comment token; skip
}
2018-11-25 03:55:59 +00:00
result.words.push_back(word());
if (c == '"') {
2019-04-17 00:11:50 +00:00
// string literal; slurp everything between quotes into data
2018-11-25 03:55:59 +00:00
ostringstream d;
d << c;
while (has_data(in)) {
in >> c;
if (c == '\\') {
in >> c;
if (c == 'n') d << '\n';
else if (c == '"') d << '"';
else if (c == '\\') d << '\\';
else {
raise << "parse_instruction_character_by_character: unknown escape sequence '\\" << c << "'\n" << end();
return;
}
2019-04-16 23:49:38 +00:00
continue;
} else {
d << c;
}
if (c == '"') break;
}
2018-11-25 03:55:59 +00:00
result.words.back().data = d.str();
// slurp metadata
ostringstream m;
2019-04-17 00:06:47 +00:00
while (!isspace(in.peek()) && has_data(in)) { // peek can sometimes trigger eof(), so do it first
2018-11-25 03:55:59 +00:00
in >> c;
if (c == '/') {
if (!m.str().empty()) result.words.back().metadata.push_back(m.str());
m.str("");
}
else {
m << c;
}
}
if (!m.str().empty()) result.words.back().metadata.push_back(m.str());
}
2018-11-25 03:55:59 +00:00
else {
2019-04-17 00:11:50 +00:00
// not a string literal; slurp all characters until whitespace
2018-11-25 03:55:59 +00:00
ostringstream w;
w << c;
2018-11-25 03:55:59 +00:00
while (!isspace(in.peek()) && has_data(in)) { // peek can sometimes trigger eof(), so do it first
in >> c;
w << c;
}
parse_word(w.str(), result.words.back());
}
trace(99, "parse2") << "word: " << to_string(result.words.back()) << end();
}
if (!result.words.empty())
out.push_back(result);
}
void skip_whitespace(istream& in) {
while (true) {
if (has_data(in) && isspace(in.peek())) in.get();
else break;
}
}
void skip_comment(istream& in) {
if (has_data(in) && in.peek() == '#') {
in.get();
while (has_data(in) && in.peek() != '\n') in.get();
}
}
line label(string s) {
line result;
result.words.push_back(word());
result.words.back().data = (s+":");
return result;
}
// helper for tests
void parse_instruction_character_by_character(const string& line_data) {
vector<line> out;
parse_instruction_character_by_character(line_data, out);
}
5001 - drop the :(scenario) DSL I've been saying for a while[1][2][3] that adding extra abstractions makes things harder for newcomers, and adding new notations doubly so. And then I notice this DSL in my own backyard. Makes me feel like a hypocrite. [1] https://news.ycombinator.com/item?id=13565743#13570092 [2] https://lobste.rs/s/to8wpr/configuration_files_are_canary_warning [3] https://lobste.rs/s/mdmcdi/little_languages_by_jon_bentley_1986#c_3miuf2 The implementation of the DSL was also highly hacky: a) It was happening in the tangle/ tool, but was utterly unrelated to tangling layers. b) There were several persnickety constraints on the different kinds of lines and the specific order they were expected in. I kept finding bugs where the translator would silently do the wrong thing. Or the error messages sucked, and readers may be stuck looking at the generated code to figure out what happened. Fixing error messages would require a lot more code, which is one of my arguments against DSLs in the first place: they may be easy to implement, but they're hard to design to go with the grain of the underlying platform. They require lots of iteration. Is that effort worth prioritizing in this project? On the other hand, the DSL did make at least some readers' life easier, the ones who weren't immediately put off by having to learn a strange syntax. There were fewer quotes to parse, fewer backslash escapes. Anyway, since there are also people who dislike having to put up with strange syntaxes, we'll call that consideration a wash and tear this DSL out. --- This commit was sheer drudgery. Hopefully it won't need to be redone with a new DSL because I grow sick of backslashes.
2019-03-13 01:56:55 +00:00
void test_parse2_comment_token_in_middle() {
parse_instruction_character_by_character(
"a . z\n"
);
CHECK_TRACE_CONTENTS(
"parse2: word: a\n"
"parse2: word: z\n"
);
CHECK_TRACE_DOESNT_CONTAIN("parse2: word: .");
// no other words
CHECK_TRACE_COUNT("parse2", 2);
}
void test_parse2_word_starting_with_dot() {
parse_instruction_character_by_character(
"a .b c\n"
);
CHECK_TRACE_CONTENTS(
"parse2: word: a\n"
"parse2: word: .b\n"
"parse2: word: c\n"
);
}
void test_parse2_comment_token_at_start() {
parse_instruction_character_by_character(
". a b\n"
);
CHECK_TRACE_CONTENTS(
"parse2: word: a\n"
"parse2: word: b\n"
);
CHECK_TRACE_DOESNT_CONTAIN("parse2: word: .");
}
void test_parse2_comment_token_at_end() {
parse_instruction_character_by_character(
"a b .\n"
);
CHECK_TRACE_CONTENTS(
"parse2: word: a\n"
"parse2: word: b\n"
);
CHECK_TRACE_DOESNT_CONTAIN("parse2: word: .");
}
void test_parse2_word_starting_with_dot_at_start() {
parse_instruction_character_by_character(
".a b c\n"
);
CHECK_TRACE_CONTENTS(
"parse2: word: .a\n"
"parse2: word: b\n"
"parse2: word: c\n"
);
}
void test_parse2_metadata() {
parse_instruction_character_by_character(
".a b/c d\n"
);
CHECK_TRACE_CONTENTS(
"parse2: word: .a\n"
"parse2: word: b /c\n"
"parse2: word: d\n"
);
}
void test_parse2_string_with_metadata() {
parse_instruction_character_by_character(
"a \"bc def\"/disp32 g\n"
);
CHECK_TRACE_CONTENTS(
"parse2: word: a\n"
"parse2: word: \"bc def\" /disp32\n"
"parse2: word: g\n"
);
}
void test_parse2_string_with_metadata_at_end() {
parse_instruction_character_by_character(
"a \"bc def\"/disp32\n"
);
CHECK_TRACE_CONTENTS(
"parse2: word: a\n"
"parse2: word: \"bc def\" /disp32\n"
);
}
void test_parse2_string_with_metadata_at_end_of_line_without_newline() {
parse_instruction_character_by_character(
"68/push \"test\"/f" // no newline, which is how calls from parse() will look
);
CHECK_TRACE_CONTENTS(
5001 - drop the :(scenario) DSL I've been saying for a while[1][2][3] that adding extra abstractions makes things harder for newcomers, and adding new notations doubly so. And then I notice this DSL in my own backyard. Makes me feel like a hypocrite. [1] https://news.ycombinator.com/item?id=13565743#13570092 [2] https://lobste.rs/s/to8wpr/configuration_files_are_canary_warning [3] https://lobste.rs/s/mdmcdi/little_languages_by_jon_bentley_1986#c_3miuf2 The implementation of the DSL was also highly hacky: a) It was happening in the tangle/ tool, but was utterly unrelated to tangling layers. b) There were several persnickety constraints on the different kinds of lines and the specific order they were expected in. I kept finding bugs where the translator would silently do the wrong thing. Or the error messages sucked, and readers may be stuck looking at the generated code to figure out what happened. Fixing error messages would require a lot more code, which is one of my arguments against DSLs in the first place: they may be easy to implement, but they're hard to design to go with the grain of the underlying platform. They require lots of iteration. Is that effort worth prioritizing in this project? On the other hand, the DSL did make at least some readers' life easier, the ones who weren't immediately put off by having to learn a strange syntax. There were fewer quotes to parse, fewer backslash escapes. Anyway, since there are also people who dislike having to put up with strange syntaxes, we'll call that consideration a wash and tear this DSL out. --- This commit was sheer drudgery. Hopefully it won't need to be redone with a new DSL because I grow sick of backslashes.
2019-03-13 01:56:55 +00:00
"parse2: word: 68 /push\n"
"parse2: word: \"test\" /f\n"
);
}
2018-11-25 03:55:59 +00:00
//: Make sure slashes inside strings don't trigger adding stuff from inside the
//: string to metadata.
5001 - drop the :(scenario) DSL I've been saying for a while[1][2][3] that adding extra abstractions makes things harder for newcomers, and adding new notations doubly so. And then I notice this DSL in my own backyard. Makes me feel like a hypocrite. [1] https://news.ycombinator.com/item?id=13565743#13570092 [2] https://lobste.rs/s/to8wpr/configuration_files_are_canary_warning [3] https://lobste.rs/s/mdmcdi/little_languages_by_jon_bentley_1986#c_3miuf2 The implementation of the DSL was also highly hacky: a) It was happening in the tangle/ tool, but was utterly unrelated to tangling layers. b) There were several persnickety constraints on the different kinds of lines and the specific order they were expected in. I kept finding bugs where the translator would silently do the wrong thing. Or the error messages sucked, and readers may be stuck looking at the generated code to figure out what happened. Fixing error messages would require a lot more code, which is one of my arguments against DSLs in the first place: they may be easy to implement, but they're hard to design to go with the grain of the underlying platform. They require lots of iteration. Is that effort worth prioritizing in this project? On the other hand, the DSL did make at least some readers' life easier, the ones who weren't immediately put off by having to learn a strange syntax. There were fewer quotes to parse, fewer backslash escapes. Anyway, since there are also people who dislike having to put up with strange syntaxes, we'll call that consideration a wash and tear this DSL out. --- This commit was sheer drudgery. Hopefully it won't need to be redone with a new DSL because I grow sick of backslashes.
2019-03-13 01:56:55 +00:00
void test_parse2_string_containing_slashes() {
parse_instruction_character_by_character(
"a \"bc/def\"/disp32\n"
);
CHECK_TRACE_CONTENTS(
"parse2: word: \"bc/def\" /disp32\n"
);
}
void test_instruction_with_string_literal_with_escaped_quote() {
parse_instruction_character_by_character(
"\"a\\\"b\"\n" // escaped quote inside string
);
CHECK_TRACE_CONTENTS(
"parse2: word: \"a\"b\"\n"
);
// no other words
CHECK_TRACE_COUNT("parse2", 1);
}
void test_instruction_with_string_literal_with_escaped_backslash() {
parse_instruction_character_by_character(
"\"a\\\\b\"\n" // escaped backslash inside string
);
CHECK_TRACE_CONTENTS(
"parse2: word: \"a\\b\"\n"
);
// no other words
CHECK_TRACE_COUNT("parse2", 1);
}