Validate ~/.linkulator/linkulator.data file contents #3

New Issue

cmccabe · 2019-11-21T13:13:24Z

cmccabe commented

2019-11-21 13:13:24 +00:00

Initially, the only validation is to check that each row of data has exactly four pipes (|). Need to also validate the content in each data field. Other validation checks?

cmccabe added the

this release

label 2019-11-21 13:18:12 +00:00

asdf commented

2019-11-29 10:24:08 +00:00

There are a lot of things we would want to remove when reading the data file. Escape sequences, control characters, newline, carriage return, terminal bell, colours, text formatting...

This was covered not long ago in Bombadillo, so perhaps @sloum can provide some help with this one.

You can see here that it gets complicated. The recommended answer seems to be:

ansi_escape_8bit = re.compile(br'(?:x1B[@-_]|[x80-x9F])[0-?]*[ -/]*[@-~]')
result = ansi_escape_8bit.sub(b'', somebytesvalue)

There are some issues with just using this as-is:

It relies on bytes, but we currently read text and rely on /n as a record delimiter.
It doesn't handle tabs, newlines, and probably other things we would want to remove.

It can be glued together to do something like:

Open file
Read line
If the line is valid (has 4 pipes)
Convert line to bytes
Do the regex thing
Remove tab, newline etc.
Convert the line back to utf8

It doesn't seem right though, and I'm not familiar with this enough to suggest a better answer.

There are a lot of things we would want to remove when reading the data file. Escape sequences, control characters, newline, carriage return, terminal bell, colours, text formatting... This was covered not long ago in Bombadillo, so perhaps @sloum can provide some help with this one. You can see [here](https://stackoverflow.com/questions/14693701/how-can-i-remove-the-ansi-escape-sequences-from-a-string-in-python) that it gets complicated. The recommended answer seems to be: ```python ansi_escape_8bit = re.compile(br'(?:x1B[@-_]|[x80-x9F])[0-?]*[ -/]*[@-~]') result = ansi_escape_8bit.sub(b'', somebytesvalue) ``` There are some issues with just using this as-is: 1. It relies on bytes, but we currently read text and rely on `/n` as a record delimiter. 1. It doesn't handle tabs, newlines, and probably other things we would want to remove. It can be glued together to do something like: 1. Open file 1. Read line 1. If the line is valid (has 4 pipes) 1. Convert line to bytes 1. Do the regex thing 1. Remove tab, newline etc. 1. Convert the line back to utf8 It doesn't seem right though, and I'm not familiar with this enough to suggest a better answer.

cmccabe commented

2019-11-29 12:29:36 +00:00

Yes, that is pretty complex. But I do think it is important to work through the risks and mitigation steps. My biggest fear is that linkulator could be used by a malicious user to inject destructive instructions that are run as the calling user. I think this is avoided because of how Python handles strings, but I'm not certain about this. The next concern is far smaller, that the malicious user simply injects codes that disrupt display of linkulator data.

sloum commented

2019-11-29 16:05:27 +00:00

You could try scanning to remove items. In Bombadillo the file gets read in, then rune by rune (character by character) I go through and have rules for what to do with each character type via a for loop and a switch. It writes acceptable characters to a buffer, ditches non-acceptable characters, and modifies things that need modification (we convert tabs to 4 spaces). Not sure if that style method would be good in this situation or not. Also not sure if Python's way of handling strings will be performant enough to make all of the copying and looping of string data functino how we'd need it. Does python provide something for creating buffered strings in a more performant way than "test" + "123"?

You could try scanning to remove items. In Bombadillo the file gets read in, then rune by rune (character by character) I go through and have rules for what to do with each character type via a for loop and a switch. It writes acceptable characters to a buffer, ditches non-acceptable characters, and modifies things that need modification (we convert tabs to 4 spaces). Not sure if that style method would be good in this situation or not. Also not sure if Python's way of handling strings will be performant enough to make all of the copying and looping of string data functino how we'd need it. Does python provide something for creating buffered strings in a more performant way than `"test" + "123"`?

asdf commented

2019-11-30 01:06:57 +00:00

Yeah that seems like a much better approach.

I did some research seeing how this might work in Python, but haven't found a good answer.

It's a bit confusing as Python might present reading files in a simple, seemingly-inefficient manner, but under the hood it might handle it very efficiently. Or, maybe it doesn't. This method uses yield for efficiency (with hopefully excellent unicode support), and I've read that joinging strings, or using fstrings, is also very fast.

It probably is not worth trying to optimise this too early though. The first approach might be simplistic, and we can improve upon it later.

Yeah that seems like a much better approach. I did some research seeing how this might work in Python, but haven't found a good answer. It's a bit confusing as Python might present reading files in a simple, seemingly-inefficient manner, but under the hood it might handle it very efficiently. Or, maybe it doesn't. [This method](https://rosettacode.org/wiki/Read_a_file_character_by_character/UTF8#Python) uses yield for efficiency (with hopefully excellent unicode support), and I've read that joinging strings, or using fstrings, is also very fast. It probably is not worth trying to optimise this too early though. The first approach might be simplistic, and we can improve upon it later.

sloum commented

2019-11-30 04:07:11 +00:00

Agreed. No need to optimize before it is needed :)

cmccabe commented

2019-12-02 03:29:03 +00:00

Shall I mark this 'Future Release' or do either of you think there is important progress to be made before this first release?

asdf commented

2019-12-02 03:44:12 +00:00

I think this is important for a first release to avoid someone breaking things.

cmccabe commented

2019-12-03 16:10:10 +00:00

Ok, keep as first release for now.

Another option for a simplistic input sanitizer is to whitelist characters for now. If we're assuming that we're just on English language servers initially, then we might be able to get away with allowing these:
abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890
`~!@#$%^&*()_-+=[]{};:'"<>,./?
[space] and [tab]

This can always be swapped out for a larger character set or an entirely different method of sanitization later.

Ok, keep as first release for now. Another option for a simplistic input sanitizer is to whitelist characters for now. If we're assuming that we're just on English language servers initially, then we might be able to get away with allowing these: abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ 1234567890 `~!@#$%^&*()_-+=[]{};:'"<>,./? [space] and [tab] This can always be swapped out for a larger character set or an entirely different method of sanitization later.

sloum commented

2019-12-03 18:12:11 +00:00

If you are just wanting the above printable ascii the following very simple regex can work:

/^[ -~]+$/

That does not include tab or newline, but does include space. Tab and newline could be easily added as could carriage return.

If you are just wanting the above printable ascii the following very simple regex can work: ```regex /^[ -~]+$/ ``` That does not include tab or newline, but does include space. Tab and newline could be easily added as could carriage return.

asdf self-assigned this 2019-12-04 00:42:58 +00:00

cmccabe commented

2019-12-04 02:36:49 +00:00

Thanks sloum. That's perfect.

Newline and carriage returns bring up a good question. A change from the original linkulator is that replies are now one-liners, but in the original, users could open up their editor of choice and type a long reply. As long as users feel compelled to craft thoughtful, concise comments, I actually really like the short reply approach. But if it prevents people from articulating bigger ideas, then maybe long replies are another feature we need to consider bringing back.

Thanks sloum. That's perfect. Newline and carriage returns bring up a good question. A change from the original linkulator is that replies are now one-liners, but in the original, users could open up their editor of choice and type a long reply. As long as users feel compelled to craft thoughtful, concise comments, I actually really like the short reply approach. But if it prevents people from articulating bigger ideas, then maybe long replies are another feature we need to consider bringing back.

asdf commented

2019-12-04 21:28:16 +00:00

It's an interesting idea, it looks like it's possible to request multiple lines. CTRL+D can act as the terminator.

Also for some reason my linkulator.data sometimes has records terminated with a carriage return only, and sometimes with a newline. No idea what is going on there.

I've been looking at this one, have had one attempt at refactoring that failed. I'll keep at this over the next couple of days.

It's an interesting idea, it looks like it's possible to request multiple lines. CTRL+D can act as the terminator. Also for some reason my linkulator.data sometimes has records terminated with a carriage return only, and sometimes with a newline. No idea what is going on there. I've been looking at this one, have had one attempt at refactoring that failed. I'll keep at this over the next couple of days.

cmccabe commented

2019-12-04 22:20:57 +00:00

@asdf, it looks like there is still a
on line 281 of the main linkulator file; where it writes the reply to file. I would fix this, but you may already be in that area on your checkout, so I'll leave it to you.

@asdf, it looks like there is still a on line 281 of the main linkulator file; where it writes the reply to file. I would fix this, but you may already be in that area on your checkout, so I'll leave it to you.

asdf referenced this issue

2019-12-05 02:27:27 +00:00

Add data module, character validation #43

~~asdf referenced this issue 2019-12-14 02:51:31 +00:00~~

Improve data validation, update records after new posts #60

asdf referenced this issue

2019-12-14 02:52:34 +00:00

Improve data validation, update records after new posts #60

asdf commented

2019-12-15 01:31:11 +00:00

This is now implemented by #60 in data.py wash_line(). tabs, newlines and other similar characters are removed by the BAD_CHARS regex, and escape sequences removed by ESCAPE_CHARS.

It would be good to do this in one pass, but am not that experienced with regex. I know @sloum has been working on things like this recently, any advice?

This is now implemented by #60 in `data.py` `wash_line()`. tabs, newlines and other similar characters are removed by the BAD_CHARS regex, and escape sequences removed by ESCAPE_CHARS. It would be good to do this in one pass, but am not that experienced with regex. I know @sloum has been working on things like this recently, any advice?

asdf referenced this issue

2019-12-15 01:33:43 +00:00

Additional validation on replies required to prevent blank replies. #36

cmccabe commented

2020-01-03 12:27:32 +00:00

@sloum, when you have some time, can you take a look at this and comment on whether you think it's good now or if we need some final tweaks? i think this might be our last issue before release -- assuming we don't come up with a replacement for the name "linkulator".

@sloum, when you have some time, can you take a look at this and comment on whether you think it's good now or if we need some final tweaks? i *think* this might be our last issue before release -- assuming we don't come up with a replacement for the name "linkulator".

asdf commented

2020-01-09 04:01:59 +00:00

The validation works ok. There are likely other issues with it, and there are other improvements in file reading to be done. This can be improved at a later date if required.

asdf closed this issue

2020-01-09 04:01:59 +00:00

Sign in to join this conversation.