Validate ~/.linkulator/linkulator.data file contents #3

Closed
opened 2019-11-21 13:13:24 +00:00 by cmccabe · 15 comments
Owner

Initially, the only validation is to check that each row of data has exactly four pipes (|). Need to also validate the content in each data field. Other validation checks?

Initially, the only validation is to check that each row of data has exactly four pipes (|). Need to also validate the content in each data field. Other validation checks?
cmccabe added the
this release
label 2019-11-21 13:18:12 +00:00
Collaborator

There are a lot of things we would want to remove when reading the data file. Escape sequences, control characters, newline, carriage return, terminal bell, colours, text formatting...

This was covered not long ago in Bombadillo, so perhaps @sloum can provide some help with this one.

You can see here that it gets complicated. The recommended answer seems to be:

ansi_escape_8bit = re.compile(br'(?:x1B[@-_]|[x80-x9F])[0-?]*[ -/]*[@-~]')
result = ansi_escape_8bit.sub(b'', somebytesvalue)

There are some issues with just using this as-is:

  1. It relies on bytes, but we currently read text and rely on /n as a record delimiter.
  2. It doesn't handle tabs, newlines, and probably other things we would want to remove.

It can be glued together to do something like:

  1. Open file
  2. Read line
  3. If the line is valid (has 4 pipes)
  4. Convert line to bytes
  5. Do the regex thing
  6. Remove tab, newline etc.
  7. Convert the line back to utf8

It doesn't seem right though, and I'm not familiar with this enough to suggest a better answer.

There are a lot of things we would want to remove when reading the data file. Escape sequences, control characters, newline, carriage return, terminal bell, colours, text formatting... This was covered not long ago in Bombadillo, so perhaps @sloum can provide some help with this one. You can see [here](https://stackoverflow.com/questions/14693701/how-can-i-remove-the-ansi-escape-sequences-from-a-string-in-python) that it gets complicated. The recommended answer seems to be: ```python ansi_escape_8bit = re.compile(br'(?:x1B[@-_]|[x80-x9F])[0-?]*[ -/]*[@-~]') result = ansi_escape_8bit.sub(b'', somebytesvalue) ``` There are some issues with just using this as-is: 1. It relies on bytes, but we currently read text and rely on `/n` as a record delimiter. 1. It doesn't handle tabs, newlines, and probably other things we would want to remove. It can be glued together to do something like: 1. Open file 1. Read line 1. If the line is valid (has 4 pipes) 1. Convert line to bytes 1. Do the regex thing 1. Remove tab, newline etc. 1. Convert the line back to utf8 It doesn't seem right though, and I'm not familiar with this enough to suggest a better answer.
Author
Owner

Yes, that is pretty complex. But I do think it is important to work through the risks and mitigation steps. My biggest fear is that linkulator could be used by a malicious user to inject destructive instructions that are run as the calling user. I think this is avoided because of how Python handles strings, but I'm not certain about this. The next concern is far smaller, that the malicious user simply injects codes that disrupt display of linkulator data.

Yes, that is pretty complex. But I do think it is important to work through the risks and mitigation steps. My biggest fear is that linkulator could be used by a malicious user to inject destructive instructions that are run as the calling user. I think this is avoided because of how Python handles strings, but I'm not certain about this. The next concern is far smaller, that the malicious user simply injects codes that disrupt display of linkulator data.
Collaborator

You could try scanning to remove items. In Bombadillo the file gets read in, then rune by rune (character by character) I go through and have rules for what to do with each character type via a for loop and a switch. It writes acceptable characters to a buffer, ditches non-acceptable characters, and modifies things that need modification (we convert tabs to 4 spaces). Not sure if that style method would be good in this situation or not. Also not sure if Python's way of handling strings will be performant enough to make all of the copying and looping of string data functino how we'd need it. Does python provide something for creating buffered strings in a more performant way than "test" + "123"?

You could try scanning to remove items. In Bombadillo the file gets read in, then rune by rune (character by character) I go through and have rules for what to do with each character type via a for loop and a switch. It writes acceptable characters to a buffer, ditches non-acceptable characters, and modifies things that need modification (we convert tabs to 4 spaces). Not sure if that style method would be good in this situation or not. Also not sure if Python's way of handling strings will be performant enough to make all of the copying and looping of string data functino how we'd need it. Does python provide something for creating buffered strings in a more performant way than `"test" + "123"`?
Collaborator

Yeah that seems like a much better approach.

I did some research seeing how this might work in Python, but haven't found a good answer.

It's a bit confusing as Python might present reading files in a simple, seemingly-inefficient manner, but under the hood it might handle it very efficiently. Or, maybe it doesn't. This method uses yield for efficiency (with hopefully excellent unicode support), and I've read that joinging strings, or using fstrings, is also very fast.

It probably is not worth trying to optimise this too early though. The first approach might be simplistic, and we can improve upon it later.

Yeah that seems like a much better approach. I did some research seeing how this might work in Python, but haven't found a good answer. It's a bit confusing as Python might present reading files in a simple, seemingly-inefficient manner, but under the hood it might handle it very efficiently. Or, maybe it doesn't. [This method](https://rosettacode.org/wiki/Read_a_file_character_by_character/UTF8#Python) uses yield for efficiency (with hopefully excellent unicode support), and I've read that joinging strings, or using fstrings, is also very fast. It probably is not worth trying to optimise this too early though. The first approach might be simplistic, and we can improve upon it later.
Collaborator

Agreed. No need to optimize before it is needed :)

Agreed. No need to optimize before it is needed :)
Author
Owner

Shall I mark this 'Future Release' or do either of you think there is important progress to be made before this first release?

Shall I mark this 'Future Release' or do either of you think there is important progress to be made before this first release?
Collaborator

I think this is important for a first release to avoid someone breaking things.

I think this is important for a first release to avoid someone breaking things.
Author
Owner

Ok, keep as first release for now.

Another option for a simplistic input sanitizer is to whitelist characters for now. If we're assuming that we're just on English language servers initially, then we might be able to get away with allowing these:
abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890
`~!@#$%^&*()_-+=[]{};:'"<>,./?
[space] and [tab]

This can always be swapped out for a larger character set or an entirely different method of sanitization later.

Ok, keep as first release for now. Another option for a simplistic input sanitizer is to whitelist characters for now. If we're assuming that we're just on English language servers initially, then we might be able to get away with allowing these: abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ 1234567890 `~!@#$%^&*()_-+=[]{};:'"<>,./? [space] and [tab] This can always be swapped out for a larger character set or an entirely different method of sanitization later.
Collaborator

If you are just wanting the above printable ascii the following very simple regex can work:

/^[ -~]+$/

That does not include tab or newline, but does include space. Tab and newline could be easily added as could carriage return.

If you are just wanting the above printable ascii the following very simple regex can work: ```regex /^[ -~]+$/ ``` That does not include tab or newline, but does include space. Tab and newline could be easily added as could carriage return.
asdf self-assigned this 2019-12-04 00:42:58 +00:00
Author
Owner

Thanks sloum. That's perfect.

Newline and carriage returns bring up a good question. A change from the original linkulator is that replies are now one-liners, but in the original, users could open up their editor of choice and type a long reply. As long as users feel compelled to craft thoughtful, concise comments, I actually really like the short reply approach. But if it prevents people from articulating bigger ideas, then maybe long replies are another feature we need to consider bringing back.

Thanks sloum. That's perfect. Newline and carriage returns bring up a good question. A change from the original linkulator is that replies are now one-liners, but in the original, users could open up their editor of choice and type a long reply. As long as users feel compelled to craft thoughtful, concise comments, I actually really like the short reply approach. But if it prevents people from articulating bigger ideas, then maybe long replies are another feature we need to consider bringing back.
Collaborator

It's an interesting idea, it looks like it's possible to request multiple lines. CTRL+D can act as the terminator.

Also for some reason my linkulator.data sometimes has records terminated with a carriage return only, and sometimes with a newline. No idea what is going on there.

I've been looking at this one, have had one attempt at refactoring that failed. I'll keep at this over the next couple of days.

It's an interesting idea, it looks like it's possible to request multiple lines. CTRL+D can act as the terminator. Also for some reason my linkulator.data sometimes has records terminated with a carriage return only, and sometimes with a newline. No idea what is going on there. I've been looking at this one, have had one attempt at refactoring that failed. I'll keep at this over the next couple of days.
Author
Owner

@asdf, it looks like there is still a
on line 281 of the main linkulator file; where it writes the reply to file. I would fix this, but you may already be in that area on your checkout, so I'll leave it to you.

@asdf, it looks like there is still a on line 281 of the main linkulator file; where it writes the reply to file. I would fix this, but you may already be in that area on your checkout, so I'll leave it to you.
Collaborator

This is now implemented by #60 in data.py wash_line(). tabs, newlines and other similar characters are removed by the BAD_CHARS regex, and escape sequences removed by ESCAPE_CHARS.

It would be good to do this in one pass, but am not that experienced with regex. I know @sloum has been working on things like this recently, any advice?

This is now implemented by #60 in `data.py` `wash_line()`. tabs, newlines and other similar characters are removed by the BAD_CHARS regex, and escape sequences removed by ESCAPE_CHARS. It would be good to do this in one pass, but am not that experienced with regex. I know @sloum has been working on things like this recently, any advice?
Author
Owner

@sloum, when you have some time, can you take a look at this and comment on whether you think it's good now or if we need some final tweaks? i think this might be our last issue before release -- assuming we don't come up with a replacement for the name "linkulator".

@sloum, when you have some time, can you take a look at this and comment on whether you think it's good now or if we need some final tweaks? i *think* this might be our last issue before release -- assuming we don't come up with a replacement for the name "linkulator".
Collaborator

The validation works ok. There are likely other issues with it, and there are other improvements in file reading to be done. This can be improved at a later date if required.

The validation works ok. There are likely other issues with it, and there are other improvements in file reading to be done. This can be improved at a later date if required.
asdf closed this issue 2020-01-09 04:01:59 +00:00
Sign in to join this conversation.
No description provided.