Validate ~/.linkulator/linkulator.data file contents #3
Labels
No Label
bug
compatibility
documentation
duplicate
enhancement
future release
help wanted
invalid
non-code
question
refactor
testing
this release
wontfix
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: cmccabe/linkulator2#3
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Initially, the only validation is to check that each row of data has exactly four pipes (|). Need to also validate the content in each data field. Other validation checks?
There are a lot of things we would want to remove when reading the data file. Escape sequences, control characters, newline, carriage return, terminal bell, colours, text formatting...
This was covered not long ago in Bombadillo, so perhaps @sloum can provide some help with this one.
You can see here that it gets complicated. The recommended answer seems to be:
There are some issues with just using this as-is:
/n
as a record delimiter.It can be glued together to do something like:
It doesn't seem right though, and I'm not familiar with this enough to suggest a better answer.
Yes, that is pretty complex. But I do think it is important to work through the risks and mitigation steps. My biggest fear is that linkulator could be used by a malicious user to inject destructive instructions that are run as the calling user. I think this is avoided because of how Python handles strings, but I'm not certain about this. The next concern is far smaller, that the malicious user simply injects codes that disrupt display of linkulator data.
You could try scanning to remove items. In Bombadillo the file gets read in, then rune by rune (character by character) I go through and have rules for what to do with each character type via a for loop and a switch. It writes acceptable characters to a buffer, ditches non-acceptable characters, and modifies things that need modification (we convert tabs to 4 spaces). Not sure if that style method would be good in this situation or not. Also not sure if Python's way of handling strings will be performant enough to make all of the copying and looping of string data functino how we'd need it. Does python provide something for creating buffered strings in a more performant way than
"test" + "123"
?Yeah that seems like a much better approach.
I did some research seeing how this might work in Python, but haven't found a good answer.
It's a bit confusing as Python might present reading files in a simple, seemingly-inefficient manner, but under the hood it might handle it very efficiently. Or, maybe it doesn't. This method uses yield for efficiency (with hopefully excellent unicode support), and I've read that joinging strings, or using fstrings, is also very fast.
It probably is not worth trying to optimise this too early though. The first approach might be simplistic, and we can improve upon it later.
Agreed. No need to optimize before it is needed :)
Shall I mark this 'Future Release' or do either of you think there is important progress to be made before this first release?
I think this is important for a first release to avoid someone breaking things.
Ok, keep as first release for now.
Another option for a simplistic input sanitizer is to whitelist characters for now. If we're assuming that we're just on English language servers initially, then we might be able to get away with allowing these:
abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890
`~!@#$%^&*()_-+=[]{};:'"<>,./?
[space] and [tab]
This can always be swapped out for a larger character set or an entirely different method of sanitization later.
If you are just wanting the above printable ascii the following very simple regex can work:
That does not include tab or newline, but does include space. Tab and newline could be easily added as could carriage return.
Thanks sloum. That's perfect.
Newline and carriage returns bring up a good question. A change from the original linkulator is that replies are now one-liners, but in the original, users could open up their editor of choice and type a long reply. As long as users feel compelled to craft thoughtful, concise comments, I actually really like the short reply approach. But if it prevents people from articulating bigger ideas, then maybe long replies are another feature we need to consider bringing back.
It's an interesting idea, it looks like it's possible to request multiple lines. CTRL+D can act as the terminator.
Also for some reason my linkulator.data sometimes has records terminated with a carriage return only, and sometimes with a newline. No idea what is going on there.
I've been looking at this one, have had one attempt at refactoring that failed. I'll keep at this over the next couple of days.
@asdf, it looks like there is still a
on line 281 of the main linkulator file; where it writes the reply to file. I would fix this, but you may already be in that area on your checkout, so I'll leave it to you.
asdf referenced this issue2019-12-14 02:51:31 +00:00
This is now implemented by #60 in
data.py
wash_line()
. tabs, newlines and other similar characters are removed by the BAD_CHARS regex, and escape sequences removed by ESCAPE_CHARS.It would be good to do this in one pass, but am not that experienced with regex. I know @sloum has been working on things like this recently, any advice?
@sloum, when you have some time, can you take a look at this and comment on whether you think it's good now or if we need some final tweaks? i think this might be our last issue before release -- assuming we don't come up with a replacement for the name "linkulator".
The validation works ok. There are likely other issues with it, and there are other improvements in file reading to be done. This can be improved at a later date if required.