Allow alternate /home paths #25

Closed
opened 2019-11-26 20:11:49 +00:00 by cmccabe · 17 comments
Owner

Not all systems keep user home directories in /home/username.

SDF/MA keeps them in /meta/[a-z]/username
SDF/Arpa keeps them in /sdf/arpa/[a-z][a-z]/[a-z]/username

We're already using glob() to aggregate all linkluator.data files, and we should be able to feed it a slightly more general pattern (configurarable by the user) to catch the files in places other than /home/username/.linkulator

Not all systems keep user home directories in /home/username. SDF/MA keeps them in /meta/[a-z]/username SDF/Arpa keeps them in /sdf/arpa/[a-z][a-z]/[a-z]/username We're already using glob() to aggregate all linkluator.data files, and we should be able to feed it a slightly more general pattern (configurarable by the user) to catch the files in places other than /home/username/.linkulator
cmccabe added the
this release
label 2019-11-26 20:11:49 +00:00
Collaborator

In each of these cases, users on a server would not need to have more than one setting, correct?

For example, the two systems mentioned are both standalone systems. One server has this configured as /meta/*/*/.linkulator/ and the other as /sdf/arpa/*/*/*/*/.linkulator/ and that is all that is required?

If this is the case, this could be a variable configurable by whoever performs the installation. Documentation could support this action. It also means that there's no need for per-user configuration for this setting.

Is that all correct?

In each of these cases, users on a server would not need to have more than one setting, correct? For example, the two systems mentioned are both standalone systems. One server has this configured as `/meta/*/*/.linkulator/` and the other as `/sdf/arpa/*/*/*/*/.linkulator/` and that is all that is required? If this is the case, this could be a variable configurable by whoever performs the installation. Documentation could support this action. It also means that there's no need for per-user configuration for this setting. Is that all correct?
Author
Owner

Yes, those are correct. I realized we have one more step to account for though, and that is extracting the username from the filepath. We do it currently in this line:
file_owner = filename.split("/")[2]

I think we can just change it to the following to accommodate any numbers of directory depth:
file_owner = filename.split("/")[-1]

Are there any edge cases where this would break down?

Yes, those are correct. I realized we have one more step to account for though, and that is extracting the username from the filepath. We do it currently in this line: file_owner = filename.split("/")[2] I think we can just change it to the following to accommodate any numbers of directory depth: file_owner = filename.split("/")[-1] Are there any edge cases where this would break down?
Collaborator

The following will get a list of users and their home directories by looking at the passwd database:

import pwd
p = pwd.getpwall()
for u in p:
    print("Username: " + u[0] + " directory: " + u[5])

Need to do some investigation on portability, but works on my computer and rawtext club.

Edit: It is available on all Unix versions

The following will get a list of users and their home directories by looking at the passwd database: ```python import pwd p = pwd.getpwall() for u in p: print("Username: " + u[0] + " directory: " + u[5]) ``` Need to do some investigation on portability, but works on my computer and rawtext club. Edit: [It is available on all Unix versions](https://docs.python.org/3.8/library/pwd.html)
Author
Owner

I think it may be more specific to stick with the [-1] approach because, rather than looking for a list of all usernames on the system, we are trying to extract usernames of linkulator users from the file globbing function.

linkulator_files = glob.glob("/home/*/.linkulator/linkulator.data")
---this one has a full filepath of all linkulator.data files, including usernames.

Later, we do this to extract the usernames associate with each linkulator.data file:

for filename in linkulator_files:

file_owner = filename.split("/")[2]

But since username will only be element [2] in a home dir scheme like /home/username, we need to generalize it more. I think username will always be one field left of the right side of the split string, so [-1]. Or is it [-2] or even [-3] because of the linkulator directory name and the linkulator.data file name?

I think it may be more specific to stick with the [-1] approach because, rather than looking for a list of all usernames on the system, we are trying to extract usernames of linkulator users from the file globbing function. linkulator_files = glob.glob("/home/*/.linkulator/linkulator.data") ---this one has a full filepath of all linkulator.data files, including usernames. Later, we do this to extract the usernames associate with each linkulator.data file: for filename in linkulator_files: file_owner = filename.split("/")[2] But since username will only be element [2] in a home dir scheme like /home/username, we need to generalize it more. I *think* username will always be one field left of the right side of the split string, so [-1]. Or is it [-2] or even [-3] because of the linkulator directory name and the linkulator.data file name?
Author
Owner

asdf, take a look at #27. I think I got it working there. It looks like [-3] was the right spacing from the right side of the split() string.

asdf, take a look at #27. I think I got it working there. It looks like [-3] was the right spacing from the right side of the split() string.
Collaborator

It is more specific, and probably faster, but requires the administrator make a configuration change to support differing home path conventions. My suggestion would (in theory) work without any special configuration, but iterating over each home directory to check if the data file exists might be slower. I get the impression glob is very efficient.

I'm happy to go with your approach though, and it looks fine as it is. My only suggestions are:

  1. We specify the paths in the config module and call them from there. This will allow easy configuration for an administrator.
  2. Instead of using split, there are methods in Path and PurePath designed for path operations like this. They both work the same way though, so it's not really that big of a deal.

I'll do a PR with these proposals to explain it better.

Edit: See #28 for my proposal

It is more specific, and probably faster, but requires the administrator make a configuration change to support differing home path conventions. My suggestion would (in theory) work without any special configuration, but iterating over each home directory to check if the data file exists might be slower. I get the impression glob is very efficient. I'm happy to go with your approach though, and it looks fine as it is. My only suggestions are: 1. We specify the paths in the config module and call them from there. This will allow easy configuration for an administrator. 2. Instead of using split, there are methods in Path and PurePath designed for path operations like this. They both work the same way though, so it's not really that big of a deal. I'll do a PR with these proposals to explain it better. Edit: See #28 for my proposal
Author
Owner

Good points. I had not thought about the outcome of an admin needing to change the configuration. But ok, let's do as you suggested with points 1 and 2.

Good points. I had not thought about the outcome of an admin needing to change the configuration. But ok, let's do as you suggested with points 1 and 2.
Collaborator

This should now be complete. Home directory path can be amended in config.py and the process is documented.

This might not be the normal way to handle customisation, but it should be usable for now.

Let me know if any other issues!

This should now be complete. Home directory path can be amended in config.py and the process is documented. This might not be the normal way to handle customisation, but it should be usable for now. Let me know if any other issues!
asdf self-assigned this 2019-12-02 01:37:57 +00:00
Author
Owner

Re-opening this issue for one specific topic.

I looked at the home directory structure on grex.org and I'm wondering if even our generalize approach will work with it. Grex's home directories are in a format like this /[a-z]/[a-z]/username ...where the first letter is on the same level as the other root-level directories.

It looks like this: ls -l /

drwxr-xr-x 37 root wheel 512 May 31 2017 a

drwxr-xr-x 2 root wheel 512 May 13 2017 afs

drwxr-xr-x 2 root wheel 512 Mar 24 2018 altroot

lrwxr-xr-x 1 root wheel 4 May 31 2017 b -> /u/b

drwxr-xr-x 2 root wheel 1024 Mar 24 2018 bin

-rw-r--r-- 1 root wheel 82784 Apr 19 2018 boot

-rwx------ 1 root wheel 12167368 Oct 13 13:37 bsd

-rwx------ 1 root wheel 12176620 Oct 4 23:00 bsd.booted

-rw-r--r-- 1 root wheel 12263770 Apr 19 2018 bsd.mp

-rw-r--r-- 1 root wheel 8874425 Apr 19 2018 bsd.rd

drwxr-xr-x 35 root wheel 512 May 31 2017 c

drwxr-xr-x 20 root wheel 512 May 24 2017 cyberspace

lrwxr-xr-x 1 root wheel 4 May 31 2017 d -> /u/d

drwxr-xr-x 3 root wheel 42496 Oct 13 13:36 dev

lrwxr-xr-x 1 root wheel 4 May 31 2017 e -> /u/e

drwxr-xr-x 82 root wheel 4608 Dec 4 15:34 etc

lrwxr-xr-x 1 root wheel 4 May 31 2017 f -> /u/f

lrwxr-xr-x 1 root wheel 4 May 31 2017 g -> /u/g

lrwxr-xr-x 1 root wheel 4 May 31 2017 h -> /u/h

drwxr-xr-x 2 root wheel 512 Mar 24 2018 home

lrwxr-xr-x 1 root wheel 4 May 31 2017 i -> /u/i

lrwxr-xr-x 1 root wheel 4 May 31 2017 j -> /u/j

lrwxr-xr-x 1 root wheel 4 May 31 2017 k -> /u/k

lrwxr-xr-x 1 root wheel 4 May 31 2017 l -> /u/l

lrwxr-xr-x 1 root wheel 4 May 31 2017 m -> /u/m

drwxr-xr-x 11 root wheel 512 Mar 24 2018 mnt

lrwxr-xr-x 1 root wheel 4 May 31 2017 n -> /u/n

lrwxr-xr-x 1 root wheel 4 May 31 2017 o -> /u/o

lrwxr-xr-x 1 root wheel 4 May 31 2017 p -> /u/p

lrwxr-xr-x 1 root wheel 4 May 31 2017 q -> /u/q

lrwxr-xr-x 1 root wheel 4 May 31 2017 r -> /u/r

drwx------ 8 root wheel 512 Mar 24 2018 root

lrwxr-xr-x 1 root wheel 4 May 31 2017 s -> /u/s

drwxr-xr-x 2 root wheel 1536 Nov 18 2018 sbin

drwxr-xr-x 2 root wheel 512 Aug 17 2011 stand

drwxr-xr-x 13 root wheel 512 May 16 2017 suid

lrwxrwx--- 1 root wheel 11 Mar 24 2018 sys -> usr/src/sys

lrwxr-xr-x 1 root wheel 4 May 31 2017 t -> /u/t

drwxrwxrwt 19 root wheel 1024 Dec 5 05:53 tmp

drwxr-xr-x 32 root wheel 512 May 31 2017 u

drwxr-xr-x 19 root wheel 512 Apr 19 2018 usr

lrwxr-xr-x 1 root wheel 4 Jun 1 2017 v -> /u/v

drwxr-xr-x 46 root wheel 1024 Mar 24 2018 var

drwxr-xr-x 35 root wheel 512 Oct 15 2018 w

-rw------- 1 root wheel 8587408 Apr 18 2019 webauthd.core

drwxr-xr-x 34 root wheel 512 Oct 18 2017 x

drwxr-xr-x 33 root wheel 512 Apr 18 2019 y

drwxr-xr-x 40 root wheel 512 May 31 2017 z

...so maybe our generalized approach will still work, using ///*/.linkulator/ as the glob path. But I'm wondering if this won't slow it down significantly because it adds a ton of potential directories to the search path.

I'm not actually sure there is a solution here. Maybe we should just test it out?

Re-opening this issue for one specific topic. I looked at the home directory structure on grex.org and I'm wondering if even our generalize approach will work with it. Grex's home directories are in a format like this /[a-z]/[a-z]/username ...where the first letter is on the same level as the other root-level directories. It looks like this: ls -l / > drwxr-xr-x 37 root wheel 512 May 31 2017 a > drwxr-xr-x 2 root wheel 512 May 13 2017 afs > drwxr-xr-x 2 root wheel 512 Mar 24 2018 altroot > lrwxr-xr-x 1 root wheel 4 May 31 2017 b -> /u/b > drwxr-xr-x 2 root wheel 1024 Mar 24 2018 bin > -rw-r--r-- 1 root wheel 82784 Apr 19 2018 boot > -rwx------ 1 root wheel 12167368 Oct 13 13:37 bsd > -rwx------ 1 root wheel 12176620 Oct 4 23:00 bsd.booted > -rw-r--r-- 1 root wheel 12263770 Apr 19 2018 bsd.mp > -rw-r--r-- 1 root wheel 8874425 Apr 19 2018 bsd.rd > drwxr-xr-x 35 root wheel 512 May 31 2017 c > drwxr-xr-x 20 root wheel 512 May 24 2017 cyberspace > lrwxr-xr-x 1 root wheel 4 May 31 2017 d -> /u/d > drwxr-xr-x 3 root wheel 42496 Oct 13 13:36 dev > lrwxr-xr-x 1 root wheel 4 May 31 2017 e -> /u/e > drwxr-xr-x 82 root wheel 4608 Dec 4 15:34 etc > lrwxr-xr-x 1 root wheel 4 May 31 2017 f -> /u/f > lrwxr-xr-x 1 root wheel 4 May 31 2017 g -> /u/g > lrwxr-xr-x 1 root wheel 4 May 31 2017 h -> /u/h > drwxr-xr-x 2 root wheel 512 Mar 24 2018 home > lrwxr-xr-x 1 root wheel 4 May 31 2017 i -> /u/i > lrwxr-xr-x 1 root wheel 4 May 31 2017 j -> /u/j > lrwxr-xr-x 1 root wheel 4 May 31 2017 k -> /u/k > lrwxr-xr-x 1 root wheel 4 May 31 2017 l -> /u/l > lrwxr-xr-x 1 root wheel 4 May 31 2017 m -> /u/m > drwxr-xr-x 11 root wheel 512 Mar 24 2018 mnt > lrwxr-xr-x 1 root wheel 4 May 31 2017 n -> /u/n > lrwxr-xr-x 1 root wheel 4 May 31 2017 o -> /u/o > lrwxr-xr-x 1 root wheel 4 May 31 2017 p -> /u/p > lrwxr-xr-x 1 root wheel 4 May 31 2017 q -> /u/q > lrwxr-xr-x 1 root wheel 4 May 31 2017 r -> /u/r > drwx------ 8 root wheel 512 Mar 24 2018 root > lrwxr-xr-x 1 root wheel 4 May 31 2017 s -> /u/s > drwxr-xr-x 2 root wheel 1536 Nov 18 2018 sbin > drwxr-xr-x 2 root wheel 512 Aug 17 2011 stand > drwxr-xr-x 13 root wheel 512 May 16 2017 suid > lrwxrwx--- 1 root wheel 11 Mar 24 2018 sys -> usr/src/sys > lrwxr-xr-x 1 root wheel 4 May 31 2017 t -> /u/t > drwxrwxrwt 19 root wheel 1024 Dec 5 05:53 tmp > drwxr-xr-x 32 root wheel 512 May 31 2017 u > drwxr-xr-x 19 root wheel 512 Apr 19 2018 usr > lrwxr-xr-x 1 root wheel 4 Jun 1 2017 v -> /u/v > drwxr-xr-x 46 root wheel 1024 Mar 24 2018 var > drwxr-xr-x 35 root wheel 512 Oct 15 2018 w > -rw------- 1 root wheel 8587408 Apr 18 2019 webauthd.core > drwxr-xr-x 34 root wheel 512 Oct 18 2017 x > drwxr-xr-x 33 root wheel 512 Apr 18 2019 y > drwxr-xr-x 40 root wheel 512 May 31 2017 z ...so maybe our generalized approach will still work, using /*/*/*/.linkulator/ as the glob path. But I'm wondering if this won't slow it down significantly because it adds a ton of potential directories to the search path. I'm not actually sure there is a solution here. Maybe we should just test it out?
cmccabe reopened this issue 2019-12-05 12:30:05 +00:00
Author
Owner

Ok, I tested on grex using //// as the home dir path, and it totally choked. The asterisks in the first two slots mean that it is searching through a ton of unnecessary and huge directories, so I also tried /[a-z]/[a-z]/ as the path, and it takes about a minute and a half to run:

grex$ time ./linkulator 
 ----------
 LINKULATOR
 ----------

 ID#  Category                 
   1  pubnixes (1)

Enter category ID or q to quit: q

Thank you for linkulating.  Goodbye.

real    1m25.894s
user    0m0.660s
sys     0m2.700s

It's not quite as bad on SDF, but still unacceptably slow (more than 20 seconds):

@sdf $ time ./linkulator 
 ----------
 LINKULATOR
 ----------

 ID#  Category                 
   1  pubnixes (1)

Enter category ID or q to quit: q

Thank you for linkulating.  Goodbye.

real    0m21.273s
user    0m0.651s
sys     0m1.868s

So unless there is a way to optimize linkulator's search of home directory paths on larger systems, we may just need to accept that it is designed for tiny systems.

But... this gives me an idea for a future enhancement. Coming soon as a new Issue. (See issue #45)

Ok, I tested on grex using /*/*/*/ as the home dir path, and it totally choked. The asterisks in the first two slots mean that it is searching through a ton of unnecessary and huge directories, so I also tried /[a-z]/[a-z]/* as the path, and it takes about a minute and a half to run: ``` grex$ time ./linkulator ---------- LINKULATOR ---------- ID# Category 1 pubnixes (1) Enter category ID or q to quit: q Thank you for linkulating. Goodbye. real 1m25.894s user 0m0.660s sys 0m2.700s ``` It's not quite as bad on SDF, but still unacceptably slow (more than 20 seconds): ``` @sdf $ time ./linkulator ---------- LINKULATOR ---------- ID# Category 1 pubnixes (1) Enter category ID or q to quit: q Thank you for linkulating. Goodbye. real 0m21.273s user 0m0.651s sys 0m1.868s ``` So unless there is a way to optimize linkulator's search of home directory paths on larger systems, we may just need to accept that it is designed for tiny systems. But... this gives me an idea for a future enhancement. Coming soon as a new Issue. (See issue #45)
Author
Owner

I tested on tilde.team which has 400+ users in the standard /home/username directory structure; on tilde.town which has over 2000 users; and on tilde.club which has about 1800 users. In each of these cases, traversing the /home dir tree was super fast. Of course, this does not mean it would be fast if each user had linkulator.data, but it's hard to test that.

I tested on tilde.team which has 400+ users in the standard /home/username directory structure; on tilde.town which has over 2000 users; and on tilde.club which has about 1800 users. In each of these cases, traversing the /home dir tree was super fast. Of course, this does not mean it would be fast if each user had linkulator.data, but it's hard to test that.
Collaborator

If the glob pattern you've specified is slow, can you try to validate how a similar operation performs in the shell? For example:

time ls /[a-z]/[a-z]/...
time cat /[a-z]/[a-z]/...

Which version of Python is python3 on each of these systems?
Also, what is the actual operating system?

If the glob pattern you've specified is slow, can you try to validate how a similar operation performs in the shell? For example: ```shell time ls /[a-z]/[a-z]/... time cat /[a-z]/[a-z]/... ``` Which version of Python is `python3` on each of these systems? Also, what is the actual operating system?
Author
Owner

Good questions.

SDF is NetBSD 8.1 with Python 3.6.9

Grex is OpenBSD 6.3 with Python 3.6.4

tilde.team is Ubuntu 18.04 with Python 3.6.9

tilde.town is Ubuntu 19.04 with Python 3.7.3

tilde.club is Fedora 30 with Python 3.7.5

I'll time those operations as soon as I have more time.

Good questions. SDF is NetBSD 8.1 with Python 3.6.9 Grex is OpenBSD 6.3 with Python 3.6.4 tilde.team is Ubuntu 18.04 with Python 3.6.9 tilde.town is Ubuntu 19.04 with Python 3.7.3 tilde.club is Fedora 30 with Python 3.7.5 I'll time those operations as soon as I have more time.
Author
Owner

The three tildes were pretty much the same, so I just included town (the biggest) here. The results are the same or slower than in Python.

tilde.town $ time ls /home/*/.linkulator/linkulator.data
...
real	0m0.015s
user	0m0.002s
sys	0m0.013s

@sdf $ time ls /sdf/arpa/*/*/*/.linkulator/linkulator.data
...
real    0m20.093s
user    0m0.031s
sys     0m1.318s

grex$ time ls /[a-z]/[a-z]/*/.linkulator/linkulator.data
...
real    2m12.662s
user    0m0.050s
sys     0m2.310s
The three tildes were pretty much the same, so I just included town (the biggest) here. The results are the same or slower than in Python. ``` tilde.town $ time ls /home/*/.linkulator/linkulator.data ... real 0m0.015s user 0m0.002s sys 0m0.013s @sdf $ time ls /sdf/arpa/*/*/*/.linkulator/linkulator.data ... real 0m20.093s user 0m0.031s sys 0m1.318s grex$ time ls /[a-z]/[a-z]/*/.linkulator/linkulator.data ... real 2m12.662s user 0m0.050s sys 0m2.310s ```
Collaborator

OK, seems to me that the performance is directly related to globbing.

I wonder if there are other options, like parallel processing, or if we are just at a real system limit.

OK, seems to me that the performance is directly related to globbing. I wonder if there are other options, like parallel processing, or if we are just at a real system limit.
Author
Owner

That would be an interesting challenge. There is a multiprocssing module for Python - https://docs.python.org/3.8/library/multiprocessing.html Maybe we should leave this for future consideration though since our initial usage target is just these small systems.

But also, although SDF has 8 CPUs, Grex has only 1. So it looks like we are at our true system limit on Grex.

So I'll close this one out for now, and we can open another one in the future if we want to tackle parallelization.

That would be an interesting challenge. There is a multiprocssing module for Python - https://docs.python.org/3.8/library/multiprocessing.html Maybe we should leave this for future consideration though since our initial usage target is just these small systems. But also, although SDF has 8 CPUs, Grex has only 1. So it looks like we are at our true system limit on Grex. So I'll close this one out for now, and we can open another one in the future if we want to tackle parallelization.
Author
Owner

Not to beat a dead horse, but I also tested using the 'find' command:

grex$ time find /[a-z]/[a-z]/*/.linkulator/linkulator.data
/c/m/cmccabe/.linkulator/linkulator.data

real    1m35.063s
user    0m0.110s
sys     0m3.070s

So that IS faster than ls, but still not nearly fast enough.

Not to beat a dead horse, but I also tested using the 'find' command: ``` grex$ time find /[a-z]/[a-z]/*/.linkulator/linkulator.data /c/m/cmccabe/.linkulator/linkulator.data real 1m35.063s user 0m0.110s sys 0m3.070s ``` So that IS faster than ls, but still not nearly fast enough.
Sign in to join this conversation.
No description provided.