Add capsules_to_ignore list and functionality

The capsules_to_ignore.txt list functions as a way to prevent capsules that have the same content from being placed in the orbit more than once. We had an issue with LEO where cd.ax would send you to gemini.is, which had the same content as cd.ax, which inadvertantly meant that when you went to go on, you'd get sent to gemini.is again. To prevent this from happening, I'll try and manually traverse the orbit and find these automatic loops, and when I do, I'll add the responsible capsule to capsules_to_ignore.txt.
This commit is contained in:
Robert Miles 2020-11-18 00:27:31 +00:00
parent 7c33a822dc
commit d6fd5e5400
2 changed files with 10 additions and 0 deletions

3
capsules_to_ignore.txt Normal file
View File

@ -0,0 +1,3 @@
# cd.ax is available at 3 separate domains. don't include them all
gemini.is
jw.rs

View File

@ -134,6 +134,13 @@ def grab_content(url,redirect_num=0):
return header.decode("utf-8"), "text/plain"
CAPSULES_IN_ORBIT = set(determine_capsule(urllib.parse.urlparse(url)) for url in URLS)
try:
with open("capsules_to_ignore.txt") as f:
for l in f:
l=l.strip()
if l and not l.startswith("#"):
CAPSULES_IN_ORBIT.add(l)
except:
modified_orbit = False
backlinks_url = "?".join([BACKLINKS, urllib.parse.quote(MAIN_PAGE)])