mu/browse-slack/convert_slack.py

# Import JSON from a Slack admin export into a disk image Mu can load.
#
# Dependencies: python, wget, awk, sed, netpbm
#
# Step 1: download a Slack archive and unpack it to some directory
#
# Step 2: download user avatars to subdirectory images/ and convert them to PPM in subdirectory images/ppm/
#   grep image_72 . -r |grep -v users.json |awk '{print $3}' |sort |uniq |sed 's/?.*//' |sed 's,\\,,g' |sed 's/"//' |sed 's/",$//' > images.list
#   mkdir images
#   cd images
#   wget -i ../images.list --wait=0.1
#   # fix some lying images
#   for f in $(file *.jpg |grep PNG |sed 's/:.*//'); do mv -i $f $(echo $f |sed 's/\.jpg$/.png/'); done
#   #
#   mkdir ppm
#   for f in *.jpg; do jpegtopnm $f |pnmtoplainpnm > ppm/$(echo $f |sed 's/\.jpg$//').ppm; done
#   for f in *.png; do pngtopnm $f |pnmtoplainpnm > ppm/$(echo $f |sed 's/\.png$//').ppm; done
#
# (Depending on your OS, you may need to replace pnmtoplainpnm with `pnmtopnm -plain`. Some places also have a pnm2pnm.
# I don't understand it either.)
#
# Step 3: construct a disk image out of the archives and avatars
#   cd ..  # go back to the top-level archive directory
#   dd if=/dev/zero of=data.img count=201600  # 100MB
#   python path/to/convert_slack.py > data.out 2> data.err
#   dd if=data.out of=data.img conv=notrunc
# Currently this process yields errors for ~300 items (~70 posts and their comments)
# on the Future of Software group (https://futureofcoding.org/community). We fail to load those.
#
# Notes on input format:
#   Redundant 'type' field that's always 'message'. Probably an "enterprise" feature.

from sys import argv, stderr
import json
from os import listdir
from os.path import isfile, join, basename, splitext
from urllib.parse import urlparse
import traceback

def look_up_ppm_image(url):
    file_root = splitext(basename(urlparse(url).path))[0]
    filename = f"images/ppm/{file_root}.ppm"
    if isfile(filename):
        with open(filename) as f:
            return f.read()

user_idx = {}
with open('users.json') as f:
    for idx, user in enumerate(json.load(f)):
        if 'real_name' not in user:
            user['real_name'] = ''
        print(f"({json.dumps(user['id'])} \"@{user['name']}\" {json.dumps(user['real_name'])} [{look_up_ppm_image(user['profile']['image_72']) or ''}])")
        user_idx[user['id']] = idx

def by(item):
    if 'subtype' in item and item['subtype'] == 'bot_message' and 'username' in item:
        federated_user = item['username']
        if federated_user not in user_idx:
            user_idx[federated_user] = len(user_idx)
        return user_idx[federated_user]
    return user_idx[item['user']]

item_idx = {}
def parent(item):
    if 'thread_ts' in item and item['thread_ts'] != item['ts']:
        # comment
        return item_idx[item['thread_ts']]
    else:
        return -1

items = []
for channel in json.load(open('channels.json')):
    for filename in sorted(listdir(channel['name'])):
        with open(join(channel['name'], filename)) as f:
            for item in json.load(f):
                item['channel_name'] = channel['name']
                items.append(item)

idx = 0
for item in sorted(items, key=lambda item: item['ts']):
    try:
        print(f"({json.dumps(item['ts'])} {parent(item)} {json.dumps(item['channel_name'])} {by(item)} {json.dumps(item['text'])})")
        item_idx[item['ts']] = idx
        idx += 1  # only increment when actually used and no exception raised
    except KeyError:
        traceback.print_exc(file=stderr)
        stderr.write(repr(item)+'\n')
. 2021-08-10 16:13:59 +00:00			`# Import JSON from a Slack admin export into a disk image Mu can load.`
. 2021-08-07 04:01:38 +00:00			`#`
"release" for FoC group 2021-08-12 02:07:51 +00:00			`# Dependencies: python, wget, awk, sed, netpbm`
beginnings of a Slack archive reader I'm hackily depending on Python (3.something) to prototype the disk image creator. But no non-std libs. Once the disk image is created, I've validated that it can be loaded from disk without too much latency (assuming KVM). 2021-08-07 03:46:48 +00:00			`#`
"release" for FoC group 2021-08-12 02:07:51 +00:00			`# Step 1: download a Slack archive and unpack it to some directory`
. 2021-08-08 18:15:22 +00:00			`#`
			`# Step 2: download user avatars to subdirectory images/ and convert them to PPM in subdirectory images/ppm/`
"release" for FoC group 2021-08-12 02:07:51 +00:00			`# grep image_72 . -r \|grep -v users.json \|awk '{print $3}' \|sort \|uniq \|sed 's/?.*//' \|sed 's,\\,,g' \|sed 's/"//' \|sed 's/",$//' > images.list`
. 2021-08-08 18:15:22 +00:00			`# mkdir images`
			`# cd images`
"release" for FoC group 2021-08-12 02:07:51 +00:00			`# wget -i ../images.list --wait=0.1`
beginnings of a Slack archive reader I'm hackily depending on Python (3.something) to prototype the disk image creator. But no non-std libs. Once the disk image is created, I've validated that it can be loaded from disk without too much latency (assuming KVM). 2021-08-07 03:46:48 +00:00			`# # fix some lying images`
			`# for f in $(file .jpg \|grep PNG \|sed 's/:.//'); do mv -i $f $(echo $f \|sed 's/\.jpg$/.png/'); done`
			`# #`
			`# mkdir ppm`
slack: update instructions for downloading images 2022-01-17 04:09:44 +00:00			`# for f in *.jpg; do jpegtopnm $f \|pnmtoplainpnm > ppm/$(echo $f \|sed 's/\.jpg$//').ppm; done`
			`# for f in *.png; do pngtopnm $f \|pnmtoplainpnm > ppm/$(echo $f \|sed 's/\.png$//').ppm; done`
			`#`
			# (Depending on your OS, you may need to replace pnmtoplainpnm with `pnmtopnm -plain`. Some places also have a pnm2pnm.
			`# I don't understand it either.)`
beginnings of a Slack archive reader I'm hackily depending on Python (3.something) to prototype the disk image creator. But no non-std libs. Once the disk image is created, I've validated that it can be loaded from disk without too much latency (assuming KVM). 2021-08-07 03:46:48 +00:00			`#`
. 2021-08-08 18:15:22 +00:00			`# Step 3: construct a disk image out of the archives and avatars`
"release" for FoC group 2021-08-12 02:07:51 +00:00			`# cd .. # go back to the top-level archive directory`
. 2021-08-07 04:01:38 +00:00			`# dd if=/dev/zero of=data.img count=201600 # 100MB`
"release" for FoC group 2021-08-12 02:07:51 +00:00			`# python path/to/convert_slack.py > data.out 2> data.err`
hackily sort items by time 2021-08-12 01:40:10 +00:00			`# dd if=data.out of=data.img conv=notrunc`
. 2021-08-10 16:13:59 +00:00			`# Currently this process yields errors for ~300 items (~70 posts and their comments)`
slack: emit comment parent indices in converter They're easier to process when loading the data disk. In the process we lose a few more items because they're comments to items we were dropping earlier. 2021-08-10 12:09:19 +00:00			`# on the Future of Software group (https://futureofcoding.org/community). We fail to load those.`
beginnings of a Slack archive reader I'm hackily depending on Python (3.something) to prototype the disk image creator. But no non-std libs. Once the disk image is created, I've validated that it can be loaded from disk without too much latency (assuming KVM). 2021-08-07 03:46:48 +00:00			`#`
			`# Notes on input format:`
			`# Redundant 'type' field that's always 'message'. Probably an "enterprise" feature.`

			`from sys import argv, stderr`
			`import json`
			`from os import listdir`
			`from os.path import isfile, join, basename, splitext`
			`from urllib.parse import urlparse`
slack: one more corner case during import 2021-11-10 03:57:53 +00:00			`import traceback`
beginnings of a Slack archive reader I'm hackily depending on Python (3.something) to prototype the disk image creator. But no non-std libs. Once the disk image is created, I've validated that it can be loaded from disk without too much latency (assuming KVM). 2021-08-07 03:46:48 +00:00
			`def look_up_ppm_image(url):`
			`file_root = splitext(basename(urlparse(url).path))[0]`
. 2021-08-08 18:15:22 +00:00			`filename = f"images/ppm/{file_root}.ppm"`
beginnings of a Slack archive reader I'm hackily depending on Python (3.something) to prototype the disk image creator. But no non-std libs. Once the disk image is created, I've validated that it can be loaded from disk without too much latency (assuming KVM). 2021-08-07 03:46:48 +00:00			`if isfile(filename):`
			`with open(filename) as f:`
			`return f.read()`

. 2021-08-10 11:55:19 +00:00			`user_idx = {}`
. 2021-08-10 11:18:34 +00:00			`with open('users.json') as f:`
. 2021-08-10 11:22:29 +00:00			`for idx, user in enumerate(json.load(f)):`
. 2021-08-10 11:19:54 +00:00			`if 'real_name' not in user:`
			`user['real_name'] = ''`
			`print(f"({json.dumps(user['id'])} \"@{user['name']}\" {json.dumps(user['real_name'])} [{look_up_ppm_image(user['profile']['image_72']) or ''}])")`
. 2021-08-10 11:55:19 +00:00			`user_idx[user['id']] = idx`
beginnings of a Slack archive reader I'm hackily depending on Python (3.something) to prototype the disk image creator. But no non-std libs. Once the disk image is created, I've validated that it can be loaded from disk without too much latency (assuming KVM). 2021-08-07 03:46:48 +00:00
. 2021-08-10 11:44:43 +00:00			`def by(item):`
slack: one more corner case during import 2021-11-10 03:57:53 +00:00			`if 'subtype' in item and item['subtype'] == 'bot_message' and 'username' in item:`
			`federated_user = item['username']`
			`if federated_user not in user_idx:`
			`user_idx[federated_user] = len(user_idx)`
			`return user_idx[federated_user]`
. 2021-08-10 11:55:19 +00:00			`return user_idx[item['user']]`
. 2021-08-10 11:44:43 +00:00
slack: emit comment parent indices in converter They're easier to process when loading the data disk. In the process we lose a few more items because they're comments to items we were dropping earlier. 2021-08-10 12:09:19 +00:00			`item_idx = {}`
			`def parent(item):`
			`if 'thread_ts' in item and item['thread_ts'] != item['ts']:`
			`# comment`
			`return item_idx[item['thread_ts']]`
			`else:`
			`return -1`

no, we can't just sort lines in the slack archive Comments contain indices back to the parent. Reordering items completely messes up the indices. 2021-08-15 02:56:09 +00:00			`items = []`
. 2021-08-10 11:28:44 +00:00			`for channel in json.load(open('channels.json')):`
. 2021-08-10 11:34:58 +00:00			`for filename in sorted(listdir(channel['name'])):`
. 2021-08-10 11:44:43 +00:00			`with open(join(channel['name'], filename)) as f:`
			`for item in json.load(f):`
no, we can't just sort lines in the slack archive Comments contain indices back to the parent. Reordering items completely messes up the indices. 2021-08-15 02:56:09 +00:00			`item['channel_name'] = channel['name']`
			`items.append(item)`

			`idx = 0`
			`for item in sorted(items, key=lambda item: item['ts']):`
			`try:`
			`print(f"({json.dumps(item['ts'])} {parent(item)} {json.dumps(item['channel_name'])} {by(item)} {json.dumps(item['text'])})")`
			`item_idx[item['ts']] = idx`
			`idx += 1 # only increment when actually used and no exception raised`
			`except KeyError:`
slack: one more corner case during import 2021-11-10 03:57:53 +00:00			`traceback.print_exc(file=stderr)`
no, we can't just sort lines in the slack archive Comments contain indices back to the parent. Reordering items completely messes up the indices. 2021-08-15 02:56:09 +00:00			`stderr.write(repr(item)+'\n')`