It's readme time

2021-10-21 09:03:02 -05:00 · 2021-10-21 09:03:02 -05:00 · 4d8eca466a
parent 6b5bdadeb5
commit 4d8eca466a
1 changed files with 94 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -0,0 +1,94 @@
+# UsenetSearch
+
+## What
+
+UsenetSearch will index the subject of every usenet post, allowing you to
+later search through the indexed results. In other words, it's basically
+a usenet search engine.
+
+## Why
+
+I got interested in indexing usenet because I was dissapointed with the
+existing options. There's horrible things like newznab, nntmux, etc,...
+all written in php, and seemingly depend on the entire body of software
+ever written by mankind in the history of the world. I actually tried
+to get nntmux running on my bsd server, failed of course. Why became
+immediately obvious as soon as I looked at their code. It's php code,
+often shelling out linux-only shell commands. Ugh. I was so apalled
+I had to go and write my own thing. After all, how hard can it be :) 
+Nntp is a relatively simple plain text protocol.
+
+## How hard can it be ?!
+
+Well, the difficulty with usenet indexing is not so much in talking to
+the NNTP server. It's very much a problem of scale. Last I checked,
+there's 111216 newsgroups on usenet totalling a number of 491702480329
+posts. That's 491 billion. According to wikipedia, as of 2021, every
+day an additional 171 million posts is added. So even if you used, say
+100 bytes per post (which is actually already unrealistically small if
+you actually care to tokenize subjects), that'd mean you'd need at least
+44 Terrabytes - just to store the tokenized hashes. (the real number
+is probably more likely between 400 and 500 TiB to index all of usenet).
+
+Well, now we've got an interesting problem on our hands! How do we
+search through that much data quickly and efficiently? How do we index
+as fast as we possibly can? How can we optimize this as much as possible
+without compromising search result quality? These are all really really
+fun problems to solve, which is why I've been working on this.
+
+## Goals
+
+I wanted to solve these problems (or at least try to), but I didn't want to
+make another newznab. I wanted to do this with as little dependencies as I
+could possibly get away with. And indeed, at the time of writing, the only
+thing UsenetSearch links against is OpenSSL. No other libraries needed. No
+database setup needed. You just need a modern c++ compiler, and OpenSSL.
+That's it.
+
+## Building
+
+Should be pretty standard:
+
+    mkdir build && cd build
+    cmake ..
+    make
+
+## Configuring
+
+There's an example configuration file in the root project folder, copy that and
+edit it with your nttp server connection details and whatever else you want to
+change. I did my best to document it as well as I could. 
+
+## Running
+
+To index, run:
+    ./usenetindexd -c /path/to/config 
+
+To search, run:
+    ./usenetfind -c /path/to/config -s 'some search string' -n maxresults
+for example:
+    ./usenetfind -c /etc/usenetsearch.conf -s 'itanium compiler' -n 5
+
+There's also a dbdump executable that lets you dump the .db files in the
+database directory. All the executables allow you to pass -h to get help.
+
+## ToDo
+
+At the time of writing this readme, you can use it to index, and search. But
+there's a lot of work left to do:
+
+* Automatically reconnect to the NNTP server when disconnected.
+* Implement a http server with REST api compatible with other existing tools 
+  so you can point a web frontend at it and get search results.
+* Implement word-stemming
+* Implement a scheduler, make it a proper daemon that forks and continuously
+  indexes in the background.
+* Implement support for indexing over multiple connections simultaneously.
+* More configurable logging.
+* Implement database repair functionality/tools.
+* Still have to compile this on platforms other than linux and make sure it
+  works.
+* Further optimize storage efficiency; I did my best. Further optimizations
+  are probably going to hurt search speed. But I'm sure there's more that can
+  be done that I haven't thought of yet.
+