It's readme time

This commit is contained in:
John Sennesael 2021-10-21 09:03:02 -05:00
parent 6b5bdadeb5
commit 4d8eca466a
1 changed files with 94 additions and 0 deletions

94
README.md Normal file
View File

@ -0,0 +1,94 @@
# UsenetSearch
## What
UsenetSearch will index the subject of every usenet post, allowing you to
later search through the indexed results. In other words, it's basically
a usenet search engine.
## Why
I got interested in indexing usenet because I was dissapointed with the
existing options. There's horrible things like newznab, nntmux, etc,...
all written in php, and seemingly depend on the entire body of software
ever written by mankind in the history of the world. I actually tried
to get nntmux running on my bsd server, failed of course. Why became
immediately obvious as soon as I looked at their code. It's php code,
often shelling out linux-only shell commands. Ugh. I was so apalled
I had to go and write my own thing. After all, how hard can it be :)
Nntp is a relatively simple plain text protocol.
## How hard can it be ?!
Well, the difficulty with usenet indexing is not so much in talking to
the NNTP server. It's very much a problem of scale. Last I checked,
there's 111216 newsgroups on usenet totalling a number of 491702480329
posts. That's 491 billion. According to wikipedia, as of 2021, every
day an additional 171 million posts is added. So even if you used, say
100 bytes per post (which is actually already unrealistically small if
you actually care to tokenize subjects), that'd mean you'd need at least
44 Terrabytes - just to store the tokenized hashes. (the real number
is probably more likely between 400 and 500 TiB to index all of usenet).
Well, now we've got an interesting problem on our hands! How do we
search through that much data quickly and efficiently? How do we index
as fast as we possibly can? How can we optimize this as much as possible
without compromising search result quality? These are all really really
fun problems to solve, which is why I've been working on this.
## Goals
I wanted to solve these problems (or at least try to), but I didn't want to
make another newznab. I wanted to do this with as little dependencies as I
could possibly get away with. And indeed, at the time of writing, the only
thing UsenetSearch links against is OpenSSL. No other libraries needed. No
database setup needed. You just need a modern c++ compiler, and OpenSSL.
That's it.
## Building
Should be pretty standard:
mkdir build && cd build
cmake ..
make
## Configuring
There's an example configuration file in the root project folder, copy that and
edit it with your nttp server connection details and whatever else you want to
change. I did my best to document it as well as I could.
## Running
To index, run:
./usenetindexd -c /path/to/config
To search, run:
./usenetfind -c /path/to/config -s 'some search string' -n maxresults
for example:
./usenetfind -c /etc/usenetsearch.conf -s 'itanium compiler' -n 5
There's also a dbdump executable that lets you dump the .db files in the
database directory. All the executables allow you to pass -h to get help.
## ToDo
At the time of writing this readme, you can use it to index, and search. But
there's a lot of work left to do:
* Automatically reconnect to the NNTP server when disconnected.
* Implement a http server with REST api compatible with other existing tools
so you can point a web frontend at it and get search results.
* Implement word-stemming
* Implement a scheduler, make it a proper daemon that forks and continuously
indexes in the background.
* Implement support for indexing over multiple connections simultaneously.
* More configurable logging.
* Implement database repair functionality/tools.
* Still have to compile this on platforms other than linux and make sure it
works.
* Further optimize storage efficiency; I did my best. Further optimizations
are probably going to hurt search speed. But I'm sure there's more that can
be done that I haven't thought of yet.