105 lines
3.8 KiB
Markdown
105 lines
3.8 KiB
Markdown
# UsenetSearch
|
|
|
|
## What
|
|
|
|
UsenetSearch will index the subject of every usenet post, allowing you to
|
|
later search through the indexed results. In other words, it's basically
|
|
a usenet search engine.
|
|
|
|
## Why
|
|
|
|
I got interested in indexing usenet because I was dissapointed with the
|
|
existing options. There's horrible things like newznab, nntmux, etc,...
|
|
all written in php, and seemingly depend on the entire body of software
|
|
ever written by mankind in the history of the world. I actually tried
|
|
to get nntmux running on my bsd server, failed of course. Why became
|
|
immediately obvious as soon as I looked at their code. It's php code,
|
|
often shelling out linux-only shell commands. Ugh. I was so apalled
|
|
I had to go and write my own thing. After all, how hard can it be :)
|
|
Nntp is a relatively simple plain text protocol.
|
|
|
|
## How hard can it be ?!
|
|
|
|
Well, the difficulty with usenet indexing is not so much in talking to
|
|
the NNTP server. It's very much a problem of scale. Last I checked,
|
|
there's 111216 newsgroups on usenet totalling a number of 491702480329
|
|
posts. That's 491 billion. According to wikipedia, as of 2021, every
|
|
day an additional 171 million posts is added. So even if you used, say
|
|
100 bytes per post (which is actually already unrealistically small if
|
|
you actually care to tokenize subjects), that'd mean you'd need at least
|
|
44 Terrabytes - just to store the tokenized hashes. (the real number
|
|
is probably more likely between 400 and 500 TiB to index all of usenet).
|
|
|
|
Well, now we've got an interesting problem on our hands! How do we
|
|
search through that much data quickly and efficiently? How do we index
|
|
as fast as we possibly can? How can we optimize this as much as possible
|
|
without compromising search result quality? These are all really really
|
|
fun problems to solve, which is why I've been working on this.
|
|
|
|
## Goals
|
|
|
|
I wanted to solve these problems (or at least try to), but I didn't want to
|
|
make another newznab. I wanted to do this with as little dependencies as I
|
|
could possibly get away with. And indeed, at the time of writing, the only
|
|
thing UsenetSearch links against is OpenSSL. No other libraries needed. No
|
|
database setup needed. You just need a modern c++ compiler, and OpenSSL.
|
|
That's it.
|
|
|
|
## Building
|
|
|
|
Should be pretty standard:
|
|
|
|
mkdir build && cd build
|
|
cmake ..
|
|
make
|
|
|
|
## Configuring
|
|
|
|
There's an example configuration file in the root project folder, copy that and
|
|
edit it with your nttp server connection details and whatever else you want to
|
|
change. I did my best to document it as well as I could.
|
|
|
|
## Running
|
|
|
|
To index, run:
|
|
|
|
```
|
|
./usenetindexd -c /path/to/config
|
|
```
|
|
|
|
To search, run:
|
|
|
|
```
|
|
./usenetfind -c /path/to/config -s 'some search string' -n maxresults
|
|
```
|
|
|
|
for example:
|
|
|
|
```
|
|
./usenetfind -c /etc/usenetsearch.conf -s 'itanium compiler' -n 5
|
|
```
|
|
|
|
There's also a dbdump executable that lets you dump the .db files in the
|
|
database directory. All the executables allow you to pass -h to get help.
|
|
|
|
## ToDo
|
|
|
|
At the time of writing this readme, you can use it to index, and search. But
|
|
there's a lot of work left to do:
|
|
|
|
* Automatically reconnect to the NNTP server when disconnected.
|
|
* Implement a http server with REST api compatible with other existing tools
|
|
so you can point a web frontend at it and get search results.
|
|
* Implement word-stemming
|
|
* Implement a scheduler, make it a proper daemon that forks and continuously
|
|
indexes in the background.
|
|
* Implement support for indexing over multiple connections simultaneously.
|
|
* More configurable logging.
|
|
* Implement database repair functionality/tools.
|
|
* Still have to compile this on platforms other than linux and make sure it
|
|
works.
|
|
* Further optimize storage efficiency; I did my best. Further optimizations
|
|
are probably going to hurt search speed. But I'm sure there's more that can
|
|
be done that I haven't thought of yet.
|
|
|