Simple, sane usenet indexer & search daemon written in C++ with minimal dependencies. (work in progress)

Go to file

John Sennesael 4be3dfcd77 Fix FreeBSD compilation		2021-10-25 17:35:02 +00:00
include/usenetsearch	Fix FreeBSD compilation	2021-10-25 17:35:02 +00:00
src	Fix FreeBSD compilation	2021-10-25 17:35:02 +00:00
.gitignore	Configuration file, arg parsing, database serialization,...	2021-09-20 19:48:49 -05:00
.kdev_ignore	btree storage, multithreaded processing	2021-09-29 18:52:54 -05:00
CMakeLists.txt	Fix FreeBSD compilation	2021-10-25 17:35:02 +00:00
COPYING	Implemented GROUP, LISTGROUP commands - Implemented database saving and loading of article ids	2021-09-21 17:47:33 -05:00
README.md	formatting	2021-10-21 09:04:23 -05:00
thread_timing.txt	Improve storage efficience, add token db file parser executable, etc....	2021-10-08 15:17:22 -05:00
usenetsearch.example.conf	Implemented newsgroup filtering, work on resuming indexing where last left off (still buggy)	2021-10-18 20:19:11 -05:00

README.md

UsenetSearch

What

UsenetSearch will index the subject of every usenet post, allowing you to later search through the indexed results. In other words, it's basically a usenet search engine.

Why

I got interested in indexing usenet because I was dissapointed with the existing options. There's horrible things like newznab, nntmux, etc,... all written in php, and seemingly depend on the entire body of software ever written by mankind in the history of the world. I actually tried to get nntmux running on my bsd server, failed of course. Why became immediately obvious as soon as I looked at their code. It's php code, often shelling out linux-only shell commands. Ugh. I was so apalled I had to go and write my own thing. After all, how hard can it be :) Nntp is a relatively simple plain text protocol.

How hard can it be ?!

Well, the difficulty with usenet indexing is not so much in talking to the NNTP server. It's very much a problem of scale. Last I checked, there's 111216 newsgroups on usenet totalling a number of 491702480329 posts. That's 491 billion. According to wikipedia, as of 2021, every day an additional 171 million posts is added. So even if you used, say 100 bytes per post (which is actually already unrealistically small if you actually care to tokenize subjects), that'd mean you'd need at least 44 Terrabytes - just to store the tokenized hashes. (the real number is probably more likely between 400 and 500 TiB to index all of usenet).

Well, now we've got an interesting problem on our hands! How do we search through that much data quickly and efficiently? How do we index as fast as we possibly can? How can we optimize this as much as possible without compromising search result quality? These are all really really fun problems to solve, which is why I've been working on this.

Goals

I wanted to solve these problems (or at least try to), but I didn't want to make another newznab. I wanted to do this with as little dependencies as I could possibly get away with. And indeed, at the time of writing, the only thing UsenetSearch links against is OpenSSL. No other libraries needed. No database setup needed. You just need a modern c++ compiler, and OpenSSL. That's it.

Building

Should be pretty standard:

mkdir build && cd build
cmake ..
make

Configuring

There's an example configuration file in the root project folder, copy that and edit it with your nttp server connection details and whatever else you want to change. I did my best to document it as well as I could.

Running

To index, run:

    ./usenetindexd -c /path/to/config

To search, run:

    ./usenetfind -c /path/to/config -s 'some search string' -n maxresults

for example:

    ./usenetfind -c /etc/usenetsearch.conf -s 'itanium compiler' -n 5

There's also a dbdump executable that lets you dump the .db files in the database directory. All the executables allow you to pass -h to get help.

ToDo

At the time of writing this readme, you can use it to index, and search. But there's a lot of work left to do:

Automatically reconnect to the NNTP server when disconnected.
Implement a http server with REST api compatible with other existing tools so you can point a web frontend at it and get search results.
Implement word-stemming
Implement a scheduler, make it a proper daemon that forks and continuously indexes in the background.
Implement support for indexing over multiple connections simultaneously.
More configurable logging.
Implement database repair functionality/tools.
Still have to compile this on platforms other than linux and make sure it works.
Further optimize storage efficiency; I did my best. Further optimizations are probably going to hurt search speed. But I'm sure there's more that can be done that I haven't thought of yet.