diff --git a/README.md b/README.md new file mode 100644 index 0000000..494d5ba --- /dev/null +++ b/README.md @@ -0,0 +1,94 @@ +# UsenetSearch + +## What + +UsenetSearch will index the subject of every usenet post, allowing you to +later search through the indexed results. In other words, it's basically +a usenet search engine. + +## Why + +I got interested in indexing usenet because I was dissapointed with the +existing options. There's horrible things like newznab, nntmux, etc,... +all written in php, and seemingly depend on the entire body of software +ever written by mankind in the history of the world. I actually tried +to get nntmux running on my bsd server, failed of course. Why became +immediately obvious as soon as I looked at their code. It's php code, +often shelling out linux-only shell commands. Ugh. I was so apalled +I had to go and write my own thing. After all, how hard can it be :) +Nntp is a relatively simple plain text protocol. + +## How hard can it be ?! + +Well, the difficulty with usenet indexing is not so much in talking to +the NNTP server. It's very much a problem of scale. Last I checked, +there's 111216 newsgroups on usenet totalling a number of 491702480329 +posts. That's 491 billion. According to wikipedia, as of 2021, every +day an additional 171 million posts is added. So even if you used, say +100 bytes per post (which is actually already unrealistically small if +you actually care to tokenize subjects), that'd mean you'd need at least +44 Terrabytes - just to store the tokenized hashes. (the real number +is probably more likely between 400 and 500 TiB to index all of usenet). + +Well, now we've got an interesting problem on our hands! How do we +search through that much data quickly and efficiently? How do we index +as fast as we possibly can? How can we optimize this as much as possible +without compromising search result quality? These are all really really +fun problems to solve, which is why I've been working on this. + +## Goals + +I wanted to solve these problems (or at least try to), but I didn't want to +make another newznab. I wanted to do this with as little dependencies as I +could possibly get away with. And indeed, at the time of writing, the only +thing UsenetSearch links against is OpenSSL. No other libraries needed. No +database setup needed. You just need a modern c++ compiler, and OpenSSL. +That's it. + +## Building + +Should be pretty standard: + + mkdir build && cd build + cmake .. + make + +## Configuring + +There's an example configuration file in the root project folder, copy that and +edit it with your nttp server connection details and whatever else you want to +change. I did my best to document it as well as I could. + +## Running + +To index, run: + ./usenetindexd -c /path/to/config + +To search, run: + ./usenetfind -c /path/to/config -s 'some search string' -n maxresults +for example: + ./usenetfind -c /etc/usenetsearch.conf -s 'itanium compiler' -n 5 + +There's also a dbdump executable that lets you dump the .db files in the +database directory. All the executables allow you to pass -h to get help. + +## ToDo + +At the time of writing this readme, you can use it to index, and search. But +there's a lot of work left to do: + +* Automatically reconnect to the NNTP server when disconnected. +* Implement a http server with REST api compatible with other existing tools + so you can point a web frontend at it and get search results. +* Implement word-stemming +* Implement a scheduler, make it a proper daemon that forks and continuously + indexes in the background. +* Implement support for indexing over multiple connections simultaneously. +* More configurable logging. +* Implement database repair functionality/tools. +* Still have to compile this on platforms other than linux and make sure it + works. +* Further optimize storage efficiency; I did my best. Further optimizations + are probably going to hurt search speed. But I'm sure there's more that can + be done that I haven't thought of yet. +