It's readme time
This commit is contained in:
parent
6b5bdadeb5
commit
4d8eca466a
|
@ -0,0 +1,94 @@
|
|||
# UsenetSearch
|
||||
|
||||
## What
|
||||
|
||||
UsenetSearch will index the subject of every usenet post, allowing you to
|
||||
later search through the indexed results. In other words, it's basically
|
||||
a usenet search engine.
|
||||
|
||||
## Why
|
||||
|
||||
I got interested in indexing usenet because I was dissapointed with the
|
||||
existing options. There's horrible things like newznab, nntmux, etc,...
|
||||
all written in php, and seemingly depend on the entire body of software
|
||||
ever written by mankind in the history of the world. I actually tried
|
||||
to get nntmux running on my bsd server, failed of course. Why became
|
||||
immediately obvious as soon as I looked at their code. It's php code,
|
||||
often shelling out linux-only shell commands. Ugh. I was so apalled
|
||||
I had to go and write my own thing. After all, how hard can it be :)
|
||||
Nntp is a relatively simple plain text protocol.
|
||||
|
||||
## How hard can it be ?!
|
||||
|
||||
Well, the difficulty with usenet indexing is not so much in talking to
|
||||
the NNTP server. It's very much a problem of scale. Last I checked,
|
||||
there's 111216 newsgroups on usenet totalling a number of 491702480329
|
||||
posts. That's 491 billion. According to wikipedia, as of 2021, every
|
||||
day an additional 171 million posts is added. So even if you used, say
|
||||
100 bytes per post (which is actually already unrealistically small if
|
||||
you actually care to tokenize subjects), that'd mean you'd need at least
|
||||
44 Terrabytes - just to store the tokenized hashes. (the real number
|
||||
is probably more likely between 400 and 500 TiB to index all of usenet).
|
||||
|
||||
Well, now we've got an interesting problem on our hands! How do we
|
||||
search through that much data quickly and efficiently? How do we index
|
||||
as fast as we possibly can? How can we optimize this as much as possible
|
||||
without compromising search result quality? These are all really really
|
||||
fun problems to solve, which is why I've been working on this.
|
||||
|
||||
## Goals
|
||||
|
||||
I wanted to solve these problems (or at least try to), but I didn't want to
|
||||
make another newznab. I wanted to do this with as little dependencies as I
|
||||
could possibly get away with. And indeed, at the time of writing, the only
|
||||
thing UsenetSearch links against is OpenSSL. No other libraries needed. No
|
||||
database setup needed. You just need a modern c++ compiler, and OpenSSL.
|
||||
That's it.
|
||||
|
||||
## Building
|
||||
|
||||
Should be pretty standard:
|
||||
|
||||
mkdir build && cd build
|
||||
cmake ..
|
||||
make
|
||||
|
||||
## Configuring
|
||||
|
||||
There's an example configuration file in the root project folder, copy that and
|
||||
edit it with your nttp server connection details and whatever else you want to
|
||||
change. I did my best to document it as well as I could.
|
||||
|
||||
## Running
|
||||
|
||||
To index, run:
|
||||
./usenetindexd -c /path/to/config
|
||||
|
||||
To search, run:
|
||||
./usenetfind -c /path/to/config -s 'some search string' -n maxresults
|
||||
for example:
|
||||
./usenetfind -c /etc/usenetsearch.conf -s 'itanium compiler' -n 5
|
||||
|
||||
There's also a dbdump executable that lets you dump the .db files in the
|
||||
database directory. All the executables allow you to pass -h to get help.
|
||||
|
||||
## ToDo
|
||||
|
||||
At the time of writing this readme, you can use it to index, and search. But
|
||||
there's a lot of work left to do:
|
||||
|
||||
* Automatically reconnect to the NNTP server when disconnected.
|
||||
* Implement a http server with REST api compatible with other existing tools
|
||||
so you can point a web frontend at it and get search results.
|
||||
* Implement word-stemming
|
||||
* Implement a scheduler, make it a proper daemon that forks and continuously
|
||||
indexes in the background.
|
||||
* Implement support for indexing over multiple connections simultaneously.
|
||||
* More configurable logging.
|
||||
* Implement database repair functionality/tools.
|
||||
* Still have to compile this on platforms other than linux and make sure it
|
||||
works.
|
||||
* Further optimize storage efficiency; I did my best. Further optimizations
|
||||
are probably going to hurt search speed. But I'm sure there's more that can
|
||||
be done that I haven't thought of yet.
|
||||
|
Loading…
Reference in New Issue