finished?

This commit is contained in:
clarissa 2023-05-30 14:53:42 -07:00
parent 0e45867037
commit 8e03f0499b
1 changed files with 35 additions and 1 deletions

View File

@ -119,7 +119,41 @@ This does bring us to things like gpt4 + browsing, where with chatGPT you're now
Although I'm seguing to our last topic major topic which is that if you want to have more control and transparency over what's happening you need to have more local code and control rather than using an opaque platform such as chatGPT. That's not quite possible yet, but we're getting there.
The big thing that happened in the "local LLM install" revolution was the release of the llama model weights
The big thing that happened in the "local LLM install" revolution was the release of the llama model weights. (As a reminder, by weights we mean the values of all the parameters that go into the neural network. If you know the shape of the neural network and you know the final weights, you have everything you need to replicate the trained network yourself.) Facebook had trained its own large language model, meant to be a competitor for gpt3.5 and gpt4, then they released the weights to researchers for study, and those researchers took to sharing it with each other more efficiently by using bittorrent to spread the files. So, very quickly, you had almost everyone who wanted to experiment with a gpt4-like model messing with llama. Llama came in a variety of sizes, too: a 7 billion parameter version, a 13 billion parameter version, and a 65 billion parameter version.
One of the things that happened almost immediately is that someone figured out how to run these models, albeit slowly, if you needed to run them entirely via the CPU rather than the GPU.
https://github.com/ggerganov/llama.cpp
(As another reminder: since the calculations you have to do when running a computation with a large nueral network involve many many many multiplications and additions that can be done in parallel---that is, at the same time---it's much faster to run them on a graphics card since that's literally the kind of math they've been designed to do quickly.) Why is this a big deal? Comparatively few people have the expensive array of multiple graphics cards that's needed to run something like even a 13 billion parameter model: the amount of video RAM needed to store the model will run you at least a $1500 dollars if not much more. To run something like a 65 billion parameter model entirely on GPUs requires the kind of setup that costs a good $10k minimum. Meanwhile, buying the equivalent amount of "normal" RAM is a few hundred dollars.
So, yes, running llamas and llama-byproducts entirely on the CPU is definitely worth it when it comes to making these things accessible for experimenters without "I have the budget for an entire datacenter" kinda money.
Beyond running LLMs on the CPU, there's also an entire world opening up about quantization of LLM model weights. The idea of quantization is a little technical but I'm going to try and give the gist. So all data on a computer, fundamentally, comes down to a sequence of ones and zeroes. That is, I think, mostly common knowledge. What's maybe not as well known is that when it comes to numbers that aren't "whole" numbers, that can have stuff to the right of the decimal point, the amount of space you devote to each number affects how fine-grained you can make the distinctions between numbers. If you allow arbitrary amounts of digits to the right of the decimal point, storing a single number could take an infinite amount of space, so you have to choose how much precision to keep.
The trick of quantization is that you can take the weights, that were made with a larger precision, and convert them to smaller precision. By doing this you can reduce the amount of RAM/VRAM needed to run an LLM by a factor of 2, 4, or even 8 times smaller. Reducing the precision does change the weights though, as on some level you are---albeit cleverly---throwing away information after a certain number of decimal points. How much does this affect the quality of the texted generated?
Can you guess what the answer is? If you said "we don't really know!" you'd be right!
I cannot stress enough that this world is not only not "science" it isn't even "engineering" yet.
But, running a 65B parameter model with the precision cleverly cut in half is probably better than not running the model at all!
Beyond the original llama, though, we've also seen a bunch of llama-descendants that have come from fine-tuning llama on various examples of chat-like question and answer pairs. The first of these was alpaca, which was a cool research project by researchers at stanford where they used chatGPT to generate the question/answer pairs for training. The problem with alpaca is that it kind of violated OpenAI's terms of service. Instead, what's really taken off is Vicuna (https://huggingface.co/lmsys, https://lmsys.org/blog/2023-03-30-vicuna/) which is an alpaca-like that does not violate OpenAI's terms of service and thus run into legal trouble for existing. I've tried using Vicuna, via the FastChat framework (https://github.com/lm-sys/FastChat) and it's a little wonky but kinda fun. Everything in this space is still very experimental and is research quality software (derogatory).
Here's a summary of how all the various GPT-likes compare to each other on non-trivial tasks:
https://medium.com/@marcotcr/exploring-chatgpt-vs-open-source-models-on-slightly-harder-tasks-aa0395c31610
Spoilers: basically nothing compares to gpt4 but on some things other models almost get to the level of gpt3.5. That's still pretty exciting, in my opinion.
So what do you do with models once you run them locally? I prefaced this whole discussion with a promise that it would let you extend and use them more transparently than gpt4 + plugins.
Enter langchain (https://python.langchain.com/en/latest/index.html). This is going to be most exciting for the programmers in our cohort, but the basic concept is that this is a framework for writing applications built on top of large language models, rather than just using an LLM as a simple prompt -> answer generator. You can extend the LLM with a kind of long term memory, integrate it into other programs, even automate processes where you feed the outputs of prompts in as new prompts to take advantage of some of those tricks I talked about above that we're using with gpt4.
Langchain technically can work with OpenAI, with the paid API, or it can work with locally installed models.
I haven't personally done anything with langchain yet but you can bet I'm hoping to dig into it when I have the time, and I'll definitely report back the results.