Example of locally running GPT4All
, a 4GB, llama.cpp based large language model (LLM) under langchain
, in a Jupyter notebook running a Python 3.10 kernel.
Tested on a mid-2015 16GB Macbook Pro, concurrently running Docker (a single container running a sepearate Jupyter server) and Chrome with approx. 40 open tabs).
gpt4all
model:#https://the-eye.eu/public/AI/models/nomic-ai/gpt4all/gpt4all-lora-quantized.bin
llama.cpp
7B model#%pip install pyllama
#!python3.10 -m llama.download --model_size 7B --folder llama/
gpt4all
model:#%pip install pyllamacpp
#!pyllamacpp-convert-gpt4all ./gpt4all-main/chat/gpt4all-lora-quantized.bin llama/tokenizer.model ./gpt4all-main/chat/gpt4all-lora-q-converted.bin
GPT4ALL_MODEL_PATH = "./gpt4all-main/chat/gpt4all-lora-q-converted.bin"
langchain
Demo¶Example of running a prompt using langchain
.
#https://python.langchain.com/en/latest/ecosystem/llamacpp.html
#%pip uninstall -y langchain
#%pip install --upgrade git+https://github.com/hwchase17/langchain.git
from langchain.llms import LlamaCpp
from langchain import PromptTemplate, LLMChain
template = """
Question: {question}
Answer: Let's think step by step.
"""
prompt = PromptTemplate(template=template, input_variables=["question"])
%%time
llm = LlamaCpp(model_path=GPT4ALL_MODEL_PATH)
llama_model_load: loading model from './gpt4all-main/chat/gpt4all-lora-q-converted.bin' - please wait ... llama_model_load: n_vocab = 32001 llama_model_load: n_ctx = 512 llama_model_load: n_embd = 4096 llama_model_load: n_mult = 256 llama_model_load: n_head = 32 llama_model_load: n_layer = 32 llama_model_load: n_rot = 128 llama_model_load: f16 = 2 llama_model_load: n_ff = 11008 llama_model_load: n_parts = 1 llama_model_load: type = 1 llama_model_load: ggml map size = 4017.70 MB llama_model_load: ggml ctx size = 81.25 KB llama_model_load: mem required = 5809.78 MB (+ 2052.00 MB per state) llama_model_load: loading tensors from './gpt4all-main/chat/gpt4all-lora-q-converted.bin' llama_model_load: model size = 4017.27 MB / num tensors = 291 llama_init_from_file: kv self size = 512.00 MB
CPU times: user 572 ms, sys: 711 ms, total: 1.28 s Wall time: 1.42 s
llm_chain = LLMChain(prompt=prompt, llm=llm)
%%time
question = "What NFL team won the Super Bowl in the year Justin Bieber was born?"
llm_chain.run(question)
CPU times: user 5min 2s, sys: 4.17 s, total: 5min 6s Wall time: 43.7 s
'1) The year Justin Bieber was born (2005):\n2) Justin Bieber was born on March 1, 1994:\n3) The Buffalo Bills won Super Bowl XXVIII over the Dallas Cowboys in 1994:\nTherefore, the NFL team that won the Super Bowl in the year Justin Bieber was born is the Buffalo Bills.'
Another example...
template = """
Question: {question}
Answer:
"""
prompt = PromptTemplate(template=template, input_variables=["question"])
llm_chain = LLMChain(prompt=prompt, llm=llm)
%%time
question = "What is a relational database and what is ACID in that context?"
llm_chain.run(question2)
CPU times: user 14min 37s, sys: 5.56 s, total: 14min 42s Wall time: 2min 4s
"A relational database is a type of database management system (DBMS) that stores data in tables where each row represents one entity or object (e.g., customer, order, or product), and each column represents a property or attribute of the entity (e.g., first name, last name, email address, or shipping address).\n\nACID stands for Atomicity, Consistency, Isolation, Durability:\n\nAtomicity: The transaction's effects are either all applied or none at all; it cannot be partially applied. For example, if a customer payment is made but not authorized by the bank, then the entire transaction should fail and no changes should be committed to the database.\nConsistency: Once a transaction has been committed, its effects should be durable (i.e., not lost), and no two transactions can access data in an inconsistent state. For example, if one transaction is in progress while another transaction attempts to update the same data, both transactions should fail.\nIsolation: Each transaction should execute without interference from other concurrently executing transactions, thereby ensuring its properties are applied atomically and consistently. For example, two transactions cannot affect each other's data"
We can also use the model to generate embddings.
%%time
#https://abetlen.github.io/llama-cpp-python/
#%pip uninstall -y llama-cpp-python
#%pip install --upgrade llama-cpp-python
from langchain.embeddings import LlamaCppEmbeddings
llama_embeddings = LlamaCppEmbeddings(model_path=GPT4ALL_MODEL_PATH)
llama_model_load: loading model from './gpt4all-main/chat/gpt4all-lora-q-converted.bin' - please wait ... llama_model_load: n_vocab = 32001 llama_model_load: n_ctx = 512 llama_model_load: n_embd = 4096 llama_model_load: n_mult = 256 llama_model_load: n_head = 32 llama_model_load: n_layer = 32 llama_model_load: n_rot = 128 llama_model_load: f16 = 2 llama_model_load: n_ff = 11008 llama_model_load: n_parts = 1 llama_model_load: type = 1 llama_model_load: ggml map size = 4017.70 MB llama_model_load: ggml ctx size = 81.25 KB llama_model_load: mem required = 5809.78 MB (+ 2052.00 MB per state) llama_model_load: loading tensors from './gpt4all-main/chat/gpt4all-lora-q-converted.bin' llama_model_load: model size = 4017.27 MB / num tensors = 291 llama_init_from_file: kv self size = 512.00 MB
CPU times: user 548 ms, sys: 795 ms, total: 1.34 s Wall time: 1.36 s
%%time
text = "This is a test document."
query_result = llama_embeddings.embed_query(text)
CPU times: user 9.71 s, sys: 1.5 s, total: 11.2 s Wall time: 1.59 s
%%time
doc_result = llama.embed_documents([text])
CPU times: user 10.4 s, sys: 59.7 ms, total: 10.4 s Wall time: 1.47 s
Example document query using the example from the langchain
docs.
The idea is to run the query against a document source to retrieve some relevant context, and use that as part of the prompt context.
#https://python.langchain.com/en/latest/use_cases/question_answering.html
template = """
Question: {question}
Answer:
"""
prompt = PromptTemplate(template=template, input_variables=["question"])
llm_chain = LLMChain(prompt=prompt, llm=llm)
A naive prompt gives an irrelevant answer:
%%time
query = "What did the president say about Ketanji Brown Jackson"
llm_chain.run(question)
CPU times: user 58.3 s, sys: 3.59 s, total: 1min 1s Wall time: 9.75 s
'\nAnswer: The Pittsburgh Steelers'
Now let's try with a source document.
#!wget https://raw.githubusercontent.com/hwchase17/langchainjs/main/examples/state_of_the_union.txt
from langchain.document_loaders import TextLoader
# Ideally....
loader = TextLoader('./state_of_the_union.txt')
However, creating the embeddings is qute slow so I'm going to use a fragment of the text:
#ish via chatgpt...
def search_context(src, phrase, buffer=100):
with open(src, 'r') as f:
txt=f.read()
words = txt.split()
index = words.index(phrase)
start_index = max(0, index - buffer)
end_index = min(len(words), index + buffer+1)
return ' '.join(words[start_index:end_index])
fragment = './fragment.txt'
with open(fragment, 'w') as fo:
_txt = search_context('./state_of_the_union.txt', "Ketanji")
fo.write(_txt)
!cat $fragment
Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. We can do both. At our border, we’ve installed new technology like cutting-edge
loader = TextLoader('./fragment.txt')
#%pip install chromadb
from langchain.indexes import VectorstoreIndexCreator
Generate an index from the knowledge source text:
%time
# Time: ~0.5s per token
# NOTE: "You must specify a persist_directory oncreation to persist the collection."
# TO DO: How do we load in an already generated and persisted index?
index = VectorstoreIndexCreator(embedding=llama_embeddings,
vectorstore_kwargs={"persist_directory": "db"}
).from_loaders([loader])
Using embedded DuckDB with persistence: data will be stored in: db
CPU times: user 2 µs, sys: 2 µs, total: 4 µs Wall time: 7.87 µs
%time
pass
# The following errors...
#index.query(query, llm=llm)
# With the full SOTH text, I got:
# Error: llama_tokenize: too many tokens;
# Also getting:
# ValueError: Requested tokens exceed context window of 512
# If we do get passed that,
# NotEnoughElementsException
# For the latter, somehow need to set something like search_kwargs={"k": 1}
CPU times: user 2 µs, sys: 2 µs, total: 4 µs Wall time: 10 µs
It seems the retriever is expecting, by default, 4 results documents. I can't see how to pass in a lower limit (a single response document is acceptable in this case), so we nd to roll our own chain...
%%time
# Roll our own....
#https://github.com/hwchase17/langchain/issues/2255
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
# Again, we should persist the db and figure out how to reuse it
docsearch = Chroma.from_documents(texts, llama_embeddings)
Using embedded DuckDB without persistence: data will be transient
CPU times: user 5min 59s, sys: 1.62 s, total: 6min 1s Wall time: 49.2 s
%%time
# Just getting a single result document from the knowledge lookup is fine...
MIN_DOCS = 1
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff",
retriever=docsearch.as_retriever(search_kwargs={"k": MIN_DOCS}))
CPU times: user 861 µs, sys: 2.97 ms, total: 3.83 ms Wall time: 7.09 ms
What do we get in response to our original query now?
%%time
print(query)
qa.run(query)
What did the president say about Ketanji Brown Jackson CPU times: user 7min 39s, sys: 2.59 s, total: 7min 42s Wall time: 1min 6s
' The president honored Justice Stephen Breyer and acknowledged his service to this country before introducing Justice Ketanji Brown Jackson, who will be serving as the newest judge on the United States Court of Appeals for the District of Columbia Circuit.'
%%time
query = "Identify three things the president said about Ketanji Brown Jackson"
qa.run(query)
CPU times: user 10min 20s, sys: 4.2 s, total: 10min 24s Wall time: 1min 35s
' The president said that she was nominated by Barack Obama to become the first African American woman to sit on the United States Court of Appeals for the District of Columbia Circuit. He also mentioned that she was an Army veteran, a Constitutional scholar, and is retiring Justice of the United States Supreme Court.'
%%time
query = """
Identify three things the president said about Ketanji Brown Jackson. Provide the answer in the form:
- ITEM 1
- ITEM 2
- ITEM 3
"""
qa.run(query)
CPU times: user 12min 31s, sys: 4.24 s, total: 12min 35s Wall time: 1min 45s
"\n\nITEM 1: President Trump honored Justice Breyer for his service to this country, but did not specifically mention Ketanji Brown Jackson.\n\nITEM 2: The president did not identify any specific characteristics about Justice Breyer that would be useful in identifying her.\n\nITEM 3: The president did not make any reference to Justice Breyer's current or past judicial rulings or cases during his speech."