Weaviate
This notebook covers how to get started with the Weaviate vector store in LangChain, using the langchain-weaviate
package.
Weaviate is an open-source vector database. It allows you to store data objects and vector embeddings from your favorite ML-models, and scale seamlessly into billions of data objects.
To use this integration, you need to have a running Weaviate database instance.
Minimum versionsโ
This module requires Weaviate 1.23.7
or higher. However, we recommend you use the latest version of Weaviate.
Connecting to Weaviateโ
In this notebook, we assume that you have a local instance of Weaviate running on http://localhost:8080
and port 50051 open for gRPC traffic. So, we will connect to Weaviate with:
weaviate_client = weaviate.connect_to_local()
Other deployment optionsโ
Weaviate can be deployed in many different ways such as using Weaviate Cloud Services (WCS), Docker or Kubernetes.
If your Weaviate instance is deployed in another way, read more here about different ways to connect to Weaviate. You can use different helper functions or create a custom instance.
Note that you require a
v4
client API, which will create aweaviate.WeaviateClient
object.
Authenticationโ
Some Weaviate instances, such as those running on WCS, have authentication enabled, such as API key and/or username+password authentication.
Read the client authentication guide for more information, as well as the in-depth authentication configuration page.
Installationโ
# install package
# %pip install -Uqq langchain-weaviate
# %pip install openai tiktoken langchain
Environment Setupโ
This notebook uses the OpenAI API through OpenAIEmbeddings
. We suggest obtaining an OpenAI API key and export it as an environment variable with the name OPENAI_API_KEY
.
Once this is done, your OpenAI API key will be read automatically. If you are new to environment variables, read more about them here or in this guide.
Usage
Find objects by similarityโ
Here is an example of how to find objects by similarity to a query, from data import to querying the Weaviate instance.
Step 1: Data importโ
First, we will create data to add to Weaviate
by loading and chunking the contents of a long text file.
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
loader = TextLoader("state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The class `langchain_community.embeddings.openai.OpenAIEmbeddings` was deprecated in langchain-community 0.1.0 and will be removed in 0.2.0. An updated version of the class exists in the langchain-openai package and should be used instead. To use it run `pip install -U langchain-openai` and import as `from langchain_openai import OpenAIEmbeddings`.
warn_deprecated(
Now, we can import the data.
To do so, connect to the Weaviate instance and use the resulting weaviate_client
object. For example, we can import the documents as shown below:
import weaviate
from langchain_weaviate.vectorstores import WeaviateVectorStore
weaviate_client = weaviate.connect_to_local()
db = WeaviateVectorStore.from_documents(docs, embeddings, client=weaviate_client)
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
warnings.warn('The `dict` method is deprecated; use `model_dump` instead.', category=PydanticDeprecatedSince20)
Step 2: Perform the searchโ
We can now perform a similarity search. This will return the most similar documents to the query text, based on the embeddings stored in Weaviate and an equivalent embedding generated from the query text.
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
# Print the first 100 characters of each result
for i, doc in enumerate(docs):
print(f"\nDocument {i+1}:")
print(doc.page_content[:100] + "...")
Document 1:
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Ac...
Document 2:
And so many families are living paycheck to paycheck, struggling to keep up with the rising cost of ...
Document 3:
Vice President Harris and I ran for office with a new economic vision for America.
Invest in Ameri...
Document 4:
A former top litigator in private practice. A former federal public defender. And from a family of p...
You can also add filters, which will either include or exclude results based on the filter conditions. (See more filter examples.)
from weaviate.classes.query import Filter
for filter_str in ["blah.txt", "state_of_the_union.txt"]:
search_filter = Filter.by_property("source").equal(filter_str)
filtered_search_results = db.similarity_search(query, filters=search_filter)
print(len(filtered_search_results))
if filter_str == "state_of_the_union.txt":
assert len(filtered_search_results) > 0 # There should be at least one result
else:
assert len(filtered_search_results) == 0 # There should be no results
0
4
It is also possible to provide k
, which is the upper limit of the number of results to return.
search_filter = Filter.by_property("source").equal("state_of_the_union.txt")
filtered_search_results = db.similarity_search(query, filters=search_filter, k=3)
assert len(filtered_search_results) <= 3