-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support embedding similarity search for GraphRAG #2200
base: main
Are you sure you want to change the base?
Conversation
@@ -145,6 +149,7 @@ def upsert_entities(self, entities: Iterator[Vertex]) -> None: | |||
"_document_id": "0", | |||
"_chunk_id": "0", | |||
"_community_id": "0", | |||
"_embedding": entity.get_prop("embedding"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的get_prop为什么还是embedding而不是_embedding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在构建memory graph的时候向量的key用的是embedding,后面为了避免从tugraph拿出数据的时候数据长度太长,所以给schema中的向量字段名称改为_embedding,这样可以被white_list过滤,防止返回的数据量太大,chunk同理
create_vector_index_query = ( | ||
f"CALL db.addVertexVectorIndex(" | ||
f'"{GraphElemType.ENTITY.value}", "_embedding", ' | ||
"{dimension: 512})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
向量的dimension是否需要一个配置项?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -189,6 +203,7 @@ def upsert_chunks(self, chunks: Iterator[Union[Vertex, ParagraphChunk]]) -> None | |||
"id": self._escape_quotes(chunk.vid), | |||
"name": self._escape_quotes(chunk.name), | |||
"content": self._escape_quotes(chunk.get_prop("content")), | |||
"_embedding": chunk.get_prop("embedding"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的get_prop为什么还是embedding而不是_embedding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
和对entity的处理相似
@@ -42,6 +42,10 @@ def __init__(self, graph_store: TuGraphStore): | |||
# Create the graph | |||
self.create_graph(self.graph_store.get_config().name) | |||
|
|||
# vector index create control | |||
self._chunk_vector_index = False | |||
self._entity_vector_index = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
upsert_vector和upsert_chunk是会被重复调用吗?为什么需要这个变量?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
会重复调用两个upsert,所以在这里用这个变量控制,保证不会多次调用创建索引的语句
similarity_search = ( | ||
f"CALL db.vertexVectorKnnSearch(" | ||
f"'{GraphElemType.ENTITY.value}','_embedding', {vector}, " | ||
"{top_k:2, hnsw_ef_search:10})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
top_k是否需要一个配置项?这里直接指定的是2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hnsw_ef_search这个参数指定的是?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
top_k太大了我觉得可能会影响效果,毕竟按照原来的设计,一个关键字对应一个header,这里相当于是一个向量对应两个header了;ef_search是hnsw搜索时需要指定的参数,构建时也同样需要指定参数,但是我们这里使用了默认的就没有写出来
@@ -560,10 +586,28 @@ def explore( | |||
rel = f"<-[r:{GraphElemType.RELATION.value}*{depth_string}]-" | |||
else: | |||
rel = f"-[r:{GraphElemType.RELATION.value}*{depth_string}]-" | |||
|
|||
if all(isinstance(item, str) for item in subs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里判断subs是List[str]还是List[List[float]]是否需要遍历整个list才能确定?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
按照之前最开始不遍历的方法,因为都是list类型,他就分辨不出来,所以这里得进去分辨
if similar_search_enabled: | ||
keywords: List[List[float]] = [] | ||
vector = await self._garph_embedder.embed(text) | ||
keywords.append(vector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个变量不应该再用keywords了吧?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
或者把keywords重命名一下?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
部分评论需后续确认
@@ -348,8 +371,13 @@ async def asimilar_search_with_scores( | |||
|
|||
if document_graph_enabled: | |||
keywords_for_document_graph = keywords | |||
for vertex in subgraph.vertices(): | |||
keywords_for_document_graph.append(vertex.name) | |||
if similar_search_enabled: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里在triplet_graph和document_graph都enable的情况下,document graph的查询似乎不应该再使用向量,而是沿用原来的直接从vertex.name查起的放法,在只有document_graph enable的情况下,使用向量
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job, and thanks for your PR. I have reviewed the most of the PR, and left some comments. Please fix them thanks!
resp = dashscope.TextEmbedding.call( | ||
model = dashscope.TextEmbedding.Models.text_embedding_v3, | ||
input = text, | ||
dimension = 512) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better not to fix the embedded model, which is not in line with extensibility.
The solution should be to encapsulate embedding_fn
(which is an LLM embedding model, not dashscope's model) instead of calling a pkg. Refer: embedding_fn: Optional[Embeddings] = Field( ...
from BuiltinKnowledgeGraphConfig
.
async def batch_embed( | ||
self, | ||
texts: List[str], | ||
) -> List[List[float]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a para called batch_size
, and donot forget to modify the config var consistantly.
Refer to: _triplet_extraction_batch_size
.
for text in texts: | ||
vector = await self._embed(text) | ||
results.extend(vector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle the task in batch/parallel (note: asyncio
)
And, it is append
rather than extend
.
|
||
from dbgpt.rag.transformer.base import EmbedderBase | ||
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
||
class Text2Vector(EmbedderBase): | ||
class Text2Vector(EmbedderBase, ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it an abstract class?
if vertex.get_prop("vertex_type") == GraphElemType.CHUNK.value: | ||
text = vertex.get_prop("content") | ||
vector = await self._embed(text) | ||
vertex.set_prop("embedding", vector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use _embedding, since it is a prop not visible for the users (it's implicit).
similar_search_enabled: bool = Field( | ||
default=False, | ||
description="Enable the similarity search", | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move the code to after document_graph_enabled
.
similar_search_enabled: bool = Field( | |
default=False, | |
description="Enable the similarity search", | |
) | |
similarity_search_enabled: bool = Field( | |
default=False, | |
description="Enable the similarity search", | |
) | |
self._similar_search_enabled = ( | ||
os.environ["SIMILAR_SEARCH_ENABLED"].lower() == "true" | ||
if "SIMILAR_SEARCH_ENABLED" in os.environ | ||
else config.similar_search_enabled | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use similarity, not similar
@@ -244,6 +256,8 @@ async def _aload_triplet_graph(self, chunks: List[Chunk]) -> None: | |||
if not graphs_list: | |||
raise ValueError("No graphs extracted from the chunks") | |||
|
|||
graphs_list = await self._garph_embedder.batch_embed(graphs_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use batch_size maybe better, and _garph_embedder
has spell error.
similar_search_enabled = self._similar_search_enabled | ||
|
||
if similar_search_enabled: | ||
keywords: List[List[float]] = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is a list of vec, which can not be called as keywords
, use subs
.
keywords_for_document_graph.append(vector) | ||
else: | ||
for vertex in subgraph.vertices(): | ||
keywords_for_document_graph.append(vertex.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I have not reviewed yet.
描述中的#2196修改为Fix #2196 |
keywords_for_document_graph.append(vertex.name) | ||
if similar_search_enabled: | ||
for vertex in subgraph.vertices(): | ||
vector = await self._garph_embedder.embed(vertex.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the document graph for tracing the source chunks, only entity names can be used, and the embeddings of entity names cannot be used.
@@ -363,6 +391,7 @@ async def asimilar_search_with_scores( | |||
limit=self._knowledge_graph_chunk_search_top_size, | |||
search_scope="document_graph", | |||
) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need an additional vector search of chunk.content and trace back all the upstream paths of chunks to obtain a document graph based on chunks and merge it with the document graph based on triple subgraph search.
async def batch_embed( | ||
self, | ||
texts: List[str], | ||
) -> List[List[float]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return List[] maybe better.
async def batch_embed( | ||
self, | ||
graphs_list: List[List[Graph]], | ||
) -> List[List[Graph]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use:
async def batch_embed(
self,
graphs_list: List[Graph],
) -> List[Graph]:
Fix #2196
Description
Now support using embedding similarity search for GraphRAG
Meanwhile you can also use keywords search
How to use
set SIMILAR_SEARCH_ENABLED=True in .env
False for default