Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support embedding similarity search for GraphRAG #2200

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

Jant1L
Copy link

@Jant1L Jant1L commented Dec 13, 2024

Fix #2196

Description

Now support using embedding similarity search for GraphRAG
Meanwhile you can also use keywords search

How to use

set SIMILAR_SEARCH_ENABLED=True in .env
False for default

@@ -145,6 +149,7 @@ def upsert_entities(self, entities: Iterator[Vertex]) -> None:
"_document_id": "0",
"_chunk_id": "0",
"_community_id": "0",
"_embedding": entity.get_prop("embedding"),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的get_prop为什么还是embedding而不是_embedding?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在构建memory graph的时候向量的key用的是embedding,后面为了避免从tugraph拿出数据的时候数据长度太长,所以给schema中的向量字段名称改为_embedding,这样可以被white_list过滤,防止返回的数据量太大,chunk同理

create_vector_index_query = (
f"CALL db.addVertexVectorIndex("
f'"{GraphElemType.ENTITY.value}", "_embedding", '
"{dimension: 512})"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

向量的dimension是否需要一个配置项?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -189,6 +203,7 @@ def upsert_chunks(self, chunks: Iterator[Union[Vertex, ParagraphChunk]]) -> None
"id": self._escape_quotes(chunk.vid),
"name": self._escape_quotes(chunk.name),
"content": self._escape_quotes(chunk.get_prop("content")),
"_embedding": chunk.get_prop("embedding"),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的get_prop为什么还是embedding而不是_embedding?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

和对entity的处理相似

@@ -42,6 +42,10 @@ def __init__(self, graph_store: TuGraphStore):
# Create the graph
self.create_graph(self.graph_store.get_config().name)

# vector index create control
self._chunk_vector_index = False
self._entity_vector_index = False

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

upsert_vector和upsert_chunk是会被重复调用吗?为什么需要这个变量?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

会重复调用两个upsert,所以在这里用这个变量控制,保证不会多次调用创建索引的语句

similarity_search = (
f"CALL db.vertexVectorKnnSearch("
f"'{GraphElemType.ENTITY.value}','_embedding', {vector}, "
"{top_k:2, hnsw_ef_search:10})"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

top_k是否需要一个配置项?这里直接指定的是2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hnsw_ef_search这个参数指定的是?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

top_k太大了我觉得可能会影响效果,毕竟按照原来的设计,一个关键字对应一个header,这里相当于是一个向量对应两个header了;ef_search是hnsw搜索时需要指定的参数,构建时也同样需要指定参数,但是我们这里使用了默认的就没有写出来

@@ -560,10 +586,28 @@ def explore(
rel = f"<-[r:{GraphElemType.RELATION.value}*{depth_string}]-"
else:
rel = f"-[r:{GraphElemType.RELATION.value}*{depth_string}]-"

if all(isinstance(item, str) for item in subs):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里判断subs是List[str]还是List[List[float]]是否需要遍历整个list才能确定?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

按照之前最开始不遍历的方法,因为都是list类型,他就分辨不出来,所以这里得进去分辨

if similar_search_enabled:
keywords: List[List[float]] = []
vector = await self._garph_embedder.embed(text)
keywords.append(vector)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个变量不应该再用keywords了吧?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

或者把keywords重命名一下?

Copy link

@SonglinLyu SonglinLyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

部分评论需后续确认

@@ -348,8 +371,13 @@ async def asimilar_search_with_scores(

if document_graph_enabled:
keywords_for_document_graph = keywords
for vertex in subgraph.vertices():
keywords_for_document_graph.append(vertex.name)
if similar_search_enabled:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里在triplet_graph和document_graph都enable的情况下,document graph的查询似乎不应该再使用向量,而是沿用原来的直接从vertex.name查起的放法,在只有document_graph enable的情况下,使用向量

Copy link
Contributor

@Appointat Appointat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job, and thanks for your PR. I have reviewed the most of the PR, and left some comments. Please fix them thanks!

Comment on lines +38 to +41
resp = dashscope.TextEmbedding.call(
model = dashscope.TextEmbedding.Models.text_embedding_v3,
input = text,
dimension = 512)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better not to fix the embedded model, which is not in line with extensibility.

The solution should be to encapsulate embedding_fn (which is an LLM embedding model, not dashscope's model) instead of calling a pkg. Refer: embedding_fn: Optional[Embeddings] = Field( ... from BuiltinKnowledgeGraphConfig.

Comment on lines +25 to +28
async def batch_embed(
self,
texts: List[str],
) -> List[List[float]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a para called batch_size, and donot forget to modify the config var consistantly.
Refer to: _triplet_extraction_batch_size.

Comment on lines +31 to +33
for text in texts:
vector = await self._embed(text)
results.extend(vector)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handle the task in batch/parallel (note: asyncio)

And, it is append rather than extend.


from dbgpt.rag.transformer.base import EmbedderBase

logger = logging.getLogger(__name__)


class Text2Vector(EmbedderBase):
class Text2Vector(EmbedderBase, ABC):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it an abstract class?

if vertex.get_prop("vertex_type") == GraphElemType.CHUNK.value:
text = vertex.get_prop("content")
vector = await self._embed(text)
vertex.set_prop("embedding", vector)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use _embedding, since it is a prop not visible for the users (it's implicit).

Comment on lines +82 to 86
similar_search_enabled: bool = Field(
default=False,
description="Enable the similarity search",
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the code to after document_graph_enabled.

Suggested change
similar_search_enabled: bool = Field(
default=False,
description="Enable the similarity search",
)
similarity_search_enabled: bool = Field(
default=False,
description="Enable the similarity search",
)

Comment on lines +147 to +151
self._similar_search_enabled = (
os.environ["SIMILAR_SEARCH_ENABLED"].lower() == "true"
if "SIMILAR_SEARCH_ENABLED" in os.environ
else config.similar_search_enabled
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use similarity, not similar

@@ -244,6 +256,8 @@ async def _aload_triplet_graph(self, chunks: List[Chunk]) -> None:
if not graphs_list:
raise ValueError("No graphs extracted from the chunks")

graphs_list = await self._garph_embedder.batch_embed(graphs_list)
Copy link
Contributor

@Appointat Appointat Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use batch_size maybe better, and _garph_embedder has spell error.

similar_search_enabled = self._similar_search_enabled

if similar_search_enabled:
keywords: List[List[float]] = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a list of vec, which can not be called as keywords, use subs.

keywords_for_document_graph.append(vector)
else:
for vertex in subgraph.vertices():
keywords_for_document_graph.append(vertex.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I have not reviewed yet.

@SonglinLyu
Copy link

SonglinLyu commented Dec 20, 2024

描述中的#2196修改为Fix #2196

keywords_for_document_graph.append(vertex.name)
if similar_search_enabled:
for vertex in subgraph.vertices():
vector = await self._garph_embedder.embed(vertex.name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the document graph for tracing the source chunks, only entity names can be used, and the embeddings of entity names cannot be used.

@@ -363,6 +391,7 @@ async def asimilar_search_with_scores(
limit=self._knowledge_graph_chunk_search_top_size,
search_scope="document_graph",
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need an additional vector search of chunk.content and trace back all the upstream paths of chunks to obtain a document graph based on chunks and merge it with the document graph based on triple subgraph search.

async def batch_embed(
self,
texts: List[str],
) -> List[List[float]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return List[] maybe better.

async def batch_embed(
self,
graphs_list: List[List[Graph]],
) -> List[List[Graph]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use:
async def batch_embed(
self,
graphs_list: List[Graph],
) -> List[Graph]:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature][GraphRAG] Enhance GraphRAG search by vector index
4 participants