feat(server): workspace embedding improve (#12022)

fix AI-10
fix AI-109
fix PD-2484

<!-- This is an auto-generated comment: release notes by coderabbit.ai -->
## Summary by CodeRabbit

- **New Features**
  - Added a method to check if a document requires embedding, improving embedding efficiency.
  - Enhanced document embeddings with enriched metadata, including title, summary, creation/update dates, and author information.
  - Introduced a new type for document fragments with extended metadata fields.

- **Improvements**
  - Embedding logic now conditionally processes only documents needing updates.
  - Embedding content now includes document metadata for more informative context.
  - Expanded and improved test coverage for embedding scenarios and workspace behaviors.
  - Event emission added for workspace embedding updates on client version mismatch.
  - Job queueing enhanced with prioritization and explicit job IDs for better management.
  - Job queue calls updated to include priority and context identifiers in a structured format.

- **Bug Fixes**
  - Improved handling of ignored documents in embedding matches.
  - Fixed incorrect document ID assignment in embedding job queueing.

- **Tests**
  - Added and updated snapshot and behavioral tests for embedding and workspace document handling.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
This commit is contained in:
darkskygit
2025-05-23 10:16:14 +00:00
parent 262f1a47a4
commit 2a80fbb993
9 changed files with 326 additions and 54 deletions

View File

@@ -175,6 +175,55 @@ export class CopilotWorkspaceConfigModel extends BaseModel {
};
}
@Transactional()
async checkDocNeedEmbedded(workspaceId: string, docId: string) {
// NOTE: check if the document needs re-embedding.
// 1. check if there have been any recent updates to the document snapshot and update
// 2. check if the embedding is older than the snapshot and update
// 3. check if the embedding is older than 10 minutes (avoid frequent updates)
// if all conditions are met, re-embedding is required.
const result = await this.db.$queryRaw<{ needs_embedding: boolean }[]>`
SELECT
EXISTS (
WITH docs AS (
SELECT
s.workspace_id,
s.guid AS doc_id,
s.updated_at
FROM
snapshots s
WHERE
s.workspace_id = ${workspaceId}
AND s.guid = ${docId}
UNION
ALL
SELECT
u.workspace_id,
u.guid AS doc_id,
u.created_at AS updated_at
FROM
"updates" u
WHERE
u.workspace_id = ${workspaceId}
AND u.guid = ${docId}
)
SELECT
1
FROM
docs
LEFT JOIN ai_workspace_embeddings e
ON e.workspace_id = docs.workspace_id
AND e.doc_id = docs.doc_id
WHERE
e.updated_at IS NULL
OR docs.updated_at > e.updated_at
OR e.updated_at < NOW() - INTERVAL '10 minutes'
) AS needs_embedding;
`;
return result[0]?.needs_embedding ?? false;
}
// ================ embeddings ================
async checkEmbeddingAvailable(): Promise<boolean> {