Skip to content

Latest commit

 

History

History
59 lines (33 loc) · 3.27 KB

File metadata and controls

59 lines (33 loc) · 3.27 KB

Development Notes

What works

  • Usage of open-source models available on HuggingFace, served locally.

  • Agentic workflow - with tool callings (both local tools and from multiple MCP servers).

  • Tool Features:

    • Video understanding - able to query and ask any questions related to video content.

    • Transcripting (speech-to-text) - able to extract speech from video and generate transcript formatted in SRT.

  • Safety (via system prompt):

    • Prompt injection mitigation - explicitly instruct model to not treating video content as command

    • Guardrailing - model to answer only the questions/perform only the tasks related to the video analysis capability

    • HITL / clarification - model to ask for clarification if unsure about instructions.

  • Chat interface to interact with agent.

  • Communication of chat interface (frontend - React & Tauri) with agent (backend - Python).

  • Persistent chat histories - both restoring the messages to chat interface and agent’s state (to resume conversations).

  • A simple C# launcher - that launch the builds (.exe) of both backend and frontend.

What doesn't (fully) work

  • OpenVINO model as orchestrator (while able to serve OpenVINO model locally on CPU/iGPU, having issue on tool bindings - see technical comments).

  • Tool Features:

    • Nice looking reports - expect to have proper sections and visual elements
  • Persistent storage for video uploaded and files generated by agent.

Potential improvement

  • Alternative workflow that generate full context of video, including frames, transcript, scene audio descriptions, and timestamps, that can be useful for downstream processes, e.g.

    • Question of video content at specific timestamp
    • OCR/object detection with bounding box can be performed on video frames
    • Nice looking report that generated from full context and visual elements taken from key video frames (images).
  • Structure output from agent, that UI can take and render respective controls e.g. preview or download transcript, images, reports etc.

  • Sending of video data from frontend directly to backend via gRPC - currently it’s by uploading to another temp local file server.

Challenges:

  • Hardware resource constraints leading to high model latency - long feedback loops during testing / verify code changes or prompt changes.

  • Time constraint for getting a working POC (see technical comments below)

Technical comments

  • No LangChain integration for openvino_genai.LLMPipeline (issue), while the model can be run using HuggingFacePipeline and wrap with ChatHuggingFace, there is issue with the tool binding, and not asynchronous invocation support (required for invoking tools from MCP servers). Expecting significant testing & debugging needed for implement a custom wrapper, resorted to serve the model using local Ollama instead for the mean time.

  • Planned the workflow that generate full context from video (see Potential Improvement) serve for downstream tasks. However decided there are too many things to implement and resorted to use an omni model for video understanding.