fikrikarim/parlor
On-device, real-time multimodal AI. Have natural voice and vision conversations with an AI that runs entirely on your machine. Powered by Gemma 4 E2B and Kokoro.
Projects and tools that deal with multimodal data, combining modalities like text, images, and speech.
On-device, real-time multimodal AI. Have natural voice and vision conversations with an AI that runs entirely on your machine. Powered by Gemma 4 E2B and Kokoro.
Safety in Embodied AI: A Survey of Risks, Attacks, and Defenses | 400+ Papers | Perception, Cognition, Planning, Interaction, Agentic System
GEMS: Agent-Native Multimodal Generation with Memory and Skills
Official skills for the GLM family of models.
🦞 OpenClaw 可视化管理面板 — 内置 AI 助手(工具调用 + 图片识别 + 多模态),一键安装 | Visual management panel with built-in AI assistant (tool calling + vision + multimodal + i18n(11))
MOSS‑TTS Family is an open‑source speech and sound generation model family from MOSI.AI and the OpenMOSS team. It is designed for high‑fidelity, high‑expressiveness, and complex real‑world scenarios, covering stable long‑form speech, multi‑speaker dialogue, voice/character design, environmental sound effects, and real‑time streaming TTS.
🔍大模型应用开发实战一:RAG 技术全栈指南,在线阅读地址:https://datawhalechina.github.io/all-in-rag/
A framework for efficient model inference with omni-modality models
[CVPR 2026] Official Implementation of UniMMAD: Unified Multi-Modal and Multi-Class Anomaly Detection via MoE-Driven Feature Decompression
Put your wardrobe in rows. Self-hosted AI-powered wardrobe management app.
An all in one AI solution compatible with any known AI service on the planet
✔(已完结)超级全面的 深度学习 笔记【土堆 Pytorch】【李沐 动手学深度学习】【吴恩达 深度学习】【大飞 大模型Agent】
Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 600+ LLMs (Qwen3.5, DeepSeek-R1, GLM-5, InternLM3, Llama4, ...) and 300+ MLLMs (Qwen3-VL, Qwen3-Omni, InternVL3.5, Ovis2.5, GLM4.5v, Llava, Phi4, ...) (AAAI 2025).
A text-to-speech (TTS), speech-to-text (STT) and speech-to-speech (STS) library built on Apple's MLX framework, providing efficient speech analysis on Apple Silicon.
Fast Multimodal LLM on Mobile Devices
日本語LLMまとめ - Overview of Japanese LLMs
[ECCV 2024 Best Paper Candidate & TPAMI 2025] PointLLM: Empowering Large Language Models to Understand Point Clouds
MOVA: Towards Scalable and Synchronized Video–Audio Generation
Paddle Multimodal Integration and eXploration, supporting mainstream multi-modal tasks, including end-to-end large-scale multi-modal pretrain models and diffusion model toolbox. Equipped with high performance and flexibility.
NEO Series: Native Vision-Language Models from First Principles
A Toolbox for MultiModal Recommendation. Integrating 10+ Models...
This repo contains the code for "VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks" [ICLR 2025]
⚡ Self-hostable YesCaptcha-compatible captcha solver built with FastAPI, Playwright, and OpenAI-compatible multimodal models.
LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.
本地监控 + AI 视觉 — LAN-based smartphone-powered AI monitoring framework with structured event output for data acquisition and analysis.
RAI is a vendor agnostic agentic framework for Physical AI robotics, utilizing ROS 2 tools to perform complex actions, defined scenarios, free interface execution, log summaries, voice interaction and more.
📱 ClawApp — OpenClaw AI 智能体手机聊天客户端 | 流式对话 · 图片收发 · 工具调用 · PWA + APK | Mobile chat client for OpenClaw AI Agent
Self-hosted OpenClaw gateway + agent runtime in .NET (NativeAOT-friendly)
State-of-the-art CLIP/SigLIP embedding models finetuned for the fashion domain. +57% increase in evaluation metrics vs FashionCLIP 2.0.
autoupdate paper list
Interactively browse multimodal tabular data
Multimodal document parser for high quality data understanding and extraction
The Open-Source Multimodal AI Agent Stack: Connecting Cutting-Edge AI Models and Agent Infra
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
☁️ Build multimodal AI applications with cloud-native stack
Run agents that work for you based on what you do. AI finally knows what you are doing
Datasets, Transforms and Models specific to Computer Vision
A comprehensive list of pytorch related content on github,such as different models,implementations,helper libraries,tutorials etc.
《李宏毅深度学习教程》(李宏毅老师推荐👍,苹果书🍎),PDF下载地址:https://github.com/datawhalechina/leedl-tutorial/releases
pix2tex: Using a ViT to convert images of equations into LaTeX code.
Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.
A toolkit for making real world machine learning and data analysis applications in C++
Your new Mentor for Data Science E-Learning.
Advanced AI Explainability for computer vision. Support for CNNs, Vision Transformers, Classification, Object detection, Segmentation, Image similarity and more.
A collection of resources and papers on Diffusion Models
A collaboration friendly studio for NeRFs
🐍 Geometric Computer Vision Library for Spatial AI
Refine high-quality datasets and visual AI models
An open source SDK for logging, storing, querying, and visualizing multimodal and multi-rate data
Content aware image resize library
A Python Library for Outlier and Anomaly Detection on Tabular, Text, and Image Data
SeaTunnel is a multimodal, high-performance, distributed, massive data integration tool.
Mobile-Agent: The Powerful GUI Agent Family
Solve Visual Understanding with Reinforced VLMs
Open-source framework for building AI-powered apps in JavaScript, Go, and Python, built and used in production by Google
A Low-Code MCP Framework for Building Complex and Innovative RAG Pipelines
High-performance data engine for AI and multimodal workloads. Process images, audio, video, and structured data at any scale
A Next-Generation Training Engine Built for Ultra-Large MoE Models
Align Anything: Training All-modality Model with Feedback
Curated tutorials and resources for Large Language Models, AI Painting, and more.
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
Code and models for ICML 2024 paper, NExT-GPT: Any-to-Any Multimodal Large Language Model
From Chain-of-Thought prompting to OpenAI o1 and DeepSeek-R1 🍓
The most accurate document search and store for building AI apps
MTEB: Massive Text Embedding Benchmark
An extensible, state of the art columnar file format. Formerly at @spiraldb, now an Incubation Stage project at LFAI&Data, part of the Linux Foundation.
Easily compute clip embeddings and build a clip retrieval system with them
streamline the fine-tuning process for multimodal models: PaliGemma 2, Florence-2, and Qwen2.5-VL
Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
HuixiangDou: Overcoming Group Chat Scenarios with LLM-based Technical Assistance
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
Comprehensive resources on Generative AI, including a detailed roadmap, projects, use cases, interview preparation, and coding preparation.
GenAI Processors is a lightweight Python library that enables efficient, parallel content processing.
Implementation of "BitNet: Scaling 1-bit Transformers for Large Language Models" in pytorch
[ICLR & NeurIPS 2025] Repository for Show-o series, One Single Transformer to Unify Multimodal Understanding and Generation.
An open-source implementaion for fine-tuning Qwen-VL series by Alibaba Cloud.
Synthesizing Graphics Programs for Scientific Figures and Sketches with TikZ.
This repository is the official implementation of Disentangling Writer and Character Styles for Handwriting Generation (CVPR 2023)
Famous Vision Language Models and Their Architectures
An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
library supporting NLP and CV research on scientific papers
A simple, unified multimodal models training engine. Lean, flexible, and built for hacking at scale.
This repository provides programs to build Retrieval Augmented Generation (RAG) code for Generative AI with LlamaIndex, Deep Lake, and Pinecone leveraging the power of OpenAI and Hugging Face models for generation and evaluation.
Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets
[VLDB' 25] ChatTS: Understanding, Chat, Reasoning about Time Series with TS-MLLM
Optimized local inference for LLMs with HuggingFace-like APIs for quantization, vision/language models, multimodal agents, speech, vector DB, and RAG.
GraTAG — Production AI Search via Graph-Based Query Decomposition and Triplet-Aligned Generation with Rich Multimodal Representations
Janus-Series: Unified Multimodal Understanding and Generation Models
Instant neural graphics primitives: lightning fast NeRF and more
Drench yourself in Deep Learning, Reinforcement Learning, Machine Learning, Computer Vision, and NLP by learning from these exciting lectures!!
A MNIST-like fashion product database. Benchmark :point_down:
cvpr2024/cvpr2023/cvpr2022/cvpr2021/cvpr2020/cvpr2019/cvpr2018/cvpr2017 论文/代码/解读/直播合集,极市团队整理
Low-code framework for building custom LLMs, neural networks, and other AI models
Production ready toolkit to run AI locally
notes for software engineers getting up to speed on new AI developments. Serves as datastore for https://latent.space writing, and product brainstorming, but has cleaned up canonical references under the /Resources folder.