EmbeddingFramework
Modular • Extensible • Production-Ready
A Python framework for embeddings, vector databases, and cloud storage providers.
📚 Documentation
A modular, extensible, and production-ready Python framework for working with embeddings, vector databases, and cloud storage providers.
Designed for AI, NLP, and semantic search applications, EmbeddingFramework provides a unified API to process, store, and query embeddings across multiple backends.
✨ Features
🔹 Multi-Vector Database Support
- ChromaDB – Local and persistent vector storage.
- Milvus – High-performance distributed vector database.
- Pinecone – Fully managed vector database service.
- Weaviate – Open-source vector search engine.
🔹 Cloud Storage Integrations
- AWS S3 – Store and retrieve embeddings or documents.
- Google Cloud Storage (GCS) – Scalable object storage.
- Azure Blob Storage – Enterprise-grade cloud storage.
🔹 Embedding Providers
- OpenAI Embeddings – State-of-the-art embedding generation.
- Easily extendable to other providers.
🔹 File Processing & Preprocessing
- Automatic file type detection.
- Text extraction from multiple formats including
.txt,.pdf,.docx,.csv,.xls,.xlsx. - Preprocessing utilities for cleaning and normalizing text.
- Intelligent text splitting for optimal embedding performance.
- Large dataset handling for Excel files with efficient chunking to preserve embedding context.
🔹 Utilities
- Retry logic for robust API calls.
- File utilities for safe and efficient I/O.
- Modular architecture for easy extension.
📦 Installation & Setup
# Basic installation pip install embeddingframework # With development dependencies pip install embeddingframework[dev]
⚡ Quick Start Example
from embeddingframework.adapters.openai_embedding_adapter import OpenAIEmbeddingAdapter from embeddingframework.adapters.vector_dbs import ChromaDBAdapter # Initialize embedding provider embedding_provider = OpenAIEmbeddingAdapter(api_key="YOUR_OPENAI_API_KEY") # Initialize vector database vector_db = ChromaDBAdapter(persist_directory="./chroma_store") # Generate embeddings embeddings = embedding_provider.embed_texts(["Hello world", "EmbeddingFramework is awesome!"]) # Store embeddings vector_db.add_texts(["Hello world", "EmbeddingFramework is awesome!"], embeddings)
📂 Project Structure
embeddingframework/
│
├── adapters/ # Vector DB & storage adapters
│ ├── base.py
│ ├── chromadb_adapter.py
│ ├── milvus_adapter.py
│ ├── pinecone_adapter.py
│ ├── weaviate_adapter.py
│ ├── storage/ # Cloud storage adapters
│
├── processors/ # File processing logic
├── utils/ # Helper utilities
└── tests/ # Test suite
🧪 Testing
pytest --maxfail=1 --disable-warnings -q
With coverage:
pytest --cov=embeddingframework --cov-report=term-missing
🔄 CI/CD
This project includes a GitHub Actions workflow (.github/workflows/python-package.yml) for:
- Automated testing with coverage.
- Version bumping & changelog generation.
- PyPI publishing.
- GitHub release creation.
📜 License
This project is licensed under the MIT License – see the LICENSE file for details.
🤝 Contributing
Contributions, issues, and feature requests are welcome!
Feel free to check the issues page.
- Fork the repository.
- Create a new branch (
feature/my-feature). - Commit your changes.
- Push to your branch.
- Open a Pull Request.
🌟 Why EmbeddingFramework?
- Unified API – Work with multiple vector DBs and storage providers seamlessly.
- Extensible – Add new adapters with minimal effort.
- Production-Ready – Built with scalability and reliability in mind.
- Developer-Friendly – Clean, modular, and well-documented codebase.
📖 Full Documentation Overview
Below is a comprehensive, end-to-end guide covering all features, usage patterns, and advanced configurations of EmbeddingFramework.
1️⃣ Introduction
EmbeddingFramework is designed to simplify the integration of embeddings, vector databases, and cloud storage into AI-powered applications. It provides:
- A unified API for multiple backends.
- Extensible architecture for adding new providers.
- Production-ready reliability with retries, error handling, and modular design.
2️⃣ Installation
pip install embeddingframework pip install embeddingframework[dev] # For development
3️⃣ Supported Vector Databases
| Database | Type | Key Features |
|---|---|---|
| ChromaDB | Local | Persistent storage, lightweight |
| Milvus | Distributed | High-performance, scalable |
| Pinecone | Managed | Fully hosted, easy to scale |
| Weaviate | Open-source | Semantic search, hybrid queries |
4️⃣ Cloud Storage Integrations
EmbeddingFramework supports:
- AWS S3
- Google Cloud Storage
- Azure Blob Storage
Example:
from embeddingframework.adapters.storage.s3_storage_adapter import S3StorageAdapter
storage = S3StorageAdapter(bucket_name="my-bucket")
storage.upload_file("local.txt", "remote.txt")
5️⃣ Embedding Providers
Currently supported:
- OpenAI Embeddings
- Easily extendable to HuggingFace, Cohere, etc.
Example:
from embeddingframework.adapters.openai_embedding_adapter import OpenAIEmbeddingAdapter provider = OpenAIEmbeddingAdapter(api_key="YOUR_KEY") embeddings = provider.embed_texts(["Hello", "World"])
6️⃣ File Processing
EmbeddingFramework provides a robust and extensible file processing pipeline that can handle a wide variety of file formats and sizes. This includes:
- Automatic File Type Detection – The framework automatically determines the file type and routes it to the appropriate parser.
- Text Extraction – Supports extracting text from:
.txt– Plain text files.pdf– PDF documents.docx– Microsoft Word documents.csv– Comma-separated values.xls/.xlsx– Microsoft Excel spreadsheets (including multi-sheet workbooks)
- Preprocessing Utilities – Cleans and normalizes extracted text for better embedding quality (e.g., removing stopwords, normalizing whitespace).
- Intelligent Text Splitting – Splits large documents into smaller, context-friendly chunks for optimal embedding performance.
- Large Dataset Handling for Excel – Efficiently processes large Excel files by:
- Reading all sheets in the workbook.
- Converting each row into a string representation.
- Chunking rows into manageable segments to avoid exceeding embedding context limits.
- Applying quality filters to remove empty or low-value chunks.
This design ensures that even massive datasets can be processed without memory overload or loss of semantic context.
Example:
from embeddingframework.processors.file_processor import FileProcessor
processor = FileProcessor()
# Process a PDF
pdf_text = processor.process_file("document.pdf")
# Process a large Excel file with multiple sheets
excel_text = processor.process_file("large_dataset.xlsx")
# Process a CSV file
csv_text = processor.process_file("data.csv")
# Process a DOCX file
docx_text = processor.process_file("report.docx")
Advanced Usage:
# Asynchronous processing with custom chunk sizes and quality filters
import asyncio
async def process_files():
await processor.process_file_async(
"large_dataset.xlsx",
chunk_size=2000,
text_chunk_size=1000,
merge_target_size=3000,
parallel=True,
min_quality_length=50
)
asyncio.run(process_files())
7️⃣ Utilities
- Retry logic
- File utilities
- Preprocessing helpers
8️⃣ CLI Usage
EmbeddingFramework includes a CLI:
embeddingframework --help
9️⃣ Advanced Configurations
- Custom vector DB adapters
- Custom embedding providers
- Batch processing
- Async support
🔟 End-to-End Example
from embeddingframework.adapters.openai_embedding_adapter import OpenAIEmbeddingAdapter from embeddingframework.adapters.vector_dbs import ChromaDBAdapter provider = OpenAIEmbeddingAdapter(api_key="KEY") db = ChromaDBAdapter(persist_directory="./store") texts = ["AI is amazing", "EmbeddingFramework is powerful"] embeddings = provider.embed_texts(texts) db.add_texts(texts, embeddings)
📊 Feature Matrix
| Feature | Supported |
|---|---|
| Multi-DB Support | ✅ |
| Cloud Storage | ✅ |
| File Processing | ✅ |
| Retry Logic | ✅ |
| CLI | ✅ |
| Async | ✅ |
📚 Learn More
For the full documentation, visit:
👉 EmbeddingFramework Docs