Newsletter
Newsletter

EmbeddingFramework

Scroll down
Sathishkumar Nagarajan
Sathishkumar Nagarajan
I am a professional in
  • Residence:
    India
  • City:
    Chennai
  • Mail:
    mail@sathishai.com

January 2, 2026

2:45 am

Sathishkumar

EmbeddingFramework

Modular • Extensible • Production-Ready
A Python framework for embeddings, vector databases, and cloud storage providers.

Build Status
PyPI Version
License

📚 Documentation

Documentation
Rich UI Docs
Explore Features
API Reference

 


modular, extensible, and production-ready Python framework for working with embeddings, vector databases, and cloud storage providers.
Designed for AI, NLP, and semantic search applications, EmbeddingFramework provides a unified API to process, store, and query embeddings across multiple backends.


✨ Features

🔹 Multi-Vector Database Support

  • ChromaDB – Local and persistent vector storage.
  • Milvus – High-performance distributed vector database.
  • Pinecone – Fully managed vector database service.
  • Weaviate – Open-source vector search engine.

🔹 Cloud Storage Integrations

  • AWS S3 – Store and retrieve embeddings or documents.
  • Google Cloud Storage (GCS) – Scalable object storage.
  • Azure Blob Storage – Enterprise-grade cloud storage.

🔹 Embedding Providers

  • OpenAI Embeddings – State-of-the-art embedding generation.
  • Easily extendable to other providers.

🔹 File Processing & Preprocessing

  • Automatic file type detection.
  • Text extraction from multiple formats including .txt.pdf.docx.csv.xls.xlsx.
  • Preprocessing utilities for cleaning and normalizing text.
  • Intelligent text splitting for optimal embedding performance.
  • Large dataset handling for Excel files with efficient chunking to preserve embedding context.

🔹 Utilities

  • Retry logic for robust API calls.
  • File utilities for safe and efficient I/O.
  • Modular architecture for easy extension.

📦 Installation & Setup

# Basic installation
pip install embeddingframework

# With development dependencies
pip install embeddingframework[dev]

⚡ Quick Start Example

from embeddingframework.adapters.openai_embedding_adapter import OpenAIEmbeddingAdapter
from embeddingframework.adapters.vector_dbs import ChromaDBAdapter

# Initialize embedding provider
embedding_provider = OpenAIEmbeddingAdapter(api_key="YOUR_OPENAI_API_KEY")

# Initialize vector database
vector_db = ChromaDBAdapter(persist_directory="./chroma_store")

# Generate embeddings
embeddings = embedding_provider.embed_texts(["Hello world", "EmbeddingFramework is awesome!"])

# Store embeddings
vector_db.add_texts(["Hello world", "EmbeddingFramework is awesome!"], embeddings)

📂 Project Structure

embeddingframework/
│
├── adapters/                # Vector DB & storage adapters
│   ├── base.py
│   ├── chromadb_adapter.py
│   ├── milvus_adapter.py
│   ├── pinecone_adapter.py
│   ├── weaviate_adapter.py
│   ├── storage/             # Cloud storage adapters
│
├── processors/              # File processing logic
├── utils/                    # Helper utilities
└── tests/                    # Test suite

🧪 Testing

pytest --maxfail=1 --disable-warnings -q

With coverage:

pytest --cov=embeddingframework --cov-report=term-missing

🔄 CI/CD

This project includes a GitHub Actions workflow (.github/workflows/python-package.yml) for:

  • Automated testing with coverage.
  • Version bumping & changelog generation.
  • PyPI publishing.
  • GitHub release creation.

📜 License

MIT License

This project is licensed under the MIT License – see the LICENSE file for details.


🤝 Contributing

Contributions, issues, and feature requests are welcome!
Feel free to check the issues page.

  1. Fork the repository.
  2. Create a new branch (feature/my-feature).
  3. Commit your changes.
  4. Push to your branch.
  5. Open a Pull Request.

🌟 Why EmbeddingFramework?

  • Unified API – Work with multiple vector DBs and storage providers seamlessly.
  • Extensible – Add new adapters with minimal effort.
  • Production-Ready – Built with scalability and reliability in mind.
  • Developer-Friendly – Clean, modular, and well-documented codebase.

📖 Full Documentation Overview

Below is a comprehensive, end-to-end guide covering all features, usage patterns, and advanced configurations of EmbeddingFramework.

1️⃣ Introduction

EmbeddingFramework is designed to simplify the integration of embeddings, vector databases, and cloud storage into AI-powered applications. It provides:

  • unified API for multiple backends.
  • Extensible architecture for adding new providers.
  • Production-ready reliability with retries, error handling, and modular design.

2️⃣ Installation

pip install embeddingframework
pip install embeddingframework[dev]  # For development

3️⃣ Supported Vector Databases

DatabaseTypeKey Features
ChromaDBLocalPersistent storage, lightweight
MilvusDistributedHigh-performance, scalable
PineconeManagedFully hosted, easy to scale
WeaviateOpen-sourceSemantic search, hybrid queries

4️⃣ Cloud Storage Integrations

EmbeddingFramework supports:

  • AWS S3
  • Google Cloud Storage
  • Azure Blob Storage

Example:

from embeddingframework.adapters.storage.s3_storage_adapter import S3StorageAdapter
storage = S3StorageAdapter(bucket_name="my-bucket")
storage.upload_file("local.txt", "remote.txt")

5️⃣ Embedding Providers

Currently supported:

  • OpenAI Embeddings
  • Easily extendable to HuggingFace, Cohere, etc.

Example:

from embeddingframework.adapters.openai_embedding_adapter import OpenAIEmbeddingAdapter
provider = OpenAIEmbeddingAdapter(api_key="YOUR_KEY")
embeddings = provider.embed_texts(["Hello", "World"])

6️⃣ File Processing

EmbeddingFramework provides a robust and extensible file processing pipeline that can handle a wide variety of file formats and sizes. This includes:

  • Automatic File Type Detection – The framework automatically determines the file type and routes it to the appropriate parser.
  • Text Extraction – Supports extracting text from:
    • .txt – Plain text files
    • .pdf – PDF documents
    • .docx – Microsoft Word documents
    • .csv – Comma-separated values
    • .xls / .xlsx – Microsoft Excel spreadsheets (including multi-sheet workbooks)
  • Preprocessing Utilities – Cleans and normalizes extracted text for better embedding quality (e.g., removing stopwords, normalizing whitespace).
  • Intelligent Text Splitting – Splits large documents into smaller, context-friendly chunks for optimal embedding performance.
  • Large Dataset Handling for Excel – Efficiently processes large Excel files by:
    • Reading all sheets in the workbook.
    • Converting each row into a string representation.
    • Chunking rows into manageable segments to avoid exceeding embedding context limits.
    • Applying quality filters to remove empty or low-value chunks.

This design ensures that even massive datasets can be processed without memory overload or loss of semantic context.

Example:

from embeddingframework.processors.file_processor import FileProcessor

processor = FileProcessor()

# Process a PDF
pdf_text = processor.process_file("document.pdf")

# Process a large Excel file with multiple sheets
excel_text = processor.process_file("large_dataset.xlsx")

# Process a CSV file
csv_text = processor.process_file("data.csv")

# Process a DOCX file
docx_text = processor.process_file("report.docx")

Advanced Usage:

# Asynchronous processing with custom chunk sizes and quality filters
import asyncio

async def process_files():
    await processor.process_file_async(
        "large_dataset.xlsx",
        chunk_size=2000,
        text_chunk_size=1000,
        merge_target_size=3000,
        parallel=True,
        min_quality_length=50
    )

asyncio.run(process_files())

7️⃣ Utilities

  • Retry logic
  • File utilities
  • Preprocessing helpers

8️⃣ CLI Usage

EmbeddingFramework includes a CLI:

embeddingframework --help

9️⃣ Advanced Configurations

  • Custom vector DB adapters
  • Custom embedding providers
  • Batch processing
  • Async support

🔟 End-to-End Example

from embeddingframework.adapters.openai_embedding_adapter import OpenAIEmbeddingAdapter
from embeddingframework.adapters.vector_dbs import ChromaDBAdapter

provider = OpenAIEmbeddingAdapter(api_key="KEY")
db = ChromaDBAdapter(persist_directory="./store")

texts = ["AI is amazing", "EmbeddingFramework is powerful"]
embeddings = provider.embed_texts(texts)
db.add_texts(texts, embeddings)

📊 Feature Matrix

FeatureSupported
Multi-DB Support
Cloud Storage
File Processing
Retry Logic
CLI
Async

📚 Learn More

For the full documentation, visit:
👉 EmbeddingFramework Docs

Posted in R&D Labs
© 2025 All Rights Reserved.
Email: mail@sathishai.com
Write me a message
Write me a message

    * I promise the confidentiality of your personal information