AI Integration
Embeddings
Vectors
Semantic Search
RAG

Embedding Vectors Guide: พื้นฐาน Vector สำหรับ AI

เรียนรู้พื้นฐาน embeddings และ vectors สำหรับ AI applications ตั้งแต่ concept ไปจนถึงการใช้งานจริงกับ semantic search และ RAG

AI Unlocked Team
17/01/2568
Embedding Vectors Guide: พื้นฐาน Vector สำหรับ AI

Embedding Vectors Guide: พื้นฐาน Vector สำหรับ AI

Embeddings เป็นหัวใจสำคัญของ AI applications ยุคใหม่ ช่วยให้คอมพิวเตอร์เข้าใจความหมายของข้อความและค้นหาข้อมูลที่เกี่ยวข้องได้

Embeddings คืออะไร?

Concept พื้นฐาน

Embedding = การแปลง text เป็น numbers

"Hello world" → [0.1, 0.3, -0.2, 0.8, ...]

ทำไมต้องแปลง?
- คอมพิวเตอร์เข้าใจตัวเลข ไม่เข้าใจ text
- สามารถคำนวณความคล้ายได้
- ใช้สำหรับ search, clustering, classification

ทำงานอย่างไร

Text → Embedding Model → Vector

"I love programming"
→ [0.12, -0.45, 0.78, 0.33, ...]

"I enjoy coding"
→ [0.14, -0.42, 0.75, 0.35, ...]

ทั้งสองมีความหมายคล้ายกัน
→ vectors ใกล้เคียงกัน

Vector Dimensions

Dimension = จำนวนตัวเลขใน vector

text-embedding-3-small: 1536 dimensions
text-embedding-3-large: 3072 dimensions

ยิ่ง dimensions มาก:
- ความแม่นยำสูงขึ้น
- ใช้ storage มากขึ้น
- คำนวณช้าลง

Creating Embeddings

OpenAI Embeddings

from openai import OpenAI

client = OpenAI()

# Single text
response = client.embeddings.create(
    model="text-embedding-3-small",
    input="Hello world"
)

embedding = response.data[0].embedding
print(f"Dimensions: {len(embedding)}")  # 1536

# Multiple texts
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=["Text 1", "Text 2", "Text 3"]
)

embeddings = [d.embedding for d in response.data]

Voyage AI Embeddings

import voyageai

vo = voyageai.Client()

# Create embeddings
result = vo.embed(
    ["Hello world", "Another text"],
    model="voyage-3",
    input_type="document"
)

embeddings = result.embeddings

Local Embeddings (Sentence Transformers)

from sentence_transformers import SentenceTransformer

# Load model (runs locally)
model = SentenceTransformer('all-MiniLM-L6-v2')

# Create embeddings
texts = ["Hello world", "Another text"]
embeddings = model.encode(texts)

print(f"Shape: {embeddings.shape}")  # (2, 384)

Cosine Similarity

import numpy as np

def cosine_similarity(a, b):
    """Calculate cosine similarity between two vectors."""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Example
vec1 = [0.1, 0.2, 0.3]
vec2 = [0.15, 0.25, 0.28]
vec3 = [-0.5, -0.6, -0.7]

print(cosine_similarity(vec1, vec2))  # ~0.99 (similar)
print(cosine_similarity(vec1, vec3))  # ~-0.99 (opposite)

Finding Similar Documents

from openai import OpenAI
import numpy as np

client = OpenAI()

# Documents to search
documents = [
    "Python is a programming language",
    "JavaScript is used for web development",
    "Machine learning uses algorithms",
    "Cats are cute animals",
    "Deep learning is a subset of ML"
]

# Create embeddings for documents
doc_embeddings = client.embeddings.create(
    model="text-embedding-3-small",
    input=documents
).data

# Query
query = "AI and neural networks"
query_embedding = client.embeddings.create(
    model="text-embedding-3-small",
    input=query
).data[0].embedding

# Find most similar
similarities = [
    cosine_similarity(query_embedding, doc.embedding)
    for doc in doc_embeddings
]

# Sort by similarity
results = sorted(
    zip(documents, similarities),
    key=lambda x: x[1],
    reverse=True
)

for doc, score in results[:3]:
    print(f"{score:.3f}: {doc}")

Vector Databases

Why Vector DBs?

ปัญหาของ brute-force search:
- ช้ามากเมื่อข้อมูลเยอะ
- O(n) ต่อ query

Vector Database:
- ใช้ indexing algorithms (HNSW, IVF)
- O(log n) หรือ O(1) ต่อ query
- รองรับ millions of vectors
Managed Services:
- Pinecone - Fully managed, easy to use
- Weaviate Cloud - Feature-rich
- Qdrant Cloud - High performance

Self-Hosted:
- Chroma - Simple, lightweight
- Qdrant - High performance
- Milvus - Enterprise scale
- pgvector - PostgreSQL extension

Using Chroma

import chromadb
from chromadb.utils import embedding_functions

# Create client
client = chromadb.Client()

# Use OpenAI embeddings
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    model_name="text-embedding-3-small"
)

# Create collection
collection = client.create_collection(
    name="my_documents",
    embedding_function=openai_ef
)

# Add documents
collection.add(
    documents=[
        "Python programming guide",
        "JavaScript for beginners",
        "Machine learning basics"
    ],
    ids=["doc1", "doc2", "doc3"],
    metadatas=[
        {"topic": "python"},
        {"topic": "javascript"},
        {"topic": "ml"}
    ]
)

# Query
results = collection.query(
    query_texts=["AI tutorials"],
    n_results=2
)

print(results['documents'])
print(results['distances'])

Using pgvector

-- Enable extension
CREATE EXTENSION vector;

-- Create table
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    embedding vector(1536)
);

-- Insert with embedding
INSERT INTO documents (content, embedding)
VALUES ('Hello world', '[0.1, 0.2, ...]');

-- Search by similarity
SELECT content, embedding <=> '[0.15, 0.25, ...]' AS distance
FROM documents
ORDER BY distance
LIMIT 5;
# Python with psycopg2
import psycopg2
from pgvector.psycopg2 import register_vector

conn = psycopg2.connect("postgresql://...")
register_vector(conn)

cur = conn.cursor()

# Insert
cur.execute(
    "INSERT INTO documents (content, embedding) VALUES (%s, %s)",
    ("Hello world", embedding)
)

# Search
cur.execute("""
    SELECT content, embedding <=> %s AS distance
    FROM documents
    ORDER BY distance
    LIMIT 5
""", (query_embedding,))

results = cur.fetchall()

Use Cases

# Traditional keyword search:
# "programming" ไม่เจอ "coding"

# Semantic search with embeddings:
# "programming" เจอ "coding" เพราะความหมายใกล้เคียง

def semantic_search(query, documents, top_k=5):
    # Get query embedding
    query_emb = get_embedding(query)

    # Calculate similarities
    scores = []
    for doc in documents:
        doc_emb = get_embedding(doc['text'])
        score = cosine_similarity(query_emb, doc_emb)
        scores.append((doc, score))

    # Return top results
    scores.sort(key=lambda x: x[1], reverse=True)
    return scores[:top_k]

2. RAG (Retrieval Augmented Generation)

def rag_answer(question, knowledge_base):
    # 1. Find relevant documents
    relevant_docs = semantic_search(question, knowledge_base, top_k=3)

    # 2. Create context
    context = "\n".join([doc['text'] for doc, _ in relevant_docs])

    # 3. Generate answer with LLM
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": f"Answer based on context:\n{context}"
            },
            {"role": "user", "content": question}
        ]
    )

    return response.choices[0].message.content

3. Document Clustering

from sklearn.cluster import KMeans

# Get embeddings for all documents
embeddings = [get_embedding(doc) for doc in documents]

# Cluster
kmeans = KMeans(n_clusters=5)
clusters = kmeans.fit_predict(embeddings)

# Group documents by cluster
for i, (doc, cluster) in enumerate(zip(documents, clusters)):
    print(f"Cluster {cluster}: {doc[:50]}...")

4. Duplicate Detection

def find_duplicates(documents, threshold=0.95):
    duplicates = []

    for i, doc1 in enumerate(documents):
        emb1 = get_embedding(doc1)

        for j, doc2 in enumerate(documents[i+1:], i+1):
            emb2 = get_embedding(doc2)
            similarity = cosine_similarity(emb1, emb2)

            if similarity > threshold:
                duplicates.append((i, j, similarity))

    return duplicates

Best Practices

1. Chunking Strategy

# Don't embed entire documents
# Split into meaningful chunks

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = splitter.split_text(long_document)

2. Batch Processing

# Don't create embeddings one by one
# Batch for efficiency

def batch_embed(texts, batch_size=100):
    embeddings = []

    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=batch
        )
        embeddings.extend([d.embedding for d in response.data])

    return embeddings

3. Caching

import hashlib
import json

def get_embedding_cached(text, cache={}):
    # Create cache key
    key = hashlib.md5(text.encode()).hexdigest()

    if key not in cache:
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=text
        )
        cache[key] = response.data[0].embedding

    return cache[key]

4. Dimensionality Reduction

# Reduce dimensions for storage/speed
response = client.embeddings.create(
    model="text-embedding-3-small",
    input="Hello world",
    dimensions=512  # Reduce from 1536
)

สรุป

Embedding Concepts:

  1. Text → Vector: แปลงข้อความเป็นตัวเลข
  2. Similarity: เปรียบเทียบความคล้าย
  3. Vector DB: เก็บและค้นหา vectors
  4. Semantic Search: ค้นหาตามความหมาย
  5. RAG: ใช้ร่วมกับ LLM

Best Practices:

  • Chunk documents appropriately
  • Batch embedding requests
  • Use caching
  • Choose right dimensions

Use Cases:

  • Semantic search
  • RAG systems
  • Document clustering
  • Duplicate detection
  • Recommendation systems

อ่านเพิ่มเติม:


เขียนโดย

AI Unlocked Team