12/18/2023

Learn how to use AI-powered features like image description and speech transcription in MarkItDown

Using AI Features in MarkItDown

MarkItDown integrates with Large Language Models to provide advanced features like image description and speech transcription. This guide explains how to set up and use these AI-powered features.

Setting Up AI Integration

To use AI features, you'll need an OpenAI API key. Here's how to set it up:

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()  # Make sure OPENAI_API_KEY is set in your environment
md = MarkItDown(llm_client=client, llm_model="gpt-4")

Image Description

When processing images, MarkItDown can generate detailed descriptions using AI:

result = md.convert("example.jpg")
print(result.text_content)  # Includes AI-generated image description

The AI will analyze the image and provide:

  • Detailed visual description
  • Object identification
  • Text recognition (OCR)
  • Scene understanding
  • Contextual information

Speech Transcription

For audio files, MarkItDown can transcribe speech to text:

result = md.convert("audio_file.mp3")
print(result.text_content)  # Includes transcribed text

Features include:

  • Multi-language support
  • Speaker diarization
  • Timestamp markers
  • Punctuation and formatting

Best Practices

  1. Choose appropriate models for your use case
  2. Consider rate limits and API costs
  3. Handle large files appropriately
  4. Cache results when processing the same files multiple times

For more information about API configuration and advanced settings, refer to our documentation.