German Vocabulary Learning System
German Vocabulary Learning System
This system automatically generates German vocabulary blog posts and podcast episodes from word lists.
Features
- Automated Content Generation: Processes German word lists to create comprehensive learning content
- Etymology & Memory Techniques: AI-generated word origins and Japanese phonetic memory aids
- Multi-Speed Audio: Slow, normal, and fast pronunciation practice
- SSML Audio Generation: High-quality text-to-speech with language switching
- Podcast Integration: Automatic RSS feed updates for podcast distribution
- File Monitoring: Watches for new word files and auto-generates content
Quick Start
1. Setup Dependencies
pip install -r requirements.txt
2. Environment Variables
export ANTHROPIC_API_KEY="your_anthropic_api_key"
export AZURE_SPEECH_KEY="your_azure_speech_key"
export AZURE_SPEECH_REGION="your_azure_region"
3. Add Word File
Create a text file in data/words/
with German words (one per line):
data/words/my_words.txt
Example content:
lernen
verstehen
sprechen
hören
lesen
4. Generate Content
Manual Generation:
python generate_vocab_content.py
Automatic Monitoring:
python watch_words.py
Output Files
Blog Posts
- Location:
_posts/YYYY-MM-DD-german-vocab-episode-X.md
- Format: Jekyll-compatible markdown with front matter
- Content: Vocabulary entries with etymology, memory techniques, examples, and quiz
Audio Files
- Location:
assets/audio/episode-X.mp3
- Format: MP3 with Japanese/German language switching
- Features: Multiple speaking speeds, pronunciation practice, quiz section
SSML Files (Debug)
- Location:
data/ssml/episode-X-ssml.xml
- Purpose: Debug SSML markup for TTS generation
Content Structure
Each vocabulary entry follows this format:
## 単語1: lernen
**意味**: 学ぶ
**語源**: 中高ドイツ語の「lernen」から派生...
**覚え方**: 「レルネン」→「レール音」→電車のレール音を聞いて学ぶ
**ゆっくり**: lernen... lernen... lernen
**普通**: lernen, lernen, lernen
**早口**: lernen-lernen-lernen
**例文トレーニング**:
1. Ich möchte Deutsch lernen. (ドイツ語を学びたいです。)
2. Kinder lernen schnell. (子供たちは早く学びます。)
3. Wir lernen jeden Tag etwas Neues. (私たちは毎日何か新しいことを学びます。)
**ゆっくり**: lernen... lernen... lernen
**普通**: lernen, lernen, lernen
**早口**: lernen-lernen-lernen
SSML Features
The audio generation uses advanced SSML markup:
- Language Switching:
<lang xml:lang="ja-JP">
and<lang xml:lang="de-DE">
- Speaking Rates: Slow (0.75), Normal (1.0), Fast (1.25)
- Breaks: Strategic pauses for comprehension
- Quiz Timing: 3-second pauses for quiz answers
Example SSML:
<speak>
<lang xml:lang="ja-JP">今日のドイツ語単語は</lang>
<lang xml:lang="de-DE">lernen</lang>
<lang xml:lang="ja-JP">です。</lang>
<break time="1s"/>
<lang xml:lang="ja-JP">ゆっくり:</lang>
<prosody rate="0.75">
<lang xml:lang="de-DE">lernen</lang>
</prosody>
</speak>
GitHub Actions Integration
The system includes automated workflows:
vocabulary-pipeline.yml
- Trigger: Push to
data/words/*.txt
- Actions:
- Generate vocabulary content
- Create audio files
- Update podcast RSS
- Auto-commit changes
Environment Secrets Required
ANTHROPIC_API_KEY
: For content generationAZURE_SPEECH_KEY
: For audio generationAZURE_SPEECH_REGION
: Azure region for Speech Services
File Watcher
The watch_words.py
script provides real-time monitoring:
- Watches
data/words/
directory - Processes new
.txt
files automatically - Generates content and audio
- Auto-commits to git (optional)
Usage
python watch_words.py
Features:
- File change detection
- Duplicate processing prevention
- Error handling and retry logic
- Environment validation
Podcast RSS Integration
The system automatically updates podcast.rss
with:
- Vocabulary episodes
- Regular German learning content
- iTunes-compatible metadata
- Episode numbering and descriptions
Directory Structure
/
├── data/
│ ├── words/ # Word list files (.txt)
│ └── ssml/ # Generated SSML files (debug)
├── _posts/ # Generated blog posts
├── assets/
│ └── audio/ # Generated MP3 files
├── .github/
│ └── workflows/
│ └── vocabulary-pipeline.yml
├── generate_vocab_content.py # Main content generator
├── watch_words.py # File watcher
├── requirements.txt # Python dependencies
└── podcast.rss # Updated RSS feed
Troubleshooting
Common Issues
- API Key Errors
- Verify
ANTHROPIC_API_KEY
is set - Check API key permissions
- Verify
- Audio Generation Fails
- Verify
AZURE_SPEECH_KEY
andAZURE_SPEECH_REGION
are set - Check Azure Speech Services subscription status
- Verify
- File Watcher Not Responding
- Check file permissions in
data/words/
- Verify required directories exist
- Check file permissions in
- Empty Audio Files
- Check SSML syntax in debug files
- Verify Azure Speech Services quotas and limits
Debug Commands
# Test content generation
python generate_vocab_content.py
# Check environment
python -c "import os; print('ANTHROPIC_API_KEY' in os.environ)"
# Test Azure Speech Services setup
python -c "import azure.cognitiveservices.speech as speechsdk; print('Azure Speech Services ready')"
Customization
Modify Content Format
Edit format_vocabulary_content()
in generate_vocab_content.py
Adjust Audio Settings
Modify voice selection and audio config in generate_audio_file()
Change File Patterns
Update glob patterns in watch_words.py
and workflow files
Performance Notes
- Rate Limits: Built-in delays between API calls
- File Size: Audio files typically 1-3MB per episode
- Generation Time: ~2-3 minutes per 5-word episode
- Concurrent Processing: Single-threaded to respect API limits
Contributing
- Test changes with small word files first
- Verify SSML output in debug files
- Check audio quality before committing
- Update documentation for new features