Published: Stabilizing Document Vector Inference with Doc2Vec
I wrote a tech blog post for the DSOC department at Sansan, Inc., where I was interning. Sharing it here as well.
Unsupervised document classification with Doc2Vec: Using a Doc2Vec model pre-trained on Japanese Wikipedia, I implemented and evaluated a method to assign arbitrary labels (baseball, soccer, basketball, etc.) to articles from the Livedoor News corpus (Sports Watch) without any annotation. Classification is achieved by comparing document vectors and word vectors in the same space using cosine similarity.
Stability analysis of infer_vector(): Doc2Vec.infer_vector() produces different results on each call because it runs a mini training pass internally, making the epochs parameter (default=5) directly tied to accuracy and stability. I ran experiments across documents of varying lengths — measuring the mean and variance of cosine similarity over 100 trials of two independent inferences on the same document — and found the following:
- Increasing
epochsbrings cosine similarity closer to 1 and improves stability - Cosine similarity exceeds 0.99 around
epochs=30, with diminishing returns beyond that - Shorter documents require more
epochsto stabilize; for documents with 50+ words, around 30 is sufficient - Conclusion: Setting
epochsto roughly 20–50 depending on document length is effective in practice
For details, see Stabilizing Document Vector Inference with Doc2Vec — Sansan Builders Box and the experiment repository.