Published: Stabilizing Document Vector Inference with Doc2Vec

Category:Tech BlogTags:
#Machine Learning#NLP
Published: 2019 - 4 - 11

I wrote a tech blog post for the DSOC department at Sansan, Inc., where I was interning. Sharing it here as well.

Unsupervised document classification with Doc2Vec: Using a Doc2Vec model pre-trained on Japanese Wikipedia, I implemented and evaluated a method to assign arbitrary labels (baseball, soccer, basketball, etc.) to articles from the Livedoor News corpus (Sports Watch) without any annotation. Classification is achieved by comparing document vectors and word vectors in the same space using cosine similarity.

Stability analysis of infer_vector(): Doc2Vec.infer_vector() produces different results on each call because it runs a mini training pass internally, making the epochs parameter (default=5) directly tied to accuracy and stability. I ran experiments across documents of varying lengths — measuring the mean and variance of cosine similarity over 100 trials of two independent inferences on the same document — and found the following:

  • Increasing epochs brings cosine similarity closer to 1 and improves stability
  • Cosine similarity exceeds 0.99 around epochs=30, with diminishing returns beyond that
  • Shorter documents require more epochs to stabilize; for documents with 50+ words, around 30 is sufficient
  • Conclusion: Setting epochs to roughly 20–50 depending on document length is effective in practice

For details, see Stabilizing Document Vector Inference with Doc2Vec — Sansan Builders Box and the experiment repository.

Read more articles