Published: Stabilizing Document Vector Inference with Doc2Vec

Published: 2019 - 4 - 11Category:Tech BlogTags:

Slug:

/blog/Published: Stabilizing Document Vector Inference with Doc2Vec

I wrote a tech blog post for the DSOC department at Sansan, Inc., where I was interning. Sharing it here as well.

Unsupervised document classification with Doc2Vec: Using a Doc2Vec model pre-trained on Japanese Wikipedia, I implemented and evaluated a method to assign arbitrary labels (baseball, soccer, basketball, etc.) to articles from the Livedoor News corpus (Sports Watch) without any annotation. Classification is achieved by comparing document vectors and word vectors in the same space using cosine similarity.

Stability analysis of infer_vector(): Doc2Vec.infer_vector() produces different results on each call because it runs a mini training pass internally, making the epochs parameter (default=5) directly tied to accuracy and stability. I ran experiments across documents of varying lengths — measuring the mean and variance of cosine similarity over 100 trials of two independent inferences on the same document — and found the following:

Increasing epochs brings cosine similarity closer to 1 and improves stability
Cosine similarity exceeds 0.99 around epochs=30, with diminishing returns beyond that
Shorter documents require more epochs to stabilize; for documents with 50+ words, around 30 is sufficient
Conclusion: Setting epochs to roughly 20–50 depending on document length is effective in practice

For details, see Stabilizing Document Vector Inference with Doc2Vec — Sansan Builders Box and the experiment repository.

atsuya koba

Published: Stabilizing Document Vector Inference with Doc2Vec

Read more articles

Trends and Practices in Music and Human-Computer Interaction Research (2020)

Created a LINE bot to search Wikipedia

bash prompt memo