glove.c
|
glove.c
Port of GloVe Embeddings in C The program below shows the usage of the glove.c
with pretrained embeddings taken from StanfordNLP/GloVe repository. The embeddings are derived from the Wikipedia 2014 + Gigaword 5 datasets consisting of 6B tokens and 400K vocab with 50 dimensions.
See src/main.c
Compile the program with glove.c
and libmath
,
See examples/java
See examples/python
glove.c
uses a hashtable with open-chaining to get near-constant access times for all embeddings, at the expense of extra storage overhead.
The steps for training a GloVe model on a custom corpus is provided on the official GitHub repository. Once the training is started with the by executing the demo.sh
script, we see the following in output written on the console,
After training is complete, the vectors.txt
file can be found in the root directory of the project. Along with vectors.txt
, we also need vector size
and vocab size
from console output, as given above. These three parameters would go into the glove_create
function which returns an instance of glove
and allows us to get embeddings for words.