You need to sign in to do that
Don't have an account?
Word Meaning and Word2vec Trailhead Badge help
I'm working on the Word Meaning and Word2vec badge and the Hands-on: Construct examples for each W2V variant is taking forever (it's already been running for 2 hours). Has anyone else gotten through this? How long did this part take?
My best guess is that I have something wrong but I'm not sure what since I've gotten no error messages...but when I stoped it it looks like it's still in the while loop - do can anyone provide some guidance on where I'm wrong and what I might want to look at to get back on track?
while True:
# TODO: select a random sentence index using random.randint and get that
# sentence. Be careful to avoid indexing errors.
sentence_idx = random.randint(0,len(numericalized_sentences)-1)
sentence = numericalized_sentences[sentence_idx]
# TODO: Select a random window index using random.randint
# and obtain that window of size n. Be careful to avoid indexing errors.
window_idx = random.randint(0,len(sentence)-1)
window = sentence[window_idx:k]
if len(window) <= n//2:
continue
Thanks!
Lynda
My best guess is that I have something wrong but I'm not sure what since I've gotten no error messages...but when I stoped it it looks like it's still in the while loop - do can anyone provide some guidance on where I'm wrong and what I might want to look at to get back on track?
while True:
# TODO: select a random sentence index using random.randint and get that
# sentence. Be careful to avoid indexing errors.
sentence_idx = random.randint(0,len(numericalized_sentences)-1)
sentence = numericalized_sentences[sentence_idx]
# TODO: Select a random window index using random.randint
# and obtain that window of size n. Be careful to avoid indexing errors.
window_idx = random.randint(0,len(sentence)-1)
window = sentence[window_idx:k]
if len(window) <= n//2:
continue
Thanks!
Lynda
def construct_examples(numericalized_sentences, vocabulary, num_examples=int(1e6), n=5, sg=True, k=0):
examples = []
while True:
# TODO: select a random sentence index using random.randint and get that
# sentence. Be careful to avoid indexing errors.
sentence_idx = random.randint(0,len(numericalized_sentences)-1)
sentence = numericalized_sentences[sentence_idx]
# TODO: Select a random window index using random.randint
# and obtain that window of size n. Be careful to avoid indexing errors.
window_idx = random.randint(0,len(sentence)-1)
window = sentence[window_idx:n]
if len(window) <= n//2:
continue
# TODO: Get the center word and the context words
center_word = window[int(round(len(window)/2))]
context_words = window
context_words.remove(center_word)
# TODO: Create examples using the guidelines above
if sg: # if Skip-Gram
context_word = context_words[random.randint(0, len(context_words)-1)]
example = [center_word, context_word]
else: # if CBOW
example = [context_words, center_word]
if len(window) < n:
continue
if k > 0: # if doing negative sampling
samples = [random.randint(0, len(vocabulary.index_to_word)-1)
for _ in range(k)]
example.append(samples)
examples.append(example)
if len(examples) >= num_examples:
break
return examples
any help on where I'm off and what I should consider for changing would be most welcome!
Lynda
I noticed this point too. Did you solve it?
Thanks,
Iago Breno
Hi. No I have not, though I did get the windowsing part fixed. Tested that with an actual index. I'm still having problems figuring out how to get the sentence index since numericalized_sentences are nested lists. This is what I have now:
while True:
# TODO: select a random sentence index using random.randint and get that - need to fix this - doesn't work because numericalized_sentences is nested vector - do I just choose from numericalized_sentences[0] - nope, that fails for the sst_vocab with the random.seed provided, do I need to flatten first? how?
# sentence. Be careful to avoid indexing errors.
sentence_idx = random.randint(0,len(numericalized_sentences)-1)
sentence = numericalized_sentences[sentence_idx]
# TODO: Select a random window index using random.randint - this works now (so keep) (pretty sure the commented out version works better)
# and obtain that window of size n. Be careful to avoid indexing errors.
window_idx = random.randint(0,len(vocabulary.sentences[sentence])-1)
window = list((vocabulary.sentences[sentence])[i] for i in range(window_idx, window_idx+n, 1))
#window = (vocabulary.sentences[sentence])[window_idx:(window_idx+n)] --this version might be better
if len(window) <= n//2:
continue
Any guidance you can provide would be appreciated!
Lynda
I was actually having some problems too. I believe I made some progress. My code is similar to yours, but there is something wrong because I am facing this error:
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable `--NotebookApp.iopub_data_rate_limit`.
Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)
I am not able to even test the output because of this. Did you solve the problem? I would appreciate any help.
Thanks,
Iago
I still haven't figured it out. Though the error you are getting, I've gotten if I tried to grab a "sentence" from the vocab with too big of a window or tried to calculate out len on some of the lists. I did find this (https://adventuresinmachinelearning.com/word2vec-tutorial-tensorflow/) but I need to find some time to look it over more but it looks very similar to what the workbook has us doing.
Lynda
This is what I currently have:
def construct_examples(numericalized_sentences, vocabulary, num_examples=int(1e6), n=5, sg=True, k=0):
examples = []
while True:
# TODO: select a random sentence index using random.randint and get that
# sentence. Be careful to avoid indexing errors.
sentence_idx = random.randint(0,len(numericalized_sentences)-1)
sentence = numericalized_sentences[sentence_idx]
# TODO: Select a random window index using random.randint
# and obtain that window of size n. Be careful to avoid indexing errors.
window_idx = random.randint(0,len(sentence)-1)
#window = vocabulary.index_to_word[window_idx][:(window_idx+n)] #this version might be better
window = list(vocabulary.index_to_word[i] for i in range(window_idx,window_idx+n,1))
if len(window) <= n//2:
continue
The current window value I have gets me a KeyError but if I run it with actual values
in a prior run window with print(list(sst_vocab.index_to_word[i] for i in range(2323,2328,1))) - I actually get what I'm expecting
plus I know that potentially this will give me an error if the range values exceed the max index
I was also trying splicing but then I get a hash error so I can't do splicing with sst_vocab.index_to_word
Don't know if you are anyone else can take this further - this is what I got after being in a PyLadies meeting this evening and getting some help there (but apparently not enough).
Lynda
Is the example model list format correct?
# TODO: Get the center word and the context words
context_words = list(window)
center_word = list([context_words.pop(len(window)//2)])
if sg: # if
context_word = context_words[random.randint(0, len(context_words)-1)]
example = [center_word, [context_word]]
else: # if CBOW
example = [context_words,center_word]
if len(window) < n:
continue
en_wiki_5sgfs_examples: [[[7487], [1921]], [[1918], [1918]]]
en_wiki_5cbowfs_examples: [[[881, 837, 17, 20348], [516]], [[8377, 6, 952, 7183], [4795]]]
en_wiki_5sg15ns_examples: [[[7210], [1776], [44199, 43751, 32134, 21466, 218, 57183, 11483, 49492, 9158, 864, 41347, 58762, 13374, 5752, 12158]], [[1760], [5393], [38247, 56444, 62511, 34776, 61511, 4816, 39989, 45018, 68376, 63302, 27113, 69084, 41322, 1644, 52197]]]
en_wiki_5cbow15ns_examples: [[[16546, 2563, 1956, 184], [24789], [68237, 54984, 49089, 66855, 4173, 23784, 10827, 63819, 34326, 22298, 43896, 44160, 51274, 9606, 59869]], [[72, 316, 506, 222], [903], [2137, 24780, 11554, 47646, 1681, 46126, 30032, 53178, 69729, 65668, 7828, 37709, 64851, 30588, 63414]]]
def construct_examples(numericalized_sentences, vocabulary, num_examples=int(1e6), n=5, sg=True, k=0):
examples = []
while True:
# TODO: select a random sentence index using random.randint and get that
# sentence. Be careful to avoid indexing errors.
sentence_idx = random.randint(0, len(numericalized_sentences)-1)
sentence = numericalized_sentences[sentence_idx]
# TODO: Select a random window index using random.randint
# and obtain that window of size n. Be careful to avoid indexing errors.
window_idx = random.randint(0, len(sentence)-1)
window = sentence[window_idx:(window_idx+n)]
if len(window) <= n//2:
continue
# TODO: Get the center word and the context words
context_words = list(window)
center_word = list([context_words.pop(len(window)//2)])
if sg: # if
context_word = context_words[random.randint(0, len(context_words)-1)]
example = [center_word, [context_word]]
else: # if CBOW
example = [context_words,center_word]
#example.append(center_word)
if len(window) < n:
continue
if k > 0: # if doing negative sampling
samples = [random.randint(0, len(vocabulary.index_to_word)-1)
for _ in range(k)]
example.append(samples)
examples.append(example)
if len(examples) >= num_examples:
break
return examples
It is constructing all samples fine.In the next step, I am getting error during negative sampling. I will go through the materials mentioned by Lynda.
Thanks
Then on the kNS Model the error I got was: RuntimeError: invalid argument 0: Tensors must have same number of dimensions: got 3 and 2 at /pytorch/aten/src/TH/generic/THTensorMath.c:3577
[[830, 2433, 19, 2530], [1439], [1195, 2027, 1607, 2206, 1656, 1489, 2056, 2574, 1710, 1116, 1374, 1843, 2950, 1448, 611]] to
[[830, 2433, 19, 2530], 1439, [1195, 2027, 1607, 2206, 1656, 1489, 2056, 2574, 1710, 1116, 1374, 1843, 2950, 1448, 611]]
The code for construct_examples
def construct_examples(numericalized_sentences, vocabulary, num_examples=int(1e6), n=5, sg=True, k=0):
examples = []
while True:
# TODO: select a random sentence index using random.randint and get that
# sentence. Be careful to avoid indexing errors.
sentence_idx = random.randint(0, len(numericalized_sentences)-1)
sentence = numericalized_sentences[sentence_idx]
# TODO: Select a random window index using random.randint
# and obtain that window of size n. Be careful to avoid indexing errors.
window_idx = random.randint(0, len(sentence)-1)
window = sentence[window_idx:window_idx+n]
if len(window) <= n//2:
continue
# TODO: Get the center word and the context words
context_words = list(window)
center_word = context_words.pop(len(window)//2)
if sg: # if
context_word = context_words[random.randint(0, len(context_words)-1)]
example = [center_word, context_word]
else: # if CBOW
example = [context_words]
example.append(center_word)
if len(window) < n:
continue
if k > 0: # if doing negative sampling
samples = [random.randint(0, len(vocabulary.index_to_word)-1)
for _ in range(k)]
example.append(samples)
examples.append(example)
if len(examples) >= num_examples:
break
return examples
Now kNS models is working fine. One of the choice of answer for Quiz 10 is matching. But Quiz 9 still do not have any match. Not sure why?
Its take around 5 mins to generate all 8 examples.
def construct_examples(numericalized_sentences, vocabulary, num_examples=int(1e6), n=5, sg=True, k=0):
examples = []
while True:
# TODO: select a random sentence index using random.randint and get that
# sentence. Be careful to avoid indexing errors.
sentence_idx = random.randint(0,len(numericalized_sentences)-1)
sentence = numericalized_sentences[sentence_idx]
# TODO: Select a random window index using random.randint
# and obtain that window of size n. Be careful to avoid indexing errors.
window_idx = random.randint(0,len(sentence)-1)
window = sentence[window_idx:(window_idx+(n-1))]
if len(window) <= n//2:
continue
# TODO: Get the center word and the context words
center_word = window[int(len(window)//2)]
context_words = list([window.pop(len(window)//2)])
Were you ever able to get an answer for question 9? I'm tempted to just keep moving but since these exercises build on one another I'm not sure that is wise.
Thanks,
Tom
Thanks,
Tom
def numericalize(sentence, vocabulary):
# TODO: Implement
ret_numericalized = []
for idx,word in enumerate(sentence):
ret_numericalized.append(vocabulary.word_to_index[word])
return (ret_numericalized)
Constructing English Wikipedia examples for 5CBOW-FS model
I'm going to let it run for a bit longer.
if len(window) < n:
continue
It seems like len(window) will always be less than n and thus, you never append anything to the examples list. This results in the 2 CBOW examples lists being empty. Howeve, I fell like you have to assume the provided code is correct.
I took a break from this one to attempt the GloVe and Word Vectors for Sentiment Analysis project, but it also has a numericalize method to implement.
However, while looking at the code, I noticed the Vocabulary class not only has an index_to_word value that I was looking at, but the opposite, WORD_TO_INDEX!
Rather than use a double-nested loop that iterates through the entire vocabulary, it became a one-liner!
Talk about massive overthinking on my part.
@LABBOO - I believe I found the solution to your original problem.
This line-
Was creating an infinite loop because when you create the context_words and use the pop command on window, it reduces the length to be < n every time.
I got around that by creating a copy of the window and using pop on that.
UPDATE: Why is there no edit function?!
While I was able to complete constructing the samples after making sure the window length matched n, there's still something wrong with my code as the output that provides the information for Quiz Question 9 had none of the answers listed, so constructing my window index is most likely incorrect.
To fix that: While that did change the results for the tensors for Quiz question 9, I'm still not seeing a number that matches up with what's in the choices.
Back at it, I guess.
When speaking with some of the top badge earners, they mentioned doing the same. I've yet to speak with anyone who was able to complete the entire workbook.
So I still can't get #9 but for #10 I'm getting the correct answer. This is what I'm using:
def construct_examples(numericalized_sentences, vocabulary, num_examples=int(1e6), n=5, sg=True, k=0):
examples = []
while True:
# TODO: select a random sentence index using random.randint and get that
# sentence. Be careful to avoid indexing errors.
sentence_idx = random.randint(0,len(numericalized_sentences)-1)
sentence = numericalized_sentences[sentence_idx]
# TODO: Select a random window index using random.randint
# and obtain that window of size n. Be careful to avoid indexing errors.
window_idx = random.randint(0,len(sentence)-1)
window = sentence[window_idx:(window_idx+n)]
if len(window) <= n//2:
continue
# TODO: Get the center word and the context words
center_word = window[int(len(window)//2)]
context_words = window
context_words.remove(center_word)
# TODO: Create examples using the guidelines above
if sg: # if Skip-Gram
context_word = context_words[random.randint(0, len(context_words)-1)]
example = [center_word,context_word]
else: # if CBOW
example = [context_words,center_word]
#if len(window) < n: Note: I changed this line because this is where I kept hanging up
if len(window) >= n:
continue
if k > 0: # if doing negative sampling
samples = [random.randint(0, len(vocabulary.index_to_word)-1)
for _ in range(k)]
example.append(samples)
examples.append(example)
if len(examples) >= num_examples:
break
return examples
So I'm trying to work the next part (Hands-on: Define the Word2Vec models) but my brain is tired so I'm going to give it a rest until my PyLadies group on Sunday afternoon.
the topic looks a bit old, but let me ask you if you find out the answer for the Quiz Question 9.
Firstly I've faced with the same issue with infinite loop and it was solved. Please pay attention to the center_word. We retrive it with pop method. The pop() method returns the item present at the given index. This item will be removed the list: So context_words is window because center word has been removed. But for the CBOW we have code:
the condition len(window) < n is true because one element was removed and len(window) not greater than n-1. To fix the loop I've changed the code to: Code was run successfully and my answer for Question 9 is 39.
I'm just wondering if you have the same result.
Thanks,
Vasilina
Did anyone have luck with this trail?
Vasilina, I too got 39 for question 9. Following is my code snippet for that:
sentence_idx = random.randint(0,len(numericalized_sentences)-1)
sentence = numericalized_sentences[sentence_idx]
# TODO: Select a random window index using random.randint
# and obtain that window of size n. Be careful to avoid indexing errors.
window_idx = random.randint(0,len(sentence)-1)
window = sentence[window_idx:window_idx+n]
if len(window) <= (n//2):
continue
# TODO: Get the center word and the context words
center_word = sentence[window_idx+(n//2)]
context_words = sentence[window_idx:window_idx+(n//2)] + sentence[window_idx+(n//2)+1:window_idx+n]
I am struggling with "Word2VecModel(nn.Module)" and "def train(...)". Can someone please kindly share the code? Hints and instructions given in the code are not clear and aren't helpful.
Thanks anticipation!!!
For Quiz Question 9, I also find this: the last index in the last tensor printed out above for the batched examples is 39. This does not match any of the suggestions. How did you find the correct answer for Quiz Question 9? Thanks a lot :-)
Best,
Gauthier