function readOnly(count){ }
Starting November 20, the site will be set to read-only. On December 4, 2023,
forum discussions will move to the Trailblazer Community.
+ Start a Discussion
LABBOOLABBOO 

Word Meaning and Word2vec Trailhead Badge help

I'm working on the Word Meaning and Word2vec badge and the Hands-on: Construct examples for each W2V variant is taking forever (it's already been running for 2 hours).  Has anyone else gotten through this?  How long did this part take?

My best guess is that I have something wrong but I'm not sure what since I've gotten no error messages...but when I stoped it it looks like it's still in the while loop - do can anyone provide some guidance on where I'm wrong and what I might want to look at to get back on track?

while True:
    # TODO: select a random sentence index using random.randint and get that
    # sentence. Be careful to avoid indexing errors.
    sentence_idx = random.randint(0,len(numericalized_sentences)-1)
    sentence = numericalized_sentences[sentence_idx]
    # TODO: Select a random window index using random.randint
    # and obtain that window of size n. Be careful to avoid indexing errors.
    window_idx = random.randint(0,len(sentence)-1)
    window = sentence[window_idx:k]
    
    if len(window) <= n//2:
      continue

Thanks!
Lynda

 
LABBOOLABBOO
update, changed the k above to an n but it's still running really long.  So here's the full code I'm currently running with: 

def construct_examples(numericalized_sentences, vocabulary, num_examples=int(1e6), n=5, sg=True, k=0):
  examples = []
  while True:
    # TODO: select a random sentence index using random.randint and get that
    # sentence. Be careful to avoid indexing errors.
    sentence_idx = random.randint(0,len(numericalized_sentences)-1)
    sentence = numericalized_sentences[sentence_idx]
    # TODO: Select a random window index using random.randint
    # and obtain that window of size n. Be careful to avoid indexing errors.
    window_idx = random.randint(0,len(sentence)-1)
    window = sentence[window_idx:n]
    
    if len(window) <= n//2:
      continue
      
    # TODO: Get the center word and the context words 
    center_word = window[int(round(len(window)/2))]
    context_words = window
    context_words.remove(center_word)
    
    # TODO: Create examples using the guidelines above
    if sg: # if Skip-Gram
      context_word = context_words[random.randint(0, len(context_words)-1)]
      example = [center_word, context_word]
    else: # if CBOW
      example = [context_words, center_word]
      if len(window) < n:
        continue
      
    if k > 0: # if doing negative sampling
      samples = [random.randint(0, len(vocabulary.index_to_word)-1) 
                 for _ in range(k)]
      example.append(samples)
      
    examples.append(example)
    if len(examples) >= num_examples:
      break
  
  return examples

any help on where I'm off and what I should consider for changing would be most welcome!
Lynda
Iago Breno AraujoIago Breno Araujo
Hi, Lynda

I noticed this point too. Did you solve it?

Thanks,

Iago Breno
LABBOOLABBOO
Iago,
Hi.  No I have not, though I did get the windowsing part fixed. Tested that with an actual index.  I'm still having problems figuring out how to get the sentence index since numericalized_sentences are nested lists.  This is what I have now:

  while True:
    # TODO: select a random sentence index using random.randint and get that - need to fix this - doesn't work because numericalized_sentences is nested vector - do I just choose from numericalized_sentences[0] - nope, that fails for the sst_vocab with the random.seed provided, do I need to flatten first? how?
    # sentence. Be careful to avoid indexing errors.
    sentence_idx = random.randint(0,len(numericalized_sentences)-1)
    sentence = numericalized_sentences[sentence_idx]

    # TODO: Select a random window index using random.randint - this works now (so keep) (pretty sure the commented out version works better)
    # and obtain that window of size n. Be careful to avoid indexing errors.
    window_idx = random.randint(0,len(vocabulary.sentences[sentence])-1)
    window = list((vocabulary.sentences[sentence])[i] for i in range(window_idx, window_idx+n, 1))
    #window = (vocabulary.sentences[sentence])[window_idx:(window_idx+n)]   --this version might be better
        
    if len(window) <= n//2:
      continue

Any guidance you can provide would be appreciated!
Lynda
Iago Breno AraujoIago Breno Araujo
Hi, Lynda, sorry for the late reply.

I was actually having some problems too. I believe I made some progress. My code is similar to yours, but there is something wrong because I am facing this error:

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable `--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

I am not able to even test the output because of this. Did you solve the problem? I would appreciate any help.

Thanks,

Iago
LABBOOLABBOO
Iago,
I still haven't figured it out.  Though the error you are getting, I've gotten if I tried to grab a "sentence" from the vocab with too big of a window or tried to calculate out len on some of the lists.  I did find this (https://adventuresinmachinelearning.com/word2vec-tutorial-tensorflow/) but I need to find some time to look it over more but it looks very similar to what the workbook has us doing.

Lynda
Iago Breno AraujoIago Breno Araujo
Lynda, Thank you very much. I understand. Ok, I will take a look on this link and continue to work on this workbook to figure it out. I will get in touch with you with any update. Iago
LABBOOLABBOO
Iago,  an update on where I'm at with this and where I'm stuck:

This is what I currently have:
def construct_examples(numericalized_sentences, vocabulary, num_examples=int(1e6), n=5, sg=True, k=0):
  examples = []
  while True:
    # TODO: select a random sentence index using random.randint and get that 
    # sentence. Be careful to avoid indexing errors.
    sentence_idx = random.randint(0,len(numericalized_sentences)-1)
    sentence = numericalized_sentences[sentence_idx]
    
    # TODO: Select a random window index using random.randint 
    # and obtain that window of size n. Be careful to avoid indexing errors.
    window_idx = random.randint(0,len(sentence)-1)
   #window = vocabulary.index_to_word[window_idx][:(window_idx+n)]   #this version might be better
    window = list(vocabulary.index_to_word[i] for i in range(window_idx,window_idx+n,1))
        
    if len(window) <= n//2:
      continue

The current window value I have gets me a KeyError but if I run it with actual values
in a prior run window with print(list(sst_vocab.index_to_word[i] for i in range(2323,2328,1))) - I actually get what I'm expecting
plus I know that potentially this will give me an error if the range values exceed the max index
I was also trying splicing but then I get a hash error so I can't do splicing with sst_vocab.index_to_word

Don't know if you are anyone else can take this further - this is what I got after being in a PyLadies meeting this evening and getting some help there (but apparently not enough).
Lynda
Iago Breno AraujoIago Breno Araujo
Lynda, I was obtaining the window in this way: while True: # TODO: select a random sentence index using random.randint and get that # sentence. Be careful to avoid indexing errors. sentence_idx = random.randint(0, len(numericalized_sentences)-1) sentence = vocabulary.sentences[sentence_idx] # TODO: Select a random window index using random.randint # and obtain that window of size n. Be careful to avoid indexing errors. window_idx = random.randint(0, len(sentence)-1) window = sentence[window_idx:(window_idx+n)] if len(window)
LABBOOLABBOO
Iago, my apologies for taking so long to reply.  I'll test this out this weekend.  Totally hoping this helps move me along some more on this!  If we're both still struggling with it after this weekend, maybe we could do a zoom and see if working together helps?  Let me know if you'd be interested in that.
Iago Breno AraujoIago Breno Araujo
Lynda, no problem. Great! Did you make some progress? I did not work on it at this weekend, but I tested sometimes and it was taking too long to process the second example. Yes, I think it would be great working on this together!
LABBOOLABBOO
Hi Iago.  Not really, though I did find a number of sites to read up that maybe I'll find something from (I'm hopeful). Don't know what time zone you're in but I have quite a bit of flexibiity/open time this weekend.  Let me know if you have some this weekend and I can open a zoom.  It's easier to reach me via email labboo.gc@gmail.com since I can access that from my phone and not just the computer.  Hope to hear back from you soon.  In the meantime, if you also want to search thru some of the sites, here are a few:
  • https://adventuresinmachinelearning.com/word2vec-tutorial-tensorflow/
  • http://www.xiaoliangbai.com/2017/03/22/word-embedding-implemented-in-python
  • https://github.com/tobigithub/tensorflow-deep-learning/wiki/word2vec-example
  • http://pythonfiddle.com/word2vec/
PixLaserPixLaser
Guys, I am also stuck at same place. I am able to construct all examples list and construct minibatch for FS models but failing at kNS models.
Is the example model list format correct?
      
    # TODO: Get the center word and the context words 
    context_words = list(window)
    center_word = list([context_words.pop(len(window)//2)])
    if sg: # if 
      context_word = context_words[random.randint(0, len(context_words)-1)]
      example = [center_word, [context_word]]
    else: # if CBOW
      example = [context_words,center_word]
      if len(window) < n:
        continue



en_wiki_5sgfs_examples:  [[[7487], [1921]], [[1918], [1918]]]
en_wiki_5cbowfs_examples: [[[881, 837, 17, 20348], [516]], [[8377, 6, 952, 7183], [4795]]]
en_wiki_5sg15ns_examples: [[[7210], [1776], [44199, 43751, 32134, 21466, 218, 57183, 11483, 49492, 9158, 864, 41347, 58762, 13374, 5752, 12158]],  [[1760], [5393], [38247, 56444, 62511, 34776, 61511, 4816, 39989, 45018, 68376, 63302, 27113, 69084, 41322, 1644, 52197]]]
en_wiki_5cbow15ns_examples: [[[16546, 2563, 1956, 184], [24789], [68237, 54984, 49089, 66855, 4173, 23784, 10827, 63819, 34326, 22298, 43896, 44160, 51274, 9606, 59869]], [[72, 316, 506, 222], [903], [2137, 24780, 11554, 47646, 1681, 46126, 30032, 53178, 69729, 65668, 7828, 37709, 64851, 30588, 63414]]]
 
Iago Breno AraujoIago Breno Araujo
Hi, Lynda. I understand, but was the KeyError solved? I'm in São Paulo, BRST zone. I also have quite a bit of flexibility/open time this weekend. I'll get in contact with you by email. I will take a look at the links. Thank you.
Iago Breno AraujoIago Breno Araujo
Hi, Elangovan, Actually, I think you are more advanced. When I try to construct examples running the second cell on the section Hands-on: Construct examples for each W2V variant, it holds on the second case: Constructing English Wikipedia examples for 5SG-FS model Constructing English Wikipedia examples for 5CBOW-FS model Could you please help us solve this point? Thanks
PixLaserPixLaser
Sure Iago. I started it with  your and Lynda's help in this thread. My cell for construction of samples is


def construct_examples(numericalized_sentences, vocabulary, num_examples=int(1e6), n=5, sg=True, k=0):
  examples = []
  while True:
    # TODO: select a random sentence index using random.randint and get that
    # sentence. Be careful to avoid indexing errors.
    sentence_idx = random.randint(0, len(numericalized_sentences)-1)   
    sentence = numericalized_sentences[sentence_idx]
    # TODO: Select a random window index using random.randint
    # and obtain that window of size n. Be careful to avoid indexing errors.
    window_idx = random.randint(0, len(sentence)-1)
    window = sentence[window_idx:(window_idx+n)]

    
    if len(window) <= n//2:
      continue
      
    # TODO: Get the center word and the context words 
    context_words = list(window)
    center_word = list([context_words.pop(len(window)//2)])
    if sg: # if 
      context_word = context_words[random.randint(0, len(context_words)-1)]
      example = [center_word, [context_word]]
    else: # if CBOW
      example = [context_words,center_word]
      #example.append(center_word)
      if len(window) < n:
        continue
      
    if k > 0: # if doing negative sampling
      samples = [random.randint(0, len(vocabulary.index_to_word)-1) 
                 for _ in range(k)]
      example.append(samples)
      
    examples.append(example)
    if len(examples) >= num_examples:
      break
  
  return examples


It is constructing all samples fine.In the next step, I am getting error during negative sampling. I will go through the materials mentioned by Lynda.
Thanks
LABBOOLABBOO
Elangovan, how long did it take to construct the samples.  Using your code the 2nd sample (the 5CBOW-FS) has been running for quite a while...
LABBOOLABBOO
Elangovan,  Once I change to use your context & center words I'm able to generate the sameples but for the next step once I get the result none of the values match any of the choices for Question 9 - so I think there's still something not correct.

Then on the kNS Model the error I got was: RuntimeError: invalid argument 0: Tensors must have same number of dimensions: got 3 and 2 at /pytorch/aten/src/TH/generic/THTensorMath.c:3577
PixLaserPixLaser
Yes, I saw those errors too. kNS model error was gone after change the shape of the example list from 
[[830, 2433, 19, 2530], [1439], [1195, 2027, 1607, 2206, 1656, 1489, 2056, 2574, 1710, 1116, 1374, 1843, 2950, 1448, 611]] to
[[830, 2433, 19, 2530], 1439, [1195, 2027, 1607, 2206, 1656, 1489, 2056, 2574, 1710, 1116, 1374, 1843, 2950, 1448, 611]]

The code for  construct_examples

def construct_examples(numericalized_sentences, vocabulary, num_examples=int(1e6), n=5, sg=True, k=0):
  examples = []
  while True:
    # TODO: select a random sentence index using random.randint and get that
    # sentence. Be careful to avoid indexing errors.
    sentence_idx = random.randint(0, len(numericalized_sentences)-1)   
    sentence = numericalized_sentences[sentence_idx]
    # TODO: Select a random window index using random.randint
    # and obtain that window of size n. Be careful to avoid indexing errors.
    window_idx = random.randint(0, len(sentence)-1)
    window = sentence[window_idx:window_idx+n]

    
    if len(window) <= n//2:
      continue
      
    # TODO: Get the center word and the context words 
    context_words = list(window)
    center_word = context_words.pop(len(window)//2)
    if sg: # if 
      context_word = context_words[random.randint(0, len(context_words)-1)]
      example = [center_word, context_word]
    else: # if CBOW
      example = [context_words]
      example.append(center_word)
      if len(window) < n:
        continue
      
    if k > 0: # if doing negative sampling
      samples = [random.randint(0, len(vocabulary.index_to_word)-1) 
                 for _ in range(k)]
      example.append(samples)
      
    examples.append(example)
    if len(examples) >= num_examples:
      break
  
  return examples

Now kNS models is working fine. One of the choice of answer for Quiz 10 is matching. But Quiz 9 still do not have any match. Not sure why? 
Its take around 5 mins to generate all 8 examples.
LABBOOLABBOO
I think I'm closer but the Wiki CBOW generation just keeps running (> 15 minutes) so something still isn't right and I think it's related to the center/context word. Here's what I have to that point:

def construct_examples(numericalized_sentences, vocabulary, num_examples=int(1e6), n=5, sg=True, k=0):
  examples = []
  while True:
    # TODO: select a random sentence index using random.randint and get that 
    # sentence. Be careful to avoid indexing errors.
    sentence_idx = random.randint(0,len(numericalized_sentences)-1)
    sentence = numericalized_sentences[sentence_idx]
       
    # TODO: Select a random window index using random.randint 
    # and obtain that window of size n. Be careful to avoid indexing errors.
    window_idx = random.randint(0,len(sentence)-1)
    window = sentence[window_idx:(window_idx+(n-1))]
        
    if len(window) <= n//2:
      continue
      
    # TODO: Get the center word and the context words 
    center_word = window[int(len(window)//2)]
    context_words = list([window.pop(len(window)//2)])
   
Tom Flynn 19Tom Flynn 19
Hi Labboo,

Were you ever able to get an answer for question 9? I'm tempted to just keep moving but since these exercises build on one another I'm not sure that is wise.

Thanks,

Tom
Tom Flynn 19Tom Flynn 19
Does anyone have a good reference for working on Hands-on: Define the Word2Vec models?  There are plenty of Word2Vec resources out there but I haven't found one yet that is helpful with this exercise.

Thanks,

Tom
Dave ParadiseDave Paradise
I've been stuck on this one too. I've implemented the numericalize method, but attempting to run it for the wiki/SST vocabs takes a long, LONG time. What would be a more efficient implementation?
 
def numericalize(sentence, vocabulary):
  # TODO: Implement
  numericalized = []
  
  for word in sentence:
    for i in range(len(vocabulary.index_to_word)):
      if (word == vocabulary.index_to_word[i]):
        numericalized.append(str(i))
        break
  
  return numericalized
  pass

 
Tom Flynn 19Tom Flynn 19
Mine is also taking a long time. It is running a long time. I have it running right now to see if it completes  My numericalized  function is a little different than yours but I'm not sure it's correct. I am using the word to go directly to the index using vocabulary.word_to_index[word]. I also got rid of "pass" but I don't think that really matters.:

def numericalize(sentence, vocabulary):
  # TODO: Implement
  ret_numericalized = []
  
  for idx,word in enumerate(sentence):
      ret_numericalized.append(vocabulary.word_to_index[word])

  return (ret_numericalized)




 
Dave ParadiseDave Paradise
Oh, mine is definitely wrong. It's been running since before my last post.
Tom Flynn 19Tom Flynn 19
Mine has been going for about an hour. It seems to be stuck on the 2nd one:

Constructing English Wikipedia examples for 5CBOW-FS model

I'm going to let it run for a bit longer.


 
Tom Flynn 19Tom Flynn 19
I gave up on mine as well.  I am now focused on the #CBOW section particularly this section:

      if len(window) < n:
         continue

It seems like len(window) will always be less than n and thus, you never append anything to the examples list.  This results in the 2 CBOW examples lists being empty.  Howeve, I fell like you have to assume the provided code is correct.
Dave ParadiseDave Paradise
I figured it out!

I took a break from this one to attempt the GloVe and Word Vectors for Sentiment Analysis project, but it also has a numericalize method to implement.

However, while looking at the code, I noticed the Vocabulary class not only has an index_to_word value that I was looking at, but the opposite, WORD_TO_INDEX!

Rather than use a double-nested loop that iterates through the entire vocabulary, it became a one-liner!
 
def numericalize(sentence, vocabulary):
  numericalized = []
  
  for word in sentence:
    numericalized.append(vocabulary.word_to_index[word])
  
  return numericalized

Talk about massive overthinking on my part.
Tom Flynn 19Tom Flynn 19
It certainly is easy to overthink things at times.  My numericalized is very close to that but then I got stuck further on...  haven't looked at it for a couple days.
Dave ParadiseDave Paradise

@LABBOO - I believe I found the solution to your original problem.

This line-
if len(window) < n:
        continue

Was creating an infinite loop because when you create the context_words and use the pop command on window, it reduces the length to be < n every time.

I got around that by creating a copy of the window and using pop on that.
# TODO: Get the center word and the context words
    window_index = int(len(window)//2)
    center_word = window[window_index]
    window_copy = list(window)
    context_words = list([window_copy.pop(window_index)])

UPDATE: Why is there no edit function?!

While I was able to complete constructing the samples after making sure the window length matched n, there's still something wrong with my code as the output that provides the information for Quiz Question 9 had none of the answers listed, so constructing my window index is most likely incorrect.

 
Dave ParadiseDave Paradise
Found part of the problem. context_words was simply creating a list containing the center_word rather than all of the words around it.

To fix that:
window_index = int(len(window)//2)
center_word = window[window_index]
context_words = list(window)
context_words.pop(window_index)
While that did change the results for the tensors for Quiz question 9, I'm still not seeing a number that matches up with what's in the choices.
Back at it, I guess.
Dave ParadiseDave Paradise
Ended up giving up and brute forcing the rest of the answers. It just wasn't worth spending more time than some Superbadges just for 100 points.

When speaking with some of the top badge earners, they mentioned doing the same. I've yet to speak with anyone who was able to complete the entire workbook.
Tom Flynn 19Tom Flynn 19
I hear you.  I gave up on this one too.
LABBOOLABBOO
OK, worked a bit on this today.  I haven't given up (though I did brute force the answers with another account, I still want to get this one legit).

So I still can't get #9 but for #10 I'm getting the correct answer.  This is what I'm using:

def construct_examples(numericalized_sentences, vocabulary, num_examples=int(1e6), n=5, sg=True, k=0):
  examples = []
  while True:
    # TODO: select a random sentence index using random.randint and get that 
    # sentence. Be careful to avoid indexing errors.
    sentence_idx = random.randint(0,len(numericalized_sentences)-1)
    sentence = numericalized_sentences[sentence_idx]
    
    # TODO: Select a random window index using random.randint 
    # and obtain that window of size n. Be careful to avoid indexing errors.
    window_idx = random.randint(0,len(sentence)-1)
    window = sentence[window_idx:(window_idx+n)]
        
    if len(window) <= n//2:
      continue
      
    # TODO: Get the center word and the context words 
    center_word = window[int(len(window)//2)]
    context_words = window
    context_words.remove(center_word)
    
    # TODO: Create examples using the guidelines above
    if sg: # if Skip-Gram
      context_word = context_words[random.randint(0, len(context_words)-1)]
      example = [center_word,context_word]
    else: # if CBOW
      example = [context_words,center_word]
      #if len(window) < n:  Note: I changed this line because this is where I kept hanging up
      if len(window) >= n:
        continue
      
    if k > 0: # if doing negative sampling
      samples = [random.randint(0, len(vocabulary.index_to_word)-1) 
                 for _ in range(k)]
      example.append(samples)
      
    examples.append(example)
    if len(examples) >= num_examples:
      break
  
  return examples

So I'm trying to work the next part (Hands-on: Define the Word2Vec models) but my brain is tired so I'm going to give it a rest until my PyLadies group on Sunday afternoon.
Tom Flynn 19Tom Flynn 19
This is where I am at. None of the options for 9 matched what I have but I have an answer for 10. 
Vasilina Veretennikova 8Vasilina Veretennikova 8
Hello guys,

the topic looks a bit old, but let me ask you if you find out the answer for the Quiz Question 9.

Firstly I've faced with the same issue with infinite loop and it was solved. Please pay attention to the center_word. We retrive it with pop method. The pop() method returns the item present at the given index. This item will be removed the list:
# TODO: Get the center word and the context words 
    center_word_idx = len(window)//2
    center_word = window.pop(center_word_idx)
    context_words = window
So  ​​​​context_words is window because center word has been removed. But for the CBOW we have code:
else: # if CBOW
      example = [context_words,center_word]
      if len(window) < n:
        continue

the condition len(window) < n is true because one element was removed and len(window) not greater than n-1. To fix the loop I've changed the code to:
else: # if CBOW
      example = [context_words,center_word]
      if len(window) < n-1:
        continue
Code was run successfully and my answer for Question 9 is 39.

I'm just wondering if you have the same result. 

Thanks,
Vasilina
Atif RazzaqAtif Razzaq
Hello Everyone

Did anyone have luck with this trail?

Vasilina, I too got 39 for question 9. Following is my code snippet for that:
   
    sentence_idx = random.randint(0,len(numericalized_sentences)-1)
    sentence = numericalized_sentences[sentence_idx]
    # TODO: Select a random window index using random.randint
    # and obtain that window of size n. Be careful to avoid indexing errors.
    window_idx = random.randint(0,len(sentence)-1)
    window = sentence[window_idx:window_idx+n]
    if len(window) <= (n//2):
      continue
      
    # TODO: Get the center word and the context words 
    center_word = sentence[window_idx+(n//2)]
    context_words = sentence[window_idx:window_idx+(n//2)] + sentence[window_idx+(n//2)+1:window_idx+n]​​​​​​


I am struggling with "Word2VecModel(nn.Module)" and "def train(...)". Can someone please kindly share the code? Hints and instructions given in the code are not clear and aren't helpful.

Thanks anticipation!!!
Gauthier MuguerzaGauthier Muguerza
Hello,
For Quiz Question 9, I also find this: the last index in the last tensor printed out above for the batched examples is 39. This does not match any of the suggestions. How did you find the correct answer for Quiz Question 9? Thanks a lot :-)
Best,
Gauthier