Text Processing and Transformation

Following steps are performed by Aryan for Text Processing of the Spoken Text

Step 1: Convert Text to lowercase and Remove Duplicates.

  • Speech Text is converted first to lower case and thereafter repeating occurrence of text are removed.
    • Human Text: "Can you Please Tell me about Queen Queen Victoria"
    • Transformed Text: "can you please tell me about queen victoria"

Step 2: Word Tokenization

  • Text transformed from Step#1 is used as input for word tokenization, wherein text is splitted into tokenized Text using NLTK word tokenization
  • Input Text: "can you please tell me about queen victoria"
  • tokenize_text = ['can' , 'you' , 'please', 'tell', 'me', 'about', 'queen', 'victoria' ]
tokenize_text = word_tokenize(text)
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.

Step 3: Remove Stopwords

  • Stopwords are the English words which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. For example, the words like the, he, have etc. Such words are already captured this in corpus named corpus
  • We create a list and store all the words which are not Stop words.
  • Input tokenize_text = ['can' , 'you' , 'please', 'tell', 'me', 'about', 'queen', 'victoria' ]
  • Output = ['please', 'tell', 'queen', 'victoria' ]
stop_words = set(stopwords.words("english"))
filtered_list = []
for word in tokenize_text:
if word.casefold() not in stop_words:
filtered_list.append(word)

Step 4: Untokenize Text

  • Now we need to untokenize the text, restore the text back to it's orginal sentence format.
  • An untokenized function is created and regular expressions are used to bring back the text to it's original format
  • Input = ['please', 'tell', 'queen', 'victoria' ]
  • Output = "please tell queen victoria"
autotransformedtext = untokenize(filtered_list)
def untokenize(words):
"""
Untokenizing a text undoes the tokenizing operation, restoring
punctuation and spaces to the places that people expect them to be.
Ideally, `untokenize(tokenize(text))` should be identical to `text`,
except for line breaks.
"""
text = ' '.join(words)
step1 = text.replace("`` ", '"').replace(" ''", '"').replace('. . .', '...')
step2 = step1.replace(" ( ", " (").replace(" ) ", ") ")
step3 = re.sub(r' ([.,:;?!%]+)([ \'"`])', r"\1\2", step2)
step4 = re.sub(r' ([.,:;?!%]+)#x27;, r"\1", step3)
step5 = step4.replace(" '", "'").replace(" n't", "n't").replace(
"can not", "cannot")
step6 = step5.replace(" ` ", " '")
return step6.strip()

Step 5: Manual Text Transformation

  • If you observes the transformed text is still not the desired one, Thus we use Manual Text transformation to further transform the Text.
transformedtext = manualTextTransformation(autotransformedtext,text)
  • We remove the common words which are not stop words to further clean the text, this set of words can be always included into refined stop word dict.
  • Also, Aryan identified that "no" was also removed when text was transformed using Stop words, thus no is appended to the text again, significance being to identify the response "no" by the human for questions such as "Do you want to search anything else"
  • Text is trimmed and returned back to the Function which called the text transformation.
def manualTextTransformation(text,textorg):
text = ' ' + text + ' '
text = text.replace(" could ",' ')
text = text.replace(" please ",' ')
text = text.replace(" tell ",' ')
text = text.replace(" okay ",' ')
text = text.replace(" us ",' ')
text = text.replace(" would ",' ')
text = text.replace(" give ",' ')
text = text.replace(" get ",' ')
text = text.replace(" detail ",' ')
text = text.replace(" hello ",' ')
text = text.replace(" want ",' ')
if subStrCheck(textorg,"no"):
text = "no " + text
text = text.lstrip()
text = text.rstrip()
return text
Avoid Manual text transformation as much as possible, We are only doing this for the desired text identified during testing that is needed by Aryan for further processing. Aryan further needs to improve on it's NLTK skills to remove manual text transformation altogether.
Copy link
On this page
Following steps are performed by Aryan for Text Processing of the Spoken Text
Step 1: Convert Text to lowercase and Remove Duplicates.
Step 2: Word Tokenization
Step 3: Remove Stopwords
Step 4: Untokenize Text
Step 5: Manual Text Transformation