> For the complete documentation index, see [llms.txt](https://aryanai.gitbook.io/docs.aryanandarya/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://aryanai.gitbook.io/docs.aryanandarya/overview/text-processing-and-keyword-identification.md).

# Text Processing and Transformation

## Following steps are performed by Aryan for Text Processing of the Spoken Text

### Step 1: Convert Text to lowercase and Remove Duplicates.

* Speech Text is converted first to lower case and thereafter repeating occurrence of text are removed.
  * Human Text: "Can you Please Tell me about Queen Queen Victoria"
  * Transformed Text: "can you please tell me about queen victoria"

### Step 2: Word Tokenization

* Text transformed from Step#1 is used as input for word tokenization, wherein text is splitted into tokenized Text using **NLTK** word tokenization
* Input Text: "can you please tell me about queen victoria"
* tokenize\_text = \['can' , 'you' , 'please', 'tell', 'me', 'about', 'queen', 'victoria' ]

```python
tokenize_text = word_tokenize(text)
```

{% hint style="info" %}
&#x20;[NLTK ](https://www.nltk.org/)is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to [over 50 corpora and lexical resources](http://nltk.org/nltk_data/) such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.
{% endhint %}

### **Step 3: Remove Stopwords**

* Stopwords are the English words which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. For example, the words like the, he, have etc. Such words are already captured this in corpus named corpus
* We create a list and store all the words which are not Stop words.
* Input tokenize\_text = \['can' , 'you' , 'please', 'tell', 'me', 'about', 'queen', 'victoria' ]
* Output = \['please', 'tell', 'queen', 'victoria' ]

```python
stop_words = set(stopwords.words("english"))
filtered_list = []

for word in tokenize_text:
    if word.casefold() not in stop_words:
        filtered_list.append(word) 
```

### **Step 4: Untokenize Text**

* Now we need to untokenize the text, restore the text back to it's orginal sentence format.
* An untokenized function is created and regular expressions are used to bring back the text to it's original format
* Input = \['please', 'tell', 'queen', 'victoria' ]
* Output = "please tell queen victoria"

```python
autotransformedtext = untokenize(filtered_list)
def untokenize(words):
    """
    Untokenizing a text undoes the tokenizing operation, restoring
    punctuation and spaces to the places that people expect them to be.
    Ideally, `untokenize(tokenize(text))` should be identical to `text`,
    except for line breaks.
    """
    text = ' '.join(words)
    step1 = text.replace("`` ", '"').replace(" ''", '"').replace('. . .',  '...')
    step2 = step1.replace(" ( ", " (").replace(" ) ", ") ")
    step3 = re.sub(r' ([.,:;?!%]+)([ \'"`])', r"\1\2", step2)
    step4 = re.sub(r' ([.,:;?!%]+)$', r"\1", step3)
    step5 = step4.replace(" '", "'").replace(" n't", "n't").replace(
         "can not", "cannot")
    step6 = step5.replace(" ` ", " '")
    return step6.strip()
```

### **Step 5: Manual Text Transformation**

* If you observes the transformed text is still not the desired one, Thus we use Manual Text transformation to further transform the Text.

```python
transformedtext = manualTextTransformation(autotransformedtext,text)
```

* We remove the common words which are not stop words to further clean the text, this set of words can be always included into refined stop word dict.
* Also, Aryan identified that "no" was also removed when text was transformed using Stop words, thus no is appended to the text again, significance being to identify the response "no" by the human for questions such as "Do you want to search anything else"&#x20;
* Text is trimmed and returned back to the Function which called the text transformation.

```python
def manualTextTransformation(text,textorg):
    text = ' ' + text + ' '
    
    text = text.replace(" could ",' ')
    text = text.replace(" please ",' ')
    text = text.replace(" tell ",' ')
    text = text.replace(" okay ",' ')
    text = text.replace(" us ",' ')
    text = text.replace(" would ",' ')
    text = text.replace(" give ",' ')
    text = text.replace(" get ",' ')
    text = text.replace(" detail ",' ')
    text = text.replace(" hello ",' ')
    text = text.replace(" want ",' ')
    
    if subStrCheck(textorg,"no"):
        text = "no " + text
    
    text = text.lstrip() 
    text = text.rstrip()
    
    return text
```

{% hint style="warning" %}
Avoid Manual text transformation as much as possible, We are only doing this for the desired text identified during testing that is needed by Aryan for further processing. Aryan further needs to improve on it's NLTK skills to remove manual text transformation altogether.
{% endhint %}


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://aryanai.gitbook.io/docs.aryanandarya/overview/text-processing-and-keyword-identification.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
