Removing duplicate paragraphs with Edit Pad Pro or Notepad++

notepadregexsublime-text-3

I have a .docx file that contains mcqs which are in the format as shown below. The problem is there are many duplicate mcqs and I would therefore like to know if a regex can be created to detect all duplicate mcqs.

I have Edit Pad Pro 7,Notepad++,powergrep and sublime text. and all the regex that I have used until now deleted duplicates on a line by line basis, thereby deleting options from other questions even though the questions don't match.

So basically what I am saying is I need a regex that can delete all the duplicate mcqs only if the whole mcq matches, not individul lines or sentences.

I am a novice with respect to regex, so please excuse any inadequacies.

Lichen planus occurs most frequently on the?
A.  buccal mucosa.
B.  tongue.
C.  floor of the mouth.
D.  gingiva.

In the absence of “Hanks balanced salt solution”, what is the most appropriate media to transport an avulsed tooth?
A.  Saliva.
B.  Milk.
C.  Saline.
D.  Tap water.

Which of the following is the most likely cause of osteoporosis, glaucoma, hypertension and peptic ulcers in a 65 year old with Crohn’s disease?
A.  Uncontrolled diabetes.
B.  Systemic corticosteroid therapy.
C.  Chronic renal failure.
D.  Prolonged NSAID therapy.
E.  Malabsorption syndrome.

Lichen planus occurs most frequently on the?
A. buccal mucosa.
B. tongue.
C. floor of the mouth.
D. gingiva.

expected result

Lichen planus occurs most frequently on the?
A.  buccal mucosa.
B.  tongue.
C.  floor of the mouth.
D.  gingiva.

In the absence of “Hanks balanced salt solution”, what is the most appropriate media to transport an avulsed?
A.  Saliva.
B.  Milk.
C.  Saline.
D.  Tap water.

Which of the following is the most likely cause of osteoporosis, glaucoma, hypertension and peptic ulcers in a 65 year old with Crohn’s disease?
A.  Uncontrolled diabetes.
B.  Systemic corticosteroid therapy.
C.  Chronic renal failure.
D.  Prolonged NSAID therapy.
E.  Malabsorption syndrome.

Best Answer

  • Ctrl+H
  • Find what: (([^?]+\?\R(?:.+\.\R)+)[\s\S]+?)\2
  • Replace with: $1
  • check Wrap around
  • check Regular expression
  • DO NOT CHECK . matches newline
  • Replace all

Explanation:

(           : start group 1
  (         : start group 2
    [^?]+   : 1 or more any character that is not "?"
    \?      : a question mark
    \R      : any kind of line break
    (?:     : start non capture group
      .+    : 1 or more any character but newline
      \.    : a dot
      \R    : any kind of line break
    )+      : end group, must appear 1 or more times
  )         : end group 2
  [\s\S]+?  : 1 or more any character, not greedy
)           : end group 1
\2          : another occurrence of group 2

Replacement:

$1          : content of group 1
Related Question