User Tools


Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
public:nnels:etext:regex [2018/07/11 22:31]
leah.brochu
public:nnels:etext:regex [2024/05/29 18:37]
rachel.osolen
Line 18: Line 18:
       * If you wanted to remove the hyphen from "BB-8" you would enter ''\1\2'' (i.e., the two groups with nothing between them) into the Replace field. Or, if you wanted to change the hyphen to a space, you would enter ''\1 \2'' (i.e., the two groups with a space between them) into the Replace field.       * If you wanted to remove the hyphen from "BB-8" you would enter ''\1\2'' (i.e., the two groups with nothing between them) into the Replace field. Or, if you wanted to change the hyphen to a space, you would enter ''\1 \2'' (i.e., the two groups with a space between them) into the Replace field.
       * Another example: ''(John) (Smith)'' replaced by ''\2 \1'' (note the spaces in the search and replace strings) – will produce ''Smith John''       * Another example: ''(John) (Smith)'' replaced by ''\2 \1'' (note the spaces in the search and replace strings) – will produce ''Smith John''
 + 
 +<note tip>Word has a lot of options to find letters (^$) and numbers (^#) when using the non-regex [[public:nnels:etext:find-and-replace|Find & Replace]], but these only work with the wildcard option //off// (which it is by default). Only turn the wildcard option on if you're using regex options. Read the info page carefully on when things apply with the wildcard option on/off.
 +</note>
  
-====Tips==== +<note tip>lot of the codes for special characters (e.g. page break) are under the "Special..." button. 
- +</note>
-[[https://support.office.com/en-ca/article/Find-and-replace-text-and-other-data-in-your-Word-2010-files-c6728c16-469e-43cd-afe4-7708c6c779b7?ui=en-US&rs=en-CA&ad=CA#__toc282774574|Using wildcards in Microsoft Word]] (this is similar to regular expressions, but Word has a lot of its own syntax) +
-  +
-  * Word has a lot of options to find letters (^$) and numbers (^#) but these only work with the wildcard option //off// (which it is by default)Only turn the wildcard option on if you're using regex optionsRead the info page carefully on when things apply with the wildcard option on/off.+
  
-  * A lot of the codes for special characters (e.g. page break) are under the "Special..." button. +<note>If you discover a solution to a problem that is not on this page, please contact the Production CoordinatorThey can teach you how to add your own solutions through updating this wiki page!</note>
-{{:public:nnels:regex.png?400|}} +
-==== In LibreOffice & OpenOffice ==== +
-Make sure that the ''Regular expressions'' box is checked on the Alternative Find & Replace dialog for all of the search and replace actions below.+
  
-[[https://help.libreoffice.org/Common/List_of_Regular_Expressions|Regular expressions in LibreOffice]] +=====Problems and Solutions Using Regular Expressions=====
-[[https://wiki.openoffice.org/wiki/Documentation/How_Tos/Regular_Expressions_in_Writer|Regular Expressions in OpenOffice]]+
  
-===== Conversion Fixes ===== +In this section you will find examples of different ways to use ''Find and Replace'' to help you with some common reformatting issues.
-The following fixes assume you are using Word, unless otherwise stated.+
  
-<note>Contribute your problems and regex solutions belowAttach your screenshots of both the problem and solution.</note>+<note tip>If you don't see the solution to your problem on this page, go to the [[public:nnels:etext:find-and-replace|Using Find & Replace]]If you still can't find it, they try writing your own Regex, or using a wild card for find and replace.</note>
  
 ---- ----
Line 77: Line 72:
  
 <WRAP center round box 80%> <WRAP center round box 80%>
-**PROBLEM:** OCR converted some "1" digits to "i/I" letters, resulting in dates like "i984" or numbers like "3i".+**PROBLEM**: Hyphenated words that break single word (not over two lines). 
 + 
 +**SOLUTION**: Replace with the same text minus the hyphen. 
 + 
 +Find: ''([a-z])-([a-z])'' 
 + 
 +Replace with: ''\1\2'' 
 + 
 +Using a-z restricts what it finds to lowercase. 
 + 
 +You will likely have to do it again for lines that end with a comma, and possibly en and em dash. Look through your document for patterns of anything else it might have missed. 
 +</WRAP> 
 + 
 +---- 
 + 
 +<WRAP center round box 80%> 
 +**PROBLEM:** OCR converted some "1" digits to "i/I" letters, resulting in dates like "i984" or numbers like "3I".
  
 **SOLUTION:** Replace "i/I"s that come immediately before of after a number with "1"s. This will be done in two steps **SOLUTION:** Replace "i/I"s that come immediately before of after a number with "1"s. This will be done in two steps
Line 90: Line 101:
  
 ---- ----
 +
  
 <WRAP center round box 80%> <WRAP center round box 80%>
-**PROBLEM**: There are extra paragraph breaks. We want to keep the real paragraph breaks and remove the fake extra paragraph breaks.   
  
-**SOLUTION**: Use MS Word'find and replace to remove the extra paragraph breaks using special Word symbols.+**PROBLEM:** OCR did not recognize spaces around quotation marks.  
 +  * Example AAs one of Montgomery'British staff officers later put ''it,"I'' feel Monty was astonishing in his relationship with all the Dominion troops. 
 +  * Example B: The "nasty little ''troublemaker,"as'' Montgomery was widely known in the British army... 
 +This problem has an added complexity; the pattern has two different solutions: 
 +  * Example A will need to say: ... later put ''it, "I'' feel Monty... (or, comma-space-quotation mark) 
 +  * Example B will need to say: The "nasty little troublemaker''," as'' Montgomery... (or, comma-quotation mark-space 
 + 
 +**SOLUTIONS:** 
 +Example A:\\  
 + 
 +Find: ''([,])(["])([A-z])''\\  
 +Replace: ''\1 \2\3'' 
 + 
 +Example B: 
 + 
 +Find: ''([,])(["])([A-z])''\\  
 +Replace: ''\1\2 \3''
  
-Find: ''^p^p'' (you can also search for more than 2 paragraph breaks, i.e. ''^p^p^p'')+Notes 
 +  * You will **not** be able to use "replace all" in this situation. You will need to keep hitting ''Find Next'' and replacing the pattern with the appropriate solution. 
 +  * You will also need to re-do this, searching for periods instead of commas.
  
-Replace with: ''^p'' 
 </WRAP> </WRAP>
  
 ---- ----
 +
  
 <WRAP center round box 80%> <WRAP center round box 80%>
-**PROBLEM**: There are newlines/line breaks (↵) instead of paragraph marks (¶).+**PROBLEM**: There are extra paragraph breaks. We want to keep the real paragraph breaks and remove the fake extra paragraph breaks 
  
-**SOLUTION**: Find and remove all line breaks and replace with a single paragraph break.+**SOLUTION**: See: [[public:nnels:etext:find-and-replace|Find & Replace]] 
 +</WRAP>
  
-Find: ''^m''+----
  
-Replace with''^p''+<WRAP center round box 80%> 
 +**PROBLEM**There are newlines/line breaks (↵) instead of paragraph marks (¶).
  
-In LibreOffice, replace all ''\n'' with ''\p'' to convert them to paragraphs.+**SOLUTION**: See: [[public:nnels:etext:find-and-replace|Find & Replace]]
 </WRAP> </WRAP>
  
Line 121: Line 152:
 ''231(paragraph break)MacG_9781770494220_5p_all_r1.indd 231(paragraph break)10/27/14 11:56 AM(paragraph break)'' ''231(paragraph break)MacG_9781770494220_5p_all_r1.indd 231(paragraph break)10/27/14 11:56 AM(paragraph break)''
  
-**SOLUTION**: Without using wildcards: +**SOLUTION**: See[[public:nnels:etext:find-and-replace|Find & Replace]]
- +
-Find:  ''^#^#^#^pMacG_9781770494220_5p_all_r1.indd ^#^#^#^p10/27/14 11:56 AM^p'' +
- +
-Replace with: nothing. If you're doing a paginated title, replace with page breaks. +
- +
-You will need to remove one of the ^# at the beginning and after the .indd to remove it for 2 digit page numbers, and one last time for single digit page numbers. The following screenshot is an example with a 1-digit page number (see below), followed by the command used to isolate all such instances.  +
- +
-<WRAP center round box 60%> +
- +
-{{:nnels:documentation:content:production:screen_shot_2015-08-06_at_6.10.55_pm.png?300|}} +
- +
-Find: ^#^pMacG_9781770494220_5p_all_r1.indd ^#^p10/27/14 11:56 AM^p +
-</WRAP> +
- +
-You will also need to do it with the leading ^#^p to catch the footer text that do not have any page numbers with it.+
 </WRAP> </WRAP>
  
Line 152: Line 168:
   * ''\p.+\s+[0-9OoIil]{1,3}\p'' ### Detect bad line breaks ###   * ''\p.+\s+[0-9OoIil]{1,3}\p'' ### Detect bad line breaks ###
   * ''[^\."?!]$''   * ''[^\."?!]$''
 +
 +
 +[[public:nnels:etext:start|Return to main eText Page]]
  
public/nnels/etext/regex.txt · Last modified: 2024/05/29 20:30 by rachel.osolen