Differences

This shows you the differences between two versions of the page.

--- public:nnels:etext:regex [2017/10/01 16:48]
sabina.iseli-otto Page moved from public:nnels:regex to public:nnels:public:nnels:etext:regex
+++ public:nnels:etext:regex [2017/11/02 18:46]
farrah.little
@@ Line 1: / Line 1: @@
+====== Regular Expressions ======
+Regular expressions (aka regex) is useful for replacing patterns of text, such as headers/footers with page breaks or simply removing them, or replacing line breaks as is common when text is converted from a PDF (to remove middle of word or middle of sentence breaks).
+====Tips====
+[[https://support.office.com/en-ca/article/Find-and-replace-text-and-other-data-in-your-Word-2010-files-c6728c16-469e-43cd-afe4-7708c6c779b7?ui=en-US&rs=en-CA&ad=CA#__toc282774574|Using wildcards in Microsoft Word]] (this is similar to regular expressions, but Word has a lot of its own syntax)
+  * Word has a lot of options to find letters (^$) and numbers (^#) but these only work with the wildcard option //off// (which it is by default). Only turn the wildcard option on if you're using regex options. Read the info page carefully on when things apply with the wildcard option on/off.
+  * A lot of the codes for special characters (e.g. page break) are under the "Special..." button.
+{{:public:nnels:regex.png?400|}}
+==== In LibreOffice & OpenOffice ====
+Make sure that the ''Regular expressions'' box is checked on the Alternative Find & Replace dialog for all of the search and replace actions below.
+[[https://help.libreoffice.org/Common/List_of_Regular_Expressions|Regular expressions in LibreOffice]]
+[[https://wiki.openoffice.org/wiki/Documentation/How_Tos/Regular_Expressions_in_Writer|Regular Expressions in OpenOffice]]
+===== Conversion Fixes =====
+The following fixes assume you are using Word, unless otherwise stated.
+<note>Contribute your problems and regex solutions below. Attach your screenshots of both the problem and solution.</note>
+----
+**PROBLEM**: Each line ends with a paragraph break.
+**SOLUTION**: There is no single solution to this, but the typical pattern is to search for the pattern not a period, followed by paragraph break, followed by letter and replace with the same thing minus the paragraph break.
+In Word, this will only work with wildcards turned on.
+Find: ''([A-z] )^13([A-z])''
+Replace with: ''\1\2''
+This looks for the pattern: any-letter space paragraph-break any-letter
+The parentheses are used to group what it finds, so \1 refers to the first "any-letter" group and \2 refers to the second "any-letter" group.
+In this way, you are putting back exactly what it found minus the paragraph break.
+----
+**PROBLEM**: Hyphenated words that break over two lines.
+**SOLUTION**: Replace with the same text minus the hyphen.
+Find: ''([a-z])-^13([a-z])''
+Replace with: ''\1\2''
+Using a-z restricts what it finds to lowercase.
+You will likely have to do it again for lines that end with a comma, and possibly en and em dash. Look through your document for patterns of anything else it might have missed.
+----
+**PROBLEM**: There are extra paragraph breaks. We want to keep the real paragraph breaks and remove the fake extra paragraph breaks.
+**SOLUTION**: Use MS Word's find and replace to remove the extra paragraph breaks using special Word symbols.
+Find: ''^p^p'' (you can also search for more than 2 paragraph breaks, i.e. ''^p^p^p'')
+Replace with: ''^p''
+----
+**PROBLEM**: There are newlines/line breaks (↵) instead of paragraph marks (¶).
+**SOLUTION**: Find and remove all line breaks and replace with a single paragraph break.
+Find: ''^m''
+Replace with: ''^p''
+In LibreOffice, replace all ''\n'' with ''\p'' to convert them to paragraphs.
+----
+**PROBLEM**: Running headers. Example, where the first three numbers and the three numbers after the filename is the page number:
+''231(paragraph break)MacG_9781770494220_5p_all_r1.indd 231(paragraph break)10/27/14 11:56 AM(paragraph break)''
+**SOLUTION**: Without using wildcards:
+Find:  ''^#^#^#^pMacG_9781770494220_5p_all_r1.indd ^#^#^#^p10/27/14 11:56 AM^p''
+Replace with: nothing. If you're doing a paginated title, replace with page breaks.
+You will need to remove one of the ^# at the beginning and after the .indd to remove it for 2 digit page numbers, and one last time for single digit page numbers. The following screenshot is an example with a 1-digit page number (see below), followed by the command used to isolate all such instances.
+<WRAP center round box 60%>
+{{:nnels:documentation:content:production:screen_shot_2015-08-06_at_6.10.55_pm.png?300|}}
+Find: ^#^pMacG_9781770494220_5p_all_r1.indd ^#^p10/27/14 11:56 AM^p
+</WRAP>
+You will also need to do it with the leading ^#^p to catch the footer text that do not have any page numbers with it.
+In LibreOffice:
+  * Verso (left hand)
+  * ''\p[0-9OoIil]{1,3}\s+.+\p''
+    * taken piece-by-piece, this means:
+    * ''\p'' : a paragraph marker
+    * ''[0-9OoIil]{1,3}'' : between one and three numbers or "number like" symbols. (OCR programs often mistake ''o'' or ''O'' for ''0'' and ''I'', ''i'', or ''l'' for ''1''.)
+    * ''\s+'' : one or more whitespace character (spaces, tabs, etc.)
+    * ''.+'' : one or more of any character
+    * ''\p'' : a final paragraph marker
+  * Recto (right hand)
+  * ''\p.+\s+[0-9OoIil]{1,3}\p'' ### Detect bad line breaks ###
+  * ''[^\."?!]$''

User Tools

Differences

Page Tools

BC Libraries Coop wiki

Site Tools