This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
public:nnels:etext:regex [2017/04/03 22:36] farrah.little [In LibreOffice] |
public:nnels:etext:regex [2024/05/29 20:30] rachel.osolen |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Regular Expressions ====== | + | ====== |
Regular expressions (aka regex) is useful for replacing patterns of text, such as headers/ | Regular expressions (aka regex) is useful for replacing patterns of text, such as headers/ | ||
- | ====Tips==== | + | With regex, you can define patterns of text in a number of different ways, but the most commonly used ones for our purposes are **Ranges** and **Groups**. For more information about others, you can take a look at [[https://wordmvp.com/FAQs/General/UsingWildcards.htm|this helpful webpage]]: |
- | + | * Ranges | |
- | [[https://support.office.com/en-ca/article/Find-and-replace-text-and-other-data-in-your-Word-2010-files-c6728c16-469e-43cd-afe4-7708c6c779b7? | + | * Square brackets are always used in pairs and are used to identify //specific characters// |
+ | * [A-Z] will find any upper case letter; | ||
+ | * [a-z] will find any lower case letter; | ||
+ | * [A-z] will find any letter (upper or lower case); | ||
+ | * [0-9] will find any number | ||
+ | * [abc] will find any of the letters a, b, or c. | ||
+ | * [F] will find upper case “F” | ||
+ | * [Fred] will find " | ||
+ | * Groups | ||
+ | * Round brackets are used in pairs to enclose //groups//. For example: | ||
+ | * '' | ||
+ | * They must be used in pairs and are addressed by number in the replacement. In the replace field, \1 represents the first group, \2 represents the second group, and so on. For example: | ||
+ | * If you wanted to remove the hyphen from " | ||
+ | * Another example: '' | ||
- | * Word has a lot of options to find letters (^$) and numbers (^#) but these only work with the wildcard option //off// (which it is by default). Only turn the wildcard option on if you're using regex options. Read the info page carefully on when things apply with the wildcard option on/off. | + | <note tip>Word has a lot of options to find letters (^$) and numbers (^#) when using the non-regex [[public: |
+ | </ | ||
+ | |||
+ | <note tip>A lot of the codes for special characters (e.g. page break) are under the " | ||
+ | </ | ||
+ | |||
+ | =====Problems and Solutions Using Regular Expressions===== | ||
+ | |||
+ | In this section you will find examples of different ways to use '' | ||
- | * A lot of the codes for special characters (e.g. page break) are under the " | + | <note tip>If you don't see the solution to your problem on this page, go to the [[public: |
- | {{:public: | + | |
- | ==== In LibreOffice & OpenOffice ==== | + | |
- | Make sure that the '' | + | |
- | [[https:// | + | < |
- | [[https://wiki.openoffice.org/ | + | |
- | ===== Conversion Fixes ===== | + | ---- |
- | The following fixes assume you are using Word, unless otherwise stated. | + | |
+ | <WRAP center round box 80%> | ||
**PROBLEM**: | **PROBLEM**: | ||
- | **SOLUTION**: | + | **SOLUTION**: |
In Word, this will only work with wildcards turned on. | In Word, this will only work with wildcards turned on. | ||
Line 29: | Line 46: | ||
Replace with: '' | Replace with: '' | ||
- | This looks for the pattern: any-letter space paragraph-break any-letter | + | This looks for the pattern: |
The parentheses are used to group what it finds, so \1 refers to the first " | The parentheses are used to group what it finds, so \1 refers to the first " | ||
In this way, you are putting back exactly what it found minus the paragraph break. | In this way, you are putting back exactly what it found minus the paragraph break. | ||
+ | </ | ||
+ | |||
+ | ---- | ||
+ | <WRAP center round box 80%> | ||
**PROBLEM**: | **PROBLEM**: | ||
Line 46: | Line 67: | ||
You will likely have to do it again for lines that end with a comma, and possibly en and em dash. Look through your document for patterns of anything else it might have missed. | You will likely have to do it again for lines that end with a comma, and possibly en and em dash. Look through your document for patterns of anything else it might have missed. | ||
+ | </ | ||
- | **PROBLEM**: | + | ---- |
- | **SOLUTION**: Use MS Word's find and replace to remove the extra paragraph breaks using special Word symbols. | + | <WRAP center round box 80%> |
+ | **PROBLEM**: Hyphenated words that break single word (not over two lines). | ||
- | Find: '' | + | **SOLUTION**: Replace with the same text minus the hyphen. |
- | Replace with: '' | + | Find: '' |
- | **PROBLEM**: There are newlines/ | + | Replace with: '' |
- | **SOLUTION**: | + | Using a-z restricts what it finds to lowercase. |
- | Find: '' | + | You will likely have to do it again for lines that end with a comma, and possibly en and em dash. Look through your document for patterns of anything else it might have missed. |
+ | </ | ||
- | Replace with: '' | + | ---- |
- | <del>We have to convert the double paragraphs breaks into something else unique, remove the single paragraph breaks and then convert the unique characters that were double paragraph breaks into new single paragraph breaks. It is best to do this at the beginning of the text correction stage as it appears to mess with existing formatting styles. | + | <WRAP center round box 80%> |
- | - Find and replace all double paragraphs | + | **PROBLEM:** OCR converted some " |
- | * initiate a find for, ^p^p | + | |
- | - Replace with a unique symbol or code, eg, ' xswedc ' | + | |
- | | + | |
- | - Find and replace all remaining single paragraphs, find = ^p, replace = [single keyboard space] | + | |
- | - Find and replace all the double paragraphs you previously changed into a special symbol | + | |
- | - Find and remove all line breaks, change into double or single paragraphs instead (find = ^m, replace = ^p )</ | + | |
+ | **SOLUTION: | ||
+ | - | ||
+ | - Find: '' | ||
+ | - Replace: '' | ||
+ | - | ||
+ | - Find: '' | ||
+ | - Replace: '' | ||
+ | </ | ||
+ | |||
+ | ---- | ||
+ | |||
+ | |||
+ | <WRAP center round box 80%> | ||
+ | |||
+ | **PROBLEM: | ||
+ | * Example A: As one of Montgomery' | ||
+ | * Example B: The "nasty little '' | ||
+ | This problem has an added complexity; the pattern has two different solutions: | ||
+ | * Example A will need to say: ... later put '' | ||
+ | * Example B will need to say: The "nasty little troublemaker''," | ||
- | ===== Running Headers ===== | + | **SOLUTIONS: |
- | ==== In Word ==== | + | Example |
- | Example, where the first three numbers and the three numbers after the filename is the page number: 231(paragraph break)MacG_9781770494220_5p_all_r1.indd 231(paragraph break)10/ | + | |
- | Without using wildcards, you can look for the pattern: ^# | + | Find: '' |
+ | Replace: '' | ||
- | Replace it with nothing. (Or replace it with page breaks if you're doing a paginated title. Refer above or to the wildcard reference page on the syntax.) | + | Example B: |
- | You will need to remove one of the ^# at the beginning and after the .indd to remove it for 2 digit page numbers, and one last time for single digit page numbers. The following screenshot is an example with a 1-digit page number | + | Find: '' |
- | <WRAP center round box 60%> | + | Replace: '' |
- | {{:nnels: | + | Notes: |
+ | * You will **not** be able to use " | ||
+ | * You will also need to re-do this, searching for periods instead of commas. | ||
- | Command: ^# | ||
</ | </ | ||
+ | ---- | ||
+ | <WRAP center round box 80%> | ||
+ | **PROBLEM**: | ||
- | You will also need to do it with the leading ^#^p to catch the footer text that do not have any page numbers | + | **SOLUTION**: |
+ | </ | ||
+ | |||
+ | ---- | ||
+ | |||
+ | <WRAP center round box 80%> | ||
+ | **PROBLEM**: | ||
+ | |||
+ | **SOLUTION**: | ||
+ | </ | ||
+ | |||
+ | ---- | ||
+ | |||
+ | <WRAP center round box 80%> | ||
+ | **PROBLEM**: | ||
+ | '' | ||
+ | |||
+ | **SOLUTION**: | ||
+ | </ | ||
- | ==== In LibreOffice | + | In LibreOffice: |
* Verso (left hand) | * Verso (left hand) | ||
Line 108: | Line 168: | ||
* '' | * '' | ||
* '' | * '' | ||
+ | |||
+ | |||
+ | [[public: | ||