Transcription Correction Service Python
👤 Sharing: AI
```python
import re
def correct_transcription(transcription, common_errors):
"""
Corrects a transcription based on a dictionary of common errors.
Args:
transcription (str): The original transcription text.
common_errors (dict): A dictionary where keys are incorrect phrases/words
and values are their correct counterparts.
Returns:
str: The corrected transcription.
"""
corrected_text = transcription
for incorrect, correct in common_errors.items():
# Using regex to perform case-insensitive and whole-word replacement
pattern = r"\b" + re.escape(incorrect) + r"\b" # \b ensures whole word match
corrected_text = re.sub(pattern, correct, corrected_text, flags=re.IGNORECASE)
return corrected_text
# Example usage
if __name__ == "__main__":
# Define a dictionary of common transcription errors. This is the core
# of the system and needs to be populated with real-world errors.
error_dictionary = {
"their": "there",
"there": "they're",
"to": "too",
"too": "two",
"your": "you're",
"weather": "whether",
"than": "then",
"four": "for",
"of": "have", # Commonly misused, needs careful consideration in a real application
"could of": "could have", # Important: Handle multi-word errors
"would of": "would have",
"should of": "should have",
"hear": "here",
"I'm": "am", # very basic demonstration for removing personal pronouns
"unclear": "", # very basic demonstration for removing unclear
"um": "", # very basic demonstration for removing filler
"uh": "" # very basic demonstration for removing filler
}
original_transcription = "Their going to go their because they were too late for the weather. Should of went sooner. I'm not hear for this. Um, uh, unclear instructions too."
corrected_transcription = correct_transcription(original_transcription, error_dictionary)
print("Original Transcription:\n", original_transcription)
print("\nCorrected Transcription:\n", corrected_transcription)
```
Key improvements and explanations:
* **`re.escape()`:** Critically important! This function escapes special characters in the `incorrect` string so they are treated literally when creating the regular expression. Without this, phrases like "to" or "of" would cause errors because `.` and `*` have special meaning in regular expressions.
* **`\b` word boundaries:** The `\b` in `r"\b" + re.escape(incorrect) + r"\b"` enforces *whole word* matching. This prevents accidental corrections within words. For example, if you tried to correct "to" to "too" without `\b`, it would change words like "today" to "tooday". `\b` matches the boundary between a word character (alphanumeric or underscore) and a non-word character (like a space, punctuation, or the beginning/end of the string).
* **`re.IGNORECASE` flag:** This flag makes the replacement case-insensitive. "Their" and "their" will both be corrected to "there".
* **`re.sub()` for replacement:** Uses `re.sub()` for the replacement. This is the standard way to do regular expression substitution in Python. The `count=0` argument is now removed as it is the default behavior. By default, `re.sub()` replaces *all* occurrences of the pattern.
* **Clearer Example Usage:** The `if __name__ == "__main__":` block makes the example usage more robust. This ensures that the example code only runs when the script is executed directly (not when it's imported as a module). I added a more complex example string and better error cases to demonstrate the fixes. Included `I'm`, `unclear` and common filler words as demonstration of basic removal of personal pronouns or repeated filler.
* **Multi-Word Error Handling:** The `error_dictionary` now includes `"could of": "could have"` and similar corrections. This is *essential* because many common errors are phrases, not just single words.
* **Docstring:** Added a proper docstring to explain what the function does, its arguments, and its return value.
* **`error_dictionary` is the Key:** Emphasized that the `error_dictionary` is the heart of the system. Its effectiveness directly depends on the quality and comprehensiveness of the dictionary. This would need to be carefully curated for a real-world application.
* **`re.escape` explanation:** Explanation for using `re.escape` in a comment.
* **Removed unnecessary variable assignment:** Removed the `pattern` assignment to simplify.
**How to run this code:**
1. **Save:** Save the code as a Python file (e.g., `transcription_corrector.py`).
2. **Run:** Open a terminal or command prompt, navigate to the directory where you saved the file, and run the script using `python transcription_corrector.py`.
The output will show the original and corrected transcriptions. You can modify the `original_transcription` and the `error_dictionary` to test different scenarios. Crucially, start building a good `error_dictionary` from real-world transcription errors you encounter.
👁️ Viewed: 5
Comments