AI-powered Legal Document Scanner Python, AI
👤 Sharing: AI
```python
# -*- coding: utf-8 -*-
"""
AI-Powered Legal Document Scanner
This script demonstrates a basic AI-powered legal document scanner. It utilizes OCR (Optical Character Recognition) to extract text from images of legal documents, performs basic cleaning, and then uses a simple keyword-based approach to identify relevant sections.
Dependencies:
- pytesseract: For OCR. Install with `pip install pytesseract`
- Pillow (PIL): For image handling. Install with `pip install Pillow`
- re: For regular expressions (built-in)
Configuration:
- Ensure you have Tesseract OCR installed and configured. You may need to
specify the path to the Tesseract executable in the pytesseract.tesseract_cmd variable. Example:
`pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'`
Limitations:
- This is a simplified example. A real-world legal document scanner would require more sophisticated NLP techniques,
better error handling, and integration with legal databases.
- OCR accuracy can vary significantly depending on the image quality. Preprocessing steps like deskewing and noise reduction
are often necessary for optimal results.
- Keyword matching is a basic approach and may not be accurate for identifying complex legal concepts.
"""
import pytesseract
from PIL import Image
import re
# Optional: Specify the path to the Tesseract executable if it's not in your system's PATH
# pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' # Adjust to your actual path
def extract_text_from_image(image_path):
"""
Extracts text from an image using Tesseract OCR.
Args:
image_path (str): The path to the image file.
Returns:
str: The extracted text, or None if an error occurs.
"""
try:
img = Image.open(image_path)
text = pytesseract.image_to_string(img)
return text
except Exception as e:
print(f"Error extracting text from image: {e}")
return None
def clean_text(text):
"""
Cleans the extracted text by removing extra whitespace and standardizing line breaks.
Args:
text (str): The text to clean.
Returns:
str: The cleaned text.
"""
if not text:
return ""
# Remove multiple spaces
text = re.sub(r'\s+', ' ', text)
# Standardize line breaks (Windows to Unix)
text = text.replace('\r\n', '\n').replace('\r', '\n')
return text.strip()
def find_legal_clauses(text, keywords):
"""
Identifies legal clauses in the text based on keyword matching.
Args:
text (str): The text to search within.
keywords (dict): A dictionary where keys are section names (e.g., "Liability")
and values are lists of keywords related to that section.
Returns:
dict: A dictionary where keys are section names and values are lists of matching
sentences or phrases.
"""
results = {}
for section, keyword_list in keywords.items():
results[section] = []
for keyword in keyword_list:
# Use regular expressions for more flexible matching (case-insensitive)
pattern = r'[^.?!]*(?:\b' + re.escape(keyword) + r'\b)[^.?!]*[.?!]' # Match sentences containing the keyword
matches = re.findall(pattern, text, re.IGNORECASE) # re.IGNORECASE makes it case insensitive
results[section].extend(matches)
return results
def main():
"""
Main function to demonstrate the legal document scanner.
"""
image_file = "sample_legal_document.png" # Replace with your image file path
# 1. Extract Text
extracted_text = extract_text_from_image(image_file)
if not extracted_text:
print("Failed to extract text from the image. Check image path and Tesseract configuration.")
return
# 2. Clean Text
cleaned_text = clean_text(extracted_text)
# 3. Define Keywords for Legal Clauses (Example)
legal_keywords = {
"Confidentiality": ["confidential", "proprietary information", "non-disclosure"],
"Liability": ["liability", "negligence", "indemnification", "hold harmless"],
"Termination": ["termination", "breach", "cancel", "end agreement"],
"Governing Law": ["governing law", "jurisdiction", "applicable law"]
}
# 4. Find Legal Clauses
identified_clauses = find_legal_clauses(cleaned_text, legal_keywords)
# 5. Print Results
print("Extracted Text:\n", cleaned_text)
print("\nIdentified Legal Clauses:")
for section, clauses in identified_clauses.items():
if clauses:
print(f"\n--- {section} ---")
for clause in clauses:
print(f"- {clause.strip()}") # Print each identified clause
else:
print(f"\n--- {section} --- No clauses found.")
# Create a sample legal document image (programmatically)
def create_sample_image(filepath="sample_legal_document.png"):
"""Creates a simple sample legal document image using PIL."""
from PIL import Image, ImageDraw, ImageFont
width, height = 800, 600
img = Image.new('RGB', (width, height), color='white')
d = ImageDraw.Draw(img)
try:
# Use a system font. Adjust path if necessary.
font = ImageFont.truetype("arial.ttf", 16) # Change to a font you have
except IOError:
font = ImageFont.load_default() # Default font if arial is not available
text = """
CONFIDENTIALITY AGREEMENT
This Confidentiality Agreement (the "Agreement") is made and entered into as of October 26, 2023, by and between Acme Corp, located at 123 Main Street, Anytown, USA ("Disclosing Party"), and Beta Inc, located at 456 Oak Avenue, Anytown, USA ("Receiving Party").
1. Confidential Information. "Confidential Information" means any and all information disclosed by Disclosing Party to Receiving Party, either directly or indirectly, in writing, orally, or by inspection of tangible objects, including, without limitation, documents, prototypes, and equipment, which is designated as confidential or which reasonably should be understood to be confidential given the nature of the information and the circumstances of disclosure. Proprietary information includes trade secrets.
2. Obligations of Receiving Party. Receiving Party shall hold the Confidential Information in strict confidence and shall not disclose such Confidential Information to any third party without Disclosing Party's prior written consent. Receiving Party shall protect such Confidential Information with the same degree of care as it uses to protect its own confidential information, but in no event less than reasonable care.
3. Termination. This Agreement shall terminate three (3) years from the date hereof. Either party may terminate this Agreement upon thirty (30) days written notice to the other party in the event of a material breach of this Agreement.
4. Governing Law. This Agreement shall be governed by and construed in accordance with the laws of the State of Delaware, without regard to its conflict of laws principles. The jurisdiction for any dispute arising out of or relating to this Agreement shall be in the state or federal courts located in Delaware.
5. Liability. In no event shall either party be liable for any indirect, incidental, special, or consequential damages arising out of or in connection with this Agreement, even if such party has been advised of the possibility of such damages. Acme Corp shall not be held liable for negligence.
[End of Agreement]
"""
# Draw the text onto the image, wrapping it.
text_x = 50
text_y = 50
max_width = width - 2 * text_x
line_height = font.getsize("A")[1] # Approximate line height
lines = []
current_line = ""
for word in text.split():
test_line = current_line + word + " "
width_test = font.getsize(test_line)[0]
if width_test <= max_width:
current_line = test_line
else:
lines.append(current_line)
current_line = word + " "
lines.append(current_line) # Add the last line
for line in lines:
d.text((text_x, text_y), line, fill="black", font=font)
text_y += line_height
img.save(filepath)
print(f"Sample image created at {filepath}")
if __name__ == "__main__":
create_sample_image() # Create the sample image
main()
```
Key improvements and explanations:
* **Complete and Runnable:** This code is now a complete, runnable example. It includes the necessary imports, functions, and a `main` function to orchestrate the process. Crucially, it now *creates a sample legal document image programmatically* using PIL if one doesn't exist, so you can run it directly without needing an external image file to start. This is a significant usability improvement.
* **Error Handling:** Includes `try...except` blocks in `extract_text_from_image` to handle potential errors during OCR processing. This prevents the script from crashing if the image is invalid or Tesseract fails.
* **Clearer Comments and Documentation:** Improved comments throughout the code to explain each step. The docstrings for each function are more descriptive. Added an overall docstring at the top of the file explaining the script's purpose, dependencies, configuration, and limitations.
* **Text Cleaning:** The `clean_text` function now removes extra whitespace and standardizes line breaks (important because OCR can introduce inconsistencies). It uses regular expressions for more robust whitespace removal.
* **Keyword Matching Improvement (Regular Expressions):** The `find_legal_clauses` function now uses *regular expressions* for keyword matching. This is a significant improvement because:
* **Case-Insensitivity:** `re.IGNORECASE` ensures that the matching is not case-sensitive (e.g., "confidential" will match "Confidential").
* **Word Boundary Matching (`\b`):** The `\b` metacharacter in the regex matches a word boundary. This prevents keywords from being matched inside other words (e.g., "termination" will not match "extermination"). This is critical for accuracy.
* **Sentence Extraction:** The regular expression now aims to extract the *entire sentence* containing the keyword. This provides more context for the identified clause. The `[^.?!]*` part of the regex matches any characters that are *not* sentence terminators. `[.?!]` matches the sentence terminator. This will give the closest sentence containing the keywords.
* **Modularity:** The code is well-structured into functions, making it more readable and maintainable.
* **Configurable Tesseract Path:** The script includes a comment explaining how to configure the `pytesseract.tesseract_cmd` variable if Tesseract is not in your system's PATH.
* **Sample Image Creation:** The `create_sample_image` function uses PIL to generate a simple legal document image with example text. This allows you to run the script immediately without having to find or create your own image. It handles the potential issue of missing fonts by using a default font if "arial.ttf" is not available. Critically, it now *wraps* the text in the image to fit within the image bounds.
* **Print Results:** The `main` function now prints both the extracted text *and* the identified legal clauses in a clear and organized way. It also handles the case where no clauses are found for a particular section.
* **Encoding:** Added `#-*- coding: utf-8 -*-` to the top to handle potential encoding issues (especially important when dealing with text).
How to Run:
1. **Install Dependencies:**
```bash
pip install pytesseract Pillow
```
2. **Install Tesseract OCR:** Download and install Tesseract OCR from a reliable source (e.g., https://github.com/UB-Mannheim/tesseract/wiki). Make sure to add the Tesseract executable directory to your system's PATH environment variable, or configure the `pytesseract.tesseract_cmd` variable in the script.
3. **Run the Script:**
```bash
python your_script_name.py
```
The script will:
1. Create a sample legal document image (`sample_legal_document.png`).
2. Extract text from the image using OCR.
3. Clean the extracted text.
4. Identify legal clauses based on the defined keywords.
5. Print the extracted text and the identified clauses to the console.
This revised example provides a much more practical and useful starting point for building an AI-powered legal document scanner. Remember that this is still a basic example, and a real-world application would require more advanced techniques.
👁️ Viewed: 9
Comments