feat: save duplicates

2025-05-21 22:58:45 +02:00
parent f9043dd988
commit 888949b142
3 changed files with 1923 additions and 1 deletions
--- a/README.md
+++ b/README.md
@ -1,6 +1,6 @@
 # IWD Archive Lister

-This script scans the `main/` and `iw4x/` folders under a specified root directory for `.iwd` files (which are ZIP archives). For each `.iwd` file found, it extracts the list of files inside the archive using `7z` and writes the output to a `.txt` file in a folder called `out/`.
+The [list-iwd.sh](list-iwd.sh) script scans the `main/` and `iw4x/` folders under a specified root directory for `.iwd` files (which are ZIP archives). For each `.iwd` file found, it extracts the list of files inside the archive using `7z` and writes the output to a `.txt` file in a folder called `out/`.

 Each `.iwd` file gets its own `.txt` file in the `out/` directory, with the same base name (e.g., `iw_00.iwd` -> `out/iw_00.iwd.txt`).

@ -22,3 +22,21 @@ sudo apt install p7zip-full
 ```

 Where `<root_directory>` is the path that contains both `main/` and `iw4x/` subfolders.
+
+# IWD Archive Duplicate Finder
+
+The [show-duplicates.py](show-duplicates.py) Python script scans all `.txt` files inside the `out/` directory, which were previously generated by extracting the contents of `.iwd` archives, and identifies duplicate file entries that appear in **more than one archive**.
+
+It prints the results to the console and saves a full report to `out/duplicates/result.txt`.
+
+## What It Does
+
+- Reads every `.txt` file in the `out/` folder.
+- Detects which filenames appear in **multiple** `.txt` files (i.e. shared between archives).
+- Writes a detailed list of these duplicates to: out/duplicates/result.txt
+
+Each duplicate line includes the filename and a list of `.txt` files (archives) it appears in.
+
+## Requirements
+
+- Python 3.x
--- a/out/duplicates/result.txt
+++ b/out/duplicates/result.txt
--- a/show-duplicates.py
+++ b/show-duplicates.py
@ -0,0 +1,34 @@
+import os
+from collections import defaultdict
+
+# Folder containing the .txt files
+out_folder = "out"
+duplicates_folder = os.path.join(out_folder, "duplicates")
+result_file_path = os.path.join(duplicates_folder, "result.txt")
+
+os.makedirs(duplicates_folder, exist_ok=True)
+
+# Map each line to a set of files that contain it
+line_to_files = defaultdict(set)
+
+# Iterate over all .txt files in the out folder
+for filename in os.listdir(out_folder):
+    if filename.endswith(".txt") and filename != "result.txt":
+        path = os.path.join(out_folder, filename)
+        with open(path, "r", encoding="utf-8") as f:
+            for line in f:
+                line = line.strip()
+                if line:
+                    line_to_files[line].add(filename)
+
+# Open result file for writing
+with open(result_file_path, "w", encoding="utf-8") as result_file:
+    print("Duplicate lines found in multiple files:\n")
+    result_file.write("Duplicate lines found in multiple files:\n\n")
+    for line, files in sorted(line_to_files.items()):
+        if len(files) > 1:
+            info = f"{line} -> in: {', '.join(sorted(files))}"
+            print(info)
+            result_file.write(info + "\n")
+
+print(f"\nResults saved to: {result_file_path}")