Skip to content

Commit 6585785

Browse files
max-ibragimowMax Ibragimov
andauthored
Release 1.6.0
* [TTDB-831] fixed communication with pg_dump and pg_restore Added: - added a postgres version definition to use suitable pg_dump versions - run option `--config` for configuration file with paths to specific versions of pg_dump/pg_restore utils Updated: - moved all modules with pg_anon modes into package `modes` - actualized pyproject.toml - tests. Function `anon_funcs.random_inn()` in tests replaced cause this function can't guarantee values unique * [TTDB-831] pg_anon modes refactoring Updated: - modes `init` and `create-dict` packed into classes - class `MainRoutine` decomposed on small methods - class `PgAnonResult` now using only in `MainRoutine` and includes methods for updating statuses - actualized tests * [TTDB-831] Improving regexp searching Added: - for `skip_rules` and `include_rules` searching by `schema_mask` and `table_mask` Updated: - search in `data_regex.rules` for searching in text including `\n` - search in `data_const.partial_constants` now case-insensitive - actualized tests and README.md * [TTDB-831] Optimization for reading data from database * [TTDB-831] Excluded `view-fields` and `view-data` modes from checking postgres utils * [TTDB-831] Updated phone regexp in examples and tests * [TTDB-831] Added postgres version requirement into README.md --------- Co-authored-by: Max Ibragimov <maxim.ibragimov@tantorlabs.ru>
1 parent 73163cc commit 6585785

29 files changed

+1356
-1013
lines changed

README.md

Lines changed: 63 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -221,6 +221,8 @@ It uses the following tools:
221221
- PostgreSQL [`pg_dump`](https://www.postgresql.org/docs/current/app-pgdump.html) tool for dumping the database structure.
222222
- PostgreSQL [`pg_restore`](https://www.postgresql.org/docs/current/app-pgrestore.html) tool for restoring the database structure.
223223

224+
**Warning:** Requires PostgreSQL 14 or higher
225+
224226
## Installation Guide
225227

226228
### Preconditions
@@ -266,6 +268,7 @@ Installation processes slightly differ depending on your operating system.
266268
## Testing
267269

268270
To test `pg_anon`, you need to have a local database installed. This section covers the installation of postgres and running the test suite.
271+
Your operating system also need have a locale `en_US.UTF-8`, because in tests creating database in this locale.
269272

270273
### Setting Up PostgreSQL
271274

@@ -335,6 +338,7 @@ set TEST_DB_HOST=127.0.0.1
335338
set TEST_DB_PORT=5432
336339
set TEST_SOURCE_DB=test_source_db
337340
set TEST_TARGET_DB=test_target_db
341+
set TEST_CONFIG=/path/to/pg_anon/tests/config.yaml
338342
```
339343

340344
## Usage
@@ -347,12 +351,13 @@ python pg_anon.py --help
347351

348352
Common pg_anon options:
349353

350-
| Option | Description |
351-
|---------------------------------|----------------------------------------------------------------------|
352-
| `--debug` | Enable debug mode (default false) |
353-
| `--verbose` | Configure verbose mode: [info, debug, error] (default info) |
354-
| `--db-connections-per-process` | Amount of connections for IO operations for each process (default 4) |
355-
| `--processes` | Amount of processes for multiprocessing operations (default 4) |
354+
| Option | Description |
355+
|--------------------------------|-----------------------------------------------------------------------------------|
356+
| `--debug` | Enable debug mode (default false) |
357+
| `--verbose` | Configure verbose mode: [info, debug, error] (default info) |
358+
| `--db-connections-per-process` | Amount of connections for IO operations for each process (default 4) |
359+
| `--processes` | Amount of processes for multiprocessing operations (default 4) |
360+
| `--config` | Path to the config file, where can be specified `pg_dump` and `pg_restore` utils. |
356361

357362
Database configuration options:
358363

@@ -409,7 +414,7 @@ python pg_anon.py --mode=create-dict \
409414
| `--output-sens-dict-file` | Output file with sensitive fields will be saved to this value |
410415
| `--output-no-sens-dict-file` | Output file with not sensitive fields will be saved to this value (Optional) |
411416
| `--scan-mode` | defines whether to scan all data or only part of it ["full", "partial"] (default "partial") |
412-
| `--scan-partial-rows` | In `--scan-mode partial` defines amount of rows to scan (default 10000) |
417+
| `--scan-partial-rows` | In `--scan-mode partial` defines amount of rows to scan (default 10000). Actual rows count can be smaller after getting unique values |
413418

414419
#### Requirements for input --meta-dict-file (metadict):
415420

@@ -428,24 +433,24 @@ var = {
428433
},
429434
"skip_rules": [ # List of schemas, tables, and fields to skip
430435
{
431-
# possibly some schema or table contains a lot of data that is not worth scanning. Skipped objects will not be automatically included in the resulting dictionary. Masks are not supported in this object.
432-
"schema": "schm_mask_ext_exclude_2", # Schema specification is mandatory
433-
"table": "card_numbers", # Optional. If there is no "table", the entire schema will be skipped.
436+
# possibly some schema or table contains a lot of data that is not worth scanning. Skipped objects will not be automatically included in the resulting dictionary
437+
"schema": "schm_mask_ext_exclude_2", # Can use "schema" for full name matching or "schema_mask" for regexp matching. Required one of them
438+
"table": "card_numbers", # Optional. Can use "table" for full name matching or "table_mask" for regexp matching. If there is no "table"/"table_mask", the entire schema will be skipped.
434439
"fields": ["val_skip"] # Optional. If there are no "fields", the entire table will be skipped.
435440
}
436441
],
437442
"include_rules": [ # List of schemas, tables, and fields which will be scanning
438443
{
439444
# possibly you need specific fields for scanning or you can debug some functions on specific field
440-
"schema": "schm_other_2", # Required. Schema specification is mandatory
441-
"table": "tbl_test_anon_functions", # Optional. If there is no "table", the entire schema will be included.
442-
"fields": ["fld_5_email"] # Optional. If there are no "fields", the entire table will be included.
445+
"schema": "schm_other_2", # Can use "schema" for full name matching or "schema_mask" for regexp matching. Required one of them
446+
"table": "tbl_test_anon_functions", # Optional. Can use "table" for full name matching or "table_mask" for regexp matching. If there is no "table"/"table_mask", the entire schema will be skipped.
447+
"fields": ["fld_5_email"] # Optional. If there are no "fields", the entire table will be skipped.
443448
}
444449
],
445450
"data_regex": { # List of regular expressions to search for sensitive data
446451
"rules": [
447-
"""[A-Za-z0-9]+([._-][A-Za-z0-9]+)*@[A-Za-z0-9-]+(\.[A-Za-z]{2,})+""", # email
448-
"7?[\d]{10}" # phone 7XXXXXXXXXX
452+
r"""[A-Za-z0-9]+([._-][A-Za-z0-9]+)*@[A-Za-z0-9-]+(\.[A-Za-z]{2,})+""", # email
453+
r"^(7?\d{10})$", # phone 7XXXXXXXXXX
449454
]
450455
},
451456
"data_const": {
@@ -545,15 +550,15 @@ var = {
545550

546551
Possible options in mode=dump:
547552

548-
| Option | Description |
549-
|--------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|
550-
| `--prepared-sens-dict-file` | Input file or file list with sensitive fields, which was obtained in previous use by option `--output-sens-dict-file` or prepared manually |
551-
| `--dbg-stage-1-validate-dict` | Validate dictionary, show the tables and run SQL queries without data export (default false) |
552-
| `--dbg-stage-2-validate-data` | Validate data, show the tables and run SQL queries with data export in prepared database (default false) |
553-
| `--dbg-stage-3-validate-full` | Makes all logic with "limit" in SQL queries (default false) |
554-
| `--clear-output-dir` | In dump mode clears output dict from previous dump or another files. (default true) |
555-
| `--pg-dump` | Path to the `pg_dump` Postgres tool (default `/usr/bin/pg_dump`). |
556-
| `--output-dir` | Output directory for dump files. (default "") |
553+
| Option | Description |
554+
|-------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|
555+
| `--prepared-sens-dict-file` | Input file or file list with sensitive fields, which was obtained in previous use by option `--output-sens-dict-file` or prepared manually |
556+
| `--dbg-stage-1-validate-dict` | Validate dictionary, show the tables and run SQL queries without data export (default false) |
557+
| `--dbg-stage-2-validate-data` | Validate data, show the tables and run SQL queries with data export in prepared database (default false) |
558+
| `--dbg-stage-3-validate-full` | Makes all logic with "limit" in SQL queries (default false) |
559+
| `--clear-output-dir` | In dump mode clears output dict from previous dump or another files. (default true) |
560+
| `--pg-dump` | Path to the `pg_dump` Postgres tool (default `/usr/bin/pg_dump`). |
561+
| `--output-dir` | Output directory for dump files. (default "") |
557562

558563
### Run restore mode
559564

@@ -721,6 +726,40 @@ from (
721726
]
722727
```
723728

729+
### Configuring of pg_anon
730+
731+
For specifying `pg_dump` and `pg_restore` utils, the parameters `--pg-dump` and `--pg-restore` can be used.
732+
Also `--config` can be used for advanced configuring. This parameter accept YAML file in this format:
733+
```yaml
734+
pg-utils-versions:
735+
<postgres_major_version>:
736+
pg_dump: "/path/to/<postgres_major_version>/pg_dump"
737+
pg_restore: "/path/to/<postgres_major_version>/pg_restore"
738+
<another_postgres_major_version>:
739+
pg_dump: "/path/to/<postgres_major_version>/pg_dump"
740+
pg_restore: "/path/to/<postgres_major_version>/pg_restore"
741+
default:
742+
pg_dump: "/path/to/default_postgres_version/pg_dump"
743+
pg_restore: "/path/to/default_postgres_version/pg_restore"
744+
```
745+
746+
For example can be specified config for postgres 15, postgres 17:
747+
748+
```yaml
749+
pg-utils-versions:
750+
15:
751+
pg_dump: "/usr/lib/postgresql/14/bin/pg_dump"
752+
pg_restore: "/usr/lib/postgresql/14/bin/pg_restore"
753+
17:
754+
pg_dump: "/usr/lib/postgresql/17/bin/pg_dump"
755+
pg_restore: "/usr/lib/postgresql/17/bin/pg_restore"
756+
default:
757+
pg_dump: "/usr/lib/postgresql/17/bin/pg_dump"
758+
pg_restore: "/usr/lib/postgresql/17/bin/pg_restore"
759+
```
760+
761+
In case of mismatch current postgres version with this config, will be used version of `pg_dump` and `pg_restore` from `default` section. For example `pg_anon` can be run with this config on postgres 16. In this case will be used `pg_dump 17` and `pg_restore 17`, i.e. from `default` section.
762+
724763
### Debug stages in dump and restore modes
725764

726765
#### Debug stages:

pg_anon/common/db_queries.py

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -106,10 +106,13 @@ def get_data_from_field_query(field_info: FieldInfo, limit: int = None, conditio
106106
query_limit = get_limit_query(limit)
107107

108108
query = f"""
109-
SELECT distinct(substring(\"{field_info.column_name}\"::text, 1, 8196))
110-
FROM \"{field_info.nspname}\".\"{field_info.relname}\"
111-
{query_condition}
112-
{query_limit}
109+
SELECT distinct t1._field
110+
FROM (
111+
SELECT (substring(\"{field_info.column_name}\"::text, 1, 8196)) as _field
112+
FROM \"{field_info.nspname}\".\"{field_info.relname}\"
113+
{query_condition}
114+
{query_limit}
115+
) as t1
113116
"""
114117

115118
return query

pg_anon/common/db_utils.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
import re
12
from typing import Dict, List
23

34
import asyncpg
@@ -200,7 +201,13 @@ async def run_query_in_pool(pool: Pool, query: str):
200201
logger.info(f"Execute query: {query}")
201202
except Exception as e:
202203
logger.error("Exception in run_query_in_pool:\n" + exception_helper())
203-
raise Exception(f"Can't execute query: {query}")
204+
raise RuntimeError(f"Can't execute query: {query}")
204205

205206
logger.info(f"<================ Finished query {query}")
206207

208+
209+
async def get_pg_version(connection_params: ConnectionParams, server_settings: Dict = SERVER_SETTINGS):
210+
db_conn = await create_connection(connection_params, server_settings=server_settings)
211+
pg_version = await db_conn.fetchval("select version()")
212+
await db_conn.close()
213+
return re.findall(r"(\d+\.\d+)", str(pg_version))[0]

pg_anon/common/dto.py

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
import json
2+
import time
23
from dataclasses import dataclass
34
from typing import Optional, Callable, Dict, List
45

@@ -9,7 +10,29 @@ class PgAnonResult:
910
params = None # JSON
1011
result_code = ResultCode.UNKNOWN
1112
result_data = None
12-
elapsed = None
13+
start_time = None
14+
end_time = None
15+
_elapsed = None
16+
17+
def start(self):
18+
self.start_time = time.time()
19+
20+
def fail(self):
21+
self.end_time = time.time()
22+
self.result_code = ResultCode.FAIL
23+
24+
def complete(self):
25+
self.end_time = time.time()
26+
self.result_code = ResultCode.DONE
27+
28+
@property
29+
def elapsed(self):
30+
if not self._elapsed:
31+
if self.start_time is None or self.end_time is None:
32+
return None
33+
34+
self._elapsed = round(self.end_time - self.start_time, 2)
35+
return self._elapsed
1336

1437

1538
@dataclass

pg_anon/common/multiprocessing_utils.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ async def init_process(name: str, ctx, target_func: Callable, tasks: List, *args
1414

1515
p = aioprocessing.AioProcess(
1616
target=target_func,
17-
args=(name, ctx, queue, tasks, *args),
17+
args=(name, queue, tasks, *args),
1818
kwargs=kwargs,
1919
)
2020
p.start()

pg_anon/common/utils.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,14 @@
33
import hashlib
44
import json
55
import os.path
6+
import pathlib
67
import re
78
import subprocess
89
import sys
910
import traceback
1011
from typing import List, Optional, Dict, Union
1112

13+
import yaml
1214
from pkg_resources import parse_version as version
1315

1416
from pg_anon.common.db_utils import get_fields_list
@@ -56,6 +58,7 @@ def f(*args, **kwargs):
5658
func(*args, **kwargs)
5759
except:
5860
print(exception_helper(show_traceback=True))
61+
raise
5962

6063
return f
6164

@@ -327,3 +330,14 @@ def get_folder_size(folder_path: str) -> int:
327330

328331
def simple_slugify(value: str):
329332
return re.sub(r'\W+', '-', value).strip('-').lower()
333+
334+
335+
def read_yaml(file_path: str) -> Dict:
336+
path = pathlib.Path(file_path)
337+
if path.suffix not in ('.yml', '.yaml'):
338+
raise ValueError("File must be .yml or .yaml")
339+
340+
with open(os.path.abspath(file_path), "r") as file:
341+
data = yaml.safe_load(file)
342+
343+
return data

pg_anon/context.py

Lines changed: 31 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5,18 +5,18 @@
55
from pg_anon.common.constants import ANON_UTILS_DB_SCHEMA_NAME, SERVER_SETTINGS
66
from pg_anon.common.dto import ConnectionParams
77
from pg_anon.common.enums import VerboseOptions, AnonMode, ScanMode
8-
from pg_anon.common.utils import (
9-
exception_handler,
10-
parse_comma_separated_list,
11-
)
8+
from pg_anon.common.utils import exception_handler, parse_comma_separated_list, read_yaml
129

1310

1411
class Context:
1512
@exception_handler
1613
def __init__(self, args):
1714
self.current_dir = os.path.dirname(os.path.dirname(os.path.realpath(__file__)))
1815
self.args = args
16+
self.config = read_yaml(args.config) if args.config else None
1917
self.pg_version = None
18+
self.pg_dump = args.pg_dump
19+
self.pg_restore = args.pg_restore
2020
self.validate_limit = "LIMIT 100"
2121
self.meta_dictionary_obj: Dict = {}
2222
self.prepared_dictionary_obj: Dict = {}
@@ -220,9 +220,11 @@ def get_arg_parser():
220220
default=False,
221221
)
222222
parser.add_argument(
223-
"--db-host",
224-
type=str,
223+
"--config",
224+
help="Path to configuration file of pg_anon in YAML",
225+
default=None,
225226
)
227+
parser.add_argument("--db-host", type=str)
226228
parser.add_argument("--db-port", type=str, default="5432")
227229
parser.add_argument("--db-name", type=str, default="default")
228230
parser.add_argument("--db-user", type=str, default="default")
@@ -408,3 +410,26 @@ def get_arg_parser():
408410
help="Appends suffix for connection name. Just for comfortable automation",
409411
)
410412
return parser
413+
414+
def set_postgres_version(self, pg_version: str):
415+
self.pg_version = pg_version
416+
if not self.config:
417+
return
418+
419+
pg_major_version = int(pg_version.split('.')[0])
420+
421+
utils_versions = self.config.get('pg-utils-versions')
422+
pg_utils = utils_versions.get(pg_major_version)
423+
if not pg_utils:
424+
pg_utils = utils_versions.get('default')
425+
if not pg_utils:
426+
return
427+
428+
pg_dump = pg_utils.get('pg_dump')
429+
pg_restore = pg_utils.get('pg_restore')
430+
431+
if not pg_dump or not pg_restore:
432+
return ValueError("Config incorrect. Must be specified pg_dump and pg_restore utils paths")
433+
434+
self.pg_dump = pg_dump
435+
self.pg_restore = pg_restore

0 commit comments

Comments
 (0)