string_spanish¶
shortfx.fxString.string_spanish
¶
Spanish Language String Functions.
This module provides specialized functions for processing Spanish language strings, including phonetic reductions, NIF/CIF/NIE validation, and text normalization according to Spanish language rules.
Key Features: - Spanish stop words removal - Phonetic reduction algorithms - NIF/CIF/NIE validation with checksums - NIF format parsing and standardization - Accent removal and text normalization - Spanish-specific character handling
Functions¶
fix_spanish(input_string: str | None, additional_allowed_chars: str = '') -> str | None
¶
Cleans an input string by replacing specific characters and filtering out any characters not explicitly allowed.
This function handles common Spanish character replacements and ensures that only alphanumeric characters, Spanish-specific characters, spaces, and any user-defined additional characters remain in the string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_string
|
str | None
|
The string to be processed. If None, the function returns None. |
required |
additional_allowed_chars
|
str
|
A string containing any extra characters that should be considered valid. Defaults to ''. |
''
|
Returns:
| Type | Description |
|---|---|
str | None
|
str | None: The cleaned string, or None if the input was None. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If input_string is not a string or None, or if additional_allowed_chars is not a string. |
Example
fix_spanish("Hello, 'world'§ with ¥ and other €!") "Hello, 'world'º with Ñ and other" fix_spanish("Hello, 'world'§ with ¥ and other €!", "€") "Hello, 'world'º with Ñ and other €"
Source code in shortfx/fxString/string_spanish.py
is_valid_cif(cif_value: str) -> bool
¶
Validates a Spanish CIF (Código de Identificación Fiscal) format and control character.
CIF validation is highly complex, depending on the first letter of the CIF (which denotes the type of legal entity) and involves different checksum calculations leading to either a digit or a letter as a control character.
This is a simplified placeholder. A complete and accurate CIF validation requires implementing precise rules for each CIF type (A, B, C, D, E, F, G, H, J, N, P, Q, R, S, W, V).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cif_value
|
str
|
The CIF string to validate (e.g., "A12345678"). |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if the CIF is valid, False otherwise. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If 'cif_value' is not a string. |
ValueError
|
If 'cif_value' does not match the basic CIF regex format. |
Example of use
is_valid_cif("A12345678") # This will likely return False with current placeholder logic False
A real, correct CIF validation requires detailed implementation.¶
Source code in shortfx/fxString/string_spanish.py
is_valid_dni(dni_value: str) -> bool
¶
Validates a Spanish DNI (Documento Nacional de Identidad) format and control letter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dni_value
|
str
|
The DNI string to validate (e.g., "12345678A"). |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if the DNI is valid, False otherwise. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If 'dni_value' is not a string. |
ValueError
|
If 'dni_value' does not match the basic DNI regex format. |
Example of use
is_valid_dni("12345678Z") # Example valid DNI (not real) True is_valid_dni("00000000T") # Invalid by specific rule False is_valid_dni("12345678B") # Invalid control letter False is_valid_dni("1234567A") # Incorrect length False
Source code in shortfx/fxString/string_spanish.py
is_valid_nie(nie_value: str) -> bool
¶
Validates a Spanish NIE (Número de Identificación de Extranjero) format and control letter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
nie_value
|
str
|
The NIE string to validate (e.g., "X1234567B"). |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if the NIE is valid, False otherwise. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If 'nie_value' is not a string. |
ValueError
|
If 'nie_value' does not match the basic NIE regex format. |
Example of use
is_valid_nie("X0000000T") # Example valid NIE (not real) True is_valid_nie("Y1234567N") # Example valid NIE (not real) True is_valid_nie("Z7654321A") # Example valid NIE (not real) True is_valid_nie("A1234567B") # Invalid starting letter False is_valid_nie("X1234567C") # Invalid control letter False
Source code in shortfx/fxString/string_spanish.py
nif_letter(p_dni: str) -> str
¶
Calculates and appends the control letter to a Spanish DNI/NIE numeric part.
This function supports both DNI (8 digits) and NIE (starts with X, Y, or Z followed by 7 digits).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
p_dni
|
str
|
The numeric part of the DNI or the full NIE (e.g., '12345678', 'X1234567'). Spaces and dots will be ignored. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The full DNI/NIE with the calculated control letter appended. Returns the original input if the input is invalid (e.g., too short, contains invalid characters, or cannot be converted to a valid number for calculation). |
Source code in shortfx/fxString/string_spanish.py
nif_padding(p_nif: Optional[str]) -> Optional[str]
¶
Attempts to format an incomplete Spanish identification number by padding with zeros.
Description
Takes a potentially incomplete NIF/NIE/CIF and tries to complete it by adding leading zeros to the numeric portion. This function does not validate the control digit but ensures the length and numeric padding are correct for subsequent validation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
p_nif
|
str
|
The identification number to format. |
required |
Returns:
| Type | Description |
|---|---|
Optional[str]
|
str or None: Padded and formatted identification number if it can be processed, None if the input is invalid or cannot be padded. |
Example
nif_padding("123456Z") "00123456Z" nif_padding("X1234L") "X0001234L" nif_padding("123Z") "00000123Z" nif_padding("123456789") "123456789" nif_padding("invalid") "INVALID"
Cost: O(L) where L is the length of the input string.
Source code in shortfx/fxString/string_spanish.py
nif_parse(nif: Optional[str]) -> Optional[str]
¶
Validates if a Spanish identification number has the correct format.
Description
Checks if the provided string matches any of the valid NIF/NIE/CIF patterns
and validates the control digit. If the initial validation fails, it attempts
to pad the NIF using nif_padding and re-validates.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
nif
|
str
|
The identification number to validate. |
required |
Returns:
| Type | Description |
|---|---|
Optional[str]
|
str or None: Validated identification number if correct, None if invalid. |
Example
nif_parse("12345678Z") "12345678Z" nif_parse("01234567Z") # Example with leading zero (becomes "01234567Z") "01234567Z" nif_parse("1234567L") # Example needing padding (becomes "01234567L") "01234567L" nif_parse("X1234567L") "X1234567L" nif_parse("invalid") None
Cost: O(L) where L is the length of the NIF string due to regex matching and string manipulation.
Source code in shortfx/fxString/string_spanish.py
388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 | |
reduce_spanish_letters(input_string: str, strength: int) -> str | None
¶
Reduces phonetic and orthographic variations in Spanish strings based on a specified strength level for similarity matching or normalization.
The function applies a series of character replacements to simplify the string, simulating common phonetic mergers or orthographic conventions in Spanish, and attempts to preserve the original casing style.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_string
|
str | None
|
The string to parse. |
required |
strength
|
int
|
The level of phonetic/orthographic reduction to apply. - 0: Only basic normalization (remove accents, to uppercase). - 1: Basic phonetic reductions (e.g., silent H, RR->R, B/V merger). - 2: Stronger phonetic reductions (e.g., C/Z to S, X to S). - 3: Most aggressive reductions (e.g., Ñ->N, W->V, G->J). |
required |
Returns:
| Type | Description |
|---|---|
str | None
|
str | None: The processed string with reduced letters and restored casing, or None if the input was None. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If 'input_string' is not a string. |
ValueError
|
If 'strength' is not within the valid range [0, 3]. |
Example of use
reduce_spanish_letters("Coche", 0) 'COCHE' reduce_spanish_letters("Guerrero", 1) 'GERERO' reduce_spanish_letters("Excelente", 2) 'ESCELENTE' # X->S, C->S reduce_spanish_letters("Ñandú", 3) 'NANDU' reduce_spanish_letters("México", 2) 'MESICO' # assuming _remove_accents handled 'é' reduce_spanish_letters("Bogotá", 1) 'VOGOTA' reduce_spanish_letters("Gigante", 3) 'JIJANTE' # G->J (aggressive) reduce_spanish_letters(None, 1) None reduce_spanish_letters("Árbol", 0) # Assuming _remove_accents handled accents 'ARBOL'
Source code in shortfx/fxString/string_spanish.py
220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 | |
remove_spanish_stop_words(input_string: str) -> str | None
¶
Removes common Spanish articles, conjunctions, and prepositions (stop words) from a text string.
This function processes the input string by: 1. Validating the input type. 2. Converting the string to lowercase (handled by re.IGNORECASE). 3. Using a pre-compiled regular expression to efficiently find and replace all defined stop words with a single space. 4. Normalizing multiple spaces to a single space and stripping leading/trailing whitespace to ensure a clean output.
Note: The list of stop words is configurable within the _SPANISH_STOP_WORDS
constant. This function focuses on a basic set and does not include all
possible prepositions or conjunctions, nor does it handle complex grammatical
nuances or Internationalized Domain Names (IDNs) if characters beyond basic
Latin alphabet are present in the text and need special handling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_string
|
str
|
The text string from which to remove stop words. |
required |
Returns:
| Type | Description |
|---|---|
str | None
|
str | None: The processed string with stop words removed and normalized spaces, or None if the input was None. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If the input 'input_string' is not a string. |
Example of use
remove_spanish_stop_words("El coche de la casa es grande y azul") 'coche casa es grande azul' remove_spanish_stop_words("Un perro y un gato") 'perro gato' remove_spanish_stop_words("La historia de El Cid") 'historia Cid' remove_spanish_stop_words("Con la mano en el corazon") # 'Con' is not in list 'Con mano corazon' remove_spanish_stop_words(None) None remove_spanish_stop_words("A, e, o, u.") ',' # This highlights that it removes the word boundary matches, not general characters. remove_spanish_stop_words(123) Traceback (most recent call last): ... TypeError: Input 'input_string' must be a string.
Cost: O(n), where n is the length of the input string (regex operations)
Source code in shortfx/fxString/string_spanish.py
validate_spanish_nif(nif_type: str, nif_value: str) -> bool
¶
Validates a Spanish NIF (DNI, NIE, or CIF) based on its type and value.
This is a dispatcher function that calls specific validation functions for DNI, NIE, or CIF based on the provided 'nif_type'.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
nif_type
|
str
|
The type of NIF to validate ('DNI', 'NIE', or 'CIF'). |
required |
nif_value
|
str
|
The NIF string to validate. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if the NIF is valid for its specified type, False otherwise. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If 'nif_type' or 'nif_value' are not strings. |
ValueError
|
If 'nif_type' is not one of 'DNI', 'NIE', or 'CIF'. |
Example of use
validate_spanish_nif("DNI", "12345678Z") # Valid DNI (example) True validate_spanish_nif("NIE", "X0000000T") # Valid NIE (example) True validate_spanish_nif("CIF", "A12345678") # Will likely be False due to placeholder CIF logic False validate_spanish_nif("DNI", "12345678X") # Invalid DNI (wrong letter) False validate_spanish_nif("DNI", None) Traceback (most recent call last): ... TypeError: NIF value must be a string. validate_spanish_nif("INVALID", "123") Traceback (most recent call last): ... ValueError: Invalid NIF type. Expected 'DNI', 'NIE', or 'CIF'.