satchlj.com/content/blog/2019-07-17_snippet-correctl.../index.md

7.7 KiB
Raw Blame History

+++ title = "Snippet: Correctly capitalize names in PHP" description = "How to correctly capitalize and normalize names in PHP with this simple snippet"

[taxonomies] categories = ["snippet", "blog"] tags = ["php", "snippet"]

[extra] zenn_applause = true comments = [ {url = "https://www.reddit.com/r/laravel/comments/cefz8o/poc_snippet_to_correctly_capitalize_names_in_php/", name = "Reddit"}, {url = "https://lobste.rs/s/klpksc/poc_snippet_correctly_capitalize_names", name = "Lobsters"}, ] +++

When building websites with any kind of user registration, it's fascinating what people enter in name fields. no casing, Random CASING, a dozen spaces    between     words, or nospacingatall. Seeing this always irritates me, I'd fancy things to nice and be consistent.

{{ fit_image(path="blog/2019-07-17_snippet-correctly-capitalize-names-in-php/banner.png", url="/blog/snippet-correctly-capitalize-names-in-php/banner.png") }}

It appears that correctly normalizing name capitalization is an unsolvable puzzle. There is no consistency in name casing, or for any kind of name formatting for that matter. See Falsehoods programmers believe about names.

I always wonder how big social networks handle this.

Okay, so this isn't solvable. But at least I could try to make it better. I came across this wonderful PHP snippet for name capitalization a while back, but it had a few shortages. It didn't correctly case with just a person's last name for instance (needed when storing first/last names separate). I love challenges like this and decided to improve, here is my take on it:

<?php

/**
 * Normalize the given (partial) name of a person.
 *
 * - re-capitalize, take last name inserts into account
 * - remove excess white spaces
 *
 * Snippet from: https://timvisee.com/blog/snippet-correctly-capitalize-names-in-php
 *
 * @param string $name The input name.
 * @return string The normalized name.
 */
function name_case($name) {
    // A list of properly cased parts
    $CASED = [
      "O'", "l'", "d'", 'St.', 'Mc', 'the', 'van', 'het', 'in', "'t", 'ten',
      'den', 'von', 'und', 'der', 'de', 'da', 'of', 'and', 'the', 'III', 'IV',
      'VI', 'VII', 'VIII', 'IX',
    ];

    // Trim whitespace sequences to one space, append space to properly chunk
    $name = preg_replace('/\s+/', ' ', $name) . ' ';

    // Break name up into parts split by name separators
    $parts = preg_split('/( |-|O\'|l\'|d\'|St\\.|Mc)/i', $name, -1, PREG_SPLIT_DELIM_CAPTURE);

    // Chunk parts, use $CASED or uppercase first, remove unfinished chunks
    $parts = array_chunk($parts, 2);
    $parts = array_filter($parts, function($part) {
            return sizeof($part) == 2;
        });
    $parts = array_map(function($part) use($CASED) {
            // Extract to name and separator part
            list($name, $separator) = $part;

            // Use specified case for separator if set
            $cased = current(array_filter($CASED, function($i) use($separator) {
                return strcasecmp($i, $separator) == 0;
            }));
            $separator = $cased ? $cased : $separator;

            // Choose specified part case, or uppercase first as default
            $cased = current(array_filter($CASED, function($i) use($name) {
                return strcasecmp($i, $name) == 0;
            }));
            return [$cased ? $cased : ucfirst(strtolower($name)), $separator];
        }, $parts);
    $parts = array_map(function($part) {
            return implode($part);
        }, $parts);
    $name = implode($parts);

    // Trim and return normalized name
    return trim($name);
}
Tap here to expand a better version for use with Laravel.

This variant is more concise and uses a function approach using Laravel collections:

<?php

/**
 * Normalize the given (partial) name of a person.
 *
 * - re-capitalize, take last name inserts into account
 * - remove excess white spaces
 *
 * Snippet from: https://timvisee.com/blog/snippet-correctly-capitalize-names-in-php
 *
 * @param string $name The input name.
 * @return string The normalized name.
 */
function name_case($name) {
    // A list of properly cased parts
    $CASED = collect([
        "O'", "l'", "d'", 'St.', 'Mc', 'the', 'van', 'het', 'in', "'t", 'ten',
        'den', 'von', 'und', 'der', 'de', 'da', 'of', 'and', 'the', 'III', 'IV',
        'VI', 'VII', 'VIII', 'IX',
    ]);

    // Trim whitespace sequences to one space, append space to properly chunk
    $name = preg_replace('/\s+/', ' ', $name) . ' ';

    // Break name up into parts split by name separators
    $parts = preg_split('/( |-|O\'|l\'|d\'|St\\.|Mc)/i', $name, -1, PREG_SPLIT_DELIM_CAPTURE);

    // Chunk parts, use $CASED or uppercase first, remove unfinished chunks
    $name = collect($parts)
        ->chunk(2)
        ->filter(function($part) {
            return $part->count() == 2;
        })
        ->mapSpread(function($name, $separator = null) use($CASED) {
            // Use specified case for separator if set
            $cased = $CASED->first(function($i) use($separator) {
                return strcasecmp($i, $separator) == 0;
            });
            $separator = $cased ?? $separator;

            // Choose specified part case, or uppercase first as default
            $cased = $CASED->first(function($i) use($name) {
                return strcasecmp($i, $name) == 0;
            });
            return [$cased ?? ucfirst(strtolower($name)), $separator];
        })
        ->map(function($part) {
            return implode($part);
        })
        ->join('');

    // Trim and return normalized name
    return trim($name);
}

Of course, this function fulfills the truth table presented with the original snippet:

Input Becomes
michael ocarrol Michael OCarrol
lucas lamour Lucas lAmour
george donofrio George dOnofrio
william stanley iii William Stanley III
UNITED STATES OF AMERICA United States of America
t. von lieres und wilkau T. von Lieres und Wilkau
paul van der knaap Paul van der Knaap
jean-luc picard Jean-Luc Picard
JOHN MCLAREN John McLaren
hENRIC vIII Henric VIII
VAsco da GAma Vasco da Gama

It neatly passes additional previously problematic situations as well. Brilliant!

Input Original snippet This snippet
van der knaap Van der Knaap van der Knaap
lamour LAmour lAmour
von lieres    UND wilkau Von Lieres    und Wilkau von Lieres und Wilkau

Normalizing using a function like this makes it impossible for some to enter their name as formatted on their ID. Knowing the audience you serve, this is a risk you may be able to accept but it will never be perfect. You could always use this to suggest formatting improvements to the user, allowing them to choose what's right.


Using numbers to identify people would be a more rational choice, except when you're called Pi. /s

{{ fit_image(path="blog/2019-07-17_snippet-correctly-capitalize-names-in-php/beagle-boys.png") }}

Feel free to use and share.

Special thanks to Armand Niculescu, for the snippet this was inspired by!