<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: strings</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/strings.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2024-05-08T14:23:13+00:00</updated><author><name>Simon Willison</name></author><entry><title>Tagged Pointer Strings (2015)</title><link href="https://simonwillison.net/2024/May/8/tagged-pointer-strings-2015/#atom-tag" rel="alternate"/><published>2024-05-08T14:23:13+00:00</published><updated>2024-05-08T14:23:13+00:00</updated><id>https://simonwillison.net/2024/May/8/tagged-pointer-strings-2015/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://mikeash.com/pyblog/friday-qa-2015-07-31-tagged-pointer-strings.html"&gt;Tagged Pointer Strings (2015)&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Mike Ash digs into a fascinating implementation detail of macOS.&lt;/p&gt;
&lt;p&gt;Tagged pointers provide a way to embed a literal value in a pointer reference. Objective-C pointers on macOS are 64 bit, providing plenty of space for representing entire values. If the least significant bit is 1 (the pointer is a 64 bit odd number) then the pointer is "tagged" and represents a value, not a memory reference.&lt;/p&gt;
&lt;p&gt;Here's where things get really clever. Storing an integer value up to 60 bits is easy. But what about strings?&lt;/p&gt;
&lt;p&gt;There's enough space for three UTF-16 characters, with 12 bits left over. But if the string fits ASCII we can store 7 characters.&lt;/p&gt;
&lt;p&gt;Drop everything except &lt;code&gt;a-z A-Z.0-9&lt;/code&gt; and we need 6 bits per character, allowing 10 characters to fit in the pointer.&lt;/p&gt;
&lt;p&gt;Apple take this a step further: if the string contains just &lt;code&gt;eilotrm.apdnsIc ufkMShjTRxgC4013&lt;/code&gt; ("b" is apparently uncommon enough to be ignored here) they can store 11 characters in that 60 bits!

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://lobste.rs/s/5417dx/storing_data_pointers#c_noslq0"&gt;Lobste.rs&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/c"&gt;c&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/objective-c"&gt;objective-c&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/strings"&gt;strings&lt;/a&gt;&lt;/p&gt;



</summary><category term="c"/><category term="objective-c"/><category term="strings"/></entry><entry><title>datasette-jellyfish</title><link href="https://simonwillison.net/2019/Mar/9/datasette-jellyfish/#atom-tag" rel="alternate"/><published>2019-03-09T18:29:13+00:00</published><updated>2019-03-09T18:29:13+00:00</updated><id>https://simonwillison.net/2019/Mar/9/datasette-jellyfish/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-jellyfish"&gt;datasette-jellyfish&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I learned about a handy Python library called Jellyfish which implements approximate and phonetic matching of strings—soundex, metaphone, porter stemming, levenshtein distance and more. I’ve built a simple Datasette plugin which wraps the library and makes each of those algorithms available as a SQL function.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/strings"&gt;strings&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;&lt;/p&gt;



</summary><category term="strings"/><category term="datasette"/></entry><entry><title>String length - Rosetta Code</title><link href="https://simonwillison.net/2019/Feb/22/string-length-rosetta-code/#atom-tag" rel="alternate"/><published>2019-02-22T15:27:31+00:00</published><updated>2019-02-22T15:27:31+00:00</updated><id>https://simonwillison.net/2019/Feb/22/string-length-rosetta-code/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.rosettacode.org/wiki/String_length"&gt;String length - Rosetta Code&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Calculating the length of a string is surprisingly difficult once Unicode is involved. Here's a fascinating illustration of how that problem can be attached dozens of different programming languages.  From that page: the string &lt;code&gt;"J̲o̲s̲é̲"&lt;/code&gt; (&lt;code&gt;"J\x{332}o\x{332}s\x{332}e\x{301}\x{332}"&lt;/code&gt;) has 4 user-visible graphemes, 9 characters (code points), and 14 bytes when encoded in UTF-8.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/jeffsonstein/status/1098927304124841984"&gt;@jeffsonstein&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/programming-languages"&gt;programming-languages&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/strings"&gt;strings&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/unicode"&gt;unicode&lt;/a&gt;&lt;/p&gt;



</summary><category term="programming-languages"/><category term="strings"/><category term="unicode"/></entry><entry><title>String types in Python 3</title><link href="https://simonwillison.net/2007/Oct/9/strings/#atom-tag" rel="alternate"/><published>2007-10-09T02:08:13+00:00</published><updated>2007-10-09T02:08:13+00:00</updated><id>https://simonwillison.net/2007/Oct/9/strings/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://pyside.blogspot.com/2007/10/string-types-in-python-3.html"&gt;String types in Python 3&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
bytes are now immutable (just like the bytestrings they are replacing) and a new mutable buffer type has been introduced.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/buffers"&gt;buffers&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/bytes"&gt;bytes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/bytestrings"&gt;bytestrings&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python3"&gt;python3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/strings"&gt;strings&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/unicode"&gt;unicode&lt;/a&gt;&lt;/p&gt;



</summary><category term="buffers"/><category term="bytes"/><category term="bytestrings"/><category term="python"/><category term="python3"/><category term="strings"/><category term="unicode"/></entry><entry><title>How should JSON strings be represented in Erlang?</title><link href="https://simonwillison.net/2007/Sep/14/lshift/#atom-tag" rel="alternate"/><published>2007-09-14T08:17:05+00:00</published><updated>2007-09-14T08:17:05+00:00</updated><id>https://simonwillison.net/2007/Sep/14/lshift/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://www.lshift.net/blog/2007/09/13/how-should-json-strings-be-represented-in-erlang"&gt;How should JSON strings be represented in Erlang?&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Erlang’s poor support for strings makes this a surprisingly tricky question.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/erlang"&gt;erlang&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/json"&gt;json&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/strings"&gt;strings&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tonygarnockjones"&gt;tonygarnockjones&lt;/a&gt;&lt;/p&gt;



</summary><category term="erlang"/><category term="json"/><category term="strings"/><category term="tonygarnockjones"/></entry></feed>