Four Things to Remember about java.lang.String

Hi,
Usually this (and my old) blog is more Hibernate, Spring and other JEE/Server side technologies. Still on my day to day work (and mainly when I lecture) I see a lot of misunderstanding/misconception about basic Java tasks. I decided to drop some notes about such subjects from time to time - and today four issues about java.lang.String:
  • Strings are immutable - a very basic one, just a reminder 
  • Strings Equality, Normalizer, and Collator - an overview of strings equality in a Locale sensitive  environments (including accented forms)
  • The substring() method might cause memory leaks - in some scenarios the usage of String.substring() method can cause memory leaks
  • String.intern() - how to improve strings equality performance and memory footprint using String.intern()

A String Object is Immutable

This is very basic, but still: everybody knows to say that a String object is immutable - but not everybody fully understands the meaning of it. Sometime students draw the next code fragment as a proof that a String object
can be modified:

String str = "a";
System.out.println (str); // Prints out: a

str = str + "b"; // (Line #3)
// And here comes the "proof":
System.out.println (str); // Prints out: ab ; therefore str was changed

And indeed when we first print the value of str we get "a" and on the second print we get "ab" but...this is not the same object. Line #3 in the sample above actually instantiates a new String object and updates the str reference to point to the new object - the original object wasn't changed. Here is an easy way to check it:

String str = "a";
String str1 = str;
System.out.println (str); // Prints "a"
System.out.println (str == str1); // Prints TRUE (references are equal)

str = str + "b"; 
System.out.println (str); // Prints "ab"
System.out.println (str == str1); // Prints FALSE (references are not equal)

 

Strings Equality, Normalizer, and Collator


ImportantNotice: I'm just giving a brief explanation of the problem and an introduction to some possible remediation. For more details I recommend on:

As with any Java object we should check for strings equality using the equals() method, this method performs binary comparison between two string objects. Another option is to use the equalsIgnoreCase() method which ignores case differences between the two string objects.

String str1 = "abc";
String str2 = "aBc";
System.out.println (str1.equals(str2)); // Prints FALSE
System.out.println (str1.equalsIgnoreCase(str2)); // Prints TRUE
Normalizer
So far strings comparison seems easy but some languages and Locales have more sophisticated rules for characters equality. For start we have the technical issue that by the Unicode standard some symbols can be represented in multiple forms. One sample is the the a-acute (á) character, it can be encoded either as "\u00C1" (one character) or as"\u0041\u0301" (combining sequence). To the user both are a-acute but not for the equals() method:

// I am setting the Locale since I use German characters, it doesn't have 
// any affect on the equals() and equalsIgnoreCase() methods
Locale.setDefault(Locale.GERMANY);
String str = "\u00C1";
System.out.println("Encoded as one character length is " + str.length()); // Length is 1
String str2 = "\u0041\u0301";
System.out.println("Encoded as combining sequence length is " + str2.length()); // Length is 2

System.out.println(str.equals(str2)); // Prints FALSE
System.out.println(str.equalsIgnoreCase(str2)); // Prints FALSE (this is not a case differance!)

In the user perspective str equals str2 but technically they are not the same (even the string lengths are differ). If we want to compare strings which can include different binary representations of our symbols we should compare their normalized form as defined by the Unicode standard. If strings are transformed into their normalized forms then canonical-equivalent ones will have precisely the same binary representation. Since JSE 1.6 we can normalize our strings using the java.text.Normalizer class. Adding the following to the sample above will solve the problem:

str = Normalizer.normalize(str, Form.NFC);
str2 = Normalizer.normalize(str2, Form.NFC);
System.out.println(str.equals(str2)); // This time the output is TRUE
 
Collator
Normalization is good for binary (Locale insensitive) comparison but sometimes we need to perform Locale sensitive comparisons. A Locale sensitive comparison should, on top of the different representations of the same symbol, consider the meaning of the different symbols. It should be able to consider different accents and to be case insensitive - all of this can be done using the java.text.Collator class. Here is a sample use of the Collator:

Collator collator = Collator.getInstance(Locale.FRENCH);
collator.setStrength(Collator.SECONDARY);
collator.setDecomposition(Collator.FULL_DECOMPOSITION);
collator.compare("ä", "À");

The collator properties are: Locale, strength and decomposition.

Strength
The strength level configures which differences should be considered significant in comparisons. There are four strength levels: PRIMARY, SECONDARY, TERTIARY, and IDENTICAL (each level includes the preceding levels in it). The translation of strength levels into comparison rules is Locale dependent
(this is why we need the Locale property) usually it should follow these rules:
  • PRIMARY - A primary different is usually in the base letters ("a" vs "b").
  • SECONDARY - a secondary different is usually accented forms of the same base letter ("a" vs "ä")
  • TERTIARY - is usually for case differences ("a" vs "A")
  • IDENTICAL - A common example is for control characters ("\u0001" vs "\u0002") to be considered equal at the PRIMARY, SECONDARY, and TERTIARY levels but different at the IDENTICAL level
Comparing two strings using a Collator which was set to the primary strength will ignore any accent (secondary strength), case (tertiary strength) and control characters (identical strength) differences but will not ignore any base letters difference. On the other if the Collator strength was set to secondary it will not ignore accented forms and any base letters differences (since the secondary level includes the primary level). However, it would still ignore any case and control characters differences (since tertiary and identical strength levels are weaker than secondary). The easiest way to understand it is with an example:

// The strings to sort. It is important to notice the use of a, ä, and Ä. Also
// notice the mixing of lower and upper case letters: a, Ä, a, ä, Ä. We will
// see soon that with different strength levels sometimes the case is
// ignored and sometimes it is considered
String[] baseStrings = {"abc", "Äbc", "abc", "bbc", "äbc", "Äbc"};

// Simple sort using the natural ordering (no Collator)
String[] strings = baseStrings.clone();
Arrays.sort(strings);
System.out.println("Natural ordering: " + Arrays.deepToString(strings));

Collator collator = Collator.getInstance(Locale.GERMAN);

// Using the PRIMARY strength - case and accent are ignored
collator.setStrength(Collator.PRIMARY);
strings=baseStrings.clone();
Arrays.sort(strings, collator);
System.out.println("Primary collator (case and accent are ignored): " +
 Arrays.deepToString(strings));

// Using the SECONDARY strength - only case is ignored
collator.setStrength(Collator.SECONDARY)
collator.setStrength(Collator.SECONDARY);
strings=baseStrings.clone();
Arrays.sort(strings, collator);
System.out.println("Secondary collator (case is ignored): " +
 Arrays.deepToString(strings));

// Using the TERTIARY strength - only control characters are ignored
collator.setStrength(Collator.TERTIARY);
strings=baseStrings.clone();
Arrays.sort(strings, collator);
System.out.println("Tertiary collator (control characters are ignored): " +
Arrays.deepToString(strings));

And the output will be (I manually added the line numbers):

(1) Natural ordering: [abc, abc, bbc, Äbc, Äbc, äbc]
(2) Primary collator (case and accent are ignored): [abc, Äbc, abc, äbc, Äbc, bbc]
(3) Secondary collator (case is ignored): [abc, abc, Äbc, äbc, Äbc, bbc]
(4) Tertiary collator (control characters are ignored): [abc, abc, äbc, Äbc, Äbc, bbc]

In line #2 (illustrating a primary Collator) we can see that the case and accent are ignored but the difference between 'a' and 'b' was taken into account (abc, Äbc, abc, äbc, Äbc, bbc). Line #3 (illustrating a secondary Collator)  shows that the case is ignored but not the accent - strings starting with 'a' are preceding strings starting with 'ä' or 'Ä' but the difference between 'Ä' and 'ä' is ignored  (Äbc, äbc, Äbx). The tertiary Collator (line #4) ignores only control characters (which I don't have in here) so the array is actually sorted while considering all of the differences: letters, accented forms, and case. Another thing to notice in the example above is that the Collator also implements the Comparator interface so it can be used as an argument to the sort method.

Decomposition
The decomposition property defines whether the characters in the compared strings should be normalized (since 1.6 using the Normalizer discussed earlier in this post) and which normalized form to use. The default Collator implementation in the JDK is java.text.RuleBasedCollator which builds its rules in such a way that it can, in most cases, compare different forms of accented characters without normalization (which is more efficient). However there are combinations in which the Collator will fail to compare string unless the strings are normalized. These scenarios are discussed in the java.text.RuleBasedCollator javadocs.

The substring() Method Might Cause Memory Leak

Take a look on the following 'program':


// Assuming our input was 1000 strings each in the length of 1K. 
// I simulate the input by building such strings
String[] strings = new String[1000];
char[] chars = new char[1024];

Arrays.fill(chars, 'a');
for (int t=0; t<strings.length; t++) {
 strings[t] = new String(chars);
}

// Actually for our processing we need only the first character in
// each string
String[] shortStrings = new String[strings.length];
for (int t=0; t<strings.length; t++) {
 shortStrings[t] = strings[t].substring(0,1);
}

// Release the strings (we don't need the initial set of 
// long strings).
strings = null;


while (true) {
 // Do something forever using the shortStrings
 doWork (shortStrings);
}


The reasonable programmer would think that he did his best to process the large input, extract from it only the needed elements and release the references to the input so memory can be reclaimed. In practice almost no memory will be reclaimed in here! Here is the substring() method code and the constructor it uses to build the returned instance (taken from the String class):



public String substring(int beginIndex, int endIndex) {
 if (beginIndex < 0) {
 throw new StringIndexOutOfBoundsException(beginIndex);
 }
 if (endIndex > count) {
 throw new StringIndexOutOfBoundsException(endIndex);
 }
 if (beginIndex > endIndex) {
 throw new StringIndexOutOfBoundsException(endIndex - beginIndex);
 }
 return ((beginIndex == 0) && (endIndex == count)) ? this :
 new String(offset + beginIndex, endIndex - beginIndex, value);
 }

// And the constructor called in the blue line above looks like that
String(int offset, int count, char value[]) {
 this.value = value;
 this.offset = offset;
 this.count = count;
}


As we can see the newly created string shares the char[] with the original string, and as such even if we
remove any reference to the original string its internal characters array (which is the bulk of the data in the String class!) is referenced by the new string preventing it from being reclaimed by the garbage collector. The reason for building the string class that way is since strings are immutable we can share their underlying characters array and gain two important advantages:

  1. Performance - since we don't copy the array
  2. In some cases - obviously not ours - we can actually reduce the memory footprint. Think of the case which needs to use both the original and the substring.
By the way: the same behavior exactly happens with String.subSequence() and StringBuffer/StringBuilder's substring(0 and subSequence() methods.


There is a solution for that, in the case that the original string is temporary we can replace originalString.substring(start, end) with new String(originalString.substring(start, end))

Intern - Improving Memory utilization and Performance for Strings Equality


First a short opening about string instances, here is a short code fragment and its output:

// Sample strings - s1, s2, and s5 are all "aaa"
String s1 = "aaa";
String s2 = "a" + "aa";
String s3 = "a";
String s4 = "aa";
String s5 = s3 + s4;

System.out.println(String.format("s1==s2? %s" , s1==s2));
System.out.println(String.format("s1.equals(s2)? %s" , s1.equals(s2)));
System.out.println(String.format("s1==s5? %s" , s1==s5));
System.out.println(String.format("s1.equals(s5)? %s" , s1.equals(s5)));
System.out.println(String.format("\"aaa\" == s1? %s" , "aaa" == s1));
System.out.println(String.format("\"aaa\" == s5? %s" , "aaa" == s5));


//
// The output
//
s1==s2? true
s1.equals(s2)? true
s1==s5? false
s1.equals(s5)? true
"aaa" == s1? true
"aaa" == s5? false

Although s1, s2, s5, and the "aaa" literal are all equals to “aaa" when comparing them using the == operator we get inconsistent results. Sometimes such a comparison returns true (meaning that the two operands are referencing the same object) and sometimes false (the operands do not refer to the same string instance). The reason for that is how references and objects were defined:
  • s1 - this is a reference to the "aaa" literal
  • s2 - this is also a reference to the "aaa" literal. The "a" + "aa" expression is parsed by the compiler which replaces it with the "aaa" literal (constant folding)
  • s5 - this is assigned value of s3 + s4 which is equals to "aaa" but since we are using references to "a" and "aa" and not the actual literals the compiler must generate a code which calculates the expression at
    runtime. This creates a new instance of "aaa" which is not the "aaa" literal.

All the string literals in our code are stored in a special table in the JVM (the strings table) and are shared between their references in the class. However strings that were constructed at runtime are not stored on that table - therefore s1, s2, and "aaa" are all references to the same memory location while s5 referencing a different location. This difference has a big performance affect on the equals() method. When comparing strings  the equals() method does the following:
  1. Check to see if this object is the same instance as the one it is compared to (using the == operator) - if the two references are equal then the strings are equal. This is the fastest way to identify equal strings
  2. If the references are not to the same instance we have to compare the size of the string objects
  3. If both are in the same size we have to compare character by character until we find a difference or we reach to the end of the internal character array (in that case the strings are equal)
As we can see comparing strings includes many computing steps (O(n)). If the string objects are not shared and a program needs to compare strings again and again we might suffer a big performance penalty, it would
have been helpful if we could be sure that our instances are shared. Java supports that through the intern() method:


s5 = s5.intern();


When the method is invoked a lookup is performed on the strings table. If the method finds an equal string at the table a reference to that string is returned. Otherwise, the string is added to the table and a reference to it is returned.


Interning strings can save memory and improve performance in the case of repeatable comparisons of strings as illustrated by the following example (the comments in the example explain the logic):



// Generate two random arrays of strings. Since this algorithm
// generates 5000 strings with only 26 available permutations we must have
// some strings that are equals between the different arrays
String[] strings1 = new String[5000];
String[] strings2 = new String[strings1.length];

char[] chars = new char[1024];

Random random = new Random();
for (int t=0; t<strings1.length; t++) {
 Arrays.fill(chars, (char)('a' + random.nextInt('z'-'a')));
 strings1[t] = new String(chars);
 Arrays.fill(chars, (char)('a' + random.nextInt('z'-'a')));
 strings2[t] = new String(chars);
}
System.out.println("All strings were generated");

// Now I compare the strings which were not interned. for each string in
// strings1[] I look for matching strings in string2[]
System.out.println("Comparing none interned strings");

long start = System.currentTimeMillis();
for (int t=0; t<strings1.length; ++t) {
 for (int z=0; z<strings2.length; ++z) {
 strings1[t].equals(strings2[z]);
 }
}
long end = System.currentTimeMillis();
System.out.println("Total time without intern:" + (end-start) + " milliseconds");

// Interning. First I intern all of the strings in the two arrays then I compare
// the arrays as done in the previous step. In this step I also sample the
// intern operation time since we have to see how much does it overload.
System.out.println("Interning");

start = System.currentTimeMillis();
for (int t=0; t<strings1.length; ++t) {
 strings1[t] = strings1[t].intern();
 strings2[t] = strings2[t].intern();
}

// Comparing the strings in both arrays
for (int t=0; t<strings1.length; ++t) {
 for (int z=0; z<strings2.length; ++z) {
 strings1[t].equals(strings2[z]);
 }
}
end = System.currentTimeMillis();

System.out.println("Total time with (and including) intern:" + (end-start) + " milliseconds");


The different between the two iterations is amazing; comparing the none interned strings takes 8.4 seconds while comparing the interned strings takes less than a second (0.7) including the intern operation time!. In
an algorithm which performs repeated searches it is strongly recommend to intern our strings.

Update: I posted a follow up explaining strings concatenating. You can find it in here.

Comments

Unknown said…
Nice post!

Tested your last example and its performance is quite different on my mashine:

All strings were generated
Comparing none interned strings
Total time without intern:11 milliseconds
Interning
Total time with (and including) intern:47 milliseconds

cheers ;)
Eyal Lupu said…
Hi Christian,

Thanks for your feedback.


This is interesting, just to double check I tested it again. This time on a different platform (a strong Linux machine rather than my old laptop).

The difference is not as extreme as it is on my laptop but still: without interning it took 1927 ms with (and including) interning it took 330 ms.

Are you using my code as is?


Eyal
Anonymous said…
Good concise post.
You can add one myth to you list
In previous Java versions,
String concatenation was a lot slower than StringBuffer.append(), and old timers still prefer to use StringBuffer.append().
But modern JVMs have highly optimized string concatenation implementations, so now the situation is reverse. StringBuffer.append can at times be less efficient than concatenation.
Anonymous said…
Here's another one:

I had an application that used new String( byte[] buf ) quiet often, and I found that most of my threads were stuck waiting for CoderResult.get(), which is a synchronized method. This caused much waiting and blocking until we realized what was happening.
Anonymous said…
Sorry Bhaskar, you are exactly wrong. Your myth is a myth.

Even with a Java 6 SDK, String concatenation with the + operator is more than 100 times slower than StringBuilder/Buffer.append().
Nice post. Thanks a lot for sharing.
Patrick said…
>Anonymous wrote: Even with a Java 6 SDK, String concatenation with the + operator is more than 100 times slower than StringBuilder/Buffer.append()


String concatenation with + is turned into a StringBuilder/Buffer.append() by the compiler. The 'slowness' only occurs in the case of doing + inside a loop since that converts the StringBuilder back into a String on each iteration.

Here is the JavaDoc about '+' "The Java language provides special support for the string concatenation operator ( + ), and for conversion of other objects to strings. String concatenation is implemented through the StringBuilder(or StringBuffer) class and its append method. "
Anonymous said…
Excellent post man! congratulations! This was really useful!
SvS said…
do you take JIT in account?
Just run your test like that
nonIntern()
intern()
nonIntern()
intern()
nonIntern()
intern()
And check results - they'l be different from ones you post there. Then swap nonIntern and intern comparision so that innert goes first - you will once again get different results (at least for the first pass).
Anonymous said…
Thanks a lot. Can you elaborate more on StringBuilder vs. StringBuffer vs. the "+" operator.
Eyal Lupu said…
Hi SvS
About the JIT:
The difference you have seen has nothing to do with the JIT. If you swap the order of the methods invocation (running the intern version and then the none intern one) actually your none-intern function is using the same array that already interned by the intern function - therefore you are actually running twice with interned strings. You can easily see it by cloning the arrays before invoking the methods.


In general the JIT has nothing to do in here - the difference between the executions is that one does O(n) operations and the other O(1). The difference is not in the domain of native vs interpreted code.


Eyal
Anonymous said…
Exciting. Will defintely be back


kuplyu viagru
Anonymous said…
Helpful blog, bookmarked the website with hopes to read more!
Unknown said…
Very helpful post.

Thanks,
Udi
grails cookbook said…
I love the concept of immutable objects. save lots of headaches
Anonymous said…
I hope I will able to learn java language

Vigrx Oil Exposed

Popular posts from this blog

New in Spring MVC 3.1: CSRF Protection using RequestDataValueProcessor

Hibernate Exception - Simultaneously Fetch Multiple Bags

Hibernate Derived Properties - Performance and Portability