Much ado about NULL: An introduction to virtual memory
Here at Ksplice, we’re always keeping a very close eye on vulnerabilities that are being announced in Linux. And in the last half of last year, it was very clear that NULL pointer dereference vulnerabilities were the current big thing. Brad Spengler made it abundantly clear to anyone who was paying the least bit attention that these vulnerabilities, far more than being mere denial of service attacks, were trivially exploitable privilege escalation vulnerabilities. Some observers even dubbed 2009 the year of the kernel NULL pointer dereference.
If you’ve ever programmed in C, you’ve probably run into a NULL pointer dereference at some point. But almost certainly, all it did was crash your program with the dreaded “Segmentation Fault”. Annoying, and often painful to debug, but nothing more than a crash. So how is it that this simple programming error becomes so dangerous when it happens in the kernel? Inspired by all the fuss, this post will explore a little bit of how memory works behind the scenes on your computer. By the end of today’s installment, we’ll understand how to write a C program that reads and writes to a NULL pointer without crashing. In a future post, I’ll take it a step further and go all the way to showing how an attacker would exploit a NULL pointer dereference in the kernel to take control of a machine!
What’s in a pointer?
There’s nothing fundamentally magical about pointers in C (or assembly, if that’s your thing). A pointer is just an integer, that (with the help of the hardware) refers to a location somewhere in that big array of bits we call a computer’s memory. We can write a C program to print out a random pointer:
Which, if you run it on my machine, prints:
The argv pointer = 1680681096
(Pointers are conventionally written in hexadecimal, which would make that 0x642d2888, but that’s just a notational thing. They’re still just integers.)
NULL is only slightly special as a pointer value: if we look in stddef.h, we can see that it’s just defined to be the pointer with value 0. The only thing really special about NULL is that, by convention, the operating system sets things up so that NULL is an invalid pointer, and any attempts to read or write through it lead to an error, which we call a segmentation fault. However, this is just convention; to the hardware, NULL is just another possible pointer value.
But what do those integers actually mean? We need to understand a little bit more about how memory works in a modern computer. In the old days (and still on many embedded devices), a pointer value was literally an index into all of the memory on those little RAM chips in your computer:
This was true for every program, including the operating system itself. You can probably guess what goes wrong here: suppose that Microsoft Word is storing your document at address 700 in memory. Now, you’re browsing the web, and a bug in Internet Explorer causes it to start scribbling over random memory and it happens to scribble over memory around address 700. Suddenly, bam, Internet Explorer takes Word down with it. It’s actually even worse than that: a bug in IE can even take down the entire operating system.
This was widely regarded as a bad move, and so all modern hardware supports, and operating systems use, a scheme called virtual memory. What this means it that every program running on your computer has its own namespace for pointers (from 0 to 232-1, on a 32-bit machine). The value 700 means something completely different to Microsoft Word and Internet Explorer, and neither can access the other’s memory. The operating system is in charge of managing these so-called address spaces, and mapping different pieces of each program’s address space to different pieces of physical memory.
mmap(2)
One feature of this setup is that while each process has its own 232 possible addresses, not all of them need to be valid (correspond to real memory). In particular, by default, the NULL or 0 pointer does not correspond to valid memory, which is why accessing it leads to a crash.
Because each application has its own address space, however, it is free to do with it as it wants. For instance, you’re welcome to declare that NULL should be a valid address in your application. We refer to this as “mapping” the NULL page, because you’re declaring that that area of memory should map to some piece of physical memory.
On Linux (and other UNIX) systems, the function call used for mapping regions of memory is mmap(2). mmap is defined as:
void *mmap(void *addr, size_t length, int prot, int flags,
int fd, off_t offset);
Let’s go through those arguments in order (All of this information comes from the man page):
addr
This is the address where the application wants to map memory. If MAP_FIXED is not specified in flags, mmap may select a different address if the selected one is not available or inappropriate for some reason.
length
The length of the region the application wants to map. Memory can only be mapped in increments of a “page”, which is 4k (4096 bytes) on x86 processors.
prot
Short for “protection”, this argument must be a combination of one or more of the values PROT_READ, PROT_WRITE, PROT_EXEC, or PROT_NONE, indicating whether the application should be able to read, write, execute, or none of the above, the mapped memory.
flags
Controls various options about the mapping. There are a number of flags that can go here. Some interesting ones are MAP_PRIVATE, which indicates the mapping should not be shared with any other process, MAP_ANONYMOUS, which indicates that the fd argument is irrelevant, and MAP_FIXED, which indicates that we want memory located exactly at addr.
fd
The primary use of mmap is not just as a memory allocator, but in order to map files on disk to appear in a process’s address space, in which case fd refers to an open file descriptor to map. Since we just want a random chunk of memory, we’re going pass MAP_ANONYMOUS in flags, which indicates that we don’t want to map a file, and fd is irrelevant.
offset
This argument would be used with fd to indicate which portion of a file we wanted to map.
mmap returns the address of the new mapping, or MAP_FAILED if something went wrong.
If we just want to be able to read and write the NULL pointer, we’ll want to set addr to 0 and length to 4096, in order to map the first page of memory. We’ll need PROT_READ and PROT_WRITE to be able to read and write, and all three of the flags I mentioned. fd and offset are irrelevant; we’ll set them to -1 and 0 respectively.
Putting it all together, we get the following short C program, which successfully reads and writes through a NULL pointer without crashing!
(Note that most modern systems actually specifically disallow mapping the NULL page, out of security concerns. To run the following example on a recent Linux machine at home, you’ll need to run # echo 0 > /proc/sys/vm/mmap_min_addr as root, first.)
#include
#include
int main() {
int *ptr = NULL;
if (mmap(0, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0)
== MAP_FAILED) {
perror("Unable to mmap(NULL)");
fprintf(stderr, "Is /proc/sys/vm/mmap_min_addr non-zero?\n");
return 1;
}
printf("Dereferencing my NULL pointer yields: %d\n", *ptr);
*ptr = 17;
printf("Now it's: %d\n", *ptr);
return 0;
}
Next time, we’ll look at how a process can not only map NULL in its own address space, but can also create mappings in the kernel’s address space. And, I’ll show you how this lets an attacker use a NULL dereference in the kernel to take over the entire machine. Stay tuned!
The 1st International Longest Tweet Contest
Posted in Programming on March 25th, 2010 by Keith Winstein – 30 Comments
How much information is in a tweet?
There are a lot of snarky answers to that. Let’s talk mathematically, where information is measured in bits: How many bits can be expressed in a tweet?
It’s kind of fun to try to figure out how to cram in the most information. Our current in-house record is 4.2 kilobits (525 bytes) per tweet, but this can definitely be bested. Twitter’s 140-character limit has been under assault for some time, but nobody has decisively anointed a winner.
To that end, announcing the 1st International Longest Tweet Contest, along the lines of the International Obfuscated C Code Contest. The goal: fit the most bits of information into a tweet. There will be glorious fame and a T-shirt for the winner. More on that later. But first, a dialog:
Ben Bitdiddle: How big can a tweet get? SMS is limited to 160 characters of 7-bit ASCII, or 1,120 bits. Since Twitter based their limit on SMS but with 20 characters reserved for your name, a tweet is probably limited to 140 × 7 = 980 bits.
Alyssa P. Hacker: Not quite — despite its SMS roots, Twitter supports Unicode, and the company says its 140-character limit is based on Unicode characters, not ASCII. So it’s a lot more than 980 bits.
Ben: Ok, Unicode is a 16-bit character set, so the answer is 140 × 16 = 2,240 bits.
Alyssa: Well, since that Java documentation was written, Unicode has expanded to cover more of the world’s languages. These days there are 1,112,064 possible Unicode characters (officially “Unicode scalar values”) that can be represented, including in the UTF-8 encoding that Twitter and most of the Internet uses. That makes Unicode about 20.1 bits per character, not 16. (Since log2 1,112,064 ≈ 20.1.)
Ben: Ok, if each character can take on one of 1,112,064 possible values, we can use that to figure out how many total different tweets there can ever be. And from that, we’ll know how many bits can be encoded into a tweet. But how?
Alyssa: It’s easy! We just calculate the total number of different tweets that can ever be received. The capacity of a tweet, in bits, is just the base-2 logarithm of that number. According to my calculator here, 1,112,064 to the 140th power is about 28.7 quattuordecillion googol googol googol googol googol googol googol googol. That’s the number of distinct 140-character messages that can be sent. Plus we have to add in the 139-character messages and 138-character messages, etc. Taking the log, I get 2,811 bits per tweet.
Ben: We’d get almost the same answer if we just multiplied 20.1 bits per character times 140 characters. Ok, 2.8 kilobits per tweet.
Alyssa: I just discovered a problem. There aren’t that many distinct tweets! For example, I can’t tell the difference between the single character ‘<’ and the four characters ‘<’. I also can’t send a tweet that contains nothing but some null characters. So we have to deflate our number a little bit. It’s tricky because we don’t know Twitter’s exact limitations; it’s a black box.
Ben: But the answer will still be roughly 2.8 kilobits. Can we do better?
Alyssa: By golly, I think we can! Turns out Unicode is basically identical to an international standard, called the Universal Multi-Octet Coded Character Set, known as UCS or ISO/IEC 10646. But the two standards weren’t always the same — in the 90s, there was a lot of disagreement between the computer companies who backed Unicode and proposed a simple 16-bit character set, and the standards mavens behind UCS, who wanted to accommodate more languages than 65,536 characters would allow. The two sides eventually compromised on the current 20.1-bit character set, but some historical differences still linger between the two “official” specifications — including in the definition of the UTF-8 encoding that Twitter uses. Check this out:
Alyssa: Although Unicode’s UTF-8 can only encode the 1,112,064 Unicode scalar values, UCS’s version of UTF-8 is a lot bigger — it goes up to 31 bits!
Ben: So, when Twitter says they use UTF-8, which version are they using? Unicode, or UCS?
Alyssa: Hmm, that’s a good question. Let’s write a program to test. I’ll send Twitter a really big UCS character in UTF-8, represented in octal like this. This is way outside the bounds of well-formed Unicode.
Ben: Hey, that kind of worked! When we fetch it back using Twitter’s JSON API, we get this JSON fragment:
Alyssa: It’s the same thing we sent! The character (really an unassigned code position) is represented in UTF-8, in octal with backslashes in ASCII. That means Twitter “characters” are actually 31 bits, not 20.1 bits. The capacity of a tweet is actually a lot bigger than the 2.8 kilobits we calculated earlier.
Ben: Should we worry that the text only shows up in the JSON API, not XML or HTML, and these high characters only last a few days on Twitter’s site before vanishing?
Alyssa: Nope. As long as we had this exchange in our conversation, people on the Internet won’t complain about those issues.
Using Alyssa’s insight that Twitter actually supports 31-bit UCS characters (not just Unicode), we can come up with a simple program to send 4.2-kilobit tweets using only code positions that aren’t in Unicode. That way there’s no ambiguity between these crazy code positions, which Twitter represents as backslashed octal, and legitimate Unicode, which Twitter sends literally. These tweets have a text field that’s a whopping 3,360 bytes long — but in ASCII octal UTF-8, so they only represent 525 bytes of information in the end.
The sender program reads the first 525 bytes of standard input or a file, slices it into 30-bit chunks, and sends it to Twitter using 31-bit UCS code positions. The high bit of each character is set to 1 to avoid ever sending a valid Unicode UTF-8 string, which Twitter might treat ambiguously. It outputs the ID of the tweet it posted. You’ll have to fill in your own username and password to test it.
The receiver program does the opposite — give it an ID on the command line, and it will retrieve and decode a “megatwit” encoded by the sender.
Here’s an example of how to use the programs and verify that they can encode at least one arbitrary 4,200-bit message:
But this is just the start -- we're not the only ones interested in really long tweets. (Structures, strictures...) Others have found an apparent loophole in how Twitter handles some URLs, allowing them to send tweets with a text field up to 532 bytes long (how much information can be coded this way isn't clear). Maxitweet has come up with a clever way to milk the most out of 140 Unicode characters without requiring a decoder program, at least at its lower compression levels. There are definitely even better ways to cram as many bits as possible into a tweet.
So here is the challenge for the 1st International Longest Tweet Contest:
Come up with a strategy for encoding arbitrary binary data into a single tweet, along the lines of the sample programs described here.
Implement a sender and receiver for the strategy in a computer programming language of your choice. We recommend you choose a language likely to be runnable by a wide variety of readers. At your option, you may provide a Web site or other public implementation for others to test your coding scheme with arbitrary binary input.
Write a description, in English, for how your coding scheme works. Explain how many bits per tweet it achieves, and justify your calculation. The explanation must be convincing because it is intractable to prove conclusively that a certain capacity is achievable, even if a program successfully sends and receives many test cases.
Send your entry, consisting of #2 and #3, to megatwit@ksplice.com before Sunday, April 11, 2010, 23h59 UTC.
Based on the English explanations of the coding schemes (#3), we'll select finalists from among the entrants. In mid-April, we'll post the finalists' submissions on the blog. In the spirit of Twitter, we'll let the community assess, criticize, test, and pick apart the finalists' entries.
Based on the community's feedback, we'll choose at least one winner. This will generally be the person whose scheme for achieving the highest information encoded per tweet is judged most convincing.
The winner will receive notoriety and fame on this blog, and a smart-looking Ksplice T-shirt in the size of your choice. You will be known the world over as the Reigning Champion of the International Longest Tweet Contest until there is another contest. Legally we can't promise this, but there's a chance Stephen Colbert will have you on as a result of winning this prestigious contest.
By entering the contest, you are giving Ksplice permission to post your entry if you are selected as a finalist. The contest is void where prohibited. The rules are subject to change at our discretion. Employees of Ksplice aren't eligible because they already have T-shirts. Judging will be done by Ksplice in its sole discretion.
That's it. Happy tweeting!
Hello from a libc-free world! (Part 1)
Posted in computer architecture on March 16th, 2010 by Jessica McKellar – 104 Comments
As an exercise, I want to write a Hello World program in C simple enough that I can disassemble it and be able to explain all of the assembly to myself.
This should be easy, right?
This adventure assumes compilation and execution on a Linux machine. Some familiarity with reading assembly is helpful.
Here’s our basic Hello World program:
Let’s compile it and get a bytecount:
Yikes! Where are 11 Kilobytes worth of executable coming from? objdump -t hello gives us 79 symbol-table entries, most of which we can blame on our using the standard library.
So let’s stop using it. We won’t use printf so we can get rid of our include file:
What? That barely changed anything!
The problem is that gcc is still using standard library startup files when linking. Want proof? We’ll compile with -nostdlib, which according to the gcc man page won’t “use the standard system libraries and startup files when linking. Only the files you specify will be passed to the linker”.
Well, it’s just a warning; let’s check it anyway:
That looks pretty good! We got our bytecount down to a much more reasonable size (an order of magnitude smaller!)…
…at the expense of segfaulting when it runs. Hrmph.
For fun, let’s get our program to be actually runnable before digging into the assembly.
So what is this _start entry symbol that appears to be required for our program to run? Where is it usually defined if you’re using libc?
From the perspective of the linker, by default _start is the actual entry point to your program, not main. It is normally defined in the crt1.o ELF relocatable. We can verify this by linking against crt1.o and noting that _start is now found (although we develop other problems by not having defined other necessary libc startup symbols):
Hurrah, our compilation problems go away! However, we still segfault. Why? Let’s compile with debugging information and take a look in gdb. We’ll set a breakpoint at main and step through until the segfault:
D’oh! Let’s save a detailed examination of the assembly for later, but in brief: when we return from the callq to main we hit some nops and run right back into main. Since we re-entered main without putting a return instruction pointer on the stack as part of the standard prologue for calling a function, the second call to retq tries to pop a bogus return instruction pointer off the stack and jump to it and we bomb out. We need an exit strategy.
Literally. After the return from callq, push 1, the syscall number for SYS_exit, into %eax, and because we want to say that we’re exiting cleanly, put a status of 0, SYS_exit’s only argument, into %ebx. Then make the interrupt to drop into the kernel with int $0x80.
Success! It compiles, it runs, and if we step through this new version under gdb it even exits normally.
Hello from a libc-free world!
Stay tuned for Part 2, where we’ll walk through the parts of the executable in earnest and watch what happens to it as we add complexity, in the process understanding more about x86 linking and calling conventions and the structure of an ELF binary.
reference: http://blog.ksplice.com/2010
If you’ve ever programmed in C, you’ve probably run into a NULL pointer dereference at some point. But almost certainly, all it did was crash your program with the dreaded “Segmentation Fault”. Annoying, and often painful to debug, but nothing more than a crash. So how is it that this simple programming error becomes so dangerous when it happens in the kernel? Inspired by all the fuss, this post will explore a little bit of how memory works behind the scenes on your computer. By the end of today’s installment, we’ll understand how to write a C program that reads and writes to a NULL pointer without crashing. In a future post, I’ll take it a step further and go all the way to showing how an attacker would exploit a NULL pointer dereference in the kernel to take control of a machine!
What’s in a pointer?
There’s nothing fundamentally magical about pointers in C (or assembly, if that’s your thing). A pointer is just an integer, that (with the help of the hardware) refers to a location somewhere in that big array of bits we call a computer’s memory. We can write a C program to print out a random pointer:
#include
int main(int argc, char **argv) {
printf("The argv pointer = %d\n", argv);
return 0;
}
Which, if you run it on my machine, prints:
The argv pointer = 1680681096
(Pointers are conventionally written in hexadecimal, which would make that 0x642d2888, but that’s just a notational thing. They’re still just integers.)
NULL is only slightly special as a pointer value: if we look in stddef.h, we can see that it’s just defined to be the pointer with value 0. The only thing really special about NULL is that, by convention, the operating system sets things up so that NULL is an invalid pointer, and any attempts to read or write through it lead to an error, which we call a segmentation fault. However, this is just convention; to the hardware, NULL is just another possible pointer value.
But what do those integers actually mean? We need to understand a little bit more about how memory works in a modern computer. In the old days (and still on many embedded devices), a pointer value was literally an index into all of the memory on those little RAM chips in your computer:
This was true for every program, including the operating system itself. You can probably guess what goes wrong here: suppose that Microsoft Word is storing your document at address 700 in memory. Now, you’re browsing the web, and a bug in Internet Explorer causes it to start scribbling over random memory and it happens to scribble over memory around address 700. Suddenly, bam, Internet Explorer takes Word down with it. It’s actually even worse than that: a bug in IE can even take down the entire operating system.
This was widely regarded as a bad move, and so all modern hardware supports, and operating systems use, a scheme called virtual memory. What this means it that every program running on your computer has its own namespace for pointers (from 0 to 232-1, on a 32-bit machine). The value 700 means something completely different to Microsoft Word and Internet Explorer, and neither can access the other’s memory. The operating system is in charge of managing these so-called address spaces, and mapping different pieces of each program’s address space to different pieces of physical memory.
mmap(2)
One feature of this setup is that while each process has its own 232 possible addresses, not all of them need to be valid (correspond to real memory). In particular, by default, the NULL or 0 pointer does not correspond to valid memory, which is why accessing it leads to a crash.
Because each application has its own address space, however, it is free to do with it as it wants. For instance, you’re welcome to declare that NULL should be a valid address in your application. We refer to this as “mapping” the NULL page, because you’re declaring that that area of memory should map to some piece of physical memory.
On Linux (and other UNIX) systems, the function call used for mapping regions of memory is mmap(2). mmap is defined as:
void *mmap(void *addr, size_t length, int prot, int flags,
int fd, off_t offset);
Let’s go through those arguments in order (All of this information comes from the man page):
addr
This is the address where the application wants to map memory. If MAP_FIXED is not specified in flags, mmap may select a different address if the selected one is not available or inappropriate for some reason.
length
The length of the region the application wants to map. Memory can only be mapped in increments of a “page”, which is 4k (4096 bytes) on x86 processors.
prot
Short for “protection”, this argument must be a combination of one or more of the values PROT_READ, PROT_WRITE, PROT_EXEC, or PROT_NONE, indicating whether the application should be able to read, write, execute, or none of the above, the mapped memory.
flags
Controls various options about the mapping. There are a number of flags that can go here. Some interesting ones are MAP_PRIVATE, which indicates the mapping should not be shared with any other process, MAP_ANONYMOUS, which indicates that the fd argument is irrelevant, and MAP_FIXED, which indicates that we want memory located exactly at addr.
fd
The primary use of mmap is not just as a memory allocator, but in order to map files on disk to appear in a process’s address space, in which case fd refers to an open file descriptor to map. Since we just want a random chunk of memory, we’re going pass MAP_ANONYMOUS in flags, which indicates that we don’t want to map a file, and fd is irrelevant.
offset
This argument would be used with fd to indicate which portion of a file we wanted to map.
mmap returns the address of the new mapping, or MAP_FAILED if something went wrong.
If we just want to be able to read and write the NULL pointer, we’ll want to set addr to 0 and length to 4096, in order to map the first page of memory. We’ll need PROT_READ and PROT_WRITE to be able to read and write, and all three of the flags I mentioned. fd and offset are irrelevant; we’ll set them to -1 and 0 respectively.
Putting it all together, we get the following short C program, which successfully reads and writes through a NULL pointer without crashing!
(Note that most modern systems actually specifically disallow mapping the NULL page, out of security concerns. To run the following example on a recent Linux machine at home, you’ll need to run # echo 0 > /proc/sys/vm/mmap_min_addr as root, first.)
#include
#include
int main() {
int *ptr = NULL;
if (mmap(0, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0)
== MAP_FAILED) {
perror("Unable to mmap(NULL)");
fprintf(stderr, "Is /proc/sys/vm/mmap_min_addr non-zero?\n");
return 1;
}
printf("Dereferencing my NULL pointer yields: %d\n", *ptr);
*ptr = 17;
printf("Now it's: %d\n", *ptr);
return 0;
}
Next time, we’ll look at how a process can not only map NULL in its own address space, but can also create mappings in the kernel’s address space. And, I’ll show you how this lets an attacker use a NULL dereference in the kernel to take over the entire machine. Stay tuned!
The 1st International Longest Tweet Contest
Posted in Programming on March 25th, 2010 by Keith Winstein – 30 Comments
How much information is in a tweet?
There are a lot of snarky answers to that. Let’s talk mathematically, where information is measured in bits: How many bits can be expressed in a tweet?
It’s kind of fun to try to figure out how to cram in the most information. Our current in-house record is 4.2 kilobits (525 bytes) per tweet, but this can definitely be bested. Twitter’s 140-character limit has been under assault for some time, but nobody has decisively anointed a winner.
To that end, announcing the 1st International Longest Tweet Contest, along the lines of the International Obfuscated C Code Contest. The goal: fit the most bits of information into a tweet. There will be glorious fame and a T-shirt for the winner. More on that later. But first, a dialog:
Ben Bitdiddle: How big can a tweet get? SMS is limited to 160 characters of 7-bit ASCII, or 1,120 bits. Since Twitter based their limit on SMS but with 20 characters reserved for your name, a tweet is probably limited to 140 × 7 = 980 bits.
Alyssa P. Hacker: Not quite — despite its SMS roots, Twitter supports Unicode, and the company says its 140-character limit is based on Unicode characters, not ASCII. So it’s a lot more than 980 bits.
Ben: Ok, Unicode is a 16-bit character set, so the answer is 140 × 16 = 2,240 bits.
Alyssa: Well, since that Java documentation was written, Unicode has expanded to cover more of the world’s languages. These days there are 1,112,064 possible Unicode characters (officially “Unicode scalar values”) that can be represented, including in the UTF-8 encoding that Twitter and most of the Internet uses. That makes Unicode about 20.1 bits per character, not 16. (Since log2 1,112,064 ≈ 20.1.)
Ben: Ok, if each character can take on one of 1,112,064 possible values, we can use that to figure out how many total different tweets there can ever be. And from that, we’ll know how many bits can be encoded into a tweet. But how?
Alyssa: It’s easy! We just calculate the total number of different tweets that can ever be received. The capacity of a tweet, in bits, is just the base-2 logarithm of that number. According to my calculator here, 1,112,064 to the 140th power is about 28.7 quattuordecillion googol googol googol googol googol googol googol googol. That’s the number of distinct 140-character messages that can be sent. Plus we have to add in the 139-character messages and 138-character messages, etc. Taking the log, I get 2,811 bits per tweet.
Ben: We’d get almost the same answer if we just multiplied 20.1 bits per character times 140 characters. Ok, 2.8 kilobits per tweet.
Alyssa: I just discovered a problem. There aren’t that many distinct tweets! For example, I can’t tell the difference between the single character ‘<’ and the four characters ‘<’. I also can’t send a tweet that contains nothing but some null characters. So we have to deflate our number a little bit. It’s tricky because we don’t know Twitter’s exact limitations; it’s a black box.
Ben: But the answer will still be roughly 2.8 kilobits. Can we do better?
Alyssa: By golly, I think we can! Turns out Unicode is basically identical to an international standard, called the Universal Multi-Octet Coded Character Set, known as UCS or ISO/IEC 10646. But the two standards weren’t always the same — in the 90s, there was a lot of disagreement between the computer companies who backed Unicode and proposed a simple 16-bit character set, and the standards mavens behind UCS, who wanted to accommodate more languages than 65,536 characters would allow. The two sides eventually compromised on the current 20.1-bit character set, but some historical differences still linger between the two “official” specifications — including in the definition of the UTF-8 encoding that Twitter uses. Check this out:
Alyssa: Although Unicode’s UTF-8 can only encode the 1,112,064 Unicode scalar values, UCS’s version of UTF-8 is a lot bigger — it goes up to 31 bits!
Ben: So, when Twitter says they use UTF-8, which version are they using? Unicode, or UCS?
Alyssa: Hmm, that’s a good question. Let’s write a program to test. I’ll send Twitter a really big UCS character in UTF-8, represented in octal like this. This is way outside the bounds of well-formed Unicode.
$ perl -Mbytes -we 'print pack "U", (2**31 - 5)' | od -t o1
0000000 375 277 277 277 277 273
Ben: Hey, that kind of worked! When we fetch it back using Twitter’s JSON API, we get this JSON fragment:
“text”:”\\375\\277\\277\\277\\277\\273″
Alyssa: It’s the same thing we sent! The character (really an unassigned code position) is represented in UTF-8, in octal with backslashes in ASCII. That means Twitter “characters” are actually 31 bits, not 20.1 bits. The capacity of a tweet is actually a lot bigger than the 2.8 kilobits we calculated earlier.
Ben: Should we worry that the text only shows up in the JSON API, not XML or HTML, and these high characters only last a few days on Twitter’s site before vanishing?
Alyssa: Nope. As long as we had this exchange in our conversation, people on the Internet won’t complain about those issues.
Using Alyssa’s insight that Twitter actually supports 31-bit UCS characters (not just Unicode), we can come up with a simple program to send 4.2-kilobit tweets using only code positions that aren’t in Unicode. That way there’s no ambiguity between these crazy code positions, which Twitter represents as backslashed octal, and legitimate Unicode, which Twitter sends literally. These tweets have a text field that’s a whopping 3,360 bytes long — but in ASCII octal UTF-8, so they only represent 525 bytes of information in the end.
The sender program reads the first 525 bytes of standard input or a file, slices it into 30-bit chunks, and sends it to Twitter using 31-bit UCS code positions. The high bit of each character is set to 1 to avoid ever sending a valid Unicode UTF-8 string, which Twitter might treat ambiguously. It outputs the ID of the tweet it posted. You’ll have to fill in your own username and password to test it.
The receiver program does the opposite — give it an ID on the command line, and it will retrieve and decode a “megatwit” encoded by the sender.
Here’s an example of how to use the programs and verify that they can encode at least one arbitrary 4,200-bit message:
$ dd if=/dev/urandom of=random.bits bs=525 count=1
1+0 records in
1+0 records out
525 bytes (525 B) copied, 0.0161392 s, 32.5 kB/s
$ md5sum random.bits
a7da09e59d1b6807e78aac7004e6ba41 random.bits
$ ./megatwit-send < random.bits
11037181699
$ ./megatwit-get 11037181699 | md5sum
a7da09e59d1b6807e78aac7004e6ba41 -
But this is just the start -- we're not the only ones interested in really long tweets. (Structures, strictures...) Others have found an apparent loophole in how Twitter handles some URLs, allowing them to send tweets with a text field up to 532 bytes long (how much information can be coded this way isn't clear). Maxitweet has come up with a clever way to milk the most out of 140 Unicode characters without requiring a decoder program, at least at its lower compression levels. There are definitely even better ways to cram as many bits as possible into a tweet.
So here is the challenge for the 1st International Longest Tweet Contest:
Come up with a strategy for encoding arbitrary binary data into a single tweet, along the lines of the sample programs described here.
Implement a sender and receiver for the strategy in a computer programming language of your choice. We recommend you choose a language likely to be runnable by a wide variety of readers. At your option, you may provide a Web site or other public implementation for others to test your coding scheme with arbitrary binary input.
Write a description, in English, for how your coding scheme works. Explain how many bits per tweet it achieves, and justify your calculation. The explanation must be convincing because it is intractable to prove conclusively that a certain capacity is achievable, even if a program successfully sends and receives many test cases.
Send your entry, consisting of #2 and #3, to megatwit@ksplice.com before Sunday, April 11, 2010, 23h59 UTC.
Based on the English explanations of the coding schemes (#3), we'll select finalists from among the entrants. In mid-April, we'll post the finalists' submissions on the blog. In the spirit of Twitter, we'll let the community assess, criticize, test, and pick apart the finalists' entries.
Based on the community's feedback, we'll choose at least one winner. This will generally be the person whose scheme for achieving the highest information encoded per tweet is judged most convincing.
The winner will receive notoriety and fame on this blog, and a smart-looking Ksplice T-shirt in the size of your choice. You will be known the world over as the Reigning Champion of the International Longest Tweet Contest until there is another contest. Legally we can't promise this, but there's a chance Stephen Colbert will have you on as a result of winning this prestigious contest.
By entering the contest, you are giving Ksplice permission to post your entry if you are selected as a finalist. The contest is void where prohibited. The rules are subject to change at our discretion. Employees of Ksplice aren't eligible because they already have T-shirts. Judging will be done by Ksplice in its sole discretion.
That's it. Happy tweeting!
Hello from a libc-free world! (Part 1)
Posted in computer architecture on March 16th, 2010 by Jessica McKellar – 104 Comments
As an exercise, I want to write a Hello World program in C simple enough that I can disassemble it and be able to explain all of the assembly to myself.
This should be easy, right?
This adventure assumes compilation and execution on a Linux machine. Some familiarity with reading assembly is helpful.
Here’s our basic Hello World program:
jesstess@kid-charlemagne:~/c$ cat hello.c
#include
int main()
{
printf("Hello World\n");
return 0;
}
Let’s compile it and get a bytecount:
jesstess@kid-charlemagne:~/c$ gcc -o hello hello.c
jesstess@kid-charlemagne:~/c$ wc -c hello
10931 hello
Yikes! Where are 11 Kilobytes worth of executable coming from? objdump -t hello gives us 79 symbol-table entries, most of which we can blame on our using the standard library.
So let’s stop using it. We won’t use printf so we can get rid of our include file:
jesstess@kid-charlemagne:~/c$ cat hello.c
int main()
{
char *str = "Hello World";
return 0;
}
Recompiling and checking the bytecount:
jesstess@kid-charlemagne:~/c$ gcc -o hello hello.c
jesstess@kid-charlemagne:~/c$ wc -c hello
10892 hello
What? That barely changed anything!
The problem is that gcc is still using standard library startup files when linking. Want proof? We’ll compile with -nostdlib, which according to the gcc man page won’t “use the standard system libraries and startup files when linking. Only the files you specify will be passed to the linker”.
jesstess@kid-charlemagne:~/c$ gcc -nostdlib -o hello hello.c
/usr/bin/ld: warning: cannot find entry symbol _start; defaulting to 00000000004000e8
Well, it’s just a warning; let’s check it anyway:
jesstess@kid-charlemagne:~/c$ wc -c hello
1329 hello
That looks pretty good! We got our bytecount down to a much more reasonable size (an order of magnitude smaller!)…
jesstess@kid-charlemagne:~/c$ ./hello
Segmentation fault
…at the expense of segfaulting when it runs. Hrmph.
For fun, let’s get our program to be actually runnable before digging into the assembly.
So what is this _start entry symbol that appears to be required for our program to run? Where is it usually defined if you’re using libc?
From the perspective of the linker, by default _start is the actual entry point to your program, not main. It is normally defined in the crt1.o ELF relocatable. We can verify this by linking against crt1.o and noting that _start is now found (although we develop other problems by not having defined other necessary libc startup symbols):
# Compile the source files but don't link
jesstess@kid-charlemagne:~/c$ gcc -Os -c hello.c
# Now try to link
jesstess@kid-charlemagne:~/c$ ld /usr/lib/crt1.o -o hello hello.o
/usr/lib/crt1.o: In function `_start':
/build/buildd/glibc-2.9/csu/../sysdeps/x86_64/elf/start.S:106: undefined reference to `__libc_csu_fini'
/build/buildd/glibc-2.9/csu/../sysdeps/x86_64/elf/start.S:107: undefined reference to `__libc_csu_init'
/build/buildd/glibc-2.9/csu/../sysdeps/x86_64/elf/start.S:113: undefined reference to `__libc_start_main'
This check conveniently also tells us where _start lives in the libc source: sysdeps/x86_64/elf/start.S for this particular machine. This delightfully well-commented file exports the _start symbol, sets up the stack and some registers, and calls __libc_start_main. If we look at the very bottom of csu/libc-start.c we see the call to our program’s main:
/* Nothing fancy, just call the function. */
result = main (argc, argv, __environ MAIN_AUXVEC_PARAM);
and down the rabbit hole we go.
So that’s what _start is all about. Conveniently, we can summarize what happens between _start and the call to main as “set up a bunch of stuff for libc and then call main”, and since we don’t care about libc, let’s just export our own _start symbol that just calls main and link against that:
jesstess@kid-charlemagne:~/c$ cat stubstart.S
.globl _start
_start:
call main
Compiling and running with our stub _start assembly file:
jesstess@kid-charlemagne:~/c$ gcc -nostdlib stubstart.S -o hello hello.c
jesstess@kid-charlemagne:~/c$ ./hello
Segmentation fault
Hurrah, our compilation problems go away! However, we still segfault. Why? Let’s compile with debugging information and take a look in gdb. We’ll set a breakpoint at main and step through until the segfault:
jesstess@kid-charlemagne:~/c$ gcc -g -nostdlib stubstart.S -o hello hello.c
jesstess@kid-charlemagne:~/c$ gdb hello
GNU gdb 6.8-debian
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu"...
(gdb) break main
Breakpoint 1 at 0x4000f4: file hello.c, line 3.
(gdb) run
Starting program: /home/jesstess/c/hello
Breakpoint 1, main () at hello.c:5
5 char *str = "Hello World";
(gdb) step
6 return 0;
(gdb) step
7 }
(gdb) step
0x00000000004000ed in _start ()
(gdb) step
Single stepping until exit from function _start,
which has no line number information.
main () at helloint.c:4
4 {
(gdb) step
Breakpoint 1, main () at helloint.c:5
5 char *str = "Hello World";
(gdb) step
6 return 0;
(gdb) step
7 }
(gdb) step
Program received signal SIGSEGV, Segmentation fault.
0x0000000000000001 in ?? ()
(gdb)
Wait, what? Why are we running through main twice? …It’s time to look at the assembly:
jesstess@kid-charlemagne:~/c$ objdump -d hello
hello: file format elf64-x86-64
Disassembly of section .text:
00000000004000e8 <_start>:
4000e8: e8 03 00 00 00 callq 4000f0
4000ed: 90 nop
4000ee: 90 nop
4000ef: 90 nop
00000000004000f0 :
4000f0: 55 push %rbp
4000f1: 48 89 e5 mov %rsp,%rbp
4000f4: 48 c7 45 f8 03 01 40 movq $0x400103,-0x8(%rbp)
4000fb: 00
4000fc: b8 00 00 00 00 mov $0x0,%eax
400101: c9 leaveq
400102: c3 retq
D’oh! Let’s save a detailed examination of the assembly for later, but in brief: when we return from the callq to main we hit some nops and run right back into main. Since we re-entered main without putting a return instruction pointer on the stack as part of the standard prologue for calling a function, the second call to retq tries to pop a bogus return instruction pointer off the stack and jump to it and we bomb out. We need an exit strategy.
Literally. After the return from callq, push 1, the syscall number for SYS_exit, into %eax, and because we want to say that we’re exiting cleanly, put a status of 0, SYS_exit’s only argument, into %ebx. Then make the interrupt to drop into the kernel with int $0x80.
jesstess@kid-charlemagne:~/c$ cat stubstart.S
.globl _start
_start:
call main
movl $1, %eax
xorl %ebx, %ebx
int $0x80
jesstess@kid-charlemagne:~/c$ gcc -nostdlib stubstart.S -o hello hello.c
jesstess@kid-charlemagne:~/c$ ./hello
jesstess@kid-charlemagne:~/c$
Success! It compiles, it runs, and if we step through this new version under gdb it even exits normally.
Hello from a libc-free world!
Stay tuned for Part 2, where we’ll walk through the parts of the executable in earnest and watch what happens to it as we add complexity, in the process understanding more about x86 linking and calling conventions and the structure of an ELF binary.
reference: http://blog.ksplice.com/2010
0 komentar: