Extra Pepperoni

To content | To menu | To search

Wednesday, August 18 2010

SystemImager & SALI

We use SystemImager to maintain (rebuild) our small HPC clusters. Conceptually it's very simple:

  1. Build a node (the 'golden client') just the way you want it.
  2. si_prepareclient: Run rsyncd on the node, accessible to the 'image server'.
  3. si_getimage on the image server copies the entire node into a directory, and analyzes it to produce a script that will recreate the image (with exclusions for files which should differ between nodes).
  4. si_updateclient on a target node fetches the script from the image server; the script configures the target (disk partitioning, etc.) and fetches the image contents, making the target match the golden client.
  5. If the node is dead or brand-new, there's a DHCP/PXE/TFTP process for bootstrapping far enough to run the script and then match the golden client.

Once the SI system is all set up, it's quick & easy to rebuild nodes. Unfortunately there are several complications:

  • The DHCP & TFTP dependencies are somewhat complicated, so bringing up SI without breaking anything is tricky. TFTP & pxelinux are not terribly well documented.
  • The "Latest Stable Release" is SystemImager 4.0.2 from December 2007. One of the key components of SystemImager is a generic kernel & Linux initrd (initial RAMdisk) which include a default set of drivers. But the release is so old that it cannot handle current hardware. There are several newer development versions but they're not fully baked and choosing between them is confusing.
  • SI doesn't yet support grub2 or ext4, which are required for large disks (GPT partition tables).

The workaround I got from the very helpful folks on sisuite-users@ was to use SALI, a modern kernel/initrd pair for SystemImager. Unfortunately SALI's a bit different -- in the process of adding grub2 support, they broke compatibility with the scripts that SI generates. Here's a quick recap of the steps I used (mostly from sisuite-users@) to use SALI:

  • Drop the 2 SALI files into the TFTP directory (normally /var/lib/tftpboot/ or /tftpboot/).
  • Specify the SALI files in /var/lib/tftpboot/pxelinux.cfg/default or equivalent.
  • Add a couple lines to /etc/dhcpd.conf.
  • Set SCRIPTNAME= in pxelinux.cfg/default.
  • In the script created by SI:
    • Change DISK_SIZE entries to "DISK_SIZE=$(get_disksize $DISK0)".
    • Remove -v1 from mkswap arguments.
    • Add -I 128 to mke2fs for the /boot FS.
    • Remove "-o defaults" from mount commands.
    • SystemImager's final line in the script is "shutdown -r now", which fails on SALI. Use reboot until SALI 1.3, which should support shutdown.
  • On our newer cluster, SALI does bizarre things with console redirection. I had to type into the (virtual VGA) console, while output appeared on the serial console. The serial console recognized and echoed my input, but did not execute it.
  • (Not SALI related): Make sure the scripts (normally in /var/lib/systemimager/scripts) are executable -- SI left mine non-executable for some reason.

Tuesday, August 17 2010

Sending a Mac away

I have to send my MacBook Pro to Apple for service again, so it's time to review my list of Sensitive Data: Things to Delete and other preparation for giving up physical control of a Mac. Unfortunately last month my MacBook Pro completely died, and I didn't have a chance to do any of this. The Genius asked for my password, and I just laughed at her. She explained they'd probably replace the hard drive with a new install if they couldn't get in, and I said I'd deal with that, but suggested they just use the installer to reset the password to something they liked. As it turned out, they apparently decided not to bother -- I got the MBP back with some security settings changed, so perhaps Apple techs have a different tool that grants them access.

Before Shipment

  1. Make a backup. I use SuperDuper for these, in addition to automatic CrashPlan & Time Capsule backups.
  2. Test the backup!
  3. Remove sensitive files for all active accounts, (including root if relevant):
    • ~/Library/Keychains/
    • ~/.ssh/ (except authorized_keys)
    • Password wallets (assuming you're not using something like 1Password on Dropbox)
    • Any sensitive email (location depends on client -- might be ~/Library/Mail/; I don't do this -- I have a lot of mail, and it's not generally sensitive)
  4. Log out of any sensitive services, such as Dropbox.
  5. Unauthorize iTunes.
  6. For each browser/user: clear history, cookies, & cache.
  7. Create an apple user, and make it an administrator. Give it a simple password (don't forget to write it on a note for the tech -- you don't want to wait a couple extra days while they ask for the password!).
  8. Set autologin for the apple account.
  9. Change any passwords, if worried Apple might decrypt them (don't forget sudo passwd root).

After Return

If the motherboard has changed, the serial number & MACs will change.

  1. Reverse all the above.
  2. Re-enable MobileMe sync.
  3. Update any static DHCP assignments.
  4. Re-pair Remote.app
  5. Re-pair Bluetooth or anything else confused by the MAC changes.

Friday, July 30 2010

amavisd-new hates yum -- solution: RPMForge

Today I patched www.reppep.com, and it broke email once again. As on several previous occasions, perl modules were broken, amavisd-new was throwing misleading errors on startup, and I had to reinstall Scalar-List-Utils to get rid of complaints about Compress::Zlib.

This time, however, I decided to upgrade amavisd-new in hopes the new version would be smarter about the (bogus) perl module complaints at startup. I also tried using yum to install some of the perl module dependencies, which entailed reinstalling spamassassin. Alas, amavisd-new-2.6.4 is no smarter, but either amavisd-new or spamassassin introduced a new dependency on Mail::DKIM, which requires the Crypt-OpenSSL-Random perl module. I tried getting them through cpan, but it kept choking -- apparently Crypt-OpenSSL-Random requires the openssl-devel RPM on CentOS, but isn't smart enough to throw a clear error demanding it.

I never did figure out where Mail::DKIM was enabled, or how to disable it, but I seem to have found a much better solution.

amavisd-new is not in the base RHEL (or CentOS) repositories, so the CentOS wiki recommends getting it from RPMForge. This turned out to be pleasantly simple, and should prevent yum from breaking it in the future. Here's hoping, anyway!

Wednesday, July 7 2010

TV is much different for kids in the post-DVR age

We limit Julia to one or two "half-hour" shows per day. They're nominally 30 minutes, but really about 22 each after ads. When I grew up ads were annoying interruptions in the show we wanted to watch, although inasmuch as they were effective they did make us want to BUY.

Julia's never watched TV without a DVR, though. Ads are still louder and as attractive as the advertisers can make them, and the interruption annoyance is minimized by the DVR's fast forward. Julia tends to choose to watch them, as well as the bumpers at beginning and end, because she doesn't want to give up any of her limited TV-watching time.

Today Julia realized that if she skipped the ads, and watched for 30 minutes, she could see extra TV (more than one 22-minute episode). I'm somewhat unhappy because this means she'll be watching more TV, but more pleased because she figured out how to maximize her time, and as a bonus she'll watch less ads (she was watching about 30 minutes per episode already, and I'd mutch rather Julia see a third of another show than watch 8 minutes of ads). Now we just need to ensure she is able to stop at 30 minutes.

Tuesday, June 29 2010

The NYC Taxi & Limousine Commision is selling us to advertisers

Last week and this week I took taxis. For a few years now, NYC taxis have had screens aimed at the back seats. After an annoying advertisement for the TLC, the OFF button shows up.

But now after tapping OFF, rather than just showing a relatively static ad for the people who make the screens, they instead flips back 'ON' to show TV-style advertising. This week I got to watch a stabbing and an ad for a fight channel. This means I cannot even sit normally in the cab without seeing advertising. Even if I sit at an uncomfortable angle to avoid staring at the screen, the sound is still on.

This makes me want to keep our daughter out of taxis -- and myself too!

Here's the complaint I sent the TLC Commissioner (they direct all complaints to the 311 switchboard, which doesn't have access to accept complaints about the TLC itself).

The seat-back monitors did not shut off before, and now they turn themselves back on after OFF is touched.

It's incredibly aggravating to be a captive audience for advertising. This is worse because the sound turns on. Today I had to watch one black man stab another, and see ads for a fighting channel. Last week, despite looking away from the screen, I was still stuck listening to ads.

This is awful.

Sunday, June 27 2010

The iPhone 4 Camera

Some time ago -- probably back when my iPhone 3G was the hot 'new' phone -- I was annoyed that my trusty Canon SD800IS didn't have GPS for photo tagging. Accurate dates on photos are very useful, as I am painfully reminded every time I combine photos from multiple cameras of the same event (DST often varies, but camera clocks just aren't very accurate). GPS coordinates are less important but also quite handy.

I realized, however, that rather than my next dedicated pocket camera having GPS, more likely the iPhone would eventually have a competitive camera, and the iPhone of course already has decent GPS and accurate time (generally from the cellular network).

The iPhone 3GS camera was much improved, but still completely inadequate for me. Lots of people, including Amy use cellphone cameras exclusively (she's about to move up from my old 3G to my now-old 3GS), but they tend to look at the photos on other phones, and/or post them at sites like Facebook that never show high-res photos anyway.

For me, the only real problems with the 3GS were its camera and its battery life. The battery was normally okay, but recently I'd get to work in the morning, after watching videos for about an hour, and find the battery down to 60% charge. I expect to get it replaced under AppleCare this week, and think Amy will be happy with it. But the camera was useless except for Twitter, and the very rare occasions when I needed a camera but didn't have my SD800IS.

So I was very happy to see Derek Powazek say Apple had done a good job with the iPhone 4 camera.

I took some comparison shots Saturday night in low light, and I'm happy with the iPhone's pictures. I'll probably keep carrying the S800IS on weekends, since I already have it and the habit -- and it does take somewhat better photos, with more room for cropping. But now pictures I post to Twitter will be decent, and I'll be more inclined to take pictures during the week without my Canon. In the future, when the SD800IS dies, I won't need to replace it.

The Canon still has advantages over my T1i for video, though -- it's much more forgiving for focus, and AF on the T1i is loud, not a factor on the SD800IS.

Friday, June 25 2010

Femme Totale with Molly Does Not Approve

Tonight Molly Does Not Approve hosted Femme Totale, inviting several dancers to perform to their original music. After throwing out more than half my photos (pole dancing lighting is unfriendly to photography) and trimming down 10gb of video, here's what remains.

Photos

This show made clear that Festivus is the pole dancer's holiday. We saw a celebration, with Feats of Strength, all orchestrated around Festivus Poles. Fortunately Molly managed to transmute the Airing of Grievances into a song, "Stop Stealing My Shit" -- much more fun than watching Frank Costanza lay into George.

Photographic lesson of the evening: use a 'fast' lens -- give up composing in favor of more photons. Second lesson: don't take stills while filming -- the video freezes are too disruptive.

Friday, June 11 2010

The Sun Also Sets

Update 2010/07/20: We had another failures; after replacing several components the system now appears fine again. The new lesson learned is that apparently the ILOM updates the "System Information: Components" inventory during boot. If the system won't boot or hasn't booted yet, ILOM just shows old (incorrect) information. Additionally, ILOM power readings are unreliable. The old ILOM didn't show any power consumption when the system was running, and the new ILOM (with latest firmware) looks different but still doesn't show 'Actual Power'.

ILOM Power Consumption: Actual Power


We have had serious hardware and service problems with a Sun server recently. Unfortunately, while the hardware problems can be written up to incredibly bad luck, other problems indicate serious corporate and support flaws at Oracle.

Prologue

We bought a new high-end server (X4540) a year ago, and a 2-hour onsite service contract. After installing it, we discovered the system only saw half its RAM. I called Sun, and they sent out a Field Engineer with a new motherboard. Unfortunately the replacement motherboard didn't work. After 3 days of parts replacement -- a second replacement motherboard, some RAM, and a replacement CPU -- they were unable to get either replacement motherboard to boot. They did eventually get the original motherboard to see all the RAM, though, so we resumed using it.

A little while later, the server became inaccesible. A reboot cleared the problem temporarily, and we discovered the problem was a bad patch breaking Sun's ipf firewall. After a couple weeks of requesting the fix (as a patch), I removed the bad patch and the firewall worked again.

April

In April, this X4540 lost a disk, which should have resulted in an automatic ZFS rebuild onto a hot spare, and the filesystem problems cascaded to disable about 20 dependent systems. I called support Thursday night, asking why the hot spares had not been utilized, and was told the problem was almost certainly a bad disk coupled with a bad disk controller on the motherboard.

Friday morning, an FE brought a new motherboard; unfortunately it didn't work. He got another motherboard and CPU, but the system still wouldn't boot. The daytime phone rep didn't know what was going on, so he escalated to another phone rep who told me (condescendingly) that he knew a lot about the X4500 & X4540 hardware, but it turned out he didn't actually know the basic component configuration. This third phone rep insisted that a bad CPU was causing all our problems, including phantom DIMMs reported in empty slots, etc. He also insisted we needed new DIMMs, a new CPU, and a couple more disks. The whole process -- mostly waiting on hold -- took long enough to kill my phone's battery.

Saturday I met another FE back at the machine room to pick up the parts, which were due by 9am. We got some of them by 10am, but others didn't arrive until later. We resumed the parts replacement dance, and again spent several hours on the phone (I brought my charger this time!), fortunately with a different phone rep (#4, for those of you counting along at home). This gent noticed that the system reported 0V coming into the motherboard, and lots of other voltages were off. At the end of the day, we agreed we needed a new chassis, as the X4540 routes power through the power supplies, into the power distribution board, through the chassis, and into the motherboard (so a fault in any major component can screw up CPU power input, and thus everything). The chassis was the only component we hadn't replaced yet. The phone reps, however, explained that the chassis could not be on-site until Monday morning. So much for our 2-hour SLA -- our Regional Service Manager explained it means an FE will be on-site within 2 hours, but they make no commitment at all on parts delivery. I asked the support reps how we could replace our lemon (at this time, refusing to boot from their fifth motherboard) which they were unable to repair, and was told the service organization could not authorize a replacement. So I called our sales rep, who referred the question back to a counterpart in the service arm.

Sunday nothing happened -- they were unable to provide a replacement chassis.

Monday morning, the second FE and I met at the machine room with a Senior System Engineer to assist and supervise. At this point (including the earlier RAM problems), we had had a complete failure to handle RAID recovery, 4 'bad' motherboards, 3 'bad' drives, and 2 'bad' CPUs. They were escalating internally, and the chassis was due at 9am. At 10am, the FE called the distribution center to ask where the chassis was, and was told it was 'almost there'. At about 11:45 the courier arrived, bringing a few small components, but not the required chassis. Someone in the warehouse had sent the wrong box. The courier explained that it would take 60-90 minutes to get us the chassis, because it took him that long to drive to the machine room from the warehouse -- meaning he left after 10am. So not only did the warehouse send the wrong part, but they sent it after the delivery time, when they told us the delivery was nearby at 10am, it hadn't even left yet. More calls, and someone explained that the chassis was not available -- they would have to send one from Boston, and it couldn't arrive Monday at all.

Tuesday they sent back an FE with 2 SSEs and the chassis, and the system came up. This ended the outage that had disabled 20 machines since Thursday night.

May/June

A month later, we received some disk alerts, apparently because we were supposed to mark the ZFS pool as repaired, but we were unaware of this and Sun didn't tell us about it.

On the next reboot, Solaris started logging errors that both the boot disks were offline (while running from these same disks). Eventually I was told that this was due to a bug in the kernel patch, which I backed out.

After rebooting we started seeing errors from another disk. When I asked the Sun case owner how to fail over to the hot spare until we could physically swap it out, he eventually sent me an unhelpful snippet from the manual page. Our SSE actually sent me a separate document with the correct command.

New policy: no more Solaris patching. Between this bug and the patch which broke networking, clearly Sun no longer test patches adequately, and we cannot trust them.

After the disk replacement, the system once more sees only 32gb. We are ordering a replacement storage system from another company, and will avoid breathing on this X4540 until we can migrate off it. It's clearly not trustworthy, and Sun is clearly incapable of supporting it.

Recap

Over two incidents I spent about 6 days at the machine room, well over 20 hours on the phone (much of it on hold), and watched Sun replace 4 motherboards, at least 2 CPUs, several RAM sticks (although they never just sent a full set of 16 4gb DIMMs), a PDB, and a chassis. This is all the components to an X4540. The chassis should have been replaced Friday or Saturday, but only arrived Tuesday.

Lesson

On this one system, we have seen multiple failures of multiple different types.

  • Undiagnosed failure (apparently in the chassis), which prevented 4 motherboards from working.
  • SATA controller failure (the first I've ever heard of).
  • Automatic ZFS hot spares didn't fail over.
  • A 'backline' phone tech was completely wrong, and obnoxious.
  • Warehouse staff failed to send the right part, failed to deliver parts on time on all 3 days, and lied about courier/delivery status.
  • Warehouse stocking is inadequate -- it took us 3 days to get a part.
  • Support escalation was a complete failure. It took about 3 weeks before I got any response from management other than "I'll get back to you."
  • In less than 18 months, this system has experienced 2 major hardware incidents, encompassing over a week of downtime. ZFS hot sparing has not yet worked, but has instead failed twice.
  • We have twice installed recommended patches with serious flaws, once making the system entirely unusable.
  • We have had entirely too many problems appear after reboots. Perhaps there is a disk scanning process that is automatically started after rebooting, but the result is that we do not trust this machine, and are afraid to reboot it.

Oracle's support is a mess. I feel like an idiot for buying this system.

Check contract SLAs carefully. I believed that this support level included parts availability within 4 hours (EMC, at least, used to make a big deal out of their 4-hour parts availability in NYC, for instance), but Sun makes no commitment for timeliness of parts replacement.

Tuesday, June 8 2010

Brooklyn Blogfest 2010

I went to the 5th Annual Brooklyn Blogfest. Aside from the annoying cheerleading about how Brooklyn is the 'bloggiest', it was pretty good.

  • Best: Smartmom telling Spike Lee (declared non-blogger) what to do, and Spike shutting her down (repeatedly).
  • Second best: Marty Markowitz ambushing Spike, who was clearly underwhelmed.
  • Saddest: Spike explaining (twice) that he cannot live in Brooklyn, because people won't leave his family alone.
  • Oddest: The whole thing think was sponsored by Absolut, pushing their new Absolut Brooklyn, co-designed by Spike Lee.
  • Oddest coincidence: Sitting in front of Andrea Bernstein (who sat in front of us last night) before she went up to run the panel -- on how and why Brooklyn is the best place to live & blog.

Met in line: @foodculturist

They showed a lovely photo montage assembled by Adrian Kinloch, which unfortunately hasn't been posted yet. Some of the photos were from Visual Stenographers, one of whom was a panelist.

I joined the "Eclectic" BoF, because there was no computer/nerd group. Fortunately the group was pretty interesting. Next year's slogan for the group: "Meta is Bettah!"

The program claimed Dave Winer would run a BoF(?) called "NYC blog directory" in the same tiny room as my group, but I didn't see him.

Friday, May 28 2010

Control arrows in Mac bash

I've been annoyed for some time that the extremely handy bash keyboard shortcuts Control-left arrow and Control-right arrow, which move by word in Linux, don't work in Mac OS X. Today I finally got aggravated enough to do some googling, and pieced together the answer.

First, bash normally defines both Control-left arrow and Esc,B as move left one word; likewise both Control-right arrow and Esc,F are defined as move right one word.

The fix is simply to tell Terminal to send Esc and then b when Control-left arrow is typed, and Esc then f for Control-right arrow. I could probably figure out what "[5D" means in Terminal's preferences and configure bash on my Macs to jump by word on that input, but this way I only have to configure 2 Macs, and it works on remote Solaris boxes as well.

Note that bash considers / to be a word delimiter, so these move through paths by directory.


Before

Terminal preferences: before

The change

Terminal preferences: changing shortcut

After

Terminal preferences: after

Monday, May 17 2010

iPad & iPhone apps

Friends & relatives have lately asked what apps I use on the iPad & iPhone. Here are my current apps and a few notes. See also February 2009 & August 2008.

pPad apps, page 1 pPad apps, page 2 pPad apps, page 3 pPad apps, page 4 pPad apps, page 5

  1. Apple apps, assorted (mostly built-in)
  2. Video & electronic books/readers
  3. Miscellany
  4. Games & toys
  5. Drawing & game overflow

iPad apps are fewer and simpler. The three of us share this, so I don't keep personal apps like 1Password & Dropbox on it.

I have a simple scheme for organizing apps. Paid apps tend to go to at the top of the screen, as they are often more functional than the free "Lite" versions, and they were important enough to pay for. I haven't tried all of these yet, but thought I might. A few (NPR & NYT Editors' Choice, for example) I don't personally use, but show off when people ask about the iPad.

Note that on both iPad & iPhone, the same cluster appears: Tweetie/Twitterrific, NetNewsWire, Instapaper, & Kindle (sorted from most timely content to most long-running). These are the four apps I use most, although the ebook reader changes as I read books in different formats. I am looking forward to comparing iBooks on iPhone to Kindle.app. On the iPad, I spend most of my time in Dock apps; the remainder is mostly in page 1 & 2 apps, with 4 & 5 largely for Julia, although Amy and I do play games.


cPhone apps, page 1 cPhone apps, page 2 cPhone apps, page 3 cPhone apps, page 4 cPhone apps, page 5 cPhone apps, page 6 cPhone apps, page 7

I've had an iPhone for 3 years, and there are still considerably more iPhone apps than iPad apps. Each iPad page shows up to 20 apps + 6 in the Dock (we have 82 installed), while the iPhone shows 16 + 4 in the Dock (I have 121 installed). On the iPhone, I spend most of my time on page 1, most of the rest on page 2, and fairly little on the other pages.

Facebook is annoying -- I rarely run it, but when I do there's often some message that activates the message counter. It's obnoxious that I cannot turn it off, but successful inasmuch as it prompts me to go back in and attempt to clear that pending message (which doesn't necessarily work -- apparently a Facebook bug).

Monday, May 3 2010

The iPad "pPad" cannot be synced. An unknown error occurred (-50).

Tonight I plugged in my iPad, and got this error:

The iPad "pPad" cannot be synced. An unknown error occurred (-50).

Error -50

  • I tried a different USB port and a different cable, but no luck.
  • Interestingly, my iPhone was fine -- no complaints.
  • Eventually, after a bunch of Google herring, I found a couple people who got this error while trying to synch photos to an iDevice.
  • I turned off all Photo synching, and the error went away.
  • Then I re-enabled Photo synch, and enabled my photo albums one by one until I found the problematic album.
  • Unfortunately it contained 584 photos, but being able to cause or avoid the problem meant I was going to fix it.
  • I created 2 new albums, A & B.
  • I copied all photos from the problematic album to A, leaving B empty.
  • I left the real (problematic) album deselected, and enabled synching for B.
  • iTunes was happy.
  • I added photos, one event or fraction of an event at a time, from A to B, and deleted them from A.
  • Eventually iTunes complained -50 again, and I started moving photos from B back to A until it was able to synch again.
  • I now knew which photo was tripping up iTunes. I don't see anything special about it -- it's just some flowers from my SD800 -- but it's here in case someone can figure out why iTunes/iPad is allergic to it.

troublesome flowers

Tuesday, March 30 2010

Dell DRAC: Technical and Support Limitations

Update 2010/04/23: Our keyboard deafness problems (both via USB and over DRAC) were apparently due to problems with the DRAC card itself. They remained after a downgrade to v1.4.5, but went away with replacement of the DRAC card.

I have spent a considerable amount of time trying to fix a Dell PowerEdge R900 server since it started reporting vague but serious memory errors last week. Since it's quite awkward to physically visit, I have been making extensive use of DRAC 5 (Dell Remote Access Card). Unfortunately, DRAC has a slew of poorly documented and poorly understood restrictions, which vary between hardware (DRAC 5) and software (v1.4.5/v1.5.1) versions.

Dell recommends Windows and supports certain flavors of Linux, but the farther you get away from the ActiveX control for IE under Windows, the more difficult things become. Since we use CentOS (unsupported) and don't have any Windows hosts on the private network with the DRAC interfaces, we have to jump through a few hoops to connect.

The compatibilities are complicated enough that I put together a table of what the different DRAC versions and components require (these are the issues I've encountered -- there are probably more in the release notes, and undocumented), with workarounds where available:

DRAC Compatibility Firefox Java keyboard stability security
DRAC v1.4.5
  • Incompatible with 64-bit Firefox.1
  • Serves ActiveX plugins to Firefox/Linux & Firefox/Mac.2
Input is garbled with Mac keyboard (even running Firefox on Linux via X11 from Mac).3
DRAC v1.5.1 32-bit Sun JRE 1.6.0-11 or earlier (-18 is current). Remote & USB keyboards are both unreliable (may be a local problem, rather than DRAC).4
  • Login errors.
  • Plugins crash more often than run successfully.
vmcli v1.5.1 Requires password on command line.5
Note .jnlp files do not launch javaws by default (not Dell's fault).6
  1. Workaround: rpm -e --allmatches firefox; yum install firefox.i386
  2. By default, DRAC attempts to serve up the ActiveX controls to Firefox/Linux. The workaround is to manually specify 'Java' rather than 'Native' for VKVM & VM.
  3. Direct login from a Mac browser shows the screen properly but typing produces the wrong characters. X11 forwarding from a Mac through a Linux system is the same. Workarounds include running Firefox in a Linux VM and tunneling X11 through ssh, or connecting to Firefox running on the Linux host via VNC.
  4. This may be an unrelated problem, which coincidentally appeared after upgrading to DRAC 5 v1.5.1. I hope to determine this soon. This was apparently the fault of a bad DRAC card.
  5. Partial workaround: precede vmcli with a space, to avoid recording the DRAC password in bash history. This does not help with ps sniffing, though.
  6. Workaround: specify a suitable javaws (such as /usr/java/default/javaws/javaws) from a compatible 32-bit Sun JRE. Additionally, saving .jnlp files for later use without redownloading in Firefox does not work. We do this on Solaris ILOM, as it avoids having to launch Firefox, login, and redownload a fresh .jnlp file for every connection.

Unfortunately, this all is poorly supported and understood by Dell. Phone techs can do 3 things: tell you to update (BIOS, DRAC, etc.), provide update URLs, and help work through the release notes for incompatibilities. Anything else is hit-and-miss. Specifically, mention of "CentOS Linux" tends to end support; X11 forwarding causes confusion and generally exceeds their knowledge and ability to assist. There are back-end folks they can escalate to, but I never managed to escalate successfully -- I had to give up after a half-hour on hold. But I did get (via a painfully long and circuitous route through our sales team) support from a helpful gentleman on the OpenManage team, who didn't have answers but was helpful in verifying problems.

To me, the moral is that Dell doesn't do software well (I haven't used their other software, so I don't know if this really generalizes, but I'm sticking to it unless I see a counterexample). It's such a small part of their business that it's not really surprising, but still a problem when you need a piece of Dell software to work, or to figure out what's wrong.


See also http://www.dell.com/downloads/global/solutions/DellRemoteAccessController5Security.Pdf.

Sunday, March 21 2010

Ubuntu: Java in Firefox

Julia's netbook runs Ubuntu Netbook Remix 9.04 (UNR's 9.10 installation process has changed, and doesn't work as well). Her teacher recently recommended http://arcytech.org/java/, which offers a bunch of educational games. Unfortunately, UNR doesn't include Sun Java, and getting it running was non-trivial.

The short version: I needed to enable the multiverse & universe repositories in synaptic, and then "apt-get install sun-java6-plugin". I initially installed just the JRE, but Firefox needs the plugin too. I also put the JVM's path in /etc/jvm, but I'm not sure if that mattered.

Interestingly, I had a similar problem at work last week -- CentOS 5 systems naturally don't come with Sun Java, but installing the JVM was easy. For both CentOS & Ubuntu, most of the documentation on installing Java (including Sun's) stops at getting the JVM installed, and neglects the Firefox plugin. On CentOS, I just dropped a symlink to /usr/java/default/... into the right directory under /usr/lib/mozilla.

Be sure you install for the correct version of firefox (some of our systems have bits of several different versions); if not sure, link the plugin into ~/.mozilla.

Friday, February 26 2010

Grid Engine foolishness

So I'm trying to set up an SGE head for testing with Amazon Web Services. I had previously groused about problems with their installer, and got a couple responses from Daniel Templeton, but 2 versions later it's still stupid.

Today's issue: the RPM installs in /gridware/sge (fine), but the installer doesn't work unless you put the SGE software in the desired directory and before running the installer. This is not the way to do it, guys! So I'd install the RPM, move all the files it just installed into the desired destination directory, run the installer, then move the original files back to /gridware/sge (hopefully without disturbing the actually installed version), and then remove the RPMs, I guess, if they don't do anything useful. Okay, I give up -- I accept that Sun's SGE RPMs are worthless. I give in -- I'll install from source. Oh, and cleaning up I noticed that the RPMs install unclaimed files, so deleting the RPMs leaves cruft behind. The person who built the RPMs must really hate their job.

And... the source tarballs are tarred, gzipped, and zipped. Who was the genius behind this? We can hope that Oracle will kill whatever insane Sun website policy required such redundant packaging, but I'm not holding my breath -- more hoping that SGE continues to exist after Oracle notices it.

Tuesday, February 23 2010

Saga: Dell Memory Diagnostics

We have a couple Dell R900s: 4 sockets, 24 Xeon cores, & 128gb RAM. One of them started reporting RAM & processor errors in December, so I called Dell. The rep explained it might be spurious, due to a BIOS bug. Not that there was any known issue, but Dell naturally hoped I could fix the problem with a software upgrade, so they wouldn't need to replace any hardware. I upgraded BIOS, and it shut up for a couple months.

Last week the front panel went amber again, and the System Event Log started recording RAM errors in one memory board (the system has 4 boards, each with 8 DIMM slots: a total of 32 4gb DIMMs).

Non-critical    02/17/2010 14:58:11 Mem CRC Err: Memory sensor, transition to non-critical from OK ( Memory Board D ) was asserted
Unknown 02/17/2010 11:47:03 I/O Fatal Err: Unknown sensor, unknown event
OK  02/17/2010 11:47:03 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:03 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:03 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:03 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:03 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:02 System Software event: OEM Diagnostic data event was asserted
Non-Recoverable 02/17/2010 11:47:02 CPU Machine Chk: Processor sensor, transition to non-recoverable was asserted
Non-Recoverable 02/17/2010 11:47:02 CPU Machine Chk: Processor sensor, transition to non-recoverable was asserted
Unknown 02/17/2010 11:47:02 I/O Fatal Err: Unknown sensor, unknown event
OK  02/17/2010 11:47:02 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:02 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:02 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:02 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:02 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:01 System Software event: OEM Diagnostic data event was asserted
Non-Recoverable 02/17/2010 11:47:01 CPU Machine Chk: Processor sensor, transition to non-recoverable was asserted
Non-Recoverable 02/17/2010 11:47:01 CPU Machine Chk: Processor sensor, transition to non-recoverable was asserted
Unknown 02/17/2010 11:47:01 I/O Fatal Err: Unknown sensor, unknown event
OK  02/17/2010 11:47:01 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:01 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:01 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:01 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:01 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:01 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:01 System Software event: OEM Diagnostic data event was asserted
Non-Recoverable 02/17/2010 11:47:00 CPU Machine Chk: Processor sensor, transition to non-recoverable was asserted
Unknown 02/17/2010 11:47:00 I/O Fatal Err: Unknown sensor, unknown event
Unknown 02/17/2010 11:47:00 I/O Fatal Err: Unknown sensor, unknown event
OK  02/17/2010 11:47:00 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:00 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:00 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:00 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:00 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:00 System Software event: OEM Diagnostic data event was asserted
Non-Recoverable 02/17/2010 11:47:00 CPU Machine Chk: Processor sensor, transition to non-recoverable was asserted
Unknown 02/17/2010 11:47:00 I/O Fatal Err: Unknown sensor, unknown event
Unknown 02/17/2010 11:46:59 I/O Fatal Err: Unknown sensor, unknown event
OK  02/17/2010 11:46:59 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:46:59 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:46:59 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:46:59 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:46:59 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:46:59 System Software event: OEM Diagnostic data event was asserted
Non-Recoverable 02/17/2010 11:46:59 CPU Machine Chk: Processor sensor, transition to non-recoverable was asserted
Unknown 02/17/2010 11:46:59 I/O Fatal Err: Unknown sensor, unknown event
Unknown 02/17/2010 11:46:59 I/O Fatal Err: Unknown sensor, unknown event
OK  02/17/2010 11:46:58 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:46:58 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:46:58 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:46:58 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:46:58 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:46:58 System Software event: OEM Diagnostic data event was asserted
Non-Recoverable 02/17/2010 11:46:58 CPU Machine Chk: Processor sensor, transition to non-recoverable was asserted
OK  02/17/2010 11:50:02 CPU3 Status: Processor sensor for CPU3, IERR was deasserted
OK  02/17/2010 11:50:02 CPU2 Status: Processor sensor for CPU2, IERR was deasserted
Critical    02/17/2010 11:49:47 CPU3 Status: Processor sensor for CPU3, IERR was asserted
Critical    02/17/2010 11:49:47 CPU2 Status: Processor sensor for CPU2, IERR was asserted

I called Dell, and was told I'd need to run a "Dell 32 Bit Diags" to isolate the bad component. Unfortunately it's only available as a Windows self-extracting executable, which can generate a floppy .img file or a CD-ROM .iso file; Dell's tool can also copy the diagnostics to a flash drive. I hate that Dell both assumes that everybody runs Windows, and helps ensure that by requiring Windows to manage Dell machines. Fortunately I have an XP VM.

So I swapped the suspect memory board from slot D into slot C and ran the diagnostics. I was told to erase the SEL and run the included mpmemory.exe. It was supposed to take half an hour, but actually took about 2 1/2 hours for each run. Additionally, the diagnostics showed an unclear warning that the (DOS-based) diagnostics are not compatible with console redirection (presumably because these hosts have serial consoles configured). Fortunately we bought DRAC, for this machine, and that seems to work fine.

To boot into the diagnostics, I checked the "Boot Order" section of the R900 BIOS. Surprisingly, although it does show VIRTUAL FLASH, I was unable to find a USB FLASH entry. For some reason Dell configures USB flash as a virtual hard drive, so I had to change the "Hard Disk Boot Order" to prefer flash to the RAID controller -- this got me a a DOS-based menu and let me run mpmemory.exe.

Disturbingly, Dell's memory diagnostic triggered but was not able to detect the memory error. mpmemory returned a clean bill of health, but the SEL recorded errors on memory board C (the suspect card in a different slot, so the motherboard itself is fine).

Non-critical    02/23/2010 20:56:56 Mem CRC Err: Memory sensor, transition to non-critical from OK ( Memory Board C ) was asserted
Non-critical    02/23/2010 15:27:27 Mem CRC Err: Memory sensor, transition to non-critical from OK ( Memory Board C ) was asserted
Non-critical    02/23/2010 15:27:27 Mem CRC Err: Memory sensor, transition to non-critical from OK ( Memory Board C ) was asserted

Unfortunately the diagnostics failed to isolate an individual DIMM, and I don't have the time to start keep reconfiguring the RAM (across all 4 memory cards, which apparently need to match each other) to do a binary search, running 150 minutes per test, to isolate the faulty DIMM or slot -- worse, I'd have to visit the server and reconfigure it after each round. Fortunately, Dell acknowledged the absurdity of running 5+ hours of tests (it could easily have taken over 20 hours to find the right DIMM). They sent a new card with 8 DIMMs (2 types, at least one refurbished). I swapped the replacement parts in and reran the test, which failed. Apparently nobody at Dell had ever seen this particular error (generated by a Dell proprietary diagnostic) -- not comforting. I ran it again and got a complete lockup -- this time apparently a common occurrence on Dell multi-processor systems. It turned out I had been given an old version of the diagnostics.

I got the new version, ran it twice, and saw no further errors in the SEL. Hopefully I won't have to think about that R900 for a while, but diagnosing it is so awkward -- it looks like the 256gb max configuration would take 5 hours for each pass!

Friday, February 19 2010

Twitter Post Types

Several people have asked me about types of tweets recently, so here's the rundown:

  1. If the first letter of a tweet is d or D, it's a 'direct message' -- follow with space and name, like "d mscrochety Hey, babe!". This only goes to the recipient, and nobody else can see it. Screwing up DMs, and making unintentionally public comments, has a storied history already. If you try to DM someone who doesn't follow you, you'll get an error.

  2. '@ mentions': Any time a username appears in a (public) tweet after an @, that user sees the message, even if they don't normally follow the poster.

  3. '@ messages' or '@ replies': Anything with first character @: These are public (visible on your page & in searches), but not sent to your followers unless they also follow the recipient specified after the initial @. So if I tweet "@mscrochety Hi there", my followers who don't know @mscrochety won't see it, but people who follow both of us will. If you are addressing a person but still want others to see your tweet, put an extra character such as a space before the @ to convert it to a 'mention' seen by all your followers.

  4. Public tweets: If the first character is anything but @ or d, anybody who follows the poster sees them. They also show up on the poster's feed page (e.g., http://twitter.com/reppep) and in searches.

  5. Hashtags: To make it easier for people to find your posts, you can use tags which start with the # character to identify topics. These are popular for people in a venue to collect their tweets, or for a larger meta-conversation. Hashtags should not contain spaces or punctuation, and are normally lower-case.

When you want to echo what someone else said, in current clients (including the twitter.com web interface), you can use the retweet button. This shares the original tweet with your followers, even if they don't follow the original poster. There's an older convention of starting with 'RT '.

For Twitter applications, including photo and video integration, see twittereye.

Wednesday, February 10 2010

RightScale / Amazon EC2 Notification Confusion

We're looking at running a Sun Grid Engine HPC cluster on Amazon Elastic Compute Cloud. Unfortunately I couldn't find a CentOS 5 / SGE Amazon Machine Image, so I have been looking at RightScale's public AMIs.

Monday at 3:02pm (EST), I booted a RightScale AMI. Apparently I forgot to shut it down, which entails a small charge from Amazon, but their pricing is low enough that's not a big deal. The oddity, however, showed up the next morning.

Subject: New instance operational: 'i-7e1c0716'
Date: Tue, 9 Feb 2010 09:12:57 +0000
From: notifier@my.rightscale.com
To: pepper@***

Your instance i-7e1c0716 created at 2010-02-09 09:12:57 is now available. This image booted in less than a minute.

It can be reached at ec2-72-44-61-196.compute-1.amazonaws.com or it can be managed in RightScale by logging in and going to:
https://my.rightscale.com/clouds/1/ec2_instances/*******

--

notifier@my.rightscale.com
http://my.rightscale.com

Note: You are receiving this email because you are the account owner at 
RightScale.com and have registered under this address (pepper@reppep.com). 
Return to RightScale.com to edit your communication preferences.

I wondered if someone had subverted my RightScale or AWS account, but it looks like RightScale just didn't notice the instance was running until 4:12am (EST) the next day -- a bit over 9 hours after it actually booted. I haven't used RightScale enough to know if this is typical, but it's certainly odd.

pepper@teriyaki:~$ ec2-describe-instances 
[Deprecated] Xalan: org.apache.xml.res.XMLErrorResources_en_US
RESERVATION r-6e452e06  290259848388    default
INSTANCE    i-7e1c0716  ami-f8b35e91    ec2-72-44-61-196.compute-1.amazonaws.com    domU-12-31-39-0B-28-A7.compute-1.internal   running *** 0       m1.small    2010-02-08T20:00:54+0000    us-east-1a  aki-a71cf9ce    ari-a51cf9cc        monitoring-disabled 72.44.61.196    10.214.47.85            instance-store      
pepper@teriyaki:~/Sites$ ssh ec2-72-44-61-196.compute-1.amazonaws.com
Last login: Mon Feb  8 15:02:30 2010 from 140.163.***.***
     ___   _        __   __   ____            __    
    / _ \ (_)___ _ / /  / /_ / __/____ ___ _ / /___ 
   / , _// // _ `// _ \/ __/_\ \ / __// _ `// // -_)
  /_/|_|/_/ \_, //_//_/\__//___/ \__/ \_,_//_/ \__/ 
           /___/                                                 

Welcome to a public Amazon EC2 image brought to you by RightScale!

********************************************************************
********************************************************************
***       Your EC2 Instance is now operational.                  ***
***       All of the configuration has completed.                ***
***       Please check /var/log/install for details.             ***
********************************************************************
********************************************************************
[root@domU-12-31-39-0B-28-A7 ~]# w
 10:13:47 up 1 day, 19:11,  1 user,  load average: 0.00, 0.00, 0.00
USER     TTY      FROM              LOGIN@   IDLE   JCPU   PCPU WHAT
root     pts/0    140.163.***.***  10:13    0.00s  0.00s  0.00s w
[root@domU-12-31-39-0B-28-A7 ~]# date
Wed Feb 10 10:14:01 EST 2010
[root@domU-12-31-39-0B-28-A7 ~]# last
root     pts/0        140.163.***.***  Wed Feb 10 10:13   still logged in   
root     pts/0        140.163.***.***  Mon Feb  8 15:02 - 17:53  (02:50)    
reboot   system boot  2.6.21.7-2.fc8xe Mon Feb  8 15:02         (1+19:14)   

wtmp begins Mon Feb  8 15:02:22 2010
[root@domU-12-31-39-0B-28-A7 ~]# 

Wednesday, February 3 2010

TidBITS: Zombie Authors Threaten Fiction Ebook Market, from the Grave!

TidBITS just published my latest article, a consideration of current and future trends in the (e)book marketplace. Adam found some cool CC pics to accompany the article (the whole 'zombie' theme was his idea too).

http://db.tidbits.com/article/10979

It's particularly timely during the iPad countdown, and after last weekend's Amazon/Macmillan standoff. Not sure if this is the most links ever in a TidBITS article, but I consider it a credible attempt.

Friday, January 22 2010

Pen v keyboard v Newton v Graffiti v Treo v iPhone

A nice comparison of data entry performance. Typing on the iPhone is okay, but I much prefer a normal USB keyboard. I never did much entry using Rosetta or Grafitti. The Treo keyboard was decent, but not great. The BlackBerry disadvantage: noisy during meetings!

That said, my worst data entry method is definitely paper -- serious readability concerns! ;)

http://hardware.slashdot.org/story/10/01/22/0812222/Pen-vs-Keyboard-vs-Touch-vs-Everything-Else

- page 1 of 16