Peru

From CSL Wiki

Jump to: navigation, search

Contents

Quickstart - Measuring Link Quality

  • Start a ping
  • Look at web page! The web page had more useful information in addition to the ping!

Pinging

  • Broadcast ping to find neighbors only!
  • The components are already running a once a second ping
  • Additional unicaast pings you send will be added into the measurements
  • You can use unicast pings with a large packet to determine link quality
ping -s 1400 10.0.0.112

Linkinfo Web Page

  • Connect to the web page using your browser. For example for node 103, go to:
http://10.0.0.103:8080/
  • The linkinfo table is up at the top, but click the link below the table to just see the linkinfo
  • It takes 120-180 seconds for the linkinfo to get a stable reading
    • If you send large ping packets in addition it will go a little faster
  • The columns are as follows:
    • Neighbor: the neighboring node
    • Status/Gen: status is the assumed status of the node / the gen is whether we are generating prove traffic to this node
      • This value is meant as a sort of quick idea of what is going on. It is more important to node that it is changing and how stable it is
      • You want it to be ACTIVE ... if it jumps between STARTUP and ACTIVE or ASYMM and ACTIVE, then the link is really poor
      • The 'Gen' indicates we are currently sending probes to this node (Y mean yes, N means no)
    • Last Heard: how long ago we heard any information from this node about us
      • If this value is high and repeatedly goes high, then that means the link is flaky since the two nodes can not continuously successfully communicate
    • RSSI (dBm): the signal strength of each received packet from this node. this value is a running average and running standard deviation
    • Silence (dBm): the ambient noise level before a packet was received. this value is a running average and running standard deviation
      • The silence value is important to watch during deployment. The higher it is, the worse the surrounding 'noise' is and the more difficult it will be to receive packets.
    • SNR: the rssi - silence
      • We have only recently started displaying this information so we do not have a good 'feel' for what the RSSI, Silence, and SNR values should be for a good site.
      • When setting up a site, pay attention to the rssi, silence value, and snr and how it changes. It is good to keep track of this mentally so you can gain an understanding of what good values are.
    • Rate In/Out: The data rate to and from this node. 10 -> 1Mbps, 20 -> 2 Mbps, 55 -> 5.5 Mbps, 110 -> 11 Mbps
      • The data rate can tell you a lot about whether a link is flaky or not.
      • If the link is at 1 or 2 Mbps, that means there is enough packet loss to make the rate lower, however this may not mean that the link is not usable, but rather that you should proceed with caution, or do your best to try and adjust the antennas.
    • Conn In/Out: The packet delivery ratios as a percentage for sent and received packets
      • Combining the packet delivery ratios and the data rate can also tell you a lot about a link
      • If you see a connectivity of less than 70\% for both links, this means the links is going to perform very poorly.
      • If the connectivity is high, but the data rate is at 1 or 2. That means there could potentially be significant loss on the link especially with more traffic
        • It is difficult to tell in this situation so the best thing to do is try to send some more pings by hand, or readjust the antennae

The CDCC and you

Deployment checklist

  • Internal CDCC cable connections
    • Verify all of these are properly plugged in
      • Both ends of serial cable
      • Both ends of Ethernet cable
      • Power connector
      • Pigtail connector to wifi card
      • Wifi card connection to stargate and screws are holding it in place
      • CF card is present
      • Daughter card on stargate is not loose

Rebooting / Powering off the CDCC

Open box

  • Press and hold the black sideways facing button on the stargate
  • The three lights will all turn solid and you can let go of the button
  • Once the three lights turn off, wait 5 seconds and you can safetly unplug the power
  • The lights will come on shortly and eventually start flashing: this indicates the stargate is booting up again
    • If you have not unplugged the power yet, you can still safely do so
  • You may have to press the white reset button to get things to boot properly if the lights do not come back on

Through the web page

  • The main CDCC web page has a shutdown/reboot button
  • Click the button to properly shutdown/reboot the node
  • Wait until a the page refreshes. It should display a note which tells you to unplug the power in 5 seconds

Through a console

  • Through the serial console, ssh, or rbsh do the following
  • Type
killall duiker
  • Wait a minute util duiker can properly clean up. You can do a the following a few times to see if it is still running
ps awx | grep duiker
  • Once duiker has stopped type
    • killall emrun
    • shutdown -r now

Understanding the CDCC web page

Main Page

  • The main page shows you all the status information about the software on the CDCC
    • There are links to access the duiker page and the linkinfo page
    • There is a button to properly reboot the CDCC

Linkinfo display

  • please see the information in the linkinfo page section below

Diskinfo display

Data directory /opt/data/ contains 0 files, which are lined for deletion
Filemover directory /opt/filemover/ contains 0 files, which are lined for deletion
Xfer directory /opt/xfer/ contains 0 files, which are lined for deletion
Total diskspace = 0.97GB
Free diskspace threshold = 25.00% which is 0.24GB
Free diskspace delete threshold = 5.00% which is 0.05GB
Free diskspace = 44.43% which is 0.43GB
  • The most important lines are the 'Total diskspace' and the last 'Free Diskspace' line
    • These tell you how much space the CF card has and how much of it is left
  • Other useful information is the number of files in each of the directories (see the section about the CF card structure below) and the stop thresholds
  • The 'Free diskspace threshold' is the point at which this node will stop accepting data from neighboring nodes
    • This tries to ensure that the locally generated data has priority over the data from other nodes
  • The 'Diskspace delete threshold' is the point at which data on the node will be deleted to maintain the threshold
    • This makes sure that the CF card will never become full and there is always room for the newest locally generated data
    • The data is sorted by data created and the oldest is deleted. No preference is given to locally or remotely generated data: the oldest goes first

Timeinfo display

wlan0 - 10.0.0.7:6945
Mode:    DISK - 1197353978.186051 - Tue Dec 11 06:19:38 2007
Next disk write: 36.67
Next time recheck: 581.67
  • This displays information about the timekeeper system. For the most part you can ignore this
  • The timekeeper system attempts to make sure the stargate system time is always current
    • It does this by trying to get the time form a q330, the time from neighbors, or the most recent time saved to disk (done once a minute)
    • The displayed information shows this (DUIKER for q330 time, UDP for network time, DISK for disk based time)
  • This is important because stargates forget what time it is when they reboot
    • Having current time on the stargate, even if it is only accurate to within a few minutes is much more useful than having no time.
  • NOTE: This module may slow down the startup of the software system
    • This is because the timekeeper attempts to find the most current time before letting the rest of the software startup

SinkTree display

Node 7: The sink is 0, there is no next hop
node sink h  tett       sett       fdr     rdr     lett       rate time     stat    next
  • This shows the current possible and current sellected paths to the sink nodes
  • The most important columns are node, sink, h (hops), tett, time, stat, next
    • node: The next hop node
    • sink: The sink this node is sending to
    • h: The number of hops away this node is from the sink
    • tett: Total path ETT. The ETT is the metric for each link. Added them all up and you get total ETT. The lower the ETT, the better!
    • time: The last time we heard anything from this node
    • stat: Status of this node. Active means it can be considered as the next hop
    • next: The path data through this node will take
  • The best next hop is choosen by the lowest ett.

Recent Transfers display

-- Outputdir:		 ClientTimeout 45000, ServerTimeout 40000

				 ---- Sending ----

				 ---- Receiving ----
  • This will show live incoming and outgoing transfers as well as recently completed/failed transfers in the last few minutes
  • You may occasionally see a file go to 100\% and then fail. This means that the transfer finished, but the final 'goodbye packets' got lost. The code should properly be able to resume the transfer and finish. If it does not, let Martin know.

Recent sysmanlog display

  • This is mostly for debugging
  • This shows the last 1.5 hours of logs which are being bundled and eventually sent to the main raid


Duiker Page

  • The duiker page shows the current status output of duiker
  • You can set the serial number and the site location
  • You can also initiate a unlock/lock/center commands from here
  • If you do not see a bunch of status information below unlock command box, that means DUIKER IS NOT RUNNING!


Linkinfo Page

  • Use linkinfo to determine if the chosen location is good enough for a deployment
  • You can ping by hand, but always come back and look at the linkinfo. It shows information that is much richer than a simple ping

How it works

  • The linkinfo module is aware of all the neighbors because it is constantly collecting information from any data traffic in the network
  • Once it is aware of a neighbor, it attempts to send probe packets to the neighbor
    • The probe packets are sent once a second
    • Every ten seconds the number of successfully sent probes over the number of sent probes are worked into an ewma which tracks the packet delivery ratio
    • If the node happens to be generating other traffic on that particular link at a rate greater than 10 packets every 10 second, the node will stop generating traffic
  • Using the probe packets and any other network traffic, the node is also able to determine what the most used data rate is
    • 802.11b can send packets at various data rates: 1Mbps, 2Mbps, 5.5Mbps, 11 Mbps
    • The higher the data rate, the faster the data can be sent between the two nodes
    • The drawback to a higher data rate is that with a poor link, it has a lower probability to get through
    • The 802.11b card will try to pick a rate based on whether it is successfully sending packets at the various rates
  • The packet delivery ratios along and the data rate are used to compute the ETT (estimated transmission time) for the given link.
  • The ETT is used to determine the best links to use to get to the sink.

The output

  • The columns are as follows:
    • Neighbor: the neighboring node
    • Status/Gen: status is the assumed status of the node / the gen is whether we are generating prove traffic to this node
      • The status can show a couple of different states.
      • It attempts to determine the state from the delivery ratio as well as the time we last heard from the node and the time since the node has reported something recent about us
      • This value is meant as a sort of quick idea of what is going on. It is more important to node that it is changing and how stable it is. The states are:
        • UNKNOWN: we know there might be a node there but we have not heard anything telling us it is a CDCC
        • STARTUP: there is a node there but we are still trying to collect information about it
        • ACTIVE: All signs point to 'Yes, this node is active and we are determining the link quality
        • ASYMM: We know it is a CDCC and we can hear it, but it does not look like it knows we exist
        • DEAD: Dead. Having lots of trouble sending or receiving packets to this node
      • If you see a node jumping between states such as STARTUP and ACTIVE and DEAD... that means the link is probably pretty flaky. Try adjusting the antenna
      • The 'Gen' indicates we are currently sending probes to this node (Y mean yes, N means no)
    • Last Heard: how long ago we heard any information from this node about us
      • Each node is aware of when it last heard some information about another node
      • They reports to each other when the last time they heard information from each other
      • If this value is high and repeatedly goes high, then that means the link is flaky since the two nodes can not continuously successfully communicate
    • RSSI (dBm): the signal strength of each received packet from this node. this value is a running average
    • Silience (dBm): the ambient noise level before a packet was received. this value is a running average
      • The silence value is important to watch during deployment. The higher it is, the worse the surrounding 'noise' is and the more difficult it will be to receive packets.
    • SNR: the rssi - silience
      • We have only recently started displaying this information so we do not have a good 'feel' for what the RSSI, Silence, and SNR values should be for a good site.
      • In our past deployments we have shown that you can have a good link even if the SNR is poor, however it is always best to try and setup sites with good SNR.
      • When setting up a site, pay attention to the rssi, silence value, and snr and how it changes. It is good to keep track of this mentally so you can gain an understanding of what good values are.
    • Rate In/Out: The data rate to and from this node. 10 -> 1Mbps, 20 -> 2 Mbps, 55 -> 5.5 Mbps, 110 -> 11 Mbps
      • The data rate can tell you a lot about whether a link is flaky or not.
      • If the link is at 1 or 2 Mbps, that means there is enough packet loss to make the, however this may not mean that the link is not usable, but rather that you should proceed with caution, or do your best to try and adjust the antennas.
    • Conn In/Out: The packet delivery ratios as a percentage for sent and received packets
      • Combining the packet delivery ratios and the data rate can also tell you a lot about a link
      • If you see a connectivity of less than 70\% for both links, this means the links is going to perform very poorly.
      • If the connectivity is high, but the data rate is at 1 or 2. That means there could potentially be significant loss on the link especially with more traffic
        • It is difficult to tell in this situation so the best thing to do is try to send some more pings by hand, or readjust the antennae

Important notes!!!

  • It takes at least 120-180 seconds for the linkinfo to get a stable reading!
    • You are more then welcome (encouraged in fact) to go and send additional pings between nodes (linkinfo will take these into consideration as readings as long as they are large packets!)
      • BUT... Please do not only use the pings to make a decision about a link! Use the data rate and the SNR as well. Good links are crucial for this system to work!
    • If you are going to send pings yourself, use the ping command with the -s 1400 flag added. Use broadcast pings only to determine what neighbors are available
ping -s 1400 10.0.0.112
  • You should _never_ see a high data rate and a low connectivity percentage.
    • This is because it there are connectivity problems, the 802.11 card will lower the data rate automatically
    • If you do see this, wait 2 minutes and see what happens. Try to send a few pings.


Talking with the CDCC over WIRELESS

  • Setup the wireless or serial on the deployment laptop
    • Run the script 'seismic_me' with sudo or as root
    • If the script does not exist, enter the following commands as root
      • To enter the commands as root, either type 'su -' and enter the root password (then just type in the commands), or add a 'sudo' in front of all the commands which will require you to put in the current users password the first time only
iwconfig wlan0 mode Ad-Hoc
iwconfig wlan0 channel 11
prism2_param wlan0 pseudo_ibss 1
iwconfig wlan0 essid perunet
ifconfig wlan0 netmask 255.255.255.0 broadcast 10.0.0.255 10.0.0.3
ifconfig wlan0 netmask 255.255.255.0 broadcast 10.0.0.255 10.0.0.3
# yes, do the line above twice
  • Wirelessly ping the laptop (we will assume you are next to box 103)
ping 10.0.0.103
  • If ping fails, go to serial connection instructions below
  • If the ping works, attempt to connect to the CDCC's web page with the url:
http://10.0.0.103:8080/
  • This will bring up a webpage that shows the status of the CDCC
  • The section above will explain how to understand what you see

Talking with the CDCC over SERIAL or SSH

  • These instructions are for you if
    • You are plugged in directly to a stargate over serial
    • You are ssh'ing into a stargate from a laptop or another stargate
  • ssh is the preferred method because the terminal is a lot nicer (you will understand once you try), but serial may sometimes be necessary


Serial setup

  • You need to have minicom properly setup
  • It is to annoying to explain here. Talk to Igor, Martin, or Vinayak on how to do this.

SSH setup

  • You can setup key's for you to work with ssh
  • The deployment laptops should have these keys on them already and there should be a script that lets you just connect to the cdcc no questions asked
  • If you do not have the keys, obtain them from Martin
    • He will send you a tar.gz file
    • extract the contents of the file with 'tar zxvf peru-keys.tar.gz'
    • then, 'mkdir ~/ssh-peru' and copy the id_rsa to the ~/ssh-peru directory with 'cp id_rsa ~/ssh-peru'

over ethernet

  • To setup your ethernet connection, run the following commands as root or with sudo
    • To enter the commands as root, either type 'su -' and enter the root password (then just type in the commands), or add a 'sudo' in front of all the commands which will require you to put in the current users password the first time only
ifconfig eth0 netmask 255.255.255.0 broadcast 192.168.100.255 192.168.100.95 up
ifconfig eth0 netmask 255.255.255.0 broadcast 192.168.100.255 192.168.100.95 up
# yes enter the command twice
  • You can check if everything is configure correctly by typing ifconfig and looking for the eth0 interface
  • Once the ethernet is configured, type:
ssh root@192.168.100.100
# the CDCC's ether net address _always_ ends in 100

# Or if you have the ssh keys, type
ssh -i ~/ssh-peru/id_rsa root@192.168.100.100

over wireless

  • To setup the wireless connection, run the 'seismic_me' script, OR run the following commands as root or with sudo
    • To enter the commands as root, either type 'su -' and enter the root password (then just type in the commands), or add a 'sudo' in front of all the commands which will require you to put in the current users password the first time only
iwconfig wlan0 mode Ad-Hoc
iwconfig wlan0 channel 11
prism2_param wlan0 pseudo_ibss 1
iwconfig wlan0 essid perunet
ifconfig wlan0 netmask 255.255.255.0 broadcast 10.0.0.255 10.0.0.3
ifconfig wlan0 netmask 255.255.255.0 broadcast 10.0.0.255 10.0.0.3
# yes, do the line above twice
  • You can check if everything is configure correctly by typing ifconfig and looking for the wlan0 interface
  • Once the wireless is configured, type:
ssh root@10.0.0.103
# Make sure to replace the 103 with the ID of the CDCC

# Or if you have the ssh keys, type
ssh -i ~/ssh-peru/id_rsa root@10.0.0.103

Using the console

  • There is a lot of useful information available in the console.

CF card layout

  • The CF card is mounted to /opt on the stargate
  • It is important to understand the layout so you can find information quickly and not mess things up :D
  • If you forget this information, there is a file in /opt which shows it all again... you can just cat README.txt to get it
# cat /opt/README.txt
bin       - extra programs
conf      - station name, and anything else to be read by any apps
cron      - all the cron directories
data      - duiker will place data here
dts       - hidden runtime information for dts... do not delete
duiker    - duiker binarys and configuration files
emstar    - all our code
filemover - any files in here will be moved to the next hop WITH a dts header
log       - system log will go here if turned on
.log      - for the systemmanager temp info... do not delete!
startup   - put scripts to run on startup here
tmp       - play here
xfer      - any files in here will be moved to the next hop without a dts header
  • Do you best not to add extra files or directories directly to opt. If you have to do stuff use the tmp subdirectory

Checking various status information

Disk usage and file counts

  • This is important to do to understand how much data is currently on the node. The data is on the compact flash card, which is mounted to /opt
  • There are a variety of ways to do this. Below you see the quickest ways, complete with examples
    • Indented text shows the output of the command
# You can ask the software running about the diskspace
cat /dev/diskmanager/status 
    Data directory /opt/data/ contains 0 files, which are lined for deletion
    Filemover directory /opt/filemover/ contains 0 files, which are lined for deletion
    Xfer directory /opt/xfer/ contains 0 files, which are lined for deletion
    Total diskspace = 0.97GB
    Free diskspace threshold = 25.00% which is 0.24GB
    Free diskspace delete threshold = 5.00% which is 0.05GB
    Free diskspace = 10.43% which is 0.10GB

# you can ask the OS about the diskspace... the thing to look for is /opt since that is where the CF card is mounted.
df -h
    Filesystem            Size  Used Avail Use% Mounted on
    rootfs                 30M   16M   14M  54% /
    /dev/root              30M   16M   14M  54% /
    /dev/hda1             996M  893M  103M  90% /opt

# You can see how many files are in each of the data directories
# The data directory where local duiker data goes first, filemover directory where in transit data and local data end up, xfer
#   directory where logs and anything extra in transit goes
ls /opt/data | wc
ls /opt/filemover | wc
ls /opt/xfer | wc
# The first number in the output is the number of files

Duiker status

# Check if duiker is running. You should see two entries, one being 'grep duiker', then you know duiker is running
ps awx | grep duiker
     3723 ?        S      8:37 ./duiker
    30520 pts/0    S      0:00 grep duiker

# Alternativly, you can do the following which show a lot of status information
cat /dev/duiker/status

# Check if duiker is collecting data. Run the following command a few times in a row. Look to see if one of the packets file is
#   increasing in size. The fifth column (the one right before the date) is what you look at
ls -l /opt/duiker/*.packet
    -rw-------    1 root     root       208616 Jan 21 18:58 /opt/duiker/20080121190026.TO.LECS.bundle_q330_packets.packets
ls -l /opt/duiker/*.packet
    -rw-------    1 root     root       212616 Jan 21 18:58 /opt/duiker/20080121190026.TO.LECS.bundle_q330_packets.packets

Software status

  • A lot of this is similar to the information shown on the web page
  • To see the sinktree information, type the following. See the information about about the web page to understand what this shows
sinkstatus
# Or, you can type
cat /dev/dts/sink_status
  • To see the recent transfers, type
xfers
# Or, you can type
cat /dev/xfer/status
  • To see the linkinfo, type the following. Note this is an advanced version of what is on the webpage. Try to use
cat /dev/linkinfo/status-wlan0
# Look for the percentages at the end of the line indicating each neighbor, and look for the data rates
  • You can check the status of all the software with the following command
    • The things to look for here is ti make sure everything is 'running'. If you see something is 'looping' or 'waiting', record the output and let Martin know
status
# or type
cat /dev/emrun/status
  • Also, if you do see something funny, run the following and record the output for Martin
cat /dev/emrun/last_msg

Using rbsh

  • rbsh will let you talk to one or mode nodes simultaneously without having to login with ssh
  • It is like being ssh'ed to one or more nodes all at once
  • It is useful for quickly checking the status of a node and it is highly recommended over using ssh!
  • It is on every single cdcc as well as on the deployment laptops
  • NOTE: rbsh is 'best effort'. It will try its best to send commands and get responses from nodes, but it does not guarantee any reliability!!!

Setting up for rbsh

  • If you are on a deployment laptop, you need to setup the wireless for this to work
  • To setup the wireless connection, run the 'seismic_me' script, OR run the following commands as root or with sudo
    • To enter the commands as root, either type 'su -' and enter the root password (then just type in the commands), or add a 'sudo' in front of all the commands which will require you to put in the current users password the first time only
iwconfig wlan0 mode Ad-Hoc
iwconfig wlan0 channel 11
prism2_param wlan0 pseudo_ibss 1
iwconfig wlan0 essid perunet
ifconfig wlan0 netmask 255.255.255.0 broadcast 10.0.0.255 10.0.0.3
ifconfig wlan0 netmask 255.255.255.0 broadcast 10.0.0.255 10.0.0.3
# yes, do the line above twice

Running rbsh

  • To run rbsh at the command line on the laptop or the CDCC run the following
rbsh -b wlan0
  • Once it starts, press 'enter' a few times, and you should see something like
[4] rbsh 4> 
Node=0.0.0.13, reply to seqno=4: Exit status 0 
Node=0.0.0.140, reply to seqno=4: Exit status 0 
Node=0.0.0.129, reply to seqno=4: Exit status 0 
Node=0.0.0.126, reply to seqno=4: Exit status 0 
[4] rbsh 5>
  • Pressing enter sends and empty command to all the nodes. It is a quick way to see what other nodes you can talk to

Entering commands

  • To enter a command, type it in, and press enter
    • As soon as you press enter, nothing will happen, because rbsh is giving you the option to enter more commands
    • Once you have entered all the commands you want to send, just push enter to submit a blank line and issue the commands
  • The responses should come back within a few seconds
  • Here is an example:
[7] rbsh 5> df -h
[7] rbsh 5> 
Node=0.0.0.140, reply to seqno=5: Exit status 0 
Filesystem            Size  Used Avail Use% Mounted on
/dev/mtdblock2         30M   22M  8.2M  73% /
... [cut]
Node=0.0.0.7, reply to seqno=5: Exit status 0 
Filesystem            Size  Used Avail Use% Mounted on
rootfs                 30M   16M   14M  54% /
/dev/hda1             996M  896M  100M  90% /opt
170 [1 missed]   
[6] rbsh 6> 
  • Note the '170 [1 missed] right at the bottom: This means rbsh thinks it did not receive a response from a node it had talked to previously. This is useful to lookout for.

checking or aborting commands

  • rbsh also lets you check the commands you have entered so for or abort them. Typing help shows whats going on:
[6] rbsh 6> help
Prompt format: [X] rbsh Y>, where
  X is the number of nodes that replied to the last request
  Y is the sequence number of the next request
Local commands:
  help:    prints this message
  abort:   discards current command
  check:   prints current command
  delta:   prints out any missing/added nodes since last command
  sorted:  prints out sorted list of node IDs
  exit:    exits
[6] rbsh 6> 
  • To exit rbsh, type exit

Talking to only certain nodes

  • To only talk to one node or to ignore certain nodes, exit rbsh and rerun it with the --ignore or --dest flags followed by a comma separated list of node id's.
  • For instance, if I only wanted to send commands to ndoes 140 and 129, do the following:
rbsh -b wlan0 --dest 140,129
  • To send commands to all nodes except 140 and 129, use the --ignore flag instead

Using dts

Manual filemover

  • Edit /etc/conf/filemover-sg.conf so the destination IP and user are correct. It should look something like this:
NEXT_HOP=192.168.100.6
FILEMOVER_DIRS="/opt/filemover /opt/xfer"
SSH_KEYS=/opt/.ssh/id_rsa
USER=uclanet
  • The SSH_KEYS about should point to the private key of the transmitting host
    • If you are on a CDCC, it is /root/.ssh/id_rsa ... note /root/.ssh soft links to /opt/.ssh/ !!!!
  • Append the public key to authorized_keys
    • If you are on a CDCC the public key could be /etc/conf/filemover_idrsa or /opt/.ssh/id_rsa.pub or /root/.ssh/id_rsa.pub
    • authorized_keys is in ~/.ssh/authorized_keys
  • Copy /opt/bin/filemover-sg to /opt/cron/cron.hourly
cp /opt/bin/filemover-sg /opt/cron/cron.hourly/filemoversg
# NOTE: You MUST not have any -'s or .'s in the filename
  • This will make the files be copied every hour to the remote host
  • To make this happen only once a day, use cron.daily or setup a crontab file in /opt/cron/cron.d/ ... ask Martin how to do this.

Upgrading a live CDCC

  • Get the latest CF image!
  • The basic idea is like this:
    • Stop cron
    • Stop all the processes you can. If you can not stop some, prevent them from starting up and reboot.
    • Save some configuration files
    • Format card if files are corrupt OR delete certain files to ensure new versions
    • Extract the tarfile, save logs

FORMAT CARD VERSION

/etc/init.d/cron stop
/opt/startup/duikstart stop
/opt/startup/dts

sleep 10

killall bozohttpd emrun sinktree dis_service emproxy \
systemmanager mhsyncf timekeeper udpd dts linkinfo \
diskmanager filemover tcpxfer

sleep 10

killall bozohttpd emrun sinktree dis_service emproxy \
systemmanager mhsyncf timekeeper udpd dts linkinfo \
diskmanager filemover tcpxfer

sleep 5

killall -9 bozohttpd emrun sinktree dis_service emproxy \
systemmanager mhsyncf timekeeper udpd dts linkinfo \
diskmanager filemover tcpxfer

sleep 2

ps awx |grep -E "(linkinfo|systemman|timekeep|dts|diskman|filemover|tcpxfer|emrun|mhsyncf)" \
| grep -q -v grep
if [ $? -eq 0 ]; then echo "Stuff still running. Try kill by hand"; break; fi

ps awx |grep duiker | grep -q -v grep
if [ $? -eq 0 ]; then echo "duiker still running. wait 30 seconds and try this again"; break; fi

remount rw
cp /opt/duiker/duiker.conf /root/
cp -r /opt/.log /root
cd ..
umount /opt
echo "Card should have umounted succesfully"
echo "If not, rm /opt/startup/*  then reboot, and try to unmount again"

echo "Do these next commands by hand"
# mkreiserfs /dev/hda1
# mount /opt
# cd /opt/
# # copy cfimage to the card
# tar zxvf peru-CFcard_200*
# cp -r /root/.log /opt/
# cp /root/duiker.conf /opt/duiker/duiker.conf
# sync

# echo "RESTART CRON "
# /etc/init.d/cron start
# remount ro

# # reboot, wait 10 mins, or do this:
# /opt/startup/duikstart start
# sleep 5
# /opt/startup/dts start



Current Bugs

  • Timekeeper: If duiker and dts start at same time, duiker actuall creates time device, so timekeeper thinks it has duiker time
    • Leave duiker as is (in future fix so only creates device once has time)
    • Have timekeeper recheck when it has duiker time as well
  • Linkinfo
    • add and test refractory setting for status client
    • add rate limiting to probing

Current Feature requests

  • Make console linkinfo simpler and similar to web version
  • add uptime command line command
  • Add 'version' command on login and as command to see CF and root versions
  • command line gps log file inspector
  • last 24 hour of gps locks in a file
  • scripts for lock/unlock/center - Also add 'unlock+center' script
    • Make add info to log when this is done
  • something to be able to check sensor lock/unlock state

Web Page

  • Note box
  • Add CF and root version info to web page
  • Make shutdown button smaller and not right in easy to click place
  • Put system uptime near top
  • add rbsh interface and dtsh interface
  • add /dev/emrun/last_msg output

Log output

  • All entries have the following information with them:
time   - 1197934497.044046
node   - 114
seqno  - 2354
  • time is in seconds.microseconds since 1970 (it is the standard struct timeval)
    • When making the table, please include two time columns. Once called systime and the other called tstime. Set systime to the above value
  • node is between 1 and 255
  • seqno is a 64 bit unsigned int
    • seqno are mostly unique per node (two nodes can and will have the same seqnos)
    • BUT, as you will see below, a single node can use the same seqno a few times in certain situations
      • The first is for reporting multiple data events at the same time. An unfortunate consequence of how things are reported.
      • The second is if the seqno gets reset. We will know when this happens since the system is aware of it.

Note: all formats can be changed into whatever before being inserted into db


Log information

  • This is provided for every log file processed
  • Each node generates one log file an hour
  • It is usefull to keep track of this information since it attempts to show what software version is running on the node!
node              - node
start             - seconds.usec representation of time the log started
startasc          - ascii representation of time the log started. Format is: %Y%m%d%H%M%S
end               - seconds.usec representation of time the log ended
endasc            - ascii representation of time the log ended
fs                - file system version - done as ascii time the fs was created like with above format
cf                - creation date of software on the CF card - ascii time format as above
processtime       - time log file was processed
proccesstime_asc  - ascii time log file was processed
# example: 182 1198019849.654704 20071218231729 1198020149.658536 20071218232229 200712171818 0 1199486688.58 20080104224448

q330 status information

  • We get status reports every 10 mins about the q330
  • The come as three separate messages
    • The time is different between the group of three messages by about 10-20 milliseconds
    • The seqno for each of the three messages is unique
    • We can do things so that the time is all the same and these get put into the DB as one entry into one table
# global info
clockqual       - clock quality - displayed as hex number... stored as 16 bit unsigned int (uint16_t)
minsinceloss    - minutes since loss (of gps) - stored as uint16_t
secoffset       - seconds offset - uint
micsecoffset    - microseconds offset - uint
totalsec        - total time in seconds - uint
powsec          - power on time in seconds - uint
lastsync        - time of last resync - uint 
vco             - current vco - uint16_t
miscin          - misc inputs - displayed as hex - uin16_t
site_code       - optional - 6 chars max
# example: 0x51 10119 249436455 999996 137585117 76251468 249436453 2081 0x00 

# gps info
powtime         -  power on time - uint16_t
powind          -  power on indicator - uint16_t 
numsatuse       -  number of satellites in use - uint16_t
numsatrange     -  number of satellites in range - uint16_t
gpstime         -  gps time string - 10 characters max
gpsdate         -  gps date string - 12 characters max
gpsfix          -  gps fix string - 6 chars max
gpsheight       -  gps height string - 12 chars max
lat             -  latitude string - 14 chars max
lon             -  longitude string - 14 chars max
lastgood1pps    -  time of last good 1PPS signal - uint
site_code       - optional - 6 chars max
# example: 36 1 0 12 "23:16:26" "17/12/2007" "NONE" "124.1M" "3404.1748N" "11826.5046W" 250642564

# power and temp info ( all are uint16_t's )
boomone          - channel one boom position
boomtwo          - channel two boom position
boomthree        - channel three boom position
possup           - positive power supply (10mv incr)
inpow            - input power supply (150mv incr)
systemp          - system temperature (C)
maincurr         - main current (1ma incr)
gpsantcurr       - gps antenna current (1ma incr)
site_code        - optional - 6 chars max
# example: 13 89 89 546 99 27 60 0 

# command info
cmd             - one of: unlock, lock, center
site_code       - optional - 6 chars max

Link information

  • Every 10 mins information about the links
  • Each report contains information about multiple nodes so it is split up
    • Because it is split up, multiple lines being put into the database will have the same seqno and time
host     - the neighboring node - an ip address
stat     - the status - one word either: UNKNOWN STARTUP ACTIVE ASYMM DEAD Inval
rss      - ewma of receive signal strength - float
rssdev   - running stddev of rss - float
sil      - ewma of silence value (do rss - sil to get snr) - float
sildev   - running stddev of sil - float
recvr    - most common recv data rate for the last 10 seconds - integer < 255 - divide by 10 to get actual data rate
sendr    - most commong send data rate for the last 10 seconds - integer < 255 - divide by 10 to get actual data rate
succp    - success percentage of incoming packets - float - 0.00 to 100.00
succewma - success percentage of outgoing packets - float - 0.00 to 100.00

filemover information

  • When a file is received it generates one of these messages
  • We can get this information from elsewhere and I think it may be a more reliable source and easier to work with
file    - the filename (see example below)
dst     - the node which received the file - ip address
src     - the node which sent the file - ip address
xtime   - time in seconds the transfer took
btime   - time in seconds the transfer took including the wait in between retries 
ret     - number of retries
bw      - approximate bandwidth
size    - file size
# example: /opt/filemover/20070430040711.TO.LECS.bundle_q330_packets.tar.gz.dts 10.0.0.97 10.0.0.182 9351 9351 0 274.812744 

path information

  • Every 10 mins information about the path to the sink
changes - number of times path has changed since the last report
ett     - full path metric to sink - value ranges from 0.00001 to a large integer
path    - a variable length string or 0. For example <182<192<149
# example: 16 0.001122 <182

disk status/deletion information

  • Every 10 mins the disk space is reported
freeds   - free disk space (in megabytes)
freedsp  - free disk space percentage
usedds   - used disk space (in megabytes)
useddsp  - used disk space percentage
total    - total disk space (in megabytes)
thresh   - threshold at which no files are accepted
dthresh  - threshold at which files are deleted
# example: 767.72 78.55 209.68 21.4 977.40 25.00 5.00
  • Whenever a file is deleted, a message is added to the log
file     - the filename
# example: 20070430070711.TO.LECS.bundle_q330_packets.tar.gz.dts

reboot and seqno information

  • There are three system messages that show up in the logs
  • They have no extra information beyond the time, the node, and the seqno that all the nodes report
SYSMAN_REBOOT - The node was experienced a reboot. This message happens when a node starts up
BUTTON_RESET - The button to reboot the node was pressed
SYSMAN_RESET_SEQ - The CF card was probably switched because a seqno was found but it was for a different node. So we reset the seqno.

Data processing

  • In addition to this we have log messages for all the data files that get processed on the RAID. This is basically a timestamp and the filename.
Personal tools