Skip to content

clab/knowledge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

90 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

clab knowledge

to know, is to know that you know nothing. that is the meaning of true knowledge. -- socrates

meta

  • purpose: this repository is meant to share knowledge (well, and pieces of information too) which is not suitable for sharing elsewhere, among members of clab.
  • editing: use the "markdown" format to edit this file. help can be found at http://stackoverflow.com/editing-help . Feel free to create a new file if need be, but make sure to link to it from this one for easy access. [GFL guidelines] (https://github.com/brendano/gfl_syntax/blob/master/guidelines/guidelines.md) is a good example of a nicely written .md file, just in case you're not familiar with the Markdown script.

notes about writing

  • a paper needs to be organized. here are some tricks for looking at the global structure of your paper: read the subject headings in your paper. do they tell the story you want? read the first sentence or two in each section one after the next, does it tell your paper's story? this advice gets more and more important as your paper gets longer.

are you writing a proposal?

papers for summer 2015

  • 2015-29-05: Transition-Based Dependency Parsing with Stack Long Short-Term Memory. by Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews and Noah Smith. ACL 15 (it is attached in an email that Chris sent to the group on 2015-22-05)
  • 2015-12-06 [A Joint Model for Entity Analysis: Coreference, Typing, and Linking] (http://www.eecs.berkeley.edu/~gdurrett/papers/durrett-klein-tacl2014.pdf) by Greg Durrett and Dan Klein. TACL
  • 2015-19-06 [A Recurrent Latent Variable Model for Sequential Data] (http://arxiv.org/abs/1506.02216) by Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron Courville, Yoshua Bengio. Arxiv.
  • 2015-26-06 [Generative Adversarial Nets] (http://arxiv.org/pdf/1406.2661v1.pdf) by Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio. Arxiv.
  • 2015-03-07 [Structured Training for Neural Network Transition-Based Parsing] (http://www.petrovi.de/data/acl15.pdf) by David Weiss, Chris Alberti, Michael Collins and Slav Petrov. ACL 2015

Ideas for papers?

papers for spring 2015

papers for fall 2014

papers for summer reading group 2014

papers for spring 2014

papers we recently read

computational resources

allegro

  • software packages of general interest should be installed at allegro:/opt/tools and a modulefile (learn more about Environment Modules [here] (http://modules.sourceforge.net/)) should be added at allegro:/opt/modulefiles so that other people can find it by executing module avail
  • corpora can be found at allegro:/usr1/corpora, allegro:/cab1/corpora, allegro:/mal1/corpora, or allegro:/mal2/corpora
  • when something goes wrong with allegro, email [email protected] or call SCS Operations at 412-268-2608 (Open 24 x 7)

trestles

login:

software:

find already installed software packages by running module avail

if you can't find a dependency you need by module avail, you can either

  • install it yourself in your home directory.
  • ask [email protected] to install it for you (especially if it's a generally useful tool); expect delays. also, they don't always agree :-/

running jobs:

for trivial jobs (e.g. compiling, bleu) can be directly run on the trestles login machine.

use qsub, specifying one of two queues: -q normal which gives exclusive access to nodes, or -q shared which gives shared access to nodes (e.g. allows you to request only 8 out of the 32 cores on a node; useful for running a debugger on the same node).

you can run interactive jobs as follows qsub -I -q normal -l nodes=1:ppn=32,walltime=48:00:00. 48 is the maximum number of hours you can request for interactive jobs. in order to avoid losing your job if the login server closes your ssh connection (e.g. due to long inactivity), you can login to a specific login server and start a /home/diag/glock/user/screen/bin/screen session before submitting the qsub -I job.

hardware configurations:

trestle consists of about 10K nodes, each has 32 cores and 64GB of RAM (as well as 120GB of flash disk memory which is much slower than RAM but much faster than regular disk).

where to write files?

/home/$USER (backed up; max 10gb; use for your source code/binaries; don't use for data). note: you would think each user has a home directory by default, but I had to email [email protected] to request one! /scratch/$USER/$PBS_JOBID (local flash disk, so orders of magnitude faster I/O than regular/network disk space; use for temporary files during the job runtime only; it's wiped off momentarily once the job finishes) /oasis/project/nsf/cmu134/$USER/ for data

official documentation

stampede

login:

  • create an account at https://portal.xsede.org/
  • send your username to Noah/Chris, they’ll put in the request to add you
  • it takes at least 30 minutes to update a user information
  • ssh [email protected]
  • gsissh stampede

available resources:

  • $HOME : 5G, backed-up
  • $WORK : 400G, not backed-up, permanent; cdw
  • $SCRATCH : 2PB, not backed-up, high-speed purged after 30 days; cds
  • Cluster contains 6,400+nodes, with 32G RAM, and 16 nodes with 1TB RAM
  • Max run time (for normal queue): 48h

queues:

DO.run script:

#!/bin/bash
#----------------------------------------------------
# Example SLURM job script to run a job
# on TACC's Stampede system.
#----------------------------------------------------
#SBATCH -J test_job       # Job name
#SBATCH -o test_job.o%j   # Name of stdout output file(%j expands to jobId)
#SBATCH -e test_job.o%j   # Name of stderr output file(%j expands to jobId)
#SBATCH -p normal      # Submit to the 'normal' or 'development' queue
#SBATCH -N 1              # Total number of nodes requested (16 cores/node)
#SBATCH -n 1                # Total number of mpi tasks requested
#SBATCH -t 00:30:00         # Run time (hh:mm:ss) - 0.5 hours

# Run the job
./job.sh
  • to kill a job: scancel <jobId>

applications:

  • to see all available modules: module spider
  • boost: boost/1.51.0
  • to load required modules, and to install software locally or on cluster - same as in trestles

documentation:

useful tools

linux

  • find the total/used/available on various disks and mount points by running df -h
  • check which processes are using shared memory by running losf | grep /run/shm

current courses to look out for:

nonlinear optimization (18-799)

four sections:

  • Xavier - GHC :: 4307 - Tue, Thu - 09:00AM to 10:20AM
  • Ramakrishnan - GHC :: 4101 - Mon, Wed - 04:30PM to 05:50PM
  • Cai - GHC :: 4215 - Tue, Thu - 03:00PM to 04:20PM
  • Mengshoel - INI :: DEC - Tue, Thu - 12:00PM to 01:20PM
  • Sankaranarayanan - BH :: A35 - Tue, Thu - 09:00AM to 10:20AM

http://users.isr.ist.utl.pt/~jxavier/NonlinearOptimization18799-2011.html

how to write fast code (18-645)

Lane; Chong - BH :: A51 - Mon - 03:30PM to 04:20PM

http://users.ece.cmu.edu/~pueschel/teaching/18-645-CMU-spring08/course.html

machine translation (11-731)

http://demo.clab.cs.cmu.edu/sp2014-11731/

corpora

LDC corpora

from: Alex Rudnicky [email protected]

to: LTI-faculty-all [email protected], "[email protected]" [email protected]

date: Fri, Jan 24, 2014 at 5:28 PM

subject: LDC holdings at LTI

This is a periodic reminder to all that CMU has a (more-or-less) complete collection of Linguistic Data Consortium (LDC) corpora available to everyone, for educational or research purposes, within the University. Due to licensing the collection is only accessible from CMU IP addresses. Go to http://www.speech.cs.cmu.edu/inner/LDC/LDC/table.html

Angela Luck [email protected] is the official librarian for the collection. You should contact her for lending and other issues.

LDC corpora were originally focused on the needs of the speech community, but over time have come to include materials of interest to the text, video and other communities. Until a few years ago acquisition was directly subsidized by the Speech Group (but available to all). More recently this role has been transferred to the LTI, which assesses ongoing contracts that use LDC corpora to meet the annual dues for our LDC membership (btw, we’re a charter member!). The Speech Group continues to pay for the server on which the data are kept.

The collection is not complete. One reason is that in the early days, the LDC allowed members only a fixed number of corpora per year. We acquired only those that were relevant to ongoing projects. If you need one of the missing ones, you should be prepared to maybe contribute the cost from your project.

Before disk storage became cheap, we only had CDs of corpora. People borrowed these; not everyone turned them in. This is the other reason some corpora are missing. In many cases we have their name; it’s in the square brackets at the entry of the entry description. Feel free to hunt them down or otherwise vocally bring this issue up in their presence. The end goal is to get the disc(s) back into the collection so that the data are available to all.

We have some other corpora in the collection. Most of these are in speech (since that’s nominally my field). If you have corpora that might be of interest to other (believe me, they will be), please feel free to contributes copies to this collection.

other places to look for corpora already acquired

  • allegro:/usr0/corpora/
  • allegro:/usr1/corpora/
  • allegro:/mal1/corpora/
  • allegro:/mal2/corpora/ (largest collection)
  • allegro:/cab1/corpora/

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published