to know, is to know that you know nothing. that is the meaning of true knowledge. -- socrates
- purpose: this repository is meant to share knowledge (well, and pieces of information too) which is not suitable for sharing elsewhere, among members of clab.
- editing: use the "markdown" format to edit this file. help can be found at http://stackoverflow.com/editing-help . Feel free to create a new file if need be, but make sure to link to it from this one for easy access. [GFL guidelines] (https://github.com/brendano/gfl_syntax/blob/master/guidelines/guidelines.md) is a good example of a nicely written .md file, just in case you're not familiar with the Markdown script.
- a paper needs to be organized. here are some tricks for looking at the global structure of your paper: read the subject headings in your paper. do they tell the story you want? read the first sentence or two in each section one after the next, does it tell your paper's story? this advice gets more and more important as your paper gets longer.
- 2015-29-05: Transition-Based Dependency Parsing with Stack Long Short-Term Memory. by Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews and Noah Smith. ACL 15 (it is attached in an email that Chris sent to the group on 2015-22-05)
- 2015-12-06 [A Joint Model for Entity Analysis: Coreference, Typing, and Linking] (http://www.eecs.berkeley.edu/~gdurrett/papers/durrett-klein-tacl2014.pdf) by Greg Durrett and Dan Klein. TACL
- 2015-19-06 [A Recurrent Latent Variable Model for Sequential Data] (http://arxiv.org/abs/1506.02216) by Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron Courville, Yoshua Bengio. Arxiv.
- 2015-26-06 [Generative Adversarial Nets] (http://arxiv.org/pdf/1406.2661v1.pdf) by Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio. Arxiv.
- 2015-03-07 [Structured Training for Neural Network Transition-Based Parsing] (http://www.petrovi.de/data/acl15.pdf) by David Weiss, Chris Alberti, Michael Collins and Slav Petrov. ACL 2015
Ideas for papers?
- TBD. Towards AI complete question answering: A set of prerequisite toy tasks by Weston, Mikolov, Bordes, Chopra.
- 2015-02-06: Deep Canonical Correlation Analysis by Andrew et al. ICML 2013
- 2015-02-21: Target Language Adaptation of Discriminative Transfer Parsers Täckström, McDonald, Nivre, NAACL'13
- 2014-09-4: Reducing the sampling complexity of topic models by Li et al. KDD 2014
- 2014-09-11: Dynamic Topic Adaptation for Phrase-based MT Hasler, Blunsom, Koehn, and Haddow; Polylingual Topic Models Mimno et al.
- 2014-10-9: Climbing the Tower of Babel: Unsupervised Multilingual Learning by B. Snyder, R. Barzilay, ICML'10
- 2014-11-11: A Theory of Learning with Similarity Functions by Balcan, Blum, Srebro. Machine Learning'08
- 2014-12-4: Search-Aware Tuning for Machine Translation by Lemao Liu, Liang Huang EMNLP'14
- 2014-12-11: Domain Adaptation under Target and Conditional Shift by Zhang et al. ICML'13
- 2014-05-22: [Dependency grammar induction via bitext projection constraints] (http://dl.acm.org/citation.cfm?id=1687931) K. Ganchev and J. Gillenwater and B. Taskar, 2009. ACL.
- 2014-05-29: Two Step CCA: A new spectral method for estimating vector models of words Dhillon et al. (Yes, this is monolingual, but yes, we will discuss a multilingual extension)
- 2014-06-05: [Syntactic Transfer Using a Bilingual Lexicon] (http://www.eecs.berkeley.edu/~gdurrett/DurPauKle_emnlp12.pdf) G. Durrett, A. Pauls, and D. Klein. EMNLP'12.
- 2014-06-12: [Unsupervised Induction of Cross-lingual Semantic Relations] (http://www.aclweb.org/anthology/D13-1064) M. Lewis and M. Steedman. EMNLP'13
- 2014-06-19: A New Approach to Lexical Disambiguation of Arabic Text R. Shah et al.
- 2014-07-03: Accurate Language Identification of Twitter Messages Lui and Baldwin.
- 2014-07-17: Empirical Comparison of Features and Tuning for Phrase-based Machine Translation S. Green, D. Cer, and C. Manning
- 2014-08-14: Linear Mixture Models for Robust Machine Translation by Marine Carpuat, Cyril Goutte and George Foster
- 2014-08-21: Natural Language Processing (Almost) from Scratch by Collobert et al. The Journal of Machine Learning Research. 2011
- 2014-02-04: [Structured Sparsity in Structured Prediction] (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.228.6262&rep=rep1&type=pdf). 2011. A. F. T. Martins, N. A. Smith, P. M. Q. Aguiar, and M. A. T. Figueiredo
- 2014-02-11: [Convolution Kernels for Natural Language] (http://books.nips.cc/papers/files/nips14/AA58.pdf). 2002. Michael Collins and Nigel Duffy
- 2014-03-04: [Large Margin Classification Using the Perceptron Algorithm] (http://cseweb.ucsd.edu/~yfreund/papers/LargeMarginsUsingPerceptron.pdf)
- 2014-03-18: [Unsupervised Induction of Cross-lingual Semantic Relations] (http://aclweb.org/anthology/D/D13/D13-1064.pdf). Lewis and Steedman, 2013. EMNLP.
- 2014-03-25 [Efficient Inference for Distributions on Permutations] (http://papers.nips.cc/paper/3183-efficient-inference-for-distributions-on-permutations.pdf). Huang, Guestrin and Guibas, 2007. NIPS.
- 2014-04-01 [Combining labeled and unlabeled data with co-training] (http://www.u.arizona.edu/~echan3/539/BlumMitchell98.pdf). Blum and Mitchell, 1998. Conference on Computational Learning Theory.
- 2014-04-08 [Optimal Beam Search for Machine Translation] (http://people.csail.mit.edu/srush/optbeam.pdf). Rush and Chang and Collins.
- 2014-04-15 [A Convolutional Neural Network for Modelling Sentences] (http://nal.co/papers/Kalchbrenner_DCNN_ACL14). Kalchbrenner and Grefenstette and Blunsom. 2014. ACL.
- 2013-10-22: [Adaptor grammars for learning non-concatenative morphology] (http://aclweb.org/anthology/D/D13/D13-1034.pdf). 2013. Jan Botha and Phil Blunsom. In proc. of EMNLP.
- 2013-11-26: [Diverse M-Best Solutions in Markov Random Fields] (https://filebox.ece.vt.edu/~dbatra/papers/MBestModes.pdf). 2012. Batra et al. In proc. of Computer Vision–ECCV.
- 2013-12-03: [A fast and simple algorithm for training neural probabilistic language models] (http://www.stats.ox.ac.uk/~teh/research/compling/MniTeh2012a.pdf). 2012. Mnih and Teh. In proc. of ICML.
- software packages of general interest should be installed at
allegro:/opt/tools
and a modulefile (learn more about Environment Modules [here] (http://modules.sourceforge.net/)) should be added atallegro:/opt/modulefiles
so that other people can find it by executingmodule avail
- corpora can be found at
allegro:/usr1/corpora
,allegro:/cab1/corpora
,allegro:/mal1/corpora
, orallegro:/mal2/corpora
- when something goes wrong with allegro, email [email protected] or call SCS Operations at 412-268-2608 (Open 24 x 7)
ssh [email protected]
providing your xsede.org portal password, followed bygsissh trestles
ssh [email protected]
providing your trestles-specific password. Email [email protected] to obtain your autogenerated trestles password, then change it at https://passive.sdsc.edu/ssh [email protected]
providing your trestles-specific password. This allows you to specify which of trestles' three login servers to use.
find already installed software packages by running module avail
if you can't find a dependency you need by module avail
, you can either
- install it yourself in your home directory.
- ask [email protected] to install it for you (especially if it's a generally useful tool); expect delays. also, they don't always agree :-/
for trivial jobs (e.g. compiling, bleu) can be directly run on the trestles login machine.
use qsub
, specifying one of two queues: -q normal
which gives exclusive access to nodes, or -q shared
which gives shared access to nodes (e.g. allows you to request only 8 out of the 32 cores on a node; useful for running a debugger on the same node).
you can run interactive jobs as follows qsub -I -q normal -l nodes=1:ppn=32,walltime=48:00:00
. 48 is the maximum number of hours you can request for interactive jobs. in order to avoid losing your job if the login server closes your ssh connection (e.g. due to long inactivity), you can login to a specific login server and start a /home/diag/glock/user/screen/bin/screen
session before submitting the qsub -I
job.
trestle consists of about 10K nodes, each has 32 cores and 64GB of RAM (as well as 120GB of flash disk memory which is much slower than RAM but much faster than regular disk).
/home/$USER
(backed up; max 10gb; use for your source code/binaries; don't use for data). note: you would think each user has a home directory by default, but I had to email [email protected] to request one!
/scratch/$USER/$PBS_JOBID
(local flash disk, so orders of magnitude faster I/O than regular/network disk space; use for temporary files during the job runtime only; it's wiped off momentarily once the job finishes)
/oasis/project/nsf/cmu134/$USER/
for data
- create an account at https://portal.xsede.org/
- send your username to Noah/Chris, they’ll put in the request to add you
- it takes at least 30 minutes to update a user information
- ssh [email protected]
- gsissh stampede
- $HOME : 5G, backed-up
- $WORK : 400G, not backed-up, permanent; cdw
- $SCRATCH : 2PB, not backed-up, high-speed purged after 30 days; cds
- Cluster contains 6,400+nodes, with 32G RAM, and 16 nodes with 1TB RAM
- Max run time (for normal queue): 48h
- available queues: https://portal.xsede.org/web/xup/tacc-stampede#running-table1
- to see all queues:
showq
,showq | grep $USER
- interactive shell:
srun -p development -t 0:30:00 -n 32 --pty /bin/bash -l
- to run a job:
sbatch ./DO.run
DO.run script:
#!/bin/bash
#----------------------------------------------------
# Example SLURM job script to run a job
# on TACC's Stampede system.
#----------------------------------------------------
#SBATCH -J test_job # Job name
#SBATCH -o test_job.o%j # Name of stdout output file(%j expands to jobId)
#SBATCH -e test_job.o%j # Name of stderr output file(%j expands to jobId)
#SBATCH -p normal # Submit to the 'normal' or 'development' queue
#SBATCH -N 1 # Total number of nodes requested (16 cores/node)
#SBATCH -n 1 # Total number of mpi tasks requested
#SBATCH -t 00:30:00 # Run time (hh:mm:ss) - 0.5 hours
# Run the job
./job.sh
- to kill a job:
scancel <jobId>
- to see all available modules: module spider
- boost: boost/1.51.0
- to load required modules, and to install software locally or on cluster - same as in trestles
- https://portal.xsede.org/web/xup/tacc-stampede
- man slurm
- man sbatch
- find the total/used/available on various disks and mount points by running
df -h
- check which processes are using shared memory by running
losf | grep /run/shm
four sections:
- Xavier - GHC :: 4307 - Tue, Thu - 09:00AM to 10:20AM
- Ramakrishnan - GHC :: 4101 - Mon, Wed - 04:30PM to 05:50PM
- Cai - GHC :: 4215 - Tue, Thu - 03:00PM to 04:20PM
- Mengshoel - INI :: DEC - Tue, Thu - 12:00PM to 01:20PM
- Sankaranarayanan - BH :: A35 - Tue, Thu - 09:00AM to 10:20AM
http://users.isr.ist.utl.pt/~jxavier/NonlinearOptimization18799-2011.html
Lane; Chong - BH :: A51 - Mon - 03:30PM to 04:20PM
http://users.ece.cmu.edu/~pueschel/teaching/18-645-CMU-spring08/course.html
http://demo.clab.cs.cmu.edu/sp2014-11731/
from: Alex Rudnicky [email protected]
to: LTI-faculty-all [email protected], "[email protected]" [email protected]
date: Fri, Jan 24, 2014 at 5:28 PM
subject: LDC holdings at LTI
This is a periodic reminder to all that CMU has a (more-or-less) complete collection of Linguistic Data Consortium (LDC) corpora available to everyone, for educational or research purposes, within the University. Due to licensing the collection is only accessible from CMU IP addresses. Go to http://www.speech.cs.cmu.edu/inner/LDC/LDC/table.html
Angela Luck [email protected] is the official librarian for the collection. You should contact her for lending and other issues.
LDC corpora were originally focused on the needs of the speech community, but over time have come to include materials of interest to the text, video and other communities. Until a few years ago acquisition was directly subsidized by the Speech Group (but available to all). More recently this role has been transferred to the LTI, which assesses ongoing contracts that use LDC corpora to meet the annual dues for our LDC membership (btw, we’re a charter member!). The Speech Group continues to pay for the server on which the data are kept.
The collection is not complete. One reason is that in the early days, the LDC allowed members only a fixed number of corpora per year. We acquired only those that were relevant to ongoing projects. If you need one of the missing ones, you should be prepared to maybe contribute the cost from your project.
Before disk storage became cheap, we only had CDs of corpora. People borrowed these; not everyone turned them in. This is the other reason some corpora are missing. In many cases we have their name; it’s in the square brackets at the entry of the entry description. Feel free to hunt them down or otherwise vocally bring this issue up in their presence. The end goal is to get the disc(s) back into the collection so that the data are available to all.
We have some other corpora in the collection. Most of these are in speech (since that’s nominally my field). If you have corpora that might be of interest to other (believe me, they will be), please feel free to contributes copies to this collection.
- allegro:/usr0/corpora/
- allegro:/usr1/corpora/
- allegro:/mal1/corpora/
- allegro:/mal2/corpora/ (largest collection)
- allegro:/cab1/corpora/