Code For Cash finds freelance programming jobs. One of our biggest methods is through scraping websites. In order to match these jobs to our subscribers, we have to apply some labeling. Currently, we have human workers doing the labeling, but in order to save money, we have to make the move to machine learning.
Here’s how I handled the initial migration to fastText:
- I took the labeled data from our MySQL database and dumped it into a TSV file.
mysql -ucfcjobs -pPASSWORD -hOURDATABASE.us-east-1.rds.amazonaws.com cfcjobs -e ‘select * from gig_opportunity_meta’ > gom_table.csv
- I wrote a python script to normalize the job ad text (combining title and description), making lowercase, removing punctuation and removing stopwords. I then outputted one lead per line, prefixed with the labels I already know.
(Here’s the script)
- I then ran fasttext to create a model
/fasttext supervised -input normalized_gom -output model_gom -lr 1.0 -epoch 25 -wordNgrams 3 -bucket 200000 -dim 50 -loss hs
This created a file, model_gom.bin
- I used ec2 to create an Amazon Linux micro instance (free tier). I then installed git (sudo yum install git), installed a c++ compiler (sudo yum groupinstall “Developer tools”), downloaded fastText and compiled it with “make”
- I then used scp to copy the compiled fasttext binary to my local machine. I moved the fasttext-for-Amazon Linux binary into a folder called lambda. I also copied model_gom.bin.
- I then installed nltk locally to the lambda directory: pip install nltk -t /Users/zackburt/fastText/lambda – I also ran nltk.download() to download nltk_data to a local directory.
- I then wrote a serverless script that could ingest the job ad text and return a string with the predicted labels. Here it is: https://gist.github.com/zackster/649555aa3d4e6d6b046627d93490b0d6
- Then, fromwithin the Lambda directory, I ran “zip -r ~/fasttext-lambda.zip *”. This created a zip archive in my home directory. It’s very important that when you zip up serverless components for AWS Lambda, your zip archive’s root isn’t a folder. I.e. when you unzip, it needs to unzip the contents into the current directory, rather than a subfolder. Amazon doesn’t yet play nice with error messages regarding this subtle point, and failing to put your handler file in the root directory of the zip (NOT within a folder in the root directory) will result in confusion.
- I then used s3cmd to upload the script to a S3 bucket.
- Once the file was up in s3, I created a new Lambda function and set NLTK_DATA=./nltk_data as an environment variable in the Lambda configuration. I also maxed out memory, since even at scale, we should be on the free tier, and a max memory config will run the function 3 times faster, even considering the fact that 350MB is the maximum memory used by the function. Just a little tip I learned from ServerlessConf.