Migrate mercurial code hosting from Bitbucket to your server in 9 steps using docker

By now Atlassian is dropping support to Mercurial on the popular Bitbucket service. Here is a proof of concept to use a Docker container as a separate environment where self-host your code using basic mercurial features without bells and whistles.

To do so, a docker container based on popular and lightweight jdeathe/centos-ssh image will be used. In this example, it’s supposed to use a remote server with docker service up and running.

1. Generate public / private pair

Create new keys to authenticate to the new container. Protect it with a password to deploy on external servers safely. In this example, an EdDSA type key is used.

ssh-keygen -t ed25519 -C "Key for xxx at xxx on xxx"

2. Choose keys and passwords

Choose a name for your new container here:

export SSHCONTAINER=mycodehosting.example.org

Create a new file named .env with the following content plus a custom:

  • content of the generated .pub file in the AUTHORIZED_KEYS row
  • a strong password to switch from hg to root via sudo su – root
  • timezone
SSH_AUTHORIZED_KEYS=*******PASTE PUB KEY here ***********
SSH_CHROOT_DIRECTORY=%h
SSH_INHERIT_ENVIRONMENT=false
SSH_PASSWORD_AUTHENTICATION=false
SSH_SUDO=ALL=(ALL) ALL
SSH_USER=hg
SSH_USER_FORCE_SFTP=false
SSH_USER_HOME=/home/%u
SSH_USER_ID=500:500
SSH_USER_PASSWORD=*******STRONG PASSWORD HERE (without ")***********
SSH_USER_PASSWORD_HASHED=false
SSH_USER_PRIVATE_KEY=
SSH_USER_SHELL=/bin/bash
SYSTEM_TIMEZONE=********YOUR TIMEZONE HERE e.g. Europe/Rome***********

This configuration:

  • Allow the connection using the private key generated before
  • Disable password authentication
  • Set default user name to hg
  • Allow all users to switch to sudo (there will be only an hg user)
  • Set server with preferred timezone

3. Create the centos-ssh container

On the same directory where resides the .env file before, create a new container:

docker run -d \
  --name $SSHCONTAINER \
  -p 12120:22 \
  --env-file .env \
  -v /opt/path/to/some/host/dir:/home \
  jdeathe/centos-ssh
  • create a detached container named $SSHCONTAINER
  • expose the container on port 12120. If you want lo limit to localhost only, use 127.0.0.1:12120:22 or iptables will be set up to bind 0.0.0.0 because docker mess up with iptables. You can also disable iptables on docker.
  • map the whole container /home directory to a new directory created by root on host /opt/path/to/some/host/dir:

Note: do not use ACL (e.g. setfacl) on /opt/path/to/some/host/dir or .ssh directory will broke (e.g. Bad owner or permissions)

4. Install mercurial on container

Now on container install mercurial and its dependencies. You can login as root using docker:

docker exec -it $SSHCONTAINER bash

or saving this script then chmod a+x it and launch:

#!/bin/bash
set -e
docker exec -it -u root $SSHCONTAINER  yum install -y python36-devel python36-setuptools gcc
docker exec -it -u root $SSHCONTAINER /usr/bin/pip3 install mercurial

Restart the container:

docker container restart $SSHCONTAINER

Then check if mercurial is running for user hg:

docker exec -it -u hg $SSHCONTAINER hg --version

Then if container is running smoothly, you can update it to restart always on reboot or on docker service restart:

docker container update $SSHCONTAINER --restart always

then check if it’s applied:

docker container inspect $SSHCONTAINER | grep -B0 -A3 RestartPolicy

5. Login to container directly

Now on your local machine you can connect directly to the container using SSH without caring about the host.

By default an iptables rule is created by docker to allow connections from outside. Anyway, you have to specify the port and the user name .ssh/config like this:

Host mycodehosting.example.org
    Hostname mycodehosting.example.org
    User hg
    Port 12120
    PreferredAuthentications publickey
    IdentityFile /home/chirale/.ssh/id_ed25519_mycodehosting_example_org

This configuration is useful when you create a subdomain exclusively to host code, then you associate it a port and a username to obtain a mercurial url like this:

ssh://hg@mycodehosting.example.org/test/project

where dir and subdir are directly in /home/hg directory of container, on host /opt/path/to/some/host/dir/hg/test/project. Differently from Bitbucket, you can have how many  directory level you want to host the project.

6. Create a test repo

Create a test repository inside this container. You can access everywhere with the above ssh configuration using:

ssh mycodehosting.example.com

Then you can

cd repo/
ls
mkdir alba
cd alba/
hg init
hg verify
checking changesets
checking manifests
crosschecking files in changesets and manifests
checking files
checked 0 changesets with 0 changes to 0 files
cat > README.txt
Read
this
CTRL+D
hg addremove
adding README.txt
hg commit -m "First flight"
abort: no username supplied
(use 'hg config --edit' to set your username)
hg config --edit
hg commit -m "First flight"

7. Clone the test repo

Then from everywhere you can clone the repo adding :

hg clone ssh://mycodehosting.example.org/repo/alba

You can commit and push now.

If you login to mycodehosting.example.org, no new file was added. You’ve simply to run

hg update

to get it. Note that you haven’t to update every time you push new commits on alba to mycodehosting.example.org. Simply all changes are recorded, but not yet reflected on directory structure inside container.

If this is a problem for you, you can automate the update every time hg has a new changeset using for example supervisor service, shipped with centos-ssh.

Compare these:

parent: 1:9f51cd87d912 tip
 Second flight
branch: default
commit: (clean)
update: (current)
hg summary
missing pager command 'less', skipping pager
parent: 1:9f51cd87d912
 Second flight
branch: default
commit: (clean)
update: 1 new changesets (update)

The first hasn’t change update: (current), the second has update: 1 new changesets (update).

8. Migrate the code from Bitbucket to self-host

From the container, logged as hg user, import temporary your key to download the repository from old bitbucket location following Bitbucket docs, then:

cd ~
mkdir typeofproject
cd typeofproject
hg clone ssh://hg@bitbucket.org/yourbbuser/youroldbbrepo

Then you can alter the directory as you like:

  • edit the .hg/hgrc file changing parameters as you like
  • rename youroldbrepo directory

Remember to temporary store on the container the ssh keys and config to access to Bitbucket if any (permission should be 600). You can remove these keys when migration is done.

After a test clone you can drop the Bitbucket repo.

9. Find your flow

With a self-hosted solution you have to maintain the service. This is a relatively simple solution to set up and maintain.

If you are comfortable with old Bitbucket commit web display, you can use PyCharm to see a nice commit tree like this:

Tested on release 2.6.1 with centos:7.6.1810.

 

Guide to migrate a Drupal website to Django after the release of Drupal 8

I maintain a news website written in Drupal since 2007. It is a Drupal 6, before was a 5. I made many Drupal 7 installations in these years and I went to three Drupal local conventions. This is a guide on how to abandon Drupal if you already knows some basics of Django and Python.

Drupal on LAMP: lessons learned

  • PHP is for (not so) fast development¬†but¬†maintainability can be a pain.
  • Drupal try to overcome PHP limits, with mixed results.
  • Apache cannot stands heavy¬†traffic without an accelerator like Varnish and time-consuming ad-hoc configurations. If traffic increases, Apache cannot stand it at all.
  • Drupal¬†contrib modules are¬†a mix of high quality tools (like Webform or Views Datasource) and bad written¬†projects. The more module are enabled, the more the project¬†lose in maintainability. It is not so evident if you don’t see any other open source project.

This is not the only real truth, this is my experience in these 8 years. I feel a more confident Python programmer than PHP programmer having spent less than one-third of the years working on it. At the end of the article I cite a list of article written for programmers feeling the same uneasiness of mine working on PHP and Drupal after trying other tools.

Django experiences

In the last years with Drupal still paying most of my bills I used the Django MVC framework written in Python for three project: an e-mail application, a real estate catalog  and a custom-made CRM. One of this is a porting of something written in PHP on Drupal 5. In all of these three project I was very happy with the maintainability, clearness of the code and high-level, well written packages I found while exploring it like Tastypie and many python packages found on cake shop.

Even considering I’m the only developer of these, I haven’t experienced the frustration I feel on Drupal when trying to make something work as I design or trying to fix some code I write time ago. I know that a CMS is at higher level than a framework, simply some projects are not suited for Drupal and I found more comfortable with Python than PHP in these days.

At the time I write¬†Drupal 8 is out as Release Candidate. I made migrations from 5 to 6 and from 6 to 7 on some websites in the past. Migrating to a new major¬†it’s not a science, it’s a sort of mystical art. When the Drupal 8 will be out, Drupal 6 will be automatically unsupported after 3 months Drupal 8 is out as of Drupal announcement since only the current and previous version are supported, 8.x and 7.x when 8 is out. Keeping a Drupal 6 running after that term will be risky.

Choosing the stack

Back to the news website I maintain, the choice is between a platform I already know well and it proves stable and maintainable for small/one-person team and another I have to learn. Plus,¬†Django will be the natural choice to avoid the problems I’ve listed above and use the¬†solutions I used on past django projects exploring new tools in the meanwhile.

Here the choices I made:

I decided to use gunicorn because it’s very easy to run and maintain¬†for a django project and you haven’t to make wsgi run on nginx. Nginx is in front of gunicorn, serving static files and sending right requests to it. Memcached is used inside Django and it will store cached pages from views¬†on volatile memory avoiding to read from the database any time a page is requested. I try to avoid using Varnish even if is a very good tool because I want to keep the stack¬†as simple as I can and I’m confident Varnish and Memcache will speed up the website enough. Now is the time to rewrite the Drupal-hosted website into a Django application.

Write the E-R model

If you are here probably you have a running Drupal website you want to port to Django. Browse it like an user, and then open your Content types list to identify the Entities and the Relationships as of the E-R model suggests. If your website is running for a long time you probably want to redesign some parts, adding, removing or fusing entities into another.

Take my news website¬†for example. I have 15 content types + 12 vocabularies (27 entities) on Drupal. After rewriting the E-R I’ve 14 models (entities), including the core ones. On the database side it translates into a 199 tables for Drupal and 25 for Django¬†since it¬†usually make an entity property into a database column. I trash some entities and fuse 4 entities into one.

From entities to models: understanding relationships

When you establish a relation between your re-designed entities you can have N:1¬†relations, N:N relations and 1:1 relations. A Drupal node “Article” that accepts a single term for a vocabulary named “Cheese type” translates into a N:1 relationship between the model Article¬†(N)¬†and the¬†model¬†CheeseType (1).¬†It is a simple case since you can translate it into a ForeignKey¬†field on your model since Article will get a ForeignKey field named author referencing to the Author model.

from django.db import models
from tinymce import models as tinymce_models
# Authors
class Author(models.Model):
    alias       = models.CharField(max_length=100)
    name        = models.CharField(max_length=100, null=True, blank=True)
    surname     = models.CharField(max_length=100, null=True, blank=True)
# Articles
class Article(models.Model):
    author      = models.ForeignKey('Author')
    title       = models.CharField(max_length=250,null=False, blank=False)
    body        = tinymce_models.HTMLField(blank=True, default='')
# Attachments to an Article
class Attachment(models.Model):
    article       = models.ForeignKey('Article', blank=True, null=True)
    file          = models.FileField(upload_to='attachment_dir', max_length=255, blank=True, null=True)
    description   = models.TextField(null=True, blank=True)
    weight        = models.PositiveSmallIntegerField()

In the case of a list of attachments to Article, you have a 1:N relationship between the Article model (1) and the Attachment model (N). Since the relationship is reversed, in the usual Django admin interface you cannot see the attachments in the article as is since you have to create an Attachment and then choose an article from a dropdown where attach it to.

For this case, Django provides an handy administration interface called inline to include entities in reversed relationship. This approach fix by design something that in Drupal world costs a lot of effort, with dozen of modules like Field Collection or workaround like this I write of in the past and it keep aligned your E-R design with your models. Plus, a list of all Attachment are available for free.

Exporting the data from Drupal

JSON is a pretty good interchange format: very fast to encode and decode, very well supported. I’m fascinated with YAML format but since I’ve to export thousands of articles I need pure speed and solid import/export modules on both Django and Drupal side.

There are many export module in the Drupal world. I’m very fond of Views Datasource and here how I used it:

  1. Install Views Json (part of Views Datasource): it is available for Drupal 6 and 7 and very solid
  2. Create a new view with your published nodes with the JSON Data style
    1. Field output: Normal
    2. Without Plain text (you need HTML)
    3. Json data format: Simple
    4. Without Views API mode
    5. application/json as Mime type
    6. Remove all parent / children tag name so you will have only arrays and objects
  3. Choose a path for your view
  4. Limit the view to a large number of elements, e.g. 1000
  5. Sort by node id, ascendent
  6. Add an exposed filter “greater than” Nid with a custom Filter identifier (e.g. nid)
  7. Add any field you need to import and any filter you need to limit the results
  8. Avoid caching the view
  9. Limit the access to the view if you don’t want to expose sensible contents¬†(optional)
  10. Install a plugin like JsonView (chrome) or JsonView (firefox) to look at the data on your browser

You will get something like that:

{
  [
    {
      {nid: "30004",
      domainsourceid: "1",
      nodepath: "http://example.com/path/here",
      postdate: "2014-09-17T22:18:42+0200",
      nodebody: "HTML TEXT HERE",
      nodetype: "drupal type",
      nodetitle: "Title here",
      nodeauthor: "monty",
      nodetags: "Drupal, basketball, paintball"
      }
    },
    ...
  ]
}

Now you can reach the view appending ?nid=0 to your path. It means that any node with id greater than 0 will be listed. With nid=0 a max of 1000 elements are listed. To get other nodes you have simply to get the nid from the last record (e.g. 2478) and use it as value for the nid parameter obtaining something like http://example.com/myview?nid=2478.

Try it on your browser simulating what a procedure will do for you: check the response size and adapt the number of elements (#4) accordingly to avoid to overload your server, hit the timeout or simply storing too much data into the memory when parsing. When the view response is¬†empty you’ve listed all nodes matching your filters and the parsing is complete.

In this example I’ve talked about nodes but you can do the same with files, using fid as id to pass as parameter and to sort your rows. In the case of files you have to move the files as well but it’s pretty simple to import these on a custom model on Django as you will see.

Importing data to Django

Django comes with some nice export (dumpdata)¬†¬†and import¬†(loaddata) commands. I’ve used a lot the YAML format to migrate and backup data from models but Json and SQL are other supported formats you can try. However in this migration I choose¬†custom admin command to do the job. It’s fast: in less than 10 minutes the procedure¬†imported 15k+ articles writing on a custom model some logging information on both error and¬†success.

All the import code in my case, comments and import included, is about 300 lines of python code. The core of the import function for nodes willing to become Articles is that:

import json, urllib
# ...
sid = int(options['start'].pop())
reading = True
while reading:
    url = "http://mydrupalwebsite.example.com/myview?nid=%d" % (sid,)
    print url
    response = urllib.urlopen(url)
    data = json.loads(response.read())
    data = data['']
    # no data received, empty view result, quit
    if not data:
        reading = False
        break
    for n, record in enumerate(data):
        sid = int(record['']['nid'])
        # ... do something with data ...

In this cycle, sid is the start argument passed to the admin command via command line. Next, sid will be set to the last read record so, when record finishes, a new request to myview starting from the last read element will be made.

All input and output is UTF-8 in my case. JSON View quotes strings and you have to decode them before saving in Django:

from myapp.models import Article
import HTMLParser
hp = HTMLParser.HTMLParser()
authors = Author.objects.all()
...
for n, record in enumerate(data):
    try:
        art = Article(
            title = hp.unescape(record['']['nodetitle']),
            body = record['']['nodebody'],
            author = authors.get(alias=record['']['nodeauthor'])
        )
        # run the same validation of an admin interface submit
        art.full_clean()
        art.save()
    except ValidationError as e:
      # cannot save the element
      # inside e all the error data you can save into
      # a custom log model or print to screen
    except:
      # any other exception
      pass

On line 9 a new article is declared. The title in Json source is named nodetitle. On line 10 the title from json is unescaped and assigned to title CharField of Article. The nodebody  is set as it is since the destination field is a TextField with HTML. On line 11 username nodeauthor from Json is used as key to associate the already imported user to the ForeignKey field author, where username is saved as Author.alias.

Performance gains

Here the download time graph from Google Search Console after some months:
downloadtime

You can clearly see the results in speed, expressed in milliseconds, between 2015 (old Drupal 6 platform) and 2016 (new Django platform).

Conclusion

Here the very basics on how to prepare a migration from Django to Drupal using Views Datasource module and a custom admin command. I described why I choose Django after years of Drupal development for this migration suggesting some tools to do the job and introducing some basic concepts for Drupal developer who wants to try Django.

I’ve read about Drupal enthusiasts that suffers the same uneasiness of mine after long-time Drupal / PHP development. I talk about reasons to leave Drupal on another post.

Epilogue

  • Python was awarded on 2018 as Programming Language of the Year. The last year PHP was awarded is 2004.
  • On October 2016 Django (Software) surpassed Drupal (Software) in Google Trends. Django gained 4 points from then, Drupal lost 2 points continuing its decline in popularity on Google search.

    Django vs Drupal on Google Trends

    Django vs Drupal on Google Trends. Django surpassed Drupal on October 2016.