Site logo
Authors
  • avatar Nguyễn Đức Xinh
    Name
    Nguyễn Đức Xinh
    Twitter
Published on
Published on

Shift_JIS vs UTF-8: Fixing Encoding Errors and Mojibake in Legacy Source Code

Why Does Text in COBOL, VB.NET Files Show Garbage?

When opening .pco, .vb, or .ini files from legacy systems (especially those originating in Japan), you may see garbled characters like:

�T�u�V�X�e���敪�@�FRZ0
�T�u�V�X�e�����@�@�F��v�h�^�e

Instead of readable Japanese. This is not caused by missing Japanese language support or fonts. The main cause is encoding mismatch (character encoding).

What Is Encoding and Why Does It Matter?

Encoding is the mapping between bytes in a file and the characters displayed. The same byte sequence can represent different text depending on which encoding is used.

Bytes (hex) Shift_JIS UTF-8
82 C0 82 B5 あさ (morning) Mojibake
E3 81 82 Mojibake

When the editor reads the file with the wrong encoding, bytes are misinterpreted → mojibake (文字化け) — garbled or corrupted characters.

Common Encodings in Development

Shift_JIS (CP932)

  • Shift_JIS is the standard encoding for Windows in Japan (code page 932).
  • Widely used in legacy systems: COBOL, VB6, VB.NET, mainframe.
  • Pros: good compatibility with older Windows, fewer bytes than UTF-8 for Japanese.
  • Cons: not internationally standardized, harder to share across multilingual systems.

UTF-8

  • UTF-8 is the standard encoding for the web and modern software.
  • Supports all languages in a single encoding.
  • Default in Git, GitHub, VSCode, Cursor.
  • Most new files are created in UTF-8.

Encoding Comparison

Criteria Shift_JIS (CP932) UTF-8
Character range Japanese, Latin Full Unicode
Bytes/char (Hiragana) 2 bytes 3 bytes
Bytes/char (ASCII) 1 byte 1 byte
Environment Windows JP, legacy Web, modern tools
BOM Not standard Optional (UTF-8 BOM)
Compatibility Declining Growing

What Causes Mojibake in Practice

Typical Scenario

  1. File is saved as Shift_JIS (e.g., program.pco, MainModule.vb).
  2. Editor (Cursor, VSCode) opens the file as UTF-8 by default.
  3. Shift_JIS bytes are decoded as UTF-8 → characters display incorrectly.

Concrete Example

COBOL file comments:

Correct content (Shift_JIS):

*     サブシステム区分   : XX
*     サブシステム名     : 会計システム
*     プログラムID     : PG001
*     プログラム名       : データチェック

When opened with wrong encoding (UTF-8):

*     �T�u�V�X�e���敪�@�FXX
*     �T�u�V�X�e�����@�@�F����V�X�e
*     �v���O�����h�c�@�@�FPG001
*     �v���O�������@�@�@�F�f�[�^�`�F�b�N

How to Fix Mojibake in Cursor / VSCode

Step 1: Reopen with Correct Encoding

  1. Click the current encoding (e.g. UTF-8) in the status bar (bottom-right).
  2. Select "Reopen with Encoding".
  3. Choose "Japanese (Shift JIS)" or type 932 / shiftjis.
  4. The file will display Japanese correctly.

After confirming the content is correct:

  1. Click encoding → "Save with Encoding".
  2. Select "UTF-8".
  3. The file is converted to UTF-8, suitable for Git and modern tools.

Step 3: Configure Default Encoding for the Project

Add .vscode/settings.json:

{
  "files.encoding": "shiftjis",
  "files.autoGuessEncoding": true
}
  • files.encoding: default encoding when opening new files.
  • files.autoGuessEncoding: auto-detect encoding (may be inaccurate for small files).

Note: For legacy projects that are entirely Shift_JIS, use "shiftjis". For new or mixed projects, use "utf8" and apply Shift_JIS only when needed.

Step 4: Handle Files by Type

Some file types need special handling:

File type Common encoding Notes
.pco, .COB, .CBL Shift_JIS Pro*COBOL, mainframe
.vb Shift_JIS VB.NET legacy
.ini Shift_JIS Windows JP config
.xml Shift_JIS or UTF-8 Check declaration
Message.XML Shift_JIS Message resource
config.xml Shift_JIS DB / app config

For XML files, check the first line:

<?xml version="1.0" encoding="Shift_JIS"?>

If you see encoding="Shift_JIS" or encoding="Windows-31J", the file uses CP932.

EUC-JP and Other Japanese Encodings

Besides Shift_JIS, you may encounter EUC-JP (Unix/Linux in Japan) or ISO-2022-JP (email):

Encoding Environment Characteristics
Shift_JIS (CP932) Windows JP Most common in legacy
EUC-JP Unix, Linux JP Server, mainframe
ISO-2022-JP Email Escape sequences
UTF-8 Modern Unicode standard

If mojibake persists after trying Shift_JIS, try EUC-JP and ISO-2022-JP. Tools like file (Unix) or chardet (Python) can help detect encoding:

import chardet
with open('program.pco', 'rb') as f:
    result = chardet.detect(f.read())
print(result)  # {'encoding': 'SHIFT_JIS', 'confidence': 0.99}

Case Study: Legacy Migration Projects

In many VB.NET + COBOL migration projects from Japan, the source code typically uses Shift_JIS (codepage 932):

  • VB.NET: Hundreds of .vb files, .xml, .ini configs
  • COBOL: Many .pco files, COPY .COB, .CBL files
  • Note: Deploy docs or README often mention encoding

When developers open files in Cursor/VSCode with the default UTF-8, Japanese comments and messages show mojibake. Common solutions:

  1. Add .vscode/settings.json with "files.encoding": "shiftjis" for the source directory.
  2. Document encoding in project documentation.
  3. New files (Markdown, new configs): use UTF-8.
  4. Legacy files: open with Shift_JIS, consider converting during migration.

Encoding Tools

iconv (command line)

Convert encoding via terminal:

# Shift_JIS → UTF-8
iconv -f SHIFT_JIS -t UTF-8 input.pco > output_utf8.pco

# UTF-8 → Shift_JIS
iconv -f UTF-8 -t SHIFT_JIS input_utf8.pco > output_sjis.pco

PowerShell (Windows)

# Read Shift_JIS file
$content = Get-Content -Path "program.pco" -Encoding Default

# Write as UTF-8
$content | Out-File -FilePath "program_utf8.pco" -Encoding UTF8

Python Script

# convert_encoding.py
def convert_file(input_path, output_path, from_enc='shift_jis', to_enc='utf-8'):
    with open(input_path, 'r', encoding=from_enc) as f:
        content = f.read()
    with open(output_path, 'w', encoding=to_enc) as f:
        f.write(content)

convert_file('program.pco', 'program_utf8.pco')

Strategy for Legacy Codebases

Option 1: Keep Shift_JIS

Suitable when:

  • You don't want to modify many files.
  • Build/deploy still relies on Shift_JIS.
  • Tools/IDE have good encoding support.

Editor config:

{
  "files.encoding": "shiftjis"
}

Option 2: Convert to UTF-8

Suitable when:

  • You want to standardize on UTF-8.
  • Using Git, CI/CD, cloud.
  • Working with a multilingual team.

Process:

  1. Convert files one by one: iconv or "Save with Encoding".
  2. Ensure build still works (compiler supports UTF-8).
  3. Update .gitattributes if needed.
  4. Inform the team about the encoding change.

Option 3: Hybrid

  • New files: UTF-8.
  • Legacy files: Shift_JIS until converted.
  • Use files.autoGuessEncoding and verify important files manually.

.gitattributes and Encoding

To have Git handle encoding correctly:

# Legacy files - treat as binary or specify encoding
*.pco diff=cobol
*.vb text working-tree-encoding=UTF-8

# If still using Shift_JIS
*.pco working-tree-encoding=Shift_JIS

Note: Converting encoding can cause large diffs. Separate the "convert encoding" commit from logic changes.

Mojibake Checklist

  • [ ] Confirm the file's original encoding (Shift_JIS, EUC-JP, UTF-8).
  • [ ] Try "Reopen with Encoding" with the appropriate encoding.
  • [ ] Verify the content displays correctly.
  • [ ] Decide: keep the old encoding or convert to UTF-8.
  • [ ] If converting: use "Save with Encoding" or iconv.
  • [ ] Update settings.json for the project if needed.
  • [ ] Test build and run the application after changing encoding.
  • [ ] Update documentation/README about encoding in the project.

UTF-8 BOM: When Do You Need It?

BOM (Byte Order Mark) is a special byte sequence at the start of a file to indicate encoding. For UTF-8:

  • UTF-8 BOM: EF BB BF — some Windows tools require it.
  • UTF-8 without BOM: Standard for web, Git, and many compilers.
Context Recommendation
Web, API, JSON UTF-8 without BOM
Windows batch scripts UTF-8 BOM or ANSI
C#, VB.NET source Usually UTF-8 without BOM
Excel CSV exported from JP Shift_JIS or UTF-8 BOM

In Cursor/VSCode, "Save with Encoding" lets you choose "UTF-8 with BOM" or "UTF-8" (without BOM). For source code, prefer UTF-8 without BOM.

Impact on Build and Runtime

COBOL Compiler (Pro*COBOL)

Pro*COBOL typically expects source in the system encoding. On Japanese Windows, the default is Shift_JIS. If converting to UTF-8, check whether the compiler supports NCHARSET(UTF8) or equivalent.

VB.NET / MSBuild

VB.NET and MSBuild support various encodings. UTF-8 .vb files usually build fine. Be careful with *.resx when changing encoding, as it can affect resources.

Database and Config

Config files such as App.config, *.xml are read by .NET runtime — encoding must match how the file is loaded. XmlDocument/XDocument can specify encoding when reading.

FAQ

Q: Do I need to install the Japanese language pack?
A: Not necessarily. Mojibake is due to encoding mismatch, not missing fonts or language. Fonts like "MS Gothic", "Yu Gothic" are often available or can be installed via the OS.

Q: Does converting to UTF-8 corrupt the file?
A: No, if done correctly. Always backup before bulk conversion.

Q: Git diff is full of encoding changes and hard to read?
A: Use a separate commit for encoding conversion, and another for logic changes. Use git diff -w to ignore whitespace when reviewing.

Q: How do I know what encoding a file uses?
A: Try "Reopen with Encoding" — if you pick the right one, text displays correctly. Or use file -i, chardet (Python).

Q: Is iconv available on macOS/Linux?
A: Yes. On Windows you can use WSL, Git Bash, or PowerShell as in the examples above.

Troubleshooting: Common Errors

"Unable to decode"

The editor cannot recognize the encoding. Try in order: Shift_JIS, EUC-JP, Windows-1252 (for Latin), UTF-8.

Partially correct, partially wrong

The file may use mixed encoding (e.g. ASCII + Shift_JIS). Use a hex editor to find where encodings change and fix manually or with a script.

Build fails after converting

The compiler or runtime still expects the old encoding. Check compiler docs, or keep the original encoding until the tool supports UTF-8.

Japanese displays as squares or ?

Missing CJK font support. Install Japanese fonts (MS Gothic, Yu Gothic, Noto Sans JP) for the OS and IDE.

Summary

Mojibake occurs when the file's encoding differs from the encoding the editor uses. For Japanese legacy source (COBOL, VB.NET), Shift_JIS files are often opened as UTF-8.

Quick fixes:

  1. Reopen with Encoding → select Shift_JIS (CP932).
  2. Save with Encoding → UTF-8 (if standardizing).
  3. Configure .vscode/settings.json for the project.

Best practice: Use UTF-8 for all new files; convert legacy files gradually when possible. Always test build and runtime after changing encoding. Document encoding in the README or project docs so teammates avoid mojibake when opening files for the first time.